My blog on my research on Twitter

Some of you have seen the very interesting article from the Facebook Data team about the echo chamber of social networks. The slate magazine has a nice review of this recent bakshy and adamic paper. They come to an illustrative conclusion if you had 100 weak ties and 10 strong ties:

**“The amount of information spread due to weak and strong ties would be 100*0.15 = 15, and 10*0.50 = 5 respectively, so in total, people would end up sharing more from their weak tie friends.”**

So I thought that is an interesting thing to revisit on Twitter. Quite new with networkX I thought this might be an interesting research example.

Having computed communities of topological specialists (see last post) I use them as my data basis. So I have a community of 100 people that have been very often tagged with the word “publishing”. The edgelists have been precomputed in ruby by analyzing the 3200 tweets of each person (and their retweets) and saved to disk in an edgelist format.

AT = nx.read_edgelist('%s_AT.edgelist' % project_name1, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph()) RT = nx.read_edgelist('%s_RT.edgelist' % project_name1, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph())

The AT network is the network where edges between Twitter users correspond to @replies of users that are not retweets. It serves as a proxy for tie strength. So if a person adresses another person 5 times their tie strength is 5.

The RT network is the network where edges between Twitter users correspond to retweets of users material. If i retweet somebody 10 times (time is not important here) this tie has the bandwidth or simply strength of 10.

So the question is if I look at all of the AT-ties that have the strenght 1, how many retweets were transmitted through those ties? If you consequently do this for any given tie strength you will come up with a chart of how many ties are out there that have a certain strenght, and how much was transferred over those ties.

The code that does this in network X is the following:

#How much information do the ties carry according to their strength result = [] # Some tie strengths thresholds = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] for threshold in thresholds: at_edges = [] for n,nbrs in AT.adjacency_iter(): for nbr,eattr in nbrs.items(): data=eattr['weight'] if data==threshold: #if the ties have a specific strength at_edges.append((n,nbr,data)) # create a tuple of from_node,to_node,strength rt_edges = [] for edge in at_edges: try: value = RT[edge[0]][edge[1]]['weight'] #if I can find this same pair of nodes in the RT graph capture how many retweets have been exchanged here. except KeyError: value = 0 if value > 0: rt_edges.append((edge[0],edge[1],value)) #if retweets have been exchanged between those actors add them to the rt edges result.append([len(at_edges), math.fsum([x[2] for x in rt_edges]),threshold]) # sum up over the retweets and save the result

You end up with an array for each tie strength holding the total number of at ties and the total number of retweets .You can plot this by using matplotlib

plt.plot(thresholds, [at[0] for at in result],'b-', label='# of AT ties with strength x') plt.plot(thresholds, [at[1] for at in result], 'g-', label='# of retweets flowing through these ties') plt.legend()

So if it is true that

**“The amount of information spread due to weak and strong ties would be 100*0.15 = 15, and 10*0.50 = 5 respectively, so in total, people would end up sharing more from their weak tie friends.”**

It is the case that the the stronger the ties get the less we have of those.

But we should see that the majority of retweets aretransmitted for ties with the strength of 1. Yet It is the case that the majority of retweets gets transmitted through ties of the strenght of 2 and 3.

I think this is interesting, I will now run this on 100 of those communities and plot the average.What do you think about this graph and the approach? Is it valid ?

I also thought there needs to be a tiestrength lower than 1 which for me is a retweet happening without there being any AT interaction before.

#Get the "0" strong tie which is when a retweet happend although there is no at_tie (in either direction) AT_undir = nx.Graph(AT) # We make it undirected to search for @replies in both directions rt_0_edges = [] for n,nbrs in RT.adjacency_iter(): for nbr,eattr in nbrs.items(): data=eattr['weight'] try: value = AT_undir[n][nbr]['weight'] except KeyError: rt_0_edges.append((n,nbr,data)) #insert our results to the array as the first datapoints result.insert(0,[0,math.fsum([x[2] for x in rt_0_edges]),0]) thresholds.insert(0,0) percentages = np.array([rt[1] for rt in result],dtype="float32")/np.array([at[0] for at in result]) # convert it to floats for division #Plot it fig = plt.figure() ax1 = fig.add_subplot(111) ax1.plot(thresholds, [at[0] for at in result],'b-', label='# of AT ties with strength x') ax1.plot(thresholds, [at[1] for at in result], 'g-', label='# of retweets flowing through these ties') ax1.legend(loc=2) ax2 = ax1.twinx() ax2.plot(thresholds, percentages, 'r-', label='% of #RT/#AT') ax2.legend()

So if we add this to the graphic we end up at another picture:

So what we see right now looks more like the “strength of NO ties”. (Would make a nice Paper title :)) People retweet each other even if they have had no interaction before. It will be interesting to explore if this asumption holds for all of the other 100 datasets.

P.S. I thought about adding another tie strength which I would have the strength e.g. 0.5, This would correspond to a person FOLLOWING the other person, wich is definitely less then writing an @reply to this person but more than nothing. (We all know following each other on Twitter is cheap.)

Cheers

Thomas

We have all heard about the importance of so called opinion leaders, mavens, influencers or simply central people. I have covered in a recent article in my blog.

Using the same approach in this blog article I will try to find out how much such opinion-leadership differs across different communities and which factors predict it best. For this task I collected 100 interest based communites. An overview over the communities and attempts can be found here in my recent blogpost and is also shown below. The table shows an overview over the dataset of interest based commmunities. These contain 100 people each. They have been chosen because they have been highly listed for this certain topic. As we can see in the table the number of tweets, and exchanged retweets differs slightly across the communities.

Given this data I can now ask the question:

How much does a structural opinion leader position in such a community affects the number of retwets you receive in this community.

This question is quite interesting since it has been covered as a hypothesis for a myriad of studies not only in the social media context, but also in medicine or advertising. It corresponds to the famous saying “the messenger is the message” (as Thomas Valente likes to put it).

In this context I have recently stumbled across a survey of Twitter users, where they were asked why they retweet information (see below). As you can see 92% of users state that it depends on the content, but a striking 84% say it is about the personal connection towards the person. From my point of view this also means nothing else than the persons position in the network. So I will surely retweet somebody who is very much respected and embedded in my interest based community, but I won’t see much value of retweeting people that are in the periphery of my community. Therfore central opinion leaders should be able to generate more retweets. We will check this assumption across the 100 interest based topic communities.

(P.S.The question of the message has an influence on whether or not a person will be retweeted will be covered in another blogpost. This task is somehow tricky because we have to come up with an idea how to measure if content was “interesting”. Ideas are welcome :))

To check which centrality metrics are good at predicting retweets in our network I have chosen the standard ones and then computed a pearson correlation between those and the number of retweets a person received from the community.

I have generated three types of networks for each of these communities. A friend-and-follower network, basically capturing the attion people towards each other, An interaction network computed by the @replies that people exchange with each other, and a information diffusion network, computed by the retweets people exchange with eachother. To read those in in networkX I used this code:

FF = nx.read_edgelist('%s_AT.edgelist' % project_name1, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph()) AT = nx.read_edgelist('%s_AT.edgelist' % project, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph()) RT = nx.read_edgelist('%s_RT.edgelist' % project, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph())

To determine central people I have used the standard network measures already implemented in networkX:

#AT Network dAT = nx.degree_centrality(AT) dAT_in = nx.in_degree_centrality(AT) dAT_out = nx.out_degree_centrality(AT) dAT_closeness = nx.closeness_centrality(AT) dAT_pagerank = nx.pagerank(AT) #FF Network dFF = nx.degree_centrality(FF) dFF_in = nx.in_degree_centrality(FF) dFF_out = nx.out_degree_centrality(FF) dFF_closeness = nx.closeness_centrality(FF) #RT dRT = nx.degree_centrality(RT) dRT_in = nx.in_degree_centrality(RT)

To see how well the centrality measures in the FF and AT networks correlate with the number of Retweets received (–> This is the dRT_in value in our retweet network) I computed the pearson correlations for each of those thematic communities. Using the 4 centrality metrics for the AT network and 4 centrality metrics for the FF network we have a sum of 8 different combinations:

1. AT Indegree vs. Retweet Indegree - The more I am mentioned ...

2. AT Outdegree vs. Retweet Indegree - The more I mention others ...

3. AT Closeness vs. Retweet Indegree - The closer I am in the network to others ...

4. AT Pagerank vs. Retweet Indegree - The more authority I posses ...

5. FF Indegree vs. Retweet Indegree - The more people follow me ...

6. FF Outdegree vs. Retweet Indegree - The more people I follow ...

7. FF Closeness vs. Retweet Indegree - ~ The more information I consume ...

8. FF Pagerank vs. Retweet Indegree - The more authority I posses ...

... the more my tweets are retweeted by others in the community.

To compute the correlation I used the scipy stats feaure. For example to compute the correlation between the FF_in network and the RT_in network I used this code:

values = match_values(dFF_in,dRT_in) output = sp.pearsonr(values[0],values[1])

The output contains the r and p in a simple array.

As you can see above I also used a function called match_values. This function makes sure that the two vectors have the same size. So for example if a person was not retweeted even once this person won’t show up in the retweet network, and therfore I won’t be able to compute how many retweets this person has received. (I could set it to zero but I preferred to rather skip these cases)

Open the correlations in google docs

As you can see in the table above the results show that especially four types of centralities metrics yielded the most significant correlations (p)

1. AT Indegree vs. Retweet Indegree - The more I am mentioned

5. FF Indegree vs. Retweet Indegree - The more people follow me

8. FF Pagerank vs. Retweet Indegree - The more authority I posses

2. AT Outdegree vs. Retweet Indegree - The more I mention others

The closeness and pagerank values did not do so well when correlating them to the number of retweets that the person received. (There might be a problem because the pearson correlation assumes that we have normally distributed data but our centrality values are highly skewed. I will have to investigate this).

So what did we learn from this? It seems that when trying to capture the opinion leadership in a community it seems to matter

- How often I am mentioned by others
- How many people follow me
- What my pagerank in the friend and follower network is and
- How often I mention others (which I think is a bit surprising)

If we were to create a “how-to-be-retweeted” document I would recommend others to intereact with others in their community (and hope that they mention me sometimes, too), try to follow interesting people from the community (and hope that they follow me back) and so hope to achieve a somewhat central position in this community. Of course somehow this is easier said then done, since at the end it is also about what I write. This dualism of content and structure is indeed an interesting one since we can speculate that the outcome where those people have become central in the community is also a result of their interesting contents or an authrity that goes beyond what we can measure on Twitter.

In the next blogpost I will try to use what we found, namely the most promising independent variables and see if we can build a linear model that predicts the amount of retweets I receive. It could turn out that the factors that I found are highly correlated and load onto the same factor, thus measure the same thing.

Cheers

Thomas