This category contains 5 posts

Extended network flow model for Twitter

In Twitter we have the situation that the network between users is multiplex (people can hold numerous ties with each other): Users can either a) follow each other b) interact with each other or c) retweet each other. The three types of ties, manifest themselves in three different networks that can be sort of laid on top of each other. This idea got me thinking. I stumbled upon a very interesting chapter for a book from Stephen Borgatti, who introduced network flow model that in my eyes seems to fit perfectly for the Twitter network. The network model from his paper is depicted below:

Network flow model

In his model Borgatti describes the model as two kinds of phenomena, which are called backcloth and traffic in the original work of Atkin. By adapting this model for Twitter we can explain how and why the three types of ties that we have in Twitter can be laid on top of each other and how they influence information diffusion in Twitter. I have therefore made a version that shows how the concepts map onto Twitter, that is depicted below.

Extended network flow model for Twitter

The backcloth is the infrastructure that enables the traffic and the traffic consists of information flowing through the network. In the case of Twitter the backcloth corresponds to the cognitive similarities among Twitter users (see below), and to their friend and follower connections. The traffic layer consists of the interactions and flow of information that takes place on top of these phenomena.

Borgatti describes the four categories as following: “The similarities category refers to physical proximity, co-membership in social categories and sharing of behaviors, attitudes and beliefs. Generally we do not see these items as social ties, but we do often see them as increasing the probabilities of certain relations and dyadic events.” This definition corresponds to the notion of implicit ties (ties we cannot directly see) of Twitter users: These implicit ties can basically be a shared interest, a shared location, a shared demographic, a shared audience and so on. Basically every type of attribute that makes two Twitter users similar to each other. The idea that similar people with the same attributes tend to flock together is known as homophily and the general process of people forming ties with similar people is called  selection mechanism (e.g. think of people that smoke becoming friends with other smokers)

The next three types of phenomena take place on so called explicit ties because these type of ties can actually be seen or measured explicitly on Twitter. Borgatti defines the social relations category as “ the classic kinds of social ties that are ubiquitous [(which friend and follower ties are in Twitter)] and serve either as role-based or cognitive/affective ties. Role-based includes kinships and role-relations such as boss of, teacher of and friend of.[(In Twitter: follower of)] They can easily be non-symmetric [(which friend and follower ties are)].” It is apparent that, these type of ties exactly relate to the explicit follower ties, which share the same attributes and characteristics.  When we think about the other reasons why people become friends other than being similar, we  stumble upon all the network effects that are a core part of network literature. Therefore I have indicated those with the back and forth arrow above social relations. In Twitter these type of processes take place everyday: People follow prominent outlets e.g. CNN (preferential attachment), become friends with friends of friends (triadic closure) or simply follow back a person that just followed them (reciprocity). There are much more of such effects, but we don’t want to go into detail here, but instead look at the next type of ties.

Borgatti describes the interactions category as “discrete and separate events that may occur frequently but then stop, such as talking with, fighting with, or having lunch with”. This category translates into the interactional (@mention) ties in Twitter, which have exactly these behavioral traits: People do intentionally mention each other in Tweets, but also might stop doing so for certain reasons. Depending on when one looks at two users in Twitter, this interactional connection might be exist at this point in time or not. The first reason according to the network flow model, why I would interact with someone is because I follow them, which makes perfectly sense for Twitter. Now are there are more reasons why people might interact with each other and a number of those reasons is already covered in various information diffusion theories: One example is that people like to interact with others who they perceive as opinion leaders for a topic. Another example is the brokerage theory that says that such brokers tend profit from interaction with two different groups. The third type of families are the threshold models, where people believed to are lured into interaction or adoption once a certain threshold of their friends talks about a certain topic. Processes like this could easily be taking place on Twitter too.

Finally the flows category is described by Borgatti as “things such as resources, information and diseases that move from node to node. They may transfer (being only at one place at a time) and duplicate (as in information).”. This definition translates directly into the explicit retweet ties that always exist when information is transferred from one actor to another. The final network layer follows the same reasoning as the one before: The first reason why I would retweet someone is because I follow that person and I have already interacted with that person. The reasoning about information diffusion theories applies here too.

Finally I thought it would be nice to add the influence mechanism in this model, which is basically people becoming more similar to each other because of the networks that people already have. All three types of networks (friend and follower ties, @interactions and retweets) might have that effect. The classic influence example is non-smokers being friends with smokers, and then starting to smoke, might be imaginable in Twitter too. Yet there are strong indications that this effect is much smaller than people believe it to be.

Using the network flow model, we came up with a nice ordering of the different concepts that surround network science and sociology and could somehow connect this pieces to the Twitter network. I hope this extended network flow model was useful for you and hope to hear some comments on it.



On the weakness of weak ties

A few months ago I’ve made a blog post (https://twitterresearcher.wordpress.com/2012/01/17/the-strength-of-ties-revisited/)  investigating tie strenghts on Twitter and their influence on  retweets. Well it turns out  that my analysis was lacking a lot of detail, so I re-did it again considering more aspects than before. So lets get started.


The data that I am using for this analysis is the following: Each group of people consists of 100 people that have been highly listed for a given topic in Twitter e.g. snowboarding or comedy or any other topical interest that people have on Twitter. There are 170 of such groups, each consisting of exactly 100 members (You can read how I created such groups in my recent blog posts here https://twitterresearcher.wordpress.com/2012/06/08/how-to-generate-interest-based-communities-part-1/ and here https://twitterresearcher.wordpress.com/2012/06/12/how-to-generate-interest-based-communities-part-2/). In an abstract way you can imagine the structure of the network to looks something like this:

The graphic above indicates that we only have the friend-follower ties on Twitter between those people. But indeed there are quite a few more ties between people, resulting in a multiplex network between them. This network consists of three layers:

  1. The friend-follower ties
  2. The @interaction ties (whenever a user mentions another user this corresponds to a tie)
  3. And finally the retweet ties (whenever a user retweets another user this corresponds to a tie)

Schematically this looks something like this:


Now when we think about ties between those people especially in regard to tie-strengths we can come up with a couple of different definitions of ties ( I mentioned a couple of those in my blog post here https://twitterresearcher.wordpress.com/2012/05/24/tie-strength-in-twitter/)


  • No Tie: Neither in the Friend and Follower network, nor in the @interaction network there are any ties between those people.
  • Non-reciprocated-friend-follower-tie: Person A follows a person B in the friend and follower network. Person B does not follow person A.
  • Reciprocated-friend-follower-tie: Person A follows person B. Person B follows person A.
  • Non-reciprocated-@-interaction-tie: Person A mentions person B EXACTLY one time. Person B does not mention person A.
  • Reciprocated-@-interaction-tie: Person A mentions person B EXACTLY one time. Person B mentions person A at least one time.

Valued ties:

  • Interaction tie with strength x: Person A mentions person B EXACTLY X times. (e.g. tie of strength 10 would mean person A has mentioned person B 10 times)

Bridging vs. bonding ties:

  • Bridging ties: We call bridging ties all of those ties that are BETWEEN groups (see schematic network graphic above the ties in red)
  • Bonding ties: We call bonding ties all of those ties that are INSIDE groups (see schematic network graphic above the ties in black)
  • Notice that our definition of bridging and bonding ties might differ a bit from the pure network perspective, where maybe by definition bonding ties would have to have a certain strength, reciprocity and so on. Here we rather take the underlying groups, that we created artificially, but which represent nicely users that strongly share a certain interest.

Research Question:

Having all those definitions of ties we can now come up with a number of observations regarding the information diffusion between those people. The information diffusion is captured in the retweet network (see third layer in the schematic graphic) and the corresponding ties. In generall we want to look at how the different tie types affect the information diffused (retweets) between those people.

Analysis per Group:

To get an overview over the data I will first have a look how many retweets have in total have been exchanged between the analyzed groups. I count how many retweets took place inside the group (blue) and between the groups (red). Each of the 170 groups is shown below:

Approximately a total of 214.000 retweets  took place between groups (red) and  414.000  retweets that took place inside the groups (blue). In the graphic above we can clearly see the differences between the different interest groups. I’ve ordered the groups ascending to retweets inside the community and which makes us see that there are some groups that focus mostly on retweets inside the group (e.g. tennis or astronomy_physics) while other groups rather get mostly retweets from outside of their own group and do not retweet each other so much inside the group (e.g.poltics_news or liberal). Although we cannot clearly say that the group has an influence if it gets retweeted from outside the group, we can say that the members of the group at least have the choice to retweet other members of the group. If these members do not retweet each other it might have a reason about which you are free to speculate (or I will try to answer in the next blog post)

On the influence of types of ties on retweets

Given the different types of ties described above we can now ask the most important question:

How do the different non-valued bridging ties differ from the bonding ties in regard to their influence on the information diffused through those ties?

What do I mean by that? Having all retweets between the persons in the sample I want to find out through which ties these retweets have flown. So for example given that A has retweeted B three times , I ask the question which ties (that A and B  already have in the friend and follower network or the interaction network)  were “responsible” for this flow of information between those actors?

EXAMPLE: If two people have mentioned each other at least once, I will assume (according to the definition above) that  they hold a reciprocated interaction tie. I will then assume that this tie was “responsible” for the retweet between them.  NOTICE: This is a simplifying assumption because I assume that  if there is a stronger tie it is always was responsible for the retweet and not the maybe underlying weaker tie (as in form of a friend and follower tie).

The assumption that I make here is therefore:

  • >  means this connection is supposed to be stronger
  • AT_reciprocated_tie > AT_directed_tie_with_strength_1
  • AT_directed_tie_with_strength_1 > FF_reciprocated_tie
  • FF_reciprocated_tie > FF_non_reciprocated_tie
  • FF_non_reciprocated_tie > No Tie

In order to compute which kind of ties were most successful of transmitting retweets, I compute the ratio of ties that had retweets that have flown through this TYPE of tie (e.g. ff_reciprocated_ties) and divide it through the amount of the same ties that no had no retweets (e.g. ff_reciprocated_ties between people where no retweet was exchanged between those persons). So if I have a total of 10.000 reciprocated ties and over 2000 a retweet took place while over the remaining 8000 no retweets have been transmitted the ratio for this type of tie is 0.25.


I have summarized the results in the table below. The std. deviation reports the deviation in the different retweet ties that belong to a certain edge type. (In the case of no_tie we have no data for no retweets because here we would have to count all the ties that are not present, which seems a bit unrealistic, given the structure of social networks)

As you can see in the table I have first of all differentiated if a tie belongs to a bridging tie or a bonding tie. Remember that bonding ties are between people who hold the same interest while bridging ties are between people who belong to different groups and thus share different interests.

No ties

As you can see first of all there are a couple of retweets that have taken place between people despite those people actually holding any ties. In the case of bridging ties we a bit more retweets than in the case of bonding ties. Yet regarding the total of almost 660.000 retweets, the approximately 73.000 retweets that took place without a tie are more or less only 10% of the total information diffusion. (So my appologies for the  blog post on the importance of no ties was overstating their importance, given this new interpretation)

Friend and follower ties

What is more interesting are the friend and follower ties. We can see that in both cases holding a reciprocated tie with a person, results in a higher chance of getting retweeted by this person. Although when we look at the bonding ties this chance is almost 4 times as high, while in the bridging ties our chances improve only by less than 10%. When we compare the bonding with the bridging ties we clearly see that the reciprocated bonding ties have a magnitude of 10 higher chance of leading to a retweet than the bridging ties. This is very interesting. So despite the fact that of course bridging ties are important because they lead to a diffusion of information outside of the interest group, they are much more difficult to activate than ties between people who share the same interest. So from my point of view this fact shows exactly the weakness of weak ties. When I mean weak ties I refer to the bridging ties that link different topic interest communities together. We see that not only the weaker the tie the lower the chance of it carrying a retweet but also if the tie is a bridging tie the chances drop significantly.

Additionally we can also see that the reciprocated friend and follower ties correspond to the majority of the bandwidth of information exchanged. This is also an interesting fact since the stronger the ties get the higher the chance of obtaining a retweet through this tie, but at the same time the total amount of retweets flowing through these ties drops dramatically (we will also see this when we take a look at the valued at-interaction ties). Just by adding up the numbers we see that almost 3/4ths of all retweets inside the group have flown through the reciprocated friend and follower ties. So although those ties have only a ratio of 0.8 of retweets / no retweets they are the ties that are mostly responsible for the whole information diffusion inside the group.

Interaction ties

When we analyze the interaction ties we find a similar pattern. We see that the bonding ties have a much higher chance of resulting in a retweet than their bridging counterparts, although the difference is not as dramatic. In general we also notice that the reciprocated at_ties have the higher chance of leading to retweets. Actually the ratio is higher than one in the reciprocated bonding ties. This means that per tie we obtain more than one retweet. From tie “maintainance perspective” it would seem smart to maintain such ties with your followers because on average they lead to the highest “earnings” or retweets. We shouldn’t jump the gun too early here, because up till now we have analyzed the rather “weak” ties. Why weak? Well having had a reciprocated conversation with a person is great but having had received 10 or 50 @ replies from that person is definitely a stronger tie, and might lead to a higher chance of getting retweeted by this person.

Valued ties

If we look at the valued ties we could replicate the table above and go through each tie strength separately, but its more fun to do this in a graphical way. I have therefore plotted the tie strength between two persons on the X-axis and the ratio (ties that had retweets flow through this type of tie / same type of ties that had no retweet) on the Y axis (make sure to click on the graphic to see it in full resolution)

So what do we see? Well first of all the red line marks the ratio of 1, which is receiving more retweets through this type of tie than not receiving retweets. Anything above one is awesome ;). You also notice that there is quite a lot of variance in the retweets, which is indicated by the error bars (std deviation). As the ties get stronger I would say that the standard deviation also gets higher (due to higher and less values in the retweets)

Bridging ties vs. bonding ties

What we notice is that both the bridging and bonding ties have a tendency to result in a higher chance of retweets flowing through this tie, the stronger they get. I would say this holds up to a certain point maybe the strength of 40? After this the curve starts to fluctuate so much that we can’t really tell if this behavior looks like this simply  by chance (notice the high error bars). What we also see is that clearly the bridging ties have a lower chance of resulting in retweets than their bonding counterparts (comare green curve with the blue one). This is an observation that we have also noticed before. So again here it is, the weakness of weak ties. Weaker ties lead to a lower chance of resulting in retweets and the typical weak bridging ties also are much harder to activate than their bonding counterparts. What is not shown in this graph is the total number of retweets that have flown through those strong ties. Those are ~ 29000 retweets for bridging ties and ~ 37000 for bonding ties. Compared to the other tie types this is only a fraction of the total of exchanged retweets. Yet these strong ties in comparison have a very high chance leading to retweets, having sometimes ratios higher than 3 (i.e. there are thee times more retweets than flowing through this type of tie than no retweets flowing through this tie).

Well that was it for today. I will update this blog post with the reverse direction of ties tomorrow where Iwill have a look on the influence of outgoing ties on the incoming retweets. But don’t expect any surprises ;). Plus I will post the code that I used to generate this type of analysis.



Problems when working on (kind of) big data to create networks between people

There is an abundant discussion about big data, also on the definition of it (e.g. http://whatsthebigdata.com/2012/06/06/a-very-short-history-of-big-data/). I would say for me big data is when I the data becomes so big that you need to shard your databases and create distributed solutions to  computational heavy routines on multiple machines e.g. using mahout, pig or some other map/reduce approach http://de.wikipedia.org/wiki/MapReduce.

In comparison to big data, my data is rather small (20.000 Twitter Users, 50 Mio. Tweets, and ~ 50 Mio x 100 Retweets). It fits on one machine and yet creates a lot of  problems when dealing with it. I thought I’d write-up some of the solutions I have found when approaching these Social Network specific data problems.

Generating, Storing and analyzing networks between people

One of they key routines of my work is extracting networks among people. The easiest network are the friend and follower connections storing and retrieving those is a problem of its own (which I will cover in another blog post). I will show you why storing ~ 100.000 ties per person in a Mysql database is a bad idea.

Solution one: Generating @-Networks from Tweets

The next relevant ties are the @-connections. Which correspond to one person mentioning another person in a tweet. These ties are more interesting since they indicate a stronger relationship between people. But extracting them is also a bit harder. Why? Well, if we have 20.000 Persons that we want to create a network of @-mentions in between, this also means that we have max 20.000 x 3200 (3200 being the maximum number of tweets we can extract for a person using the Twitter API) Tweets in our database. This means around ~ 50 Mio of tweets, where each tweet has to be searched for the occurrence of one of the 20.000 usernames. This leads to algorithm #1:

Suppose that in project we are having our 20.000 people, that we want to analyze the network between. In usernames we are storing the names that we want to match each tweet against. The algorithm is simple we read the tweets of every person and check:

  • Is tweet mentioning one of the other 20.000 persons?
  • Is this tweet not containing the “RT” (e.g. “RT @user have you seen xyz”)
  • Has this tweet been retweeted by others? Here we assume that @conversations are such tweets that are not retweeted but mention another user

If the criteria are met we add an edge in the form [From, To, strength] to our network which we store in values. Each mention has a strength of one. At the end we aggregate those ties adding up the ties having the same pairs of users and adding the values. The result is a network containing the @interactions. Great. But we have a problem, which is the time that it takes to compute this. Why? Well I’ve created a toy sample to show you. It contains 57 people and ~ 120.000 tweets with up to 100 retweets for each tweet. The time it takes to generate the network between them is almost 32 seconds.

This looks good, but if we start to match each tweet against 20.000 people instead of 57 people our performance goes down drastically from around 0.5 seconds per person to almost 60-80 seconds per person. If we now extrapolate from this (60seconds/person * 20.000 Persons)/(3600*24) ~ 10-15 days!! It will take around two weeks to generate this rather small network of 20k people, plus we can never be sure if this process won’t crash because we have used up all the memory of the machine. What to do?

Solution two: Use multiple workers to get the job done

I have mentioned delayed job https://github.com/collectiveidea/delayed_job which is a great gem to be able to create tons of small jobs which can then be processed in parallel by a multitude of workers. We will create a job for each person, write down the results of the job in a csv file and then at the end aggregate all jobs results. This results in algorithm #2:

I’ve created three methods, the first one creates the jobs, one for each person. The second one aggregates the jobs results and is called when all jobs have been processed. The last one is the actual job itself, which is very similar to algorithm #1 except that it saves the output to a csv file instead of an array in the memory. This approach is kind of similar to map reduce since we are in parallel computing the networks for each person and then map or aggregate the results. Additionally I use a method that queries the db periodically to see if the delayed jobs finished their work:

What  about the results? For the toy network we get around 21 seconds to finish the jobs. We have improved quite a lot, but how about the 20.000k network. Well sadly the performance did not improve much because the bottleneck is still the same each job has to go through each persons’ tweets and find the ones that contain the username. Thus despite now being able to use multiple cores we are stuck with the db bottleneck. What to do?

Solution three: Use lucene / solr a enterprise solution for indexed full-text search

To solve the problem of the slow lookup time, we will use a full-fledged search engine called lucene http://de.wikipedia.org/wiki/Lucene which is being accessed by a java solr servlet http://en.wikipedia.org/wiki/Solr. Since we want to use it in rails we will additionally use the http://sunspot.github.com/ gem that makes things even more elegant. Ok what is this about? Well basically we add a server that indexes the tweets in the database and provides an ultra fast search on this corpus. To make our tweets searchable we have to add this description to the model to tell solr what to index:

In this case we want to index the tweet text and all of the corresponding retweet ids. After this all is left is to start the solr server (after you installed the gems etc.) by rake sunspot:solr:start and do a full reindexing of our tweets by rake sunspot:solr:reindex. This might take a while, even up to a day if your database is big.  If we are done we can now use the third algorithm:

It is similar to the ones we have seen before yet  different in the way that we are not using two iterating loops anymore. Instead for each person we fetch the tweets that mention this person by using full text “@person.username”, which returns all the tweets in which this person was mentioned with an at sign. Then for these we double-check if the author of this tweet is not the same person (loop) and if the tweets don’t  include “RT” and have no retweets. If the match these criteria we similarly create a tie. And similarly we aggregate these ties at the end. What about the performance of this algorithm? For the toy project it finishes around 2 seconds. And for the 20.000 k network I’ve displayed some of the times per person results below:

As you can see, even when we are analyzing 20.000 people at the same time per person we get results that are often under one second and up to 10 seconds in peaks, when the person has been mentioned a lot, and we need time to filter those results. One final thing, I’ve noticed that the standard Tokenizer in Solr strips the @sign from the tokens, that’s why for the search engine the term “@facebook” and “facebook” means the same (see my post on stackoverflow http://stackoverflow.com/questions/11153155/solr-sunspot-excact-search-for-words). But in this case I actually care for this difference, while in the first the person is addressing the @facebook account on twitter, in the later the person might be only saying something about facebook and not addressing this particular account. So if we change the tokenizer to whitespaceTokenizer, which doesn’t remove these @ signs we are actually able to search for both.


Well that is it for today. The lesson: It differs a lot how you represent and store your data, in some cases you might end up with terrible results and wait for weeks for the result, and by doing slight changes you might speed up this process up to 10x or 100x times. Although big data focuses on algorithms that run on big data sets and on distributed computations, in the end it might often be easier to shrink the big data into small data by aggregating it in some form and then process it. Or to use what we already have, namely smart solutions e.g. lucene for existing problems like text search. The most important outcome of this approach is that you gain flexibility to experiment  with your data, by re-running experiments or algorithms and be able to see what happens, instead of waiting for two weeks.  Plus you can somehow trust already existing solutions instead of creating your own ones, which might be often buggy.

P.P.S I am planing to write about a similar story on how I decided to store the friendship ties in a key-value store instead of a relational database and then finally write about how I processed the resulting networks with networkX.


Tie strength in Twitter

Is it that in a group the more stronger ties the group has, also more information gets diffused between its members? Well according to Granovetter saying that information among people with strong ties tends to diffuse faster this should be the case.  But if we want to study this phenomenon in Twitter we have to come up with a definition of what a strong tie is. I have come up with at least four definitions using the following relationship and the @reply relationship.

  • Weak ties: A follows B, B does not follow A. This is the cheapest tie on Twitter, where a person simply follows another person.
  • Weak-Strong ties: A follows B, B does not follow A but @replies A. In this case A followed B and B greeted A so at least acknowledged A’s existence.
  • Strong ties: A follows B, B follows A. This tie is reciprocated, so it would suffice some definitions of a strong tie.
  • Strongest ties: A @replies B, B @replies A. These are the strongest ties, since both interact at least once with each other.

Beyond that we can try to compute an “average tie strength” between a number of people by summing all of the tie strengths, which are counted by how many @replies were exchanged and then calculate an average for a group. So for example if the group consists of 3 people A,B,C: A @replies 3x B, B @replies 4x C. The average is 3+4 / 2 (Strengths added up / # ties). To do this with networkX is pretty easy. Given you have a graph (D) which holds the tie strengths in “weight”. This measure is somehow problematic though as I can imagine cases where among 100 people no-one talks to each other apart from two persons exchanging 100 @replies. The “average” tie strength would be 1 then.

If we want to be rather conservative about tie strength we can use the reciprocated ties definition. To compute it with networkX is similarly easy. We can call this measure reciprocity as it measures the proportion of reciprocated ties of all ties. See http://www.faculty.ucr.edu/~hanneman/nettext/C8_Embedding.html UCINET has a similar routine under Network>Network Properties>Reciprocity. I dont’ know why networkX is not having one.

Given that we have three different graphs:

  • The FF graph, holding the following relationships:
  • The AT (@) graph, holfing the interaction relationships
  • And the RT graph, holding the actual diffusion between people

If we input the FF graph into reciprocity, we find out how many of the follower relationships are reciprocated (see above “strong ties”). If we enter the AT graph we get the “strongest ties” (see above). So we end up having three at least three operationalizable definions of strong ties (Reciprocated ties in FF, Reciprocated ties in AT, and average tie strength measured by the average interaction inside the group). We will see now if for 100 groups, we find out that the more strong ties the group has the more information is diffused inside the group.

To measure the information diffusion I will use two measures. One is the density in the RT network. The higher the density the more information diffusion ties there are between those people. The second measure is the total volume of the exchanged information in the group. We do this by adding up all the retweet ties with their according weight for a group and then dividing by the number of people in the group. See method total_edge_weight above / len(RT.edges() . So for example if our network had only two nodes A and B: And A retweeted B 3 times. The total volume of information exchanged would be 3 / 2 = 1.5, and the density would be 0.5.

Using simple regresssion you get the following results:

The covariance between FF_reciprocity and AT_reciprocity not shown.
Result of Regression in SPSS

So in general by counting how many strong ties the group has we can explain about 22% of the variance in the diffusion, as measured by density. If we do the same regression and measure the diffusion by volume, we get the following result:

So the strong ties defined by AT_reciprocity seem to not be able to contribute to the explanation of the volume and we can barely explain 8% of the variance. I will maybe have to re-think my measure of information diffusion as measured in volume. It might suffer from the fact that on average the volume might seem high for a group, but is only produced by a small number of people who retweet each other all the time.  I will create some histograms of the RT volume for each group to see what is going on.



The strength of ties (revisited)

Some of you have seen the very interesting article from the Facebook Data team about the echo chamber of social networks. The slate magazine has a nice review of this recent bakshy and adamic paper. They come to an illustrative conclusion if you had 100 weak ties and 10 strong ties:

“The amount of information spread due to weak and strong ties would be 100*0.15 = 15, and 10*0.50 = 5 respectively, so in total, people would end up sharing more from their weak tie friends.”

So I thought that is an interesting thing to revisit on Twitter. Quite new with networkX I thought this might be an interesting research example.

Having computed communities of topological specialists (see last post) I use them as my data basis. So I have a community of 100 people that have been very often tagged with the word “publishing”. The edgelists have been precomputed in ruby by analyzing the 3200 tweets of each person (and their retweets) and saved to disk in an edgelist format.

AT = nx.read_edgelist('%s_AT.edgelist' % project_name1, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph())
RT = nx.read_edgelist('%s_RT.edgelist' % project_name1, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph())

The AT network is the network where edges between Twitter users correspond to @replies of users that are not retweets. It serves as a proxy for tie strength. So if a person adresses another person 5 times their tie strength is 5.

The RT network is the network where edges between Twitter users correspond to retweets of users material. If i retweet somebody 10 times (time is not important here) this tie has the bandwidth or simply strength of 10.

So the question is if I look at all of the AT-ties that have the strenght 1, how many retweets were transmitted through those ties? If you consequently do this for any given tie strength you will come up with a chart of how many ties are out there that have a certain strenght, and how much was transferred over those ties.
The code that does this in network X is the following:

#How much information do the ties carry according to their strength
result = []
# Some tie strengths
thresholds = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
for threshold in thresholds:
    at_edges = []
    for n,nbrs in AT.adjacency_iter():
        for nbr,eattr in nbrs.items():
            if data==threshold: #if the ties have a specific strength
                at_edges.append((n,nbr,data)) # create a tuple of from_node,to_node,strength

    rt_edges = []
    for edge in at_edges:
            value = RT[edge[0]][edge[1]]['weight'] #if I can find this same pair of nodes in the RT graph capture how many retweets have been exchanged here.
        except KeyError:
            value = 0
        if value > 0:
            rt_edges.append((edge[0],edge[1],value))    #if retweets have been exchanged between those actors add them to the rt edges
    result.append([len(at_edges), math.fsum([x[2] for x in rt_edges]),threshold]) # sum up over the retweets and save the result

You end up with an array for each tie strength holding the  total number of at ties and the total number of retweets .You can plot this by using matplotlib

plt.plot(thresholds, [at[0] for at in result],'b-', label='# of AT ties with strength x')
plt.plot(thresholds, [at[1] for at in result], 'g-', label='# of retweets flowing through these ties')

So if it is true that

“The amount of information spread due to weak and strong ties would be 100*0.15 = 15, and 10*0.50 = 5 respectively, so in total, people would end up sharing more from their weak tie friends.”

It is the case that the the stronger the ties get the less we have of those.
But we should see that the majority of retweets aretransmitted for ties with the strength of 1. Yet It is the case that the majority of retweets gets transmitted through ties of the strenght of 2 and 3.

I think this is interesting, I will now run this on 100 of those communities and plot the average.What do you think about this graph and the approach? Is it valid ?

Bonus Update

I also thought there needs to be a tiestrength lower than 1 which for me is a retweet happening without there being any AT interaction before.

#Get the "0" strong tie which is when a retweet happend although there is no at_tie (in either direction)
AT_undir = nx.Graph(AT) # We make it undirected to search for @replies in both directions
rt_0_edges = []
for n,nbrs in RT.adjacency_iter():
        for nbr,eattr in nbrs.items():
                value = AT_undir[n][nbr]['weight']
            except KeyError:

#insert our results to the array as the first datapoints
result.insert(0,[0,math.fsum([x[2] for x in rt_0_edges]),0])
percentages = np.array([rt[1] for rt in result],dtype="float32")/np.array([at[0] for at in result]) # convert it to floats for division

#Plot it
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(thresholds, [at[0] for at in result],'b-', label='# of AT ties with strength x')
ax1.plot(thresholds, [at[1] for at in result], 'g-', label='# of retweets flowing through these ties')
ax2 = ax1.twinx()
ax2.plot(thresholds, percentages, 'r-', label='% of #RT/#AT')

So if we add this to the graphic we end up at another picture:

So what we see right now looks more like the “strength of NO ties”. (Would make a nice Paper title :)) People retweet each other even if they have had no interaction before. It will be interesting to explore if this asumption holds for all of the other 100 datasets.

P.S. I thought about adding another tie strength which I would have the strength e.g. 0.5, This would correspond to a person FOLLOWING the other person, wich is definitely less then writing an @reply to this person but more than nothing. (We all know following each other on Twitter is cheap.)