Some of you have probably read the very popular article on techcrunch on “The rise of interest based social network” such as pintrest, instagram, thumb, foodspotting and so on. While it seems like this is a new phenomenon I think that this kind of interest based social networks has existed for a long time in Twitter. While there are many theories why people form ties on Twitter, such following big stars (preferential attachment) or because of local proximity or closing triangles ( a friend of a friend) the most obvious reason is actually provided by twitter: “Follow your interests”. Actually at the time of writing this blog article I re-checked Twitter and saw that it now says “Find out what’s happening, right now, with the people and organizations you care about.”.
As you see on the left in the picture there we suspect that there is number of interest groups that are mainly interacting with each other but don’t care so much about other interest groups. It seems only natural that the principle of homophily (people like people who are alike) seems to foster such groups. In order to investigate such interest based social networks on Twitter, I have gathered 100 groups of 100 people based on a particular interest that is captured by a keyword.
To gather people that best represent a given interest or keyword I have used the Twitter list feature, where for each person you can see for which topics the person is listed. Using a rather sophisticated approach (I will cover it in another blogpost) I made sure to collect those people that are highly listed for a given keyword.
For each person I captured the number of tweets the person wrote and the number of retweets those tweets have received. Such corpus of 100 communities of each 100 persons allows us to study the “topic based social capital” – which can be defined on a number of levels.
This data allows us to analyze if there is such a thing like “topic based social capital” – which can somehow be the social capital aquired by individuals or groups that are based around a certain topic. On the individual level it means the more embedded (or central) I am in such a group, the higher we suspect the chances that I am important for this group and have some sort of influence on those people. Here when talking about influence I will simply measure the amount of retweets that person received from this group. On the group level we can think of this “topic based social capital” as a group feature, thus the better connected the group is and the more interaction takes place between their members, the higher the chances that those people actually exchange more information wich each other. This is also measured in the number of total retweets that have been exchanged between the members of this group. We will start the investigation with the group level version of social capital and cover the others later….
So in the overview we see that each community is based on a certain keyword and contains 100 persons and in total quite a number of tweets and retweets. In total that means that we are analyzing approximately 10 000 persons which in total have produced somewhere around 24 Mio Tweets and 95 Mio Retweets. I stored this data in a database in order to be able to access those communities any time.
The first ting that I noticed from an aggregate view is that when sorting the communities by the number of retweets that they managed to produce, we get interesting results.(Although here in this table the number of retweets is the total number of retweets and not the retweets that they managed to produce in their own community. We will come to this later). I noticed that the groups like celebrities, news, musicians, comedy, politicians managed to spark the highest amount of retweets. It is not that surprising, since we all know what popular TV and magazines are made from. In this sense twitter only represents what we are used to each day. When creating a ratio of Retweets/Tweet we and up with the same kind of sorting, meaning that those categories were most successfull at sparking retweets. If we took into account the number of followers these communities have, we might end up with a different result though (but we will also cover that in another blogpost). Back to the research question:
So regarding the social capital the group has we are interested if the higher it is the more retweets flow through the network.
To operationalize this question I have used networkX to compute three kinds of networks for those communities.
- A follower network – that captures all friend and follower ties between each of those 100 persons.
- An “interaction networks” – that captures all of the @replies of those persons and
- A retweet network – that captures all of the retweets that those persons have exchanged with each other.
Now in order to measure the social capital the group has we can compute the densities in either the follower network or the interaction network. These densities will serve as a proxy of social capital the group has as a whole. The denser the network is the more embedded those people are with each other and the higher the total social capital of the group. In order to measure the resulting information diffusion I will also measure the density in the retweet network. The more retweets those people have exchanged with eachother INSIDE the community the more information diffusion took place.
We have a the MORE of something the MORE of something else relationship. Therfore using a regression is a nice way of analyzing those things.
To compute those densities for all of those communities I dumped the resulting networks using an edgelist format and then used networkX to compute the densities for each of those networks. Then I saved them in a csv format to import them into a statistics program like SPSS ( we could also use numpy or scipy to compute those regressions)
for project in communities: print "" print "############ Calculating Project %s ############### " % project print "" FF = nx.read_edgelist('data/%s_FF.edgelist' % project, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph()) AT = nx.read_edgelist('data/%s_AT.edgelist' % project, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph()) RT = nx.read_edgelist('data/%s_RT.edgelist' % project, nodetype=str, data=(('weight',float),),create_using=nx.DiGraph()) FF_density = nx.density(FF) AT_density = nx.density(AT) RT_density = nx.density(RT) csv_writer.writerow([project, FF_density, AT_density, RT_density])
We end up with a datafile like this:
Now the last step is to see which type of independent variable either the social capital measured by friend and follower ties or the social capital measured by at ties captures the information diffusion that is happening in the network.
For this task I have created a simple regression in SPSS where I used a backward step method to exclude factors. So this model starts with including both the FF densities and the AT densities and then checks if it makes sense to get rid of one predictor because it does not explain enough of the variance in the data.
So we see since the ff-density and the at-density are quite correlated .742 p
Looking at the beta coefficients we see that the group bonding social capital as captured by the simple density in the interaction-network is able to explain 81% of the information diffsuion in the group. Thus we can come to the conclusion that the more group “bonding” social capital the group posseses, which is captured by the interactions they are having, the more information is exchanged between the members of the group.
The conclusion is somewhat not that much of a surprise, but we have seen an easy way of trying to investigate this question.
If we look at the groups with the highest social capital we see that other groups lead now. The groups with the highest social capital are programming languages “php”, “python” or “ruby” or some sports based such as “motorcross” or a german political party called the “piraten”. Where the groups of “celebries” for example only “rank” on somewhere the 80th place. THis means that celebrities are good at sparking a lot of retweets but they don’t really care about each other as a group. They don’t interact much with each other and they don’t retweet each other as a community.
If you have any questions, comments or ideas about this approach let me know.