A short presentation I gave at the DGPUK 2014.
Abstract in German: Die Öffentlichkeit, welche durch klassische Massenmedien wie fernsehen, radio und Zeitungen hergestellt wird, hat in den letzten Jahren einen starken Wandel erlebt. Die Digitalisierung der Verbreitungskanäle hat nicht nur zu einer Vervielfachung der ver- fügbaren angebote geführt, sondern auch neue nutzungsformen ermöglicht (z.B. zeit- versetztes und mobiles fernsehen). Während diese Entwicklungen eine zunehmende segmentierung des Publikums nahelegen, wurden durch die Digitalisierung der an- schlusskommunikation auch neue formen der Vernetzung dieser Teilöffentlichkeiten geschaffen. Beim Fernsehen findet die Anschlusskommunikation, welche früher auf familie, arbeits-/ ausbildungsplatz oder stammtisch beschränkt war, neuerdings zum Beispiel über facebook, Twitter und spezielle social TV Programme statt. Breitere auf- merksamkeit erlangte in jüngster Zeit die beachtliche anzahl Tweets während TV-sen- dungen wie dem amerikanische superbowl und der oskarverleihung oder auch dem deutschen Kanzlerduell. neben der etablierten Währung der reichweite scheint sich die fachwelt wie auch die breite Öffentlichkeit zunehmend für neue Partizipationsfor- men zu interessieren und sucht nach Kennzahlen, welche die aktivität der rezipienten quantifizieren. Bisher ist allerdings ungeklärt, ob es sich dabei um eine Parallelwährung handelt, die nur durch die reichweite determiniert ist, oder aber neue aspekte der TV- nutzung zum ausdruck kommen. Dieser offenen fragestellung nimmt sich das vorlie- gende forschungsprojekt an und untersucht, in welchem Verhältnis Zuschauerzahlen und rezipientenaktivitäten über social Media Kanäle stehen.
It’s been a while since the last update. But here is the gist: I finished my phD and am working now at the University of Zürich and at the IaKom Institute. I will try to provide interesting insights from my new work.
Thinking about social capital can be a bit tricky because it has been reviewed from so many perspectives and dimensions. In this blog post I will introduce the individual and group perspective on social capital and the bonding and bridging dimension. I will then show how Stephen Borgatti unite these views in a matrix (here is his paper on it) and then try to extend his ideas.
Generally the first problem that I had with social capital is that it can be found and analyzed in different hierarchical degrees such as on a individual- and collective-level and on a on a micro-, meso- and macro-level: In the micro-level relations between individuals, household or neighborhoods are analyzed. The meso-level contains municipalities, institutions and organizations. The macro-level deals with whole regions and nations which are analyzed.
This multi-level view leads to a bit of confusion: Theorists debate whether social capital is a private good, where individuals invest in the formation of relationships so they can access the resources of others, or a public good such that everybody belonging to a social group with social capital may enjoy its benefits. While for example Putnam’s work (Bowling alone) describes social capital a quality of groups or societies, Burt or Lin’s work (Brokerage and Closure) describe social capital for individuals. This leads to a lot of confusion if social capital is a good of an individual or of a group.
The second problem with it is that some people highlight the bonding nature of it while others focus on its bridging attributes. Much of the discussion these two perspectives of bonding and bridging social capital has already been captured in the views of Coleman and Burt (Brokerage and Closure) which have highlighted different sources of social capital.
While in Coleman’s view social capital mostly results from analyzing internal group closure, in Burt’s view it results from exactly the opposite group mechanic, namely structural holes. While closure corresponds to creating internal ties with group members, the structural holes theory corresponds to creating ties to members outside the group. So what is social capital? Well we can either decide that we don’t like this theory anymore or try to somehow combine these dimensions into one framework. This is what Stephen Borgatti tried.
Borgatti’s Social Capital Matrix
Bortatti says that discussion on social capital mostly suffers from a different perspective on the group concept. In most cases the group “has been implicitly conceived as a universe, nothing outside the group is considered.”[p.3] Adler and Kwon in their review come to similar insights on theoretical views on social capital stating that “external ties at a given level of analysis become internal ties at the higher levels of analysis and conversely, internal ties become external at the lower levels”[p.35].
Consequently we have to accept the duality of bonding and bridging social capital and see groups as embedded actors in their own social environments. The logical conclusion is to combine the individual vs. group dimension with the inside (bonding) vs. outside (bridging) dimension: This can be demonstrated nicely in the example of a work department: To look at it the group level internally one would analyze the working relationships among the members of the department, looking at the individual level one would analyze the individual ties of members of the group. But looking beyond this department one would have to analyze the relationships that the department has with other departments outside of it. Thinking in network terms an internal view looks at the group’s relationships within the group, while an external view looks at the structure of the group’s relationships to outsiders. This leads to a 4 fold classification matrix which has been suggested by Borgatti.
Social Capital Matrix of Borgatti
Using Borgatti’s original interpretation, quadrant A should be left empty, since there is no way to analyze the individual’s internal ties. This would correspond to analyzing the inner workings of an individual, such as the networks formed in his brain.I argue that the internal focus might also be used to describe the individual internal-group focus. This means defining internal focus as a focus inside the group and by defining external focus as a focus outside of the group. Using this logic the social capital that can be harvested in quadrant A depends on the individuals’ ties with the internal group. This type of thinking correlates with the popular concepts postulated in e.g. the works of Coleman saying that the bonding social capital in school classes benefits weak pupils.
Quadrant B corresponds to the individual social capital a person acquires by maintaining external ties outside of the group. This view has also been predominantly described by the works of Ronald Burt saying that managers benefit from brokerage between departments. Quadrant C describes the collective-good idea of social capital as described by Putnam describing the concept of social capital as a public good. (e.g. the more the people of a country are connected the better for everybody) Finally quadrant D describes the potential social capital that a whole group can acquire by maintaining ties to other groups. Mainly due to the available data collection survey based methods in the past, this concept has rarely have been operationalized and marginally been explored in the literature of social capital.
If we now take this matrix a bit further and think of the different network “zoom levels” we can sort of create a recursive definition of Borgattis matrix where quadrant D at the inner level becomes quadrant A in the next version of the matrix if we “zoom out”. I have depicted this concept in the figure below and described it on a company example. (Attention I have flipped the original matrix 90°). So at the highest zoom level (which is the top of the figure) the individual is the person, and the department is the group.
Extended social capital matrix, with three different zoom levels on the example of a company.
The top left cell describes individual bonding social capital and deals with the position of the person in the department. The top right cell describes the group’s bonding social capital as a whole. The lower left cell describes how an individual creates bridging social capital by connecting two departments. Finally the lower right cell describes how the department as a whole creates bridging social capital, by being centrally connected to other departments in the company.
This brigs us to the next “zoom level”. Where the department becomes the individual unit of analysis and the company becomes the group concept, which is boundary of the system. So in this zoom level we focus on the study of how different departments are connected with each other inside the company.
If we zoom out again the whole company becomes the unit of analysis. And for example the country becomes the group concept. This brings us to studies that study how different companies connect with each other and how it benefits them from a social capital perspective. Finally if we were to zoom out once more then the whole country becomes the unit of analysis, bringing us to studies on a global level which analyze how different countries for example do trade with each other and so on.
As a conclusion I think that the social capital matrix is very handy when we try to conceptualize a network perspective on things. It helps to unite the bridging and bonding and the individual and group concepts of social capital and reminds us that the “group” concept repeats over and over again only on higher levels, yet the questions remain the same.
If you liked that post please forward it on Twitter, reblog it or leave a comment. I would love to hear what you think about it.
In Twitter we have the situation that the network between users is multiplex (people can hold numerous ties with each other): Users can either a) follow each other b) interact with each other or c) retweet each other. The three types of ties, manifest themselves in three different networks that can be sort of laid on top of each other. This idea got me thinking. I stumbled upon a very interesting chapter for a book from Stephen Borgatti, who introduced network flow model that in my eyes seems to fit perfectly for the Twitter network. The network model from his paper is depicted below:
In his model Borgatti describes the model as two kinds of phenomena, which are called backcloth and traffic in the original work of Atkin. By adapting this model for Twitter we can explain how and why the three types of ties that we have in Twitter can be laid on top of each other and how they influence information diffusion in Twitter. I have therefore made a version that shows how the concepts map onto Twitter, that is depicted below.
The backcloth is the infrastructure that enables the traffic and the traffic consists of information flowing through the network. In the case of Twitter the backcloth corresponds to the cognitive similarities among Twitter users (see below), and to their friend and follower connections. The traffic layer consists of the interactions and flow of information that takes place on top of these phenomena.
Borgatti describes the four categories as following: “The similarities category refers to physical proximity, co-membership in social categories and sharing of behaviors, attitudes and beliefs. Generally we do not see these items as social ties, but we do often see them as increasing the probabilities of certain relations and dyadic events.” This definition corresponds to the notion of implicit ties (ties we cannot directly see) of Twitter users: These implicit ties can basically be a shared interest, a shared location, a shared demographic, a shared audience and so on. Basically every type of attribute that makes two Twitter users similar to each other. The idea that similar people with the same attributes tend to flock together is known as homophily and the general process of people forming ties with similar people is called selection mechanism (e.g. think of people that smoke becoming friends with other smokers)
The next three types of phenomena take place on so called explicit ties because these type of ties can actually be seen or measured explicitly on Twitter. Borgatti defines the social relations category as “ the classic kinds of social ties that are ubiquitous [(which friend and follower ties are in Twitter)] and serve either as role-based or cognitive/affective ties. Role-based includes kinships and role-relations such as boss of, teacher of and friend of.[(In Twitter: follower of)] They can easily be non-symmetric [(which friend and follower ties are)].” It is apparent that, these type of ties exactly relate to the explicit follower ties, which share the same attributes and characteristics. When we think about the other reasons why people become friends other than being similar, we stumble upon all the network effects that are a core part of network literature. Therefore I have indicated those with the back and forth arrow above social relations. In Twitter these type of processes take place everyday: People follow prominent outlets e.g. CNN (preferential attachment), become friends with friends of friends (triadic closure) or simply follow back a person that just followed them (reciprocity). There are much more of such effects, but we don’t want to go into detail here, but instead look at the next type of ties.
Borgatti describes the interactions category as “discrete and separate events that may occur frequently but then stop, such as talking with, fighting with, or having lunch with”. This category translates into the interactional (@mention) ties in Twitter, which have exactly these behavioral traits: People do intentionally mention each other in Tweets, but also might stop doing so for certain reasons. Depending on when one looks at two users in Twitter, this interactional connection might be exist at this point in time or not. The first reason according to the network flow model, why I would interact with someone is because I follow them, which makes perfectly sense for Twitter. Now are there are more reasons why people might interact with each other and a number of those reasons is already covered in various information diffusion theories: One example is that people like to interact with others who they perceive as opinion leaders for a topic. Another example is the brokerage theory that says that such brokers tend profit from interaction with two different groups. The third type of families are the threshold models, where people believed to are lured into interaction or adoption once a certain threshold of their friends talks about a certain topic. Processes like this could easily be taking place on Twitter too.
Finally the flows category is described by Borgatti as “things such as resources, information and diseases that move from node to node. They may transfer (being only at one place at a time) and duplicate (as in information).”. This definition translates directly into the explicit retweet ties that always exist when information is transferred from one actor to another. The final network layer follows the same reasoning as the one before: The first reason why I would retweet someone is because I follow that person and I have already interacted with that person. The reasoning about information diffusion theories applies here too.
Finally I thought it would be nice to add the influence mechanism in this model, which is basically people becoming more similar to each other because of the networks that people already have. All three types of networks (friend and follower ties, @interactions and retweets) might have that effect. The classic influence example is non-smokers being friends with smokers, and then starting to smoke, might be imaginable in Twitter too. Yet there are strong indications that this effect is much smaller than people believe it to be.
Using the network flow model, we came up with a nice ordering of the different concepts that surround network science and sociology and could somehow connect this pieces to the Twitter network. I hope this extended network flow model was useful for you and hope to hear some comments on it.
Great blog entry for anyone working at the intersection of social science, networks and software development.
Graph theory and network science are two related academic fields that have found application in numerous commercial industries. The terms ‘graph’ and ‘network’ are synonymous and one or the other is favored depending on the domain of application. A Rosetta Stone of terminology is provided below to help ground the academic terms to familiar, real-world structures.
Graph theory is a branch of discrete mathematics concerned with proving theorems and developing algorithms for arbitrary graphs (e.g. random graphs, lattices, hierarchies). For example, can a graph with four vertices, seven edges, and structured according to the landmasses and bridges of Königsberg have its edges traversed once and only once? From such problems, the field of graph theory has developed numerous algorithms that can be applied to any graphical…
View original post 757 more words
A few months ago I’ve made a blog post (https://twitterresearcher.wordpress.com/2012/01/17/the-strength-of-ties-revisited/) investigating tie strenghts on Twitter and their influence on retweets. Well it turns out that my analysis was lacking a lot of detail, so I re-did it again considering more aspects than before. So lets get started.
The data that I am using for this analysis is the following: Each group of people consists of 100 people that have been highly listed for a given topic in Twitter e.g. snowboarding or comedy or any other topical interest that people have on Twitter. There are 170 of such groups, each consisting of exactly 100 members (You can read how I created such groups in my recent blog posts here https://twitterresearcher.wordpress.com/2012/06/08/how-to-generate-interest-based-communities-part-1/ and here https://twitterresearcher.wordpress.com/2012/06/12/how-to-generate-interest-based-communities-part-2/). In an abstract way you can imagine the structure of the network to looks something like this:
The graphic above indicates that we only have the friend-follower ties on Twitter between those people. But indeed there are quite a few more ties between people, resulting in a multiplex network between them. This network consists of three layers:
Schematically this looks something like this:
Now when we think about ties between those people especially in regard to tie-strengths we can come up with a couple of different definitions of ties ( I mentioned a couple of those in my blog post here https://twitterresearcher.wordpress.com/2012/05/24/tie-strength-in-twitter/)
Bridging vs. bonding ties:
Having all those definitions of ties we can now come up with a number of observations regarding the information diffusion between those people. The information diffusion is captured in the retweet network (see third layer in the schematic graphic) and the corresponding ties. In generall we want to look at how the different tie types affect the information diffused (retweets) between those people.
To get an overview over the data I will first have a look how many retweets have in total have been exchanged between the analyzed groups. I count how many retweets took place inside the group (blue) and between the groups (red). Each of the 170 groups is shown below:
Approximately a total of 214.000 retweets took place between groups (red) and 414.000 retweets that took place inside the groups (blue). In the graphic above we can clearly see the differences between the different interest groups. I’ve ordered the groups ascending to retweets inside the community and which makes us see that there are some groups that focus mostly on retweets inside the group (e.g. tennis or astronomy_physics) while other groups rather get mostly retweets from outside of their own group and do not retweet each other so much inside the group (e.g.poltics_news or liberal). Although we cannot clearly say that the group has an influence if it gets retweeted from outside the group, we can say that the members of the group at least have the choice to retweet other members of the group. If these members do not retweet each other it might have a reason about which you are free to speculate (or I will try to answer in the next blog post)
Given the different types of ties described above we can now ask the most important question:
How do the different non-valued bridging ties differ from the bonding ties in regard to their influence on the information diffused through those ties?
What do I mean by that? Having all retweets between the persons in the sample I want to find out through which ties these retweets have flown. So for example given that A has retweeted B three times , I ask the question which ties (that A and B already have in the friend and follower network or the interaction network) were “responsible” for this flow of information between those actors?
EXAMPLE: If two people have mentioned each other at least once, I will assume (according to the definition above) that they hold a reciprocated interaction tie. I will then assume that this tie was “responsible” for the retweet between them. NOTICE: This is a simplifying assumption because I assume that if there is a stronger tie it is always was responsible for the retweet and not the maybe underlying weaker tie (as in form of a friend and follower tie).
The assumption that I make here is therefore:
In order to compute which kind of ties were most successful of transmitting retweets, I compute the ratio of ties that had retweets that have flown through this TYPE of tie (e.g. ff_reciprocated_ties) and divide it through the amount of the same ties that no had no retweets (e.g. ff_reciprocated_ties between people where no retweet was exchanged between those persons). So if I have a total of 10.000 reciprocated ties and over 2000 a retweet took place while over the remaining 8000 no retweets have been transmitted the ratio for this type of tie is 0.25.
I have summarized the results in the table below. The std. deviation reports the deviation in the different retweet ties that belong to a certain edge type. (In the case of no_tie we have no data for no retweets because here we would have to count all the ties that are not present, which seems a bit unrealistic, given the structure of social networks)
As you can see in the table I have first of all differentiated if a tie belongs to a bridging tie or a bonding tie. Remember that bonding ties are between people who hold the same interest while bridging ties are between people who belong to different groups and thus share different interests.
As you can see first of all there are a couple of retweets that have taken place between people despite those people actually holding any ties. In the case of bridging ties we a bit more retweets than in the case of bonding ties. Yet regarding the total of almost 660.000 retweets, the approximately 73.000 retweets that took place without a tie are more or less only 10% of the total information diffusion. (So my appologies for the blog post on the importance of no ties was overstating their importance, given this new interpretation)
Friend and follower ties
What is more interesting are the friend and follower ties. We can see that in both cases holding a reciprocated tie with a person, results in a higher chance of getting retweeted by this person. Although when we look at the bonding ties this chance is almost 4 times as high, while in the bridging ties our chances improve only by less than 10%. When we compare the bonding with the bridging ties we clearly see that the reciprocated bonding ties have a magnitude of 10 higher chance of leading to a retweet than the bridging ties. This is very interesting. So despite the fact that of course bridging ties are important because they lead to a diffusion of information outside of the interest group, they are much more difficult to activate than ties between people who share the same interest. So from my point of view this fact shows exactly the weakness of weak ties. When I mean weak ties I refer to the bridging ties that link different topic interest communities together. We see that not only the weaker the tie the lower the chance of it carrying a retweet but also if the tie is a bridging tie the chances drop significantly.
Additionally we can also see that the reciprocated friend and follower ties correspond to the majority of the bandwidth of information exchanged. This is also an interesting fact since the stronger the ties get the higher the chance of obtaining a retweet through this tie, but at the same time the total amount of retweets flowing through these ties drops dramatically (we will also see this when we take a look at the valued at-interaction ties). Just by adding up the numbers we see that almost 3/4ths of all retweets inside the group have flown through the reciprocated friend and follower ties. So although those ties have only a ratio of 0.8 of retweets / no retweets they are the ties that are mostly responsible for the whole information diffusion inside the group.
When we analyze the interaction ties we find a similar pattern. We see that the bonding ties have a much higher chance of resulting in a retweet than their bridging counterparts, although the difference is not as dramatic. In general we also notice that the reciprocated at_ties have the higher chance of leading to retweets. Actually the ratio is higher than one in the reciprocated bonding ties. This means that per tie we obtain more than one retweet. From tie “maintainance perspective” it would seem smart to maintain such ties with your followers because on average they lead to the highest “earnings” or retweets. We shouldn’t jump the gun too early here, because up till now we have analyzed the rather “weak” ties. Why weak? Well having had a reciprocated conversation with a person is great but having had received 10 or 50 @ replies from that person is definitely a stronger tie, and might lead to a higher chance of getting retweeted by this person.
If we look at the valued ties we could replicate the table above and go through each tie strength separately, but its more fun to do this in a graphical way. I have therefore plotted the tie strength between two persons on the X-axis and the ratio (ties that had retweets flow through this type of tie / same type of ties that had no retweet) on the Y axis (make sure to click on the graphic to see it in full resolution)
So what do we see? Well first of all the red line marks the ratio of 1, which is receiving more retweets through this type of tie than not receiving retweets. Anything above one is awesome ;). You also notice that there is quite a lot of variance in the retweets, which is indicated by the error bars (std deviation). As the ties get stronger I would say that the standard deviation also gets higher (due to higher and less values in the retweets)
Bridging ties vs. bonding ties
What we notice is that both the bridging and bonding ties have a tendency to result in a higher chance of retweets flowing through this tie, the stronger they get. I would say this holds up to a certain point maybe the strength of 40? After this the curve starts to fluctuate so much that we can’t really tell if this behavior looks like this simply by chance (notice the high error bars). What we also see is that clearly the bridging ties have a lower chance of resulting in retweets than their bonding counterparts (comare green curve with the blue one). This is an observation that we have also noticed before. So again here it is, the weakness of weak ties. Weaker ties lead to a lower chance of resulting in retweets and the typical weak bridging ties also are much harder to activate than their bonding counterparts. What is not shown in this graph is the total number of retweets that have flown through those strong ties. Those are ~ 29000 retweets for bridging ties and ~ 37000 for bonding ties. Compared to the other tie types this is only a fraction of the total of exchanged retweets. Yet these strong ties in comparison have a very high chance leading to retweets, having sometimes ratios higher than 3 (i.e. there are thee times more retweets than flowing through this type of tie than no retweets flowing through this tie).
Well that was it for today. I will update this blog post with the reverse direction of ties tomorrow where Iwill have a look on the influence of outgoing ties on the incoming retweets. But don’t expect any surprises ;). Plus I will post the code that I used to generate this type of analysis.
A lot of recommendation algorithms these days suffer under the so called cold start problem. Usually this problem is tackled by having the user fill out some initial forms, or provide some initial ratings e.g. for movies in order to give the algorithm something to work on. Another idea is to use what is already out there namely the information encoded in the friends and follower graph on Twitter.
I thought it would be fun to use my recent corpus of 16.000 Twitter users (that have been categorized by how people list them in the list feature) to determine what an arbitrary user is interested in. If this user follows one of these people this means that he might also be interested in the area that they represent. See schema figure below. The approach is really quite simple. Collect all the friends edges from a user, go through them and see if we can find this person in our pre-tagged set of users. The more users we find from one category the more this users seems to be interested in this topic.
Below is all that is needed to perform this user interest aggregation:
The final partitions file in the code is only the output of a task that I performed in my last blog post . I think results of this very simple idea are quite satisfactory. But see for yourself. I have pre-computed the results for some people that I follow and am thinking of putting this online somewhere so you can also check for yourself. Below is the sample yaml output for the user zephoria (danah boyd). The second number next to the person in each category lists how high this person has been ranked in this category.
Here are some shortened results (omitting the individual persons) of people I follow on twitter . If you like, you can tell me in the comments how well this approach actually captured your interests.
Again I’d like to note that in order to find out about user’s interests using this method, there is no need to study his tweets. His friends ties already reveal quite a lot. The first couple of interests are often not that surprising, but some of the later interests reveal things about persons that I was not aware of.
There is an abundant discussion about big data, also on the definition of it (e.g. http://whatsthebigdata.com/2012/06/06/a-very-short-history-of-big-data/). I would say for me big data is when I the data becomes so big that you need to shard your databases and create distributed solutions to computational heavy routines on multiple machines e.g. using mahout, pig or some other map/reduce approach http://de.wikipedia.org/wiki/MapReduce.
In comparison to big data, my data is rather small (20.000 Twitter Users, 50 Mio. Tweets, and ~ 50 Mio x 100 Retweets). It fits on one machine and yet creates a lot of problems when dealing with it. I thought I’d write-up some of the solutions I have found when approaching these Social Network specific data problems.
One of they key routines of my work is extracting networks among people. The easiest network are the friend and follower connections storing and retrieving those is a problem of its own (which I will cover in another blog post). I will show you why storing ~ 100.000 ties per person in a Mysql database is a bad idea.
The next relevant ties are the @-connections. Which correspond to one person mentioning another person in a tweet. These ties are more interesting since they indicate a stronger relationship between people. But extracting them is also a bit harder. Why? Well, if we have 20.000 Persons that we want to create a network of @-mentions in between, this also means that we have max 20.000 x 3200 (3200 being the maximum number of tweets we can extract for a person using the Twitter API) Tweets in our database. This means around ~ 50 Mio of tweets, where each tweet has to be searched for the occurrence of one of the 20.000 usernames. This leads to algorithm #1:
Suppose that in project we are having our 20.000 people, that we want to analyze the network between. In usernames we are storing the names that we want to match each tweet against. The algorithm is simple we read the tweets of every person and check:
If the criteria are met we add an edge in the form [From, To, strength] to our network which we store in values. Each mention has a strength of one. At the end we aggregate those ties adding up the ties having the same pairs of users and adding the values. The result is a network containing the @interactions. Great. But we have a problem, which is the time that it takes to compute this. Why? Well I’ve created a toy sample to show you. It contains 57 people and ~ 120.000 tweets with up to 100 retweets for each tweet. The time it takes to generate the network between them is almost 32 seconds.
This looks good, but if we start to match each tweet against 20.000 people instead of 57 people our performance goes down drastically from around 0.5 seconds per person to almost 60-80 seconds per person. If we now extrapolate from this (60seconds/person * 20.000 Persons)/(3600*24) ~ 10-15 days!! It will take around two weeks to generate this rather small network of 20k people, plus we can never be sure if this process won’t crash because we have used up all the memory of the machine. What to do?
I have mentioned delayed job https://github.com/collectiveidea/delayed_job which is a great gem to be able to create tons of small jobs which can then be processed in parallel by a multitude of workers. We will create a job for each person, write down the results of the job in a csv file and then at the end aggregate all jobs results. This results in algorithm #2:
I’ve created three methods, the first one creates the jobs, one for each person. The second one aggregates the jobs results and is called when all jobs have been processed. The last one is the actual job itself, which is very similar to algorithm #1 except that it saves the output to a csv file instead of an array in the memory. This approach is kind of similar to map reduce since we are in parallel computing the networks for each person and then map or aggregate the results. Additionally I use a method that queries the db periodically to see if the delayed jobs finished their work:
What about the results? For the toy network we get around 21 seconds to finish the jobs. We have improved quite a lot, but how about the 20.000k network. Well sadly the performance did not improve much because the bottleneck is still the same each job has to go through each persons’ tweets and find the ones that contain the username. Thus despite now being able to use multiple cores we are stuck with the db bottleneck. What to do?
To solve the problem of the slow lookup time, we will use a full-fledged search engine called lucene http://de.wikipedia.org/wiki/Lucene which is being accessed by a java solr servlet http://en.wikipedia.org/wiki/Solr. Since we want to use it in rails we will additionally use the http://sunspot.github.com/ gem that makes things even more elegant. Ok what is this about? Well basically we add a server that indexes the tweets in the database and provides an ultra fast search on this corpus. To make our tweets searchable we have to add this description to the model to tell solr what to index:
In this case we want to index the tweet text and all of the corresponding retweet ids. After this all is left is to start the solr server (after you installed the gems etc.) by rake sunspot:solr:start and do a full reindexing of our tweets by rake sunspot:solr:reindex. This might take a while, even up to a day if your database is big. If we are done we can now use the third algorithm:
It is similar to the ones we have seen before yet different in the way that we are not using two iterating loops anymore. Instead for each person we fetch the tweets that mention this person by using full text “@person.username”, which returns all the tweets in which this person was mentioned with an at sign. Then for these we double-check if the author of this tweet is not the same person (loop) and if the tweets don’t include “RT” and have no retweets. If the match these criteria we similarly create a tie. And similarly we aggregate these ties at the end. What about the performance of this algorithm? For the toy project it finishes around 2 seconds. And for the 20.000 k network I’ve displayed some of the times per person results below:
As you can see, even when we are analyzing 20.000 people at the same time per person we get results that are often under one second and up to 10 seconds in peaks, when the person has been mentioned a lot, and we need time to filter those results. One final thing, I’ve noticed that the standard Tokenizer in Solr strips the @sign from the tokens, that’s why for the search engine the term “@facebook” and “facebook” means the same (see my post on stackoverflow http://stackoverflow.com/questions/11153155/solr-sunspot-excact-search-for-words). But in this case I actually care for this difference, while in the first the person is addressing the @facebook account on twitter, in the later the person might be only saying something about facebook and not addressing this particular account. So if we change the tokenizer to whitespaceTokenizer, which doesn’t remove these @ signs we are actually able to search for both.
Well that is it for today. The lesson: It differs a lot how you represent and store your data, in some cases you might end up with terrible results and wait for weeks for the result, and by doing slight changes you might speed up this process up to 10x or 100x times. Although big data focuses on algorithms that run on big data sets and on distributed computations, in the end it might often be easier to shrink the big data into small data by aggregating it in some form and then process it. Or to use what we already have, namely smart solutions e.g. lucene for existing problems like text search. The most important outcome of this approach is that you gain flexibility to experiment with your data, by re-running experiments or algorithms and be able to see what happens, instead of waiting for two weeks. Plus you can somehow trust already existing solutions instead of creating your own ones, which might be often buggy.
P.P.S I am planing to write about a similar story on how I decided to store the friendship ties in a key-value store instead of a relational database and then finally write about how I processed the resulting networks with networkX.
In the last blog post last week (https://twitterresearcher.wordpress.com/2012/06/08/how-to-generate-interest-based-communities-part-1/) I have described my way of collecting people on Twitter that are highly listed on lists for certain keywords such as swimming, running, perl, ruby and so on. I have then sorted each of those persons in each category according to how often they were listed in each category. This lead to lists like these below, where you see a listing people found on list that contained the word “actor”.
We might say this is a satisfactory result, because the list seems to contain people that actually seem relevant in regard to this keyword. But what about the persons that we collected for the keyword “hollywood”. Lets have a look:
If you look at the first persons you notice that a lot of these people are the same. Although in my last attempts (https://twitterresearcher.wordpress.com/2012/04/16/5/ and https://twitterresearcher.wordpress.com/2012/03/16/a-net-of-words-a-high-level-ontology-for-twitter-tags/) I tried hard to find keywords that are semantically related such as “car” and “automotive”, the list of user interests ended up having some examples like “actor” and “hollywood”. What are we going to do about this prolem? My solution is to merge those two lists into one since it seems to cover the same interest. But how do I do this without having to subjectively decide on each list?
An idea is to calculate how often members from one list appear on other lists. The lists that have a high overlap will be then merged into one list and the counts that those people received will be added up. The new position on the list will be then determined by the new count. We will need two parameters: the maximum number of persons that we want to look at in each list (i simply called it MAX) and a threshold percentage of % of similar people which decides when to merge two lists. If we merge two lists “actor” and “hollywood” into “actor_hollywood” we also want to run this list against all remaining keywords such as “tvshows” and also merge it with them if the criteria s are met, resulting in “actor_hollywood_tvshows”. The result is a nice clustering of the members we found for our interests. Although these interests have different keywords, if they contain the same members they seem to capture the same semantical concept or user interest. The code to perform this is shown below:
For further processing the code also saves which concepts it merged into which keys and also makes sure that if we merge 200 people from one list with 200 from another list we only take the first 200 from the resulting list.
What does the result look like? I’ve displayed the resulting merged categories using a threshold of 0.1 and the checking the first 1000 places for overlap.
Below you see the final output where I have used a threshold of 0.2 and looked at only the first 200 users in each list. Regarding the final number of communities there is a trade off: When setting the threshold too low we end up with “big” user interest areas where lots of nodes are clumped together. When having a too high threshold, it seems like the groups that obviously should be united (e.g. “theater” and “theatre” ) won’t be merged. I have had good experiences with setting the threshold to 0.2 which means that groups that share 20% of their members are merged into one.
The results of the above attempts are not bad they can be improved. Why ? Well imagine your name was in the actors category which got merged with drama, hollywood, tv_shows and you ended up having the 154th place in this category. This is not bad, but it might be that people actually think that you are more of a “theatre” guy and that is why in the category of theatre you rank 20th. Although knowing that a person can belong to multiple interest groups, if I were to chose the one that best represents you I would say that you are in the theatre category because you ranked 20th there, while only ranking 154th in the actor category.
So this means that I am comparing the rankings that you achieved in each cateogory. But I could also compare the total number of votes that you received on each list. If I did that you would end up being in the actor category because the total number of lists for this category is much higher than for theatre, and the 200 votes received by somebody on the 154th place in the actor category are higher than the 50 votes received by the same person on the 20th place in the theatre category. I have chosen to go with the ranking method, because it is more stable in regard to this problem. Popular interests do not “outweigh” the more specific ones, and if a person can be placed in a specific category then it should be the specific one and not the popular one. The code below does exactly this. Additionally it also notes for each person how often this person was also part of other categories, but the person gets assigned to the category where it got on the higher place.
There is also a small array called final_candidates that is used to put exactly 100 persons in each category at the end. What does the output look like? In most of the cases it leaves the persons in the same category, but in some cases people actually switch categories. These are the interesting cases. I have filtered the output in Excel and sorted it by the number of competing categories, to showcase some of the cases that took place. You notice that e.g. the “DalaiLama” started in the “yoga” category but according to our algorithm (or actually the people’s votes) he fitted more into “buddhism”, or “NASA” started in “tech” but was moved to “astronomy”, which seems even more fitting.
To provide an idea how often this switcheroo took place I have created a simple pivot table listing the average value of competing categories per category (see below). We see that for the majority of categories their people don’t compete for other categories (right side of the chart), but maybe for a handful of categories their people compete for other categories (left peaks of the chart). What you also notice on this graph, is that the lower the threshold, the smaller the final groups, but these groups have a smaller cometing average count (e.g compare violet line size:1000, threshold 0.1 vs. geen line size 1000 threshold 0.2). What you also see is that if we consider only the first 200 places vs. the first 1000 places we get actually better results (compare violet line with red line). This is a bit counter intuitive. Since I was thinking the that the more people we take into consideration the better the results. It rather turns out that after a certain point this voting mechanism seems to get “blurrier”. People getting voted on the 345th place somewhere don’t really matter that much, but eventually they lead to merging these categories together, which shouldn’t have had been merged.
No matter which threshold and size we use there are always a couple of groups that always seem “problematic” (aka the high peaks in the chart on the left) where it seems hard for people to decide where these people belong to. Below I have provided an an excerpt for group size 200 and threshold 0.2. For people in these categories it seems really hard to “pin” them down to a certain interest.
For the rest of the groups we get very stable results. These interest groups seem to be well defined and people don’t think that those people belong to other categories:
For these remaining interest groups we will now take a look at their internal group structure, looking how e.g. opinon leaders (people being very central in the group) are able to get a lot of retweets (or not). Additionally we will take a look on how there are people between different groups (e.g. programming languages ruby and perl) that work as brokers or “boundry spanners”, and if these people are able to get retweets from both communities or only one or none at all. For questions like these these interest groups provide an interesting data source.
In this blog post I want to talk about how to find people on Twitter that are interested in the “same things”. I have posted a number on entries about
Today I want to go through the process of how to use the approximately 200 different keywords representing user interests (e.g. swimming, running, ruby, php, jazz, career, hunting, islam and so on…) and how to get all of the relevant users that are highly contributing to these topics aka. forming the interest based community.
To capture the collective knowledge of Twitter I will make use of Twitters “list-feature”, shown below:
As you can see I am listed in a number of lists such as SNA, social media, dataviz and so on. These lists have been created by people in order to organize Twitter followers into some categories, similar to book lists on amazon and so on.
Having had scraped off the first 100 people for each of the 200 keywords in the last blog post (https://twitterresearcher.wordpress.com/2012/02/17/how-to-make-sense-out-of-twitter-tags/) and storing them in the database I will use these people to find more lists that feature similar people for the keyword. Why? Mainly because Twitter doesnt let you search for lists with a certain name, and because the alternatives such as wefollow.com, twellow.com or listorious.com do not give you all the lists for a given search term. That is why I will have to snow-ball through Twitter lists and keep those that are relevant for a given topic. This process consists of three parts:
This process is shown in the figure below:
How do we collect lists? Well we start by checking if we have enough API calls left on Twitter, if this is the case we start by collecting the memberships for a given user and keep on paging until there are no more lists that the user is listed on. As you can see Twitter can be a bit sensitive to the page size, it can be 1000 items max, but in practice it is around 200-400 items, before we get timeouts. That’s why the function is adopting dynamically to those. Also collecting more than lets say 10000 lists for a given user does not make much sense since, we are probably wasting our API calls for a celebrity like aplusk. Once all of the lists that the user is listed on have been collected I store them into a csv file and the most important part of this procedure: persist these lists in my database that contain the keyword that the person was originally collected for. This means e.g. if the seed user was in the category “swimming”, I will only keep those lists that include the keyword swimming in it. Additionally I make sure that if I already have encountered this list I don’t add it twice to my database.
Once I have collected tons of lists that match the category keywords, I will collect all of the list members that are listed on these lists. The code below is run for every list in the database. As you can see I make sure that there are enough API calls left, and then start to collect all of the members on the list. For this I am using delayed job, which is a nice ruby library https://github.com/tobi/delayed_job that allows me to wrap time consuming tasks in a neat job that can then be run later or by multiple machines on multiple computers. I have made good experiences using around 10-15 workers on a single machine which then process these jobs in the background. Anyway at the end of this step we end up having projects each containing a a high number of people that seem to be relevant for this user interest because they have been listed on lists for explicitly this interest.
After step 1 and 2 we have a number of potential candidates that are relevant for a given topic but we are only interested in those that represent the user interest the most. That is why we need a procedure that filters those people according to how often they have been listed for a certain topic. How do we do that? Well for each topic we have a number of lists that each list which people are relevant according to this list for this topic. Now if we go though all the lists for a given topic and count how often certain persons were listed for this topic we might end up with finding the most relevant users for a given topic (As we will see later in part two, this process gives some nice resulsts, but it can be greatly improved in regard to it’s accuracy). So what does the code below do?
This function is run on the projects that contain the bunch of people that have been collected for a certain Topic. First for all the persons in there it stores the persons in memory for faster computation. It then goes through all the lists that we collected for a given topic and checks again if the lists matches the topic keyword, if it does it checks if we have not encountered this list before (which should not happen, since we made sure we don’t add lists twice in the insertion process, but double checking won’t hurt). If this list has members, then it collects all of these member usernames into an array and checks if among the seen_membersets (which are simply the collection of usernames) there is a set that contains exactly the same members already. Why are we doing this? Because there is list spam out there in Twitter and people or bots end up copying lists only to save it under a different name. So in our case if there happens to be no lists that already has the same members (99%), then we actually analyze this list, otherwise we drop it, because it is too similar to the lists that we already encountered. For each of the list members we check if the the persons we are computing the list count for can be found on the list, if this is the case we add plus 1 to the persons counter. Otherwise if there is a person on the list that we somehow have not captured before we add it to our pool of persons and also raise the persons counter. At the end of this procedure we end up having a list count for each person that was on these lists and can directly see how relevant this person is in regard to a certain topic. We output the sorted list count into a simple csv only to use these later in part 2, to improve our accuracy.
To show the result of this process I have cut and pasted a small part of the result of step 3 below. As you can see we managed to find out that it seems like aplusk, JimCarrey, tomhanks and so seem to be the most relevant Twitter users for the community of actors. This list contains ~18.000 entries, where people towards the end do not seem to be that much representing the actor community as the people at the beginning of the list.
If we now take e.g. 100 – 200 people from each of these lists (assuming that people cannot manage more than this amount of people according to the dunbar or wellman number http://en.wikipedia.org/wiki/Dunbar%27s_number) we end up having those interest based communities of people on Twitter that share interest.
Studying those communities is what I am trying to do in my work, but more on that later. These communities are also interesting for advertisers: Imagine someone who wants to sell swimming underwear, this retailer would be highly interested if you could show him all the people on Twitter that are interested in swimming. Those people could be his first customer group. If his swimming underwear gets approved by those people then it is very likely that they will talk about it and so inform other swimming interested people about this product.
And in part two I will show you how we can improve these communities by allowing people to move from one community to the other, if they “fit” better into this community.