There is an abundant discussion about big data, also on the definition of it (e.g. http://whatsthebigdata.com/2012/06/06/a-very-short-history-of-big-data/). I would say for me big data is when I the data becomes so big that you need to shard your databases and create distributed solutions to computational heavy routines on multiple machines e.g. using mahout, pig or some other map/reduce approach http://de.wikipedia.org/wiki/MapReduce.
In comparison to big data, my data is rather small (20.000 Twitter Users, 50 Mio. Tweets, and ~ 50 Mio x 100 Retweets). It fits on one machine and yet creates a lot of problems when dealing with it. I thought I’d write-up some of the solutions I have found when approaching these Social Network specific data problems.
One of they key routines of my work is extracting networks among people. The easiest network are the friend and follower connections storing and retrieving those is a problem of its own (which I will cover in another blog post). I will show you why storing ~ 100.000 ties per person in a Mysql database is a bad idea.
The next relevant ties are the @-connections. Which correspond to one person mentioning another person in a tweet. These ties are more interesting since they indicate a stronger relationship between people. But extracting them is also a bit harder. Why? Well, if we have 20.000 Persons that we want to create a network of @-mentions in between, this also means that we have max 20.000 x 3200 (3200 being the maximum number of tweets we can extract for a person using the Twitter API) Tweets in our database. This means around ~ 50 Mio of tweets, where each tweet has to be searched for the occurrence of one of the 20.000 usernames. This leads to algorithm #1:
Suppose that in project we are having our 20.000 people, that we want to analyze the network between. In usernames we are storing the names that we want to match each tweet against. The algorithm is simple we read the tweets of every person and check:
If the criteria are met we add an edge in the form [From, To, strength] to our network which we store in values. Each mention has a strength of one. At the end we aggregate those ties adding up the ties having the same pairs of users and adding the values. The result is a network containing the @interactions. Great. But we have a problem, which is the time that it takes to compute this. Why? Well I’ve created a toy sample to show you. It contains 57 people and ~ 120.000 tweets with up to 100 retweets for each tweet. The time it takes to generate the network between them is almost 32 seconds.
This looks good, but if we start to match each tweet against 20.000 people instead of 57 people our performance goes down drastically from around 0.5 seconds per person to almost 60-80 seconds per person. If we now extrapolate from this (60seconds/person * 20.000 Persons)/(3600*24) ~ 10-15 days!! It will take around two weeks to generate this rather small network of 20k people, plus we can never be sure if this process won’t crash because we have used up all the memory of the machine. What to do?
I have mentioned delayed job https://github.com/collectiveidea/delayed_job which is a great gem to be able to create tons of small jobs which can then be processed in parallel by a multitude of workers. We will create a job for each person, write down the results of the job in a csv file and then at the end aggregate all jobs results. This results in algorithm #2:
I’ve created three methods, the first one creates the jobs, one for each person. The second one aggregates the jobs results and is called when all jobs have been processed. The last one is the actual job itself, which is very similar to algorithm #1 except that it saves the output to a csv file instead of an array in the memory. This approach is kind of similar to map reduce since we are in parallel computing the networks for each person and then map or aggregate the results. Additionally I use a method that queries the db periodically to see if the delayed jobs finished their work:
What about the results? For the toy network we get around 21 seconds to finish the jobs. We have improved quite a lot, but how about the 20.000k network. Well sadly the performance did not improve much because the bottleneck is still the same each job has to go through each persons’ tweets and find the ones that contain the username. Thus despite now being able to use multiple cores we are stuck with the db bottleneck. What to do?
To solve the problem of the slow lookup time, we will use a full-fledged search engine called lucene http://de.wikipedia.org/wiki/Lucene which is being accessed by a java solr servlet http://en.wikipedia.org/wiki/Solr. Since we want to use it in rails we will additionally use the http://sunspot.github.com/ gem that makes things even more elegant. Ok what is this about? Well basically we add a server that indexes the tweets in the database and provides an ultra fast search on this corpus. To make our tweets searchable we have to add this description to the model to tell solr what to index:
In this case we want to index the tweet text and all of the corresponding retweet ids. After this all is left is to start the solr server (after you installed the gems etc.) by rake sunspot:solr:start and do a full reindexing of our tweets by rake sunspot:solr:reindex. This might take a while, even up to a day if your database is big. If we are done we can now use the third algorithm:
It is similar to the ones we have seen before yet different in the way that we are not using two iterating loops anymore. Instead for each person we fetch the tweets that mention this person by using full text “@person.username”, which returns all the tweets in which this person was mentioned with an at sign. Then for these we double-check if the author of this tweet is not the same person (loop) and if the tweets don’t include “RT” and have no retweets. If the match these criteria we similarly create a tie. And similarly we aggregate these ties at the end. What about the performance of this algorithm? For the toy project it finishes around 2 seconds. And for the 20.000 k network I’ve displayed some of the times per person results below:
As you can see, even when we are analyzing 20.000 people at the same time per person we get results that are often under one second and up to 10 seconds in peaks, when the person has been mentioned a lot, and we need time to filter those results. One final thing, I’ve noticed that the standard Tokenizer in Solr strips the @sign from the tokens, that’s why for the search engine the term “@facebook” and “facebook” means the same (see my post on stackoverflow http://stackoverflow.com/questions/11153155/solr-sunspot-excact-search-for-words). But in this case I actually care for this difference, while in the first the person is addressing the @facebook account on twitter, in the later the person might be only saying something about facebook and not addressing this particular account. So if we change the tokenizer to whitespaceTokenizer, which doesn’t remove these @ signs we are actually able to search for both.
Well that is it for today. The lesson: It differs a lot how you represent and store your data, in some cases you might end up with terrible results and wait for weeks for the result, and by doing slight changes you might speed up this process up to 10x or 100x times. Although big data focuses on algorithms that run on big data sets and on distributed computations, in the end it might often be easier to shrink the big data into small data by aggregating it in some form and then process it. Or to use what we already have, namely smart solutions e.g. lucene for existing problems like text search. The most important outcome of this approach is that you gain flexibility to experiment with your data, by re-running experiments or algorithms and be able to see what happens, instead of waiting for two weeks. Plus you can somehow trust already existing solutions instead of creating your own ones, which might be often buggy.
P.P.S I am planing to write about a similar story on how I decided to store the friendship ties in a key-value store instead of a relational database and then finally write about how I processed the resulting networks with networkX.
In the last blog post last week (https://twitterresearcher.wordpress.com/2012/06/08/how-to-generate-interest-based-communities-part-1/) I have described my way of collecting people on Twitter that are highly listed on lists for certain keywords such as swimming, running, perl, ruby and so on. I have then sorted each of those persons in each category according to how often they were listed in each category. This lead to lists like these below, where you see a listing people found on list that contained the word “actor”.
We might say this is a satisfactory result, because the list seems to contain people that actually seem relevant in regard to this keyword. But what about the persons that we collected for the keyword “hollywood”. Lets have a look:
If you look at the first persons you notice that a lot of these people are the same. Although in my last attempts (https://twitterresearcher.wordpress.com/2012/04/16/5/ and https://twitterresearcher.wordpress.com/2012/03/16/a-net-of-words-a-high-level-ontology-for-twitter-tags/) I tried hard to find keywords that are semantically related such as “car” and “automotive”, the list of user interests ended up having some examples like “actor” and “hollywood”. What are we going to do about this prolem? My solution is to merge those two lists into one since it seems to cover the same interest. But how do I do this without having to subjectively decide on each list?
An idea is to calculate how often members from one list appear on other lists. The lists that have a high overlap will be then merged into one list and the counts that those people received will be added up. The new position on the list will be then determined by the new count. We will need two parameters: the maximum number of persons that we want to look at in each list (i simply called it MAX) and a threshold percentage of % of similar people which decides when to merge two lists. If we merge two lists “actor” and “hollywood” into “actor_hollywood” we also want to run this list against all remaining keywords such as “tvshows” and also merge it with them if the criteria s are met, resulting in “actor_hollywood_tvshows”. The result is a nice clustering of the members we found for our interests. Although these interests have different keywords, if they contain the same members they seem to capture the same semantical concept or user interest. The code to perform this is shown below:
For further processing the code also saves which concepts it merged into which keys and also makes sure that if we merge 200 people from one list with 200 from another list we only take the first 200 from the resulting list.
What does the result look like? I’ve displayed the resulting merged categories using a threshold of 0.1 and the checking the first 1000 places for overlap.
Below you see the final output where I have used a threshold of 0.2 and looked at only the first 200 users in each list. Regarding the final number of communities there is a trade off: When setting the threshold too low we end up with “big” user interest areas where lots of nodes are clumped together. When having a too high threshold, it seems like the groups that obviously should be united (e.g. “theater” and “theatre” ) won’t be merged. I have had good experiences with setting the threshold to 0.2 which means that groups that share 20% of their members are merged into one.
The results of the above attempts are not bad they can be improved. Why ? Well imagine your name was in the actors category which got merged with drama, hollywood, tv_shows and you ended up having the 154th place in this category. This is not bad, but it might be that people actually think that you are more of a “theatre” guy and that is why in the category of theatre you rank 20th. Although knowing that a person can belong to multiple interest groups, if I were to chose the one that best represents you I would say that you are in the theatre category because you ranked 20th there, while only ranking 154th in the actor category.
So this means that I am comparing the rankings that you achieved in each cateogory. But I could also compare the total number of votes that you received on each list. If I did that you would end up being in the actor category because the total number of lists for this category is much higher than for theatre, and the 200 votes received by somebody on the 154th place in the actor category are higher than the 50 votes received by the same person on the 20th place in the theatre category. I have chosen to go with the ranking method, because it is more stable in regard to this problem. Popular interests do not “outweigh” the more specific ones, and if a person can be placed in a specific category then it should be the specific one and not the popular one. The code below does exactly this. Additionally it also notes for each person how often this person was also part of other categories, but the person gets assigned to the category where it got on the higher place.
There is also a small array called final_candidates that is used to put exactly 100 persons in each category at the end. What does the output look like? In most of the cases it leaves the persons in the same category, but in some cases people actually switch categories. These are the interesting cases. I have filtered the output in Excel and sorted it by the number of competing categories, to showcase some of the cases that took place. You notice that e.g. the “DalaiLama” started in the “yoga” category but according to our algorithm (or actually the people’s votes) he fitted more into “buddhism”, or “NASA” started in “tech” but was moved to “astronomy”, which seems even more fitting.
To provide an idea how often this switcheroo took place I have created a simple pivot table listing the average value of competing categories per category (see below). We see that for the majority of categories their people don’t compete for other categories (right side of the chart), but maybe for a handful of categories their people compete for other categories (left peaks of the chart). What you also notice on this graph, is that the lower the threshold, the smaller the final groups, but these groups have a smaller cometing average count (e.g compare violet line size:1000, threshold 0.1 vs. geen line size 1000 threshold 0.2). What you also see is that if we consider only the first 200 places vs. the first 1000 places we get actually better results (compare violet line with red line). This is a bit counter intuitive. Since I was thinking the that the more people we take into consideration the better the results. It rather turns out that after a certain point this voting mechanism seems to get “blurrier”. People getting voted on the 345th place somewhere don’t really matter that much, but eventually they lead to merging these categories together, which shouldn’t have had been merged.
No matter which threshold and size we use there are always a couple of groups that always seem “problematic” (aka the high peaks in the chart on the left) where it seems hard for people to decide where these people belong to. Below I have provided an an excerpt for group size 200 and threshold 0.2. For people in these categories it seems really hard to “pin” them down to a certain interest.
For the rest of the groups we get very stable results. These interest groups seem to be well defined and people don’t think that those people belong to other categories:
For these remaining interest groups we will now take a look at their internal group structure, looking how e.g. opinon leaders (people being very central in the group) are able to get a lot of retweets (or not). Additionally we will take a look on how there are people between different groups (e.g. programming languages ruby and perl) that work as brokers or “boundry spanners”, and if these people are able to get retweets from both communities or only one or none at all. For questions like these these interest groups provide an interesting data source.
In this blog post I want to talk about how to find people on Twitter that are interested in the “same things”. I have posted a number on entries about
Today I want to go through the process of how to use the approximately 200 different keywords representing user interests (e.g. swimming, running, ruby, php, jazz, career, hunting, islam and so on…) and how to get all of the relevant users that are highly contributing to these topics aka. forming the interest based community.
To capture the collective knowledge of Twitter I will make use of Twitters “list-feature”, shown below:
As you can see I am listed in a number of lists such as SNA, social media, dataviz and so on. These lists have been created by people in order to organize Twitter followers into some categories, similar to book lists on amazon and so on.
Having had scraped off the first 100 people for each of the 200 keywords in the last blog post (https://twitterresearcher.wordpress.com/2012/02/17/how-to-make-sense-out-of-twitter-tags/) and storing them in the database I will use these people to find more lists that feature similar people for the keyword. Why? Mainly because Twitter doesnt let you search for lists with a certain name, and because the alternatives such as wefollow.com, twellow.com or listorious.com do not give you all the lists for a given search term. That is why I will have to snow-ball through Twitter lists and keep those that are relevant for a given topic. This process consists of three parts:
This process is shown in the figure below:
How do we collect lists? Well we start by checking if we have enough API calls left on Twitter, if this is the case we start by collecting the memberships for a given user and keep on paging until there are no more lists that the user is listed on. As you can see Twitter can be a bit sensitive to the page size, it can be 1000 items max, but in practice it is around 200-400 items, before we get timeouts. That’s why the function is adopting dynamically to those. Also collecting more than lets say 10000 lists for a given user does not make much sense since, we are probably wasting our API calls for a celebrity like aplusk. Once all of the lists that the user is listed on have been collected I store them into a csv file and the most important part of this procedure: persist these lists in my database that contain the keyword that the person was originally collected for. This means e.g. if the seed user was in the category “swimming”, I will only keep those lists that include the keyword swimming in it. Additionally I make sure that if I already have encountered this list I don’t add it twice to my database.
Once I have collected tons of lists that match the category keywords, I will collect all of the list members that are listed on these lists. The code below is run for every list in the database. As you can see I make sure that there are enough API calls left, and then start to collect all of the members on the list. For this I am using delayed job, which is a nice ruby library https://github.com/tobi/delayed_job that allows me to wrap time consuming tasks in a neat job that can then be run later or by multiple machines on multiple computers. I have made good experiences using around 10-15 workers on a single machine which then process these jobs in the background. Anyway at the end of this step we end up having projects each containing a a high number of people that seem to be relevant for this user interest because they have been listed on lists for explicitly this interest.
After step 1 and 2 we have a number of potential candidates that are relevant for a given topic but we are only interested in those that represent the user interest the most. That is why we need a procedure that filters those people according to how often they have been listed for a certain topic. How do we do that? Well for each topic we have a number of lists that each list which people are relevant according to this list for this topic. Now if we go though all the lists for a given topic and count how often certain persons were listed for this topic we might end up with finding the most relevant users for a given topic (As we will see later in part two, this process gives some nice resulsts, but it can be greatly improved in regard to it’s accuracy). So what does the code below do?
This function is run on the projects that contain the bunch of people that have been collected for a certain Topic. First for all the persons in there it stores the persons in memory for faster computation. It then goes through all the lists that we collected for a given topic and checks again if the lists matches the topic keyword, if it does it checks if we have not encountered this list before (which should not happen, since we made sure we don’t add lists twice in the insertion process, but double checking won’t hurt). If this list has members, then it collects all of these member usernames into an array and checks if among the seen_membersets (which are simply the collection of usernames) there is a set that contains exactly the same members already. Why are we doing this? Because there is list spam out there in Twitter and people or bots end up copying lists only to save it under a different name. So in our case if there happens to be no lists that already has the same members (99%), then we actually analyze this list, otherwise we drop it, because it is too similar to the lists that we already encountered. For each of the list members we check if the the persons we are computing the list count for can be found on the list, if this is the case we add plus 1 to the persons counter. Otherwise if there is a person on the list that we somehow have not captured before we add it to our pool of persons and also raise the persons counter. At the end of this procedure we end up having a list count for each person that was on these lists and can directly see how relevant this person is in regard to a certain topic. We output the sorted list count into a simple csv only to use these later in part 2, to improve our accuracy.
To show the result of this process I have cut and pasted a small part of the result of step 3 below. As you can see we managed to find out that it seems like aplusk, JimCarrey, tomhanks and so seem to be the most relevant Twitter users for the community of actors. This list contains ~18.000 entries, where people towards the end do not seem to be that much representing the actor community as the people at the beginning of the list.
If we now take e.g. 100 – 200 people from each of these lists (assuming that people cannot manage more than this amount of people according to the dunbar or wellman number http://en.wikipedia.org/wiki/Dunbar%27s_number) we end up having those interest based communities of people on Twitter that share interest.
Studying those communities is what I am trying to do in my work, but more on that later. These communities are also interesting for advertisers: Imagine someone who wants to sell swimming underwear, this retailer would be highly interested if you could show him all the people on Twitter that are interested in swimming. Those people could be his first customer group. If his swimming underwear gets approved by those people then it is very likely that they will talk about it and so inform other swimming interested people about this product.
And in part two I will show you how we can improve these communities by allowing people to move from one community to the other, if they “fit” better into this community.