There is an abundant discussion about big data, also on the definition of it (e.g. http://whatsthebigdata.com/2012/06/06/a-very-short-history-of-big-data/). I would say for me big data is when I the data becomes so big that you need to shard your databases and create distributed solutions to computational heavy routines on multiple machines e.g. using mahout, pig or some other map/reduce approach http://de.wikipedia.org/wiki/MapReduce.
In comparison to big data, my data is rather small (20.000 Twitter Users, 50 Mio. Tweets, and ~ 50 Mio x 100 Retweets). It fits on one machine and yet creates a lot of problems when dealing with it. I thought I’d write-up some of the solutions I have found when approaching these Social Network specific data problems.
Generating, Storing and analyzing networks between people
One of they key routines of my work is extracting networks among people. The easiest network are the friend and follower connections storing and retrieving those is a problem of its own (which I will cover in another blog post). I will show you why storing ~ 100.000 ties per person in a Mysql database is a bad idea.
Solution one: Generating @-Networks from Tweets
The next relevant ties are the @-connections. Which correspond to one person mentioning another person in a tweet. These ties are more interesting since they indicate a stronger relationship between people. But extracting them is also a bit harder. Why? Well, if we have 20.000 Persons that we want to create a network of @-mentions in between, this also means that we have max 20.000 x 3200 (3200 being the maximum number of tweets we can extract for a person using the Twitter API) Tweets in our database. This means around ~ 50 Mio of tweets, where each tweet has to be searched for the occurrence of one of the 20.000 usernames. This leads to algorithm #1:
Suppose that in project we are having our 20.000 people, that we want to analyze the network between. In usernames we are storing the names that we want to match each tweet against. The algorithm is simple we read the tweets of every person and check:
- Is tweet mentioning one of the other 20.000 persons?
- Is this tweet not containing the “RT” (e.g. “RT @user have you seen xyz”)
- Has this tweet been retweeted by others? Here we assume that @conversations are such tweets that are not retweeted but mention another user
If the criteria are met we add an edge in the form [From, To, strength] to our network which we store in values. Each mention has a strength of one. At the end we aggregate those ties adding up the ties having the same pairs of users and adding the values. The result is a network containing the @interactions. Great. But we have a problem, which is the time that it takes to compute this. Why? Well I’ve created a toy sample to show you. It contains 57 people and ~ 120.000 tweets with up to 100 retweets for each tweet. The time it takes to generate the network between them is almost 32 seconds.
This looks good, but if we start to match each tweet against 20.000 people instead of 57 people our performance goes down drastically from around 0.5 seconds per person to almost 60-80 seconds per person. If we now extrapolate from this (60seconds/person * 20.000 Persons)/(3600*24) ~ 10-15 days!! It will take around two weeks to generate this rather small network of 20k people, plus we can never be sure if this process won’t crash because we have used up all the memory of the machine. What to do?
Solution two: Use multiple workers to get the job done
I have mentioned delayed job https://github.com/collectiveidea/delayed_job which is a great gem to be able to create tons of small jobs which can then be processed in parallel by a multitude of workers. We will create a job for each person, write down the results of the job in a csv file and then at the end aggregate all jobs results. This results in algorithm #2:
I’ve created three methods, the first one creates the jobs, one for each person. The second one aggregates the jobs results and is called when all jobs have been processed. The last one is the actual job itself, which is very similar to algorithm #1 except that it saves the output to a csv file instead of an array in the memory. This approach is kind of similar to map reduce since we are in parallel computing the networks for each person and then map or aggregate the results. Additionally I use a method that queries the db periodically to see if the delayed jobs finished their work:
What about the results? For the toy network we get around 21 seconds to finish the jobs. We have improved quite a lot, but how about the 20.000k network. Well sadly the performance did not improve much because the bottleneck is still the same each job has to go through each persons’ tweets and find the ones that contain the username. Thus despite now being able to use multiple cores we are stuck with the db bottleneck. What to do?
Solution three: Use lucene / solr a enterprise solution for indexed full-text search
To solve the problem of the slow lookup time, we will use a full-fledged search engine called lucene http://de.wikipedia.org/wiki/Lucene which is being accessed by a java solr servlet http://en.wikipedia.org/wiki/Solr. Since we want to use it in rails we will additionally use the http://sunspot.github.com/ gem that makes things even more elegant. Ok what is this about? Well basically we add a server that indexes the tweets in the database and provides an ultra fast search on this corpus. To make our tweets searchable we have to add this description to the model to tell solr what to index:
In this case we want to index the tweet text and all of the corresponding retweet ids. After this all is left is to start the solr server (after you installed the gems etc.) by rake sunspot:solr:start and do a full reindexing of our tweets by rake sunspot:solr:reindex. This might take a while, even up to a day if your database is big. If we are done we can now use the third algorithm:
It is similar to the ones we have seen before yet different in the way that we are not using two iterating loops anymore. Instead for each person we fetch the tweets that mention this person by using full text “@person.username”, which returns all the tweets in which this person was mentioned with an at sign. Then for these we double-check if the author of this tweet is not the same person (loop) and if the tweets don’t include “RT” and have no retweets. If the match these criteria we similarly create a tie. And similarly we aggregate these ties at the end. What about the performance of this algorithm? For the toy project it finishes around 2 seconds. And for the 20.000 k network I’ve displayed some of the times per person results below:
As you can see, even when we are analyzing 20.000 people at the same time per person we get results that are often under one second and up to 10 seconds in peaks, when the person has been mentioned a lot, and we need time to filter those results. One final thing, I’ve noticed that the standard Tokenizer in Solr strips the @sign from the tokens, that’s why for the search engine the term “@facebook” and “facebook” means the same (see my post on stackoverflow http://stackoverflow.com/questions/11153155/solr-sunspot-excact-search-for-words). But in this case I actually care for this difference, while in the first the person is addressing the @facebook account on twitter, in the later the person might be only saying something about facebook and not addressing this particular account. So if we change the tokenizer to whitespaceTokenizer, which doesn’t remove these @ signs we are actually able to search for both.
Well that is it for today. The lesson: It differs a lot how you represent and store your data, in some cases you might end up with terrible results and wait for weeks for the result, and by doing slight changes you might speed up this process up to 10x or 100x times. Although big data focuses on algorithms that run on big data sets and on distributed computations, in the end it might often be easier to shrink the big data into small data by aggregating it in some form and then process it. Or to use what we already have, namely smart solutions e.g. lucene for existing problems like text search. The most important outcome of this approach is that you gain flexibility to experiment with your data, by re-running experiments or algorithms and be able to see what happens, instead of waiting for two weeks. Plus you can somehow trust already existing solutions instead of creating your own ones, which might be often buggy.
P.P.S I am planing to write about a similar story on how I decided to store the friendship ties in a key-value store instead of a relational database and then finally write about how I processed the resulting networks with networkX.