//
archives

Datamining

This category contains 7 posts

Problems when working on (kind of) big data to create networks between people

There is an abundant discussion about big data, also on the definition of it (e.g. http://whatsthebigdata.com/2012/06/06/a-very-short-history-of-big-data/). I would say for me big data is when I the data becomes so big that you need to shard your databases and create distributed solutions to  computational heavy routines on multiple machines e.g. using mahout, pig or some other map/reduce approach http://de.wikipedia.org/wiki/MapReduce.

In comparison to big data, my data is rather small (20.000 Twitter Users, 50 Mio. Tweets, and ~ 50 Mio x 100 Retweets). It fits on one machine and yet creates a lot of  problems when dealing with it. I thought I’d write-up some of the solutions I have found when approaching these Social Network specific data problems.

Generating, Storing and analyzing networks between people

One of they key routines of my work is extracting networks among people. The easiest network are the friend and follower connections storing and retrieving those is a problem of its own (which I will cover in another blog post). I will show you why storing ~ 100.000 ties per person in a Mysql database is a bad idea.

Solution one: Generating @-Networks from Tweets

The next relevant ties are the @-connections. Which correspond to one person mentioning another person in a tweet. These ties are more interesting since they indicate a stronger relationship between people. But extracting them is also a bit harder. Why? Well, if we have 20.000 Persons that we want to create a network of @-mentions in between, this also means that we have max 20.000 x 3200 (3200 being the maximum number of tweets we can extract for a person using the Twitter API) Tweets in our database. This means around ~ 50 Mio of tweets, where each tweet has to be searched for the occurrence of one of the 20.000 usernames. This leads to algorithm #1:

Suppose that in project we are having our 20.000 people, that we want to analyze the network between. In usernames we are storing the names that we want to match each tweet against. The algorithm is simple we read the tweets of every person and check:

  • Is tweet mentioning one of the other 20.000 persons?
  • Is this tweet not containing the “RT” (e.g. “RT @user have you seen xyz”)
  • Has this tweet been retweeted by others? Here we assume that @conversations are such tweets that are not retweeted but mention another user

If the criteria are met we add an edge in the form [From, To, strength] to our network which we store in values. Each mention has a strength of one. At the end we aggregate those ties adding up the ties having the same pairs of users and adding the values. The result is a network containing the @interactions. Great. But we have a problem, which is the time that it takes to compute this. Why? Well I’ve created a toy sample to show you. It contains 57 people and ~ 120.000 tweets with up to 100 retweets for each tweet. The time it takes to generate the network between them is almost 32 seconds.

This looks good, but if we start to match each tweet against 20.000 people instead of 57 people our performance goes down drastically from around 0.5 seconds per person to almost 60-80 seconds per person. If we now extrapolate from this (60seconds/person * 20.000 Persons)/(3600*24) ~ 10-15 days!! It will take around two weeks to generate this rather small network of 20k people, plus we can never be sure if this process won’t crash because we have used up all the memory of the machine. What to do?

Solution two: Use multiple workers to get the job done

I have mentioned delayed job https://github.com/collectiveidea/delayed_job which is a great gem to be able to create tons of small jobs which can then be processed in parallel by a multitude of workers. We will create a job for each person, write down the results of the job in a csv file and then at the end aggregate all jobs results. This results in algorithm #2:

I’ve created three methods, the first one creates the jobs, one for each person. The second one aggregates the jobs results and is called when all jobs have been processed. The last one is the actual job itself, which is very similar to algorithm #1 except that it saves the output to a csv file instead of an array in the memory. This approach is kind of similar to map reduce since we are in parallel computing the networks for each person and then map or aggregate the results. Additionally I use a method that queries the db periodically to see if the delayed jobs finished their work:

What  about the results? For the toy network we get around 21 seconds to finish the jobs. We have improved quite a lot, but how about the 20.000k network. Well sadly the performance did not improve much because the bottleneck is still the same each job has to go through each persons’ tweets and find the ones that contain the username. Thus despite now being able to use multiple cores we are stuck with the db bottleneck. What to do?

Solution three: Use lucene / solr a enterprise solution for indexed full-text search

To solve the problem of the slow lookup time, we will use a full-fledged search engine called lucene http://de.wikipedia.org/wiki/Lucene which is being accessed by a java solr servlet http://en.wikipedia.org/wiki/Solr. Since we want to use it in rails we will additionally use the http://sunspot.github.com/ gem that makes things even more elegant. Ok what is this about? Well basically we add a server that indexes the tweets in the database and provides an ultra fast search on this corpus. To make our tweets searchable we have to add this description to the model to tell solr what to index:

In this case we want to index the tweet text and all of the corresponding retweet ids. After this all is left is to start the solr server (after you installed the gems etc.) by rake sunspot:solr:start and do a full reindexing of our tweets by rake sunspot:solr:reindex. This might take a while, even up to a day if your database is big.  If we are done we can now use the third algorithm:

It is similar to the ones we have seen before yet  different in the way that we are not using two iterating loops anymore. Instead for each person we fetch the tweets that mention this person by using full text “@person.username”, which returns all the tweets in which this person was mentioned with an at sign. Then for these we double-check if the author of this tweet is not the same person (loop) and if the tweets don’t  include “RT” and have no retweets. If the match these criteria we similarly create a tie. And similarly we aggregate these ties at the end. What about the performance of this algorithm? For the toy project it finishes around 2 seconds. And for the 20.000 k network I’ve displayed some of the times per person results below:

As you can see, even when we are analyzing 20.000 people at the same time per person we get results that are often under one second and up to 10 seconds in peaks, when the person has been mentioned a lot, and we need time to filter those results. One final thing, I’ve noticed that the standard Tokenizer in Solr strips the @sign from the tokens, that’s why for the search engine the term “@facebook” and “facebook” means the same (see my post on stackoverflow http://stackoverflow.com/questions/11153155/solr-sunspot-excact-search-for-words). But in this case I actually care for this difference, while in the first the person is addressing the @facebook account on twitter, in the later the person might be only saying something about facebook and not addressing this particular account. So if we change the tokenizer to whitespaceTokenizer, which doesn’t remove these @ signs we are actually able to search for both.

Conclusion

Well that is it for today. The lesson: It differs a lot how you represent and store your data, in some cases you might end up with terrible results and wait for weeks for the result, and by doing slight changes you might speed up this process up to 10x or 100x times. Although big data focuses on algorithms that run on big data sets and on distributed computations, in the end it might often be easier to shrink the big data into small data by aggregating it in some form and then process it. Or to use what we already have, namely smart solutions e.g. lucene for existing problems like text search. The most important outcome of this approach is that you gain flexibility to experiment  with your data, by re-running experiments or algorithms and be able to see what happens, instead of waiting for two weeks.  Plus you can somehow trust already existing solutions instead of creating your own ones, which might be often buggy.

P.P.S I am planing to write about a similar story on how I decided to store the friendship ties in a key-value store instead of a relational database and then finally write about how I processed the resulting networks with networkX.

Cheers
Thomas

Audience analysis of major Twitter news outlets

Motivation

A very interesting blog post from the people at socialflow was the inspiration for this little study. The socialflow study analyzed the Twitter outlets of the main news providers like CNN, NYT to find out if they have a common audience, how they compare when it comes to being retweeted an so on. So I thought it is a good idea to try to come up with something similar for German newspapers. Another issue is the simple fact that the media analysis of TV, Radio or newspapers is strongly focusing on the demographics of their readers (see screenshot below)  but totally neglect the following issues:

  • Social media as a medium (incl. Twitter, Facebook etc..) is not analyzed at all (how do accounts compare on their followers, friending, tweet content/frequency …)
  • The reader’s relationships with each other ( is there a connected audience?)
  • How the readership is extended by the sharing functions (retweets)  (How do stories get passed along, which ones are the most popular…)

A screenshot from ma-reichweite.de

Research Questions

I’ve decided to focus on a couple of very general research questions:

  • How many outlets does each publisher have and how are they connected with each other?
  • How do accounts compare regarding their Followers, Friends and Messages?
  • How does the  user engagement in terms of retweets differ between the outlets?
  • How do retweets help to reach a wider audience?
  • Is there a shared audience between those accounts and publishers?

Data

Since the german newspaper ecosystem is quite fragmented there are quite a couple of different publishers and thousands of different (daily, weekly) newspapers and magazines. I’ve decided to focus on the following ones:

Results

What we see from the general overview is that News agencies differ quite a lot in the number of active accounts. SPIEGEL has 24 Twitter accounts with a total of almost 500.000 followers. The leading tabloid BILD despite having a huge reach of 12 Mio Users offline, only accumulates 170.000 followers on Twitter.

Structure of the Twitter Outlets among each other

When we have a look how the total of these 118 Twitter accounts are linked with each other a pattern appears (see figure below). It seems like the general norm is to have a main Twitter Outlet (e.g. BILD_News or zeitonline) which is connected with the remaining topic specific accounts, which themselves are all connected to the other Twitter outlets of the same publishing house. Twitter Outlets are not connected with each other between different publishers. Comparison of Followers, Messages and Friends Looking at the distribution of followers, I have found that more than 70% of the analyzed accounts have less than 10.000 Followers. Among the Top 20 Follower outlets it is striking that we find more than 8 Outlets of the SPIEGEL Account. Apparently this publisher seems to dominate the field.

Overview of Tweets

If we look at the number of Tweets produced, we see that again around 70% of all accounts have generated less than 10.000 Tweets during their existence. In the top 20 we find such extreme examples like focussport or focusonline, which produce up to 90 tweets / day. At such a frequency I am asking myself how the followers of such accounts cope with the flood of tweets from these accounts.

Followees distribution

Looking at the figure of followees we find the most surprising finding. It seems like only the account of TAZ (tazgezwitscher) is following his readers back and at least offers potential to read what readers have to say.This brings us to the question: If we are in social media and interaction with the readership is a given, how do these outlets actually interact with their readers?

 Interaction with readers

To measure how much these outlets interact with their readers, I have collected all tweets of each account and counted how often they refer to somebody using the @ sign. I have made the distinction of counting how often thy refer to their own accounts, and how often they refer to actual readers. The results are rather surprising: Out of 270.000 tweets only 13.000  tweets are actually interacting with somebody. Out of these almost 10000 tweets are referring to own accounts (eg. when  BILD_NEWS refers to BILD_Sport). So only 3000 tweets are actually interacting with readers, which is a meager 1%. So we can say that interaction with readers is  taking place at a shockingly low level.

How does a readership of an account look like

There are theories about a connected readership online, speculating that the social media readers of accounts are connected to each other and are exchanging and discussing content online. In order to find out if such a structure is emerging, I have exemplary analyzed the account of fr_online, and collected all of its ~9000 readers. Below you see a spring layout of 9000 nodes in gephi. You find the typical core-periphery structure, where 10% of readers do not have any connections to other readers, 50% of readers have less than 15 links and finally you see that there is a core of highly connected readers. Among these highly connected readers we actually find commercial or celebrity accounts such as: ntvde, derfreitag, Calmund, tagesspiegel_de, Piratenpartei, handelsblatt, hronline, spdde, …

Network layout of 9000 readers of fr_online

Engagement of Readers

In order to measure how the accounts differ in reader engagement, I have collected all retweets for all tweets of all accounts and created two ratios:

  • Retweets / Message
  • Retweets / Follower

Retweets / Message

Looking at all accounts I found that almost 90% of all accounts got less than  one retweet / tweet on average. This is still a respectable result, if we think of the findings of Romero et. al, who found that users retweet only one in 318 links.  If we look at the top 20 accounts with the highest retweets/message ratio, we see that the news breaking account of spiegel emerges with a total of 10 retweets / tweets on average. Similar results are only yielded by  the main accounts of ZEIT and TAZ. On the other end of the spectrum we find accounts like focuspanorama ( 11.000 Messages / 14 Retweets) oder focussport (95.000 Nachrichten / 3 Retweets).

Retweets / Follower

Regarding the retweets / follower 79 accounts had a ratio of less than 0,1. Which means that every 10 followers they got one retweet. Among the top 20 the highest ratio of 1 retweet for each 3 followers was achieved by tazgezwitscher. It seems that this account has the most engaged readership, that helps this account to spread their news well beyond their direct readership. Among the accounts with the lowest audience engagement we find  BILD_Bundesliga (with 40.000 Followers and 1000 Retweets) or  SPIEGEL_Rezens (with 30.000 Followers and 300 Retweets). We can speculate that especially sports related content is not retweeted that often because soccer results are simply consumed and not shared. Exemplary analysis of the engaged readership of one account In order to see the structure of readers that have retweeted at least one tweet from an account I have collected such users for the account of fr_online, laid them out with gephi, and applied the modularisation community finding algorithm. The results below show that readers actually cluster in different communities, which differ on their political orientation or interests.

Structural overview of readers of fr_online that retweeted at least one of its messages

 Extended Readership

Knowing that retweets yield an extended readership (see below), one goal was to take a glimpse of what such an extended readership might mean for the reach of one account.

Extended Readership through retweets

To get an idea how the extended readership helps to boost an accounts reach I have collected all tweets and respective retweets of these accounts. For each retweet I looked up how many followers this reader had. By simply adding up all followers for each reader that did a retweet for this account you get a number that is the potentially maximal extended audience that might have been reached through these retweets. I am saying potentially maximal because I am not taking into account if persons who retweeted messages might have a shared audience (e.g. Imagine reader5 and reader6 being the same person in the figure above)

Extended audience through retweets

We notice that the total of 27.000 Retweets of zeitonline have generated an extended audience of 4.2 Mio readers or in the case of tazgezwitscher we see that 15000 retweets resulted in more than 2 Mio additional readers. What we can take away from this calculation is that retweets really change the distribution game: While zeitonline has approximately 80.000 followers they have managed to get some of their news to be seen by a total of 4.2 Mio people , which is a multiplication of ~50x. I think this shows the true power of social media.

Potential multipliers

When drilling down in the data we have found readers that are especially valuable for an account because they have a high number of followers themselves, serving as huge multipliers for the audience. We find that three cases emerge quite often:

  • Publishers use their own main-accounts to boost the readership of smaller thematic-accounts (e.g. when bild_sport (10.000 followers) is retweeted by bild_news (80.000 followers), or zeitonline_wir (3.000 followers) is retweeted by zeitonline (80.000)
  • Influential users retweet the content (e.g. tweets from BILD_Digital – 4600 Follower, SPIEGEL_Reise – 14000 Follower , SPIEGEL_Netz -22000 Follower are retweeted by rather unknown readers that have a high number of followers einerHaupka -170000 Follower, AxelKoster – 120000 Follower, haukepetersen 70000 Follower)
  • The subject of the content retweets himself ( e.g. A tweet about the band “jetward”  from bild_aktuell(35.000 followers is retweeted by a fan account planetjetward 300.000 followers, or jeffjarvis retweets (80.000 followers) retweets the focuslive account (10.000) who made an interview with him

“Two-Step-Flow” of information

Regarding this diffusion patterns I asked myself if we can compute something similar like a two step flow of information, which is the percentage of retweeted material that has been retweeted because it has been seen not on the original account itself, but has reached a reader by an intermediary. We defined the two-step-flow ratio as:

The number of people that have retweeted an account and follow  directly / total amount of people that have retweeted the account.

Readers following an account and retweeting it (green) , Readers NOT following an account and retweeting it (orange). Potential Two-Step-Flow dashed line.

The ratio can be as high as 1 if everybody that retweeted that account is directly following him and as low as 0 when everybody that retweeted an account is not directly following this account. We have ordered the accounts by the lowest ratio first, and we see that some accounts like zeitonline_wir achieve a ratio of less than 0.5 which means that half of their retweets were from people who were not directly following this account. Now there can be two explanations for such a low ratio: a) people have received the retweet from a broker or middleman and then retweeted it (which is in favor of the two-step-flow hypothesis) or people simply have seen the article on the website and decided to retweet it. Since we didn’t analyse this in detail we can only guess about the percentage, but it would definitely be worth an own analysis.

(in red) Ratio of people that tweeted an article and were directly following an account / all people that retweeted an article

Shared Readers

The final step of this analysis was to find out how many readers the outlets had in common (see orange people in the graphic below). The common readers measure can have a maximal value of 0.5 when e.g. each account has 100 users and both are following both accounts (100/200) or can be minimal 0 when 0 users are in common .

Shared Readers

We computed this ratio for each combination of accounts and displayed in a symmetric matrix (see image below). We additionally grouped the accounts in the matrix by publisher (see blue boxes). The higher the ratio the greener the cells , red = lower.

Shared audience by publisher

Symetric matrix of shared audience What we see in this visualization is that especially among accounts of the same publisher (e.g. Spiegel_eil, Spiegel_news, Spiegel_reise…) a common readership emerges. Thus people who like the spiegel are very often following the other accounts. This pattern emerges even better when we group the shared audience by the publisher (below). What really strikes out is that the tabloid paper BILD has an audience which is very different from the other audiences. On the other hand “intellectual” and social media established newspapers such as the ZEIT or SPIEGEL seem to share a relative  big audience (~ 8%). View of shared audience grouped by publisher

Shared audience by account

If we highlight the shared audience that is three deviations higher than the average value (0,03) we also note that there are certain accounts that are not part of the same publisher but have a very big shared audience (green cells in the matrix below)

Shared audience with between accounts (Green = three SD higher than average)

Since the matrix above is not really good at showing the structure that emerges in the data, we have simply visualized the data in a network format, connecting the accounts that share an audience, the line-strength was chosen accordingly to the percentage of shared audience (see below)

Shared audience network visualization

In this visualization a number of interesting observations emerge:

  • Accounts focusing on the spread of top-news (red e.g. Spiegel_EIL, BILD_NEWS, BILD_AKTUELL, Spiegel_TOP, tazgezwitscher) have a shared audience.
  • We see the same pattern of readers of readers following accounts of the same publisher (e.g. zeitonline_wir, zeitonline_kul, zeitonline_wis und zeitonline_pol or Spiegel_wirtsch, Spiegel_politik, Spiegel_pano, Spiegel_seite2, Spiegelzwischen, Spiegel_SPAM)
  • Accounts that have a thematic focus seem to generate a shared audience. See Travel:  Stern_reise, Welt_reise, Faz_reise, Focusreise. Or Cars: ocusauto, FAZauto, SZ_Auto

Conclusion

We have arrived at the end of our little explorative analysis. A couple of take aways are:

  • Some publishers use Twitter quite successfully as a channel to enhance their reach and the interaction with their readers (as in the examples of spiegel, zeit or taz)
  • Despite the enthusiasm, the image of an interconnected audience, does not emerge that strongly, as readers do not interact with the outlets too much and a high number of readers is only weakly connected to each other
  • Engagement of readers can quite nicely be measured in retweets/message and retweets/follower capturing different aspects.
  • Using a simple modularity analysis  of the retweets network of an account can bring interesting insights on how the audience of an account is clustered (as in the case of fr_online)
  • Retweets in general and the resulting Two-Step-Flow of information can boost the reach of an account by a potential magnitude of ~10-50x
  • Some very influential readers emerge as their audience often is bigger than the audience of the outlet itself
  • A shared audience emerges between accounts of the same publisher, but it also emerges between accounts of different publishers when they share a common topic (e.g. travel)

That is it for today, I am excited to hear your comments

Cheers

Thomas

P.S.

I am presenting this small analysis tomorrow at the SGKM conference (on journalism, social media and communication) and am excited to hear what the audience has to say.

Datamining Twitter Part 5 – Collecting multiple keywords

We are doing quite fine now, we are able to store the stream and make sure that the collection is running smoothly. It also restarts in case something happened. So everything is in place. Apart from the fact that I want to monitor multiple things at once and filter my tweets before storing them.

So lets get started. We will do only a few modifications to the collect_tweets.rb file.

Step 1

I like the yaml format for storing stuff that should be human readable. I know there is json and xml but they are just not fun to read.

Lets suppose we want to see which movies are doing well, and which
aren’t so we set up our collection to monitor 5 movies at the same time
and store the tweets.

So we will create a small config file config.yaml that will hold the information we need:

movie1:
  db: ateam.sqlite
  keywords: a-team

movie2:
  db: macgruber.sqlite
  keywords: macgruber

movie3:
  db: marmaduke.sqlite
  keywords: marmaduke

movie4:
  db: princepersia.sqlite
  keywords:
    - prince
    - persia

movie5:
  db: robinhood.sqlite
  keywords:
    - robin
    - hood

movie6:
  db: shrek.sqlite
  keywords: shrek

The file holds the db parameters for each movie (although we also could use tables instead) and the keywords we want to monitor. For some movies like robin hood we want to look for two keywords robin AND hood. For some others like shrek one is fine.

Step 2

Now we have the file lets read it in.

path = File.dirname(File.expand_path(__FILE__))
config = YAML.load_file(path + "/" + "config.yaml")

Wasn’t that easy? I mean how much more convinient can it get :). So our config parameters are now stored in the config hash. Lets use this hash to configure our application.

To get the keywords we can do :

keywords = config.values.collect {|k| k["keywords"]}.flatten

</pre>
To have a more convenient access to the tweet databases we could do:
<pre>

tweet_tables = {}
config.values.each do |k|
  tweet_tables[k["db"]] = Sequel.sqlite(path + "/" + k["db"])[:tweets]
end

So we have all the connectors to the databases and can get going.

Step 3

The only thing we need to change now is the collection process. While we encounter our keywords I would like to store the tweets in the appropriate databases.

So our client gets a new starting line, which makes sure he collects all the keywords.

@client.track(keywords.join(",")) do |status|

The problem is that all of those keywords are connected with an OR. Actually its a good thing, otherwise we wouldn’t be able to track multiple things at once. So in the inner loop we have to make sure that we dispatch those tweets and store them appropriately.

config.values.each do |k|
   if k["keywords"].all? {|str| status.text.downcase.include? str}
    selected = k["db"]
   end
  end
  if selected == ""
    puts red("[Not all keywords found]") +  status.text
  else
    tweet_tables[selected].insert(tweet)
    puts "[" + green(selected)+ "]" + "[#{status.user.screen_name}] #{status.text}"
  end
end

I’ve left out the uninteresting stuff, but thats all you need to store the tweets in the databases. So what is happening here?

  • First I am checking if all the keywords are contained in the tweet. Notice how nicely the all? enumerator helps me out here. If we only have a movie with one keyword or 10 doesn’t matter.
  • Secondly depending on the keyowrd i select the database.
  • And last in case the tweets did not match all of the keywords I print a little line saying that I didn’t find all of the keywords otherwise i store the tweet in the appropriate database.

You might ask what those funny green and red methods do? Its a little trick I learned on dmytros blog. You can have two nice helper methods that will color the output in your console. I think it makes supervising the process much more fun.

So in case you want to use them too, here they are

def colorize(text, color_code)
  "\e[#{color_code}m#{text}\e[0m"
end

def red(text); colorize(text, 31); end
def green(text); colorize(text, 32); end

So we are pretty much done. We have a nice config file that contains all the information and we have our collection process that collects the tweets and puts them into the right databases. Make sure to create those databases before you start the collection process, otherwise it might compain.

Have fun with your ideas and drop me a line if you have a question.

Cheers
Thomas

Datamining Twitter Part 4 – Daemons and Cron

Although we covered in part 3, that we can use screen to run our collection in the background and detach from it safely, it has some minor drawbacks.

  • To start the process I have to go through a manual setup routine of starting screen then executing the collection and then detaching from it.
  • If my process dies somehow in screen, either by a buffer overflow or because I haven’t been prepared for all the eventualities and the process somehow disconneted from the source my datacollection will be corrupted.

Step 1: Deamons

So to compensate for those things I will show you a set up that allows
us better to make sure our process is running and is collecting
tweets.  daemon gem  and a bit of cronjob magic. To install the daemons gem them just write:

gem install daemons

We will need to create an additional file that will serve as our control program that will start and end the collection. I will call it the collect_tweets_control.rb

require 'rubygems'
require 'daemons'

Daemons.run("collect_tweets.rb")

We can use it like this:

ruby collect_tweets_control.rb start
      (collect_tweets.rb is now running in the background)
  ruby collect_tweets_control.rb.rb restart
      (...)
  ruby collect_tweets_control.rb stop

I think it is quite cool :).

For the first time we will test it by running collect_tweets.rbwithout
forking
in the background:

  ruby collect_tweets_control.rb run

If you are using files in the collect_tweets method, make sure you are using them with their full path.

path = File.dirname(File.expand_path(__FILE__))
#log = Logger.new('collect_tweets.log')
log = Logger.new(path + "/" + 'collect_tweets.log')
# This also applies for your sqlite database
tweets = Sequel.sqlite(path + "/" + "tweets.sqlite")[:tweets]

Otherwise the daemon will complain about not finding your files. Make sure to check if it is running fine by running:

 
ruby collect_tweets_control.rb run

So now its time to start our process by:

ruby collect_tweets_control.rb start

You will notice that it created a little .pid file that indicates our deamon is up and running. You can also check by:

ps aux | grep collect_tweets.rb

It should show you your process.

Step 2: Script

So our collection process is up and running. We can check the logfile to see if things are going well. But in any case something might happen and our process dies.

Thats why I would like to have a cronjob that checks every 10 Minutes if my process is still doing fine.

If you are on debian it should automatically come with cronjob, or just install it with apt-get.

In Debian the cron package is installed as part of the base system, and will be running by default.

You will find a nice tutorial on cronjob on debian-administration: here

We will first create a little .sh script that will check if our collection is still in progress. I call it check_collection.rb

!/bin/sh
up=`ps aux | grep collect_tweets.rb |grep  -v "grep" | wc -l`
if [ $up -eq 0 ]
then
    /usr/local/bin/ruby /home/plotti/twitter/filme/collect_tweets_control.rb start
else
    echo "Collection is running fine at  `date` "
fi

Watch out for those different quotation marks around date. What it does is using the ps command in combination with grep to look four our collection process. If it can find it it will output a 1 otherwise a 0.

If it is not running we will start our deamon again and otherwise just output that the collection process is doing fine.

You might want to make it runable with chmod and try it out by typing:

./check_collection.sh

Step 3: Cronjob

Now everything is in place we just need an entry in cronjob that starts our little script which will take care of a respawn.To check if cron is running:

ps aux | grep cron

If its not running on debian you can start it like this:

/etc/init.d/cron start

Type the following command to enter cronjob:

 crontab -e

Each cronjob has following syntax:

# +---------------- minute (0 - 59)
# |  +------------- hour (0 - 23)
# |  |  +---------- day of month (1 - 31)
# |  |  |  +------- month (1 - 12)
# |  |  |  |  +---- day of week (0 - 6) (Sunday=0 or 7)
# |  |  |  |  |
  *  *  *  *  *  command to be executed

So our command will look like this:

*/10 * * * * /home/plotti/twitter/check_collection.sh &gt;&gt; /var/log/cron

Which is a nice shortcut (instead of writing 0,10,20,30,40,50 * * * * ) to getting what we want. There is a cool cron genarator here.

The last part redirects the output of our script to the /var/log/cron file so we can see that it actually ran. You might want to check your /var/cron/log file to see if anything went wrong.

Cheers
Thomas

Datamining Twitter Part 3 – Logging

I am a nervous person, so if my collection of tweets is running on the server I would somehow like to log what is going on so in case things go down I can at least know when it happened.

We will be using the logging library  .The logger gem is part of the standard package that ruby comes with so there is nothing to install. There is a nice comparison of loggers for ruby here (in German). 

Logging stuff in ruby is easy. You simply need this:

require 'rubygems'
require 'logger'

#since we want to write out to a file:
log = Logger.new("collect_tweets.log")

#You can use all of those different level errors to make your file more readable and see what is going on. 
log.debug("just a debug message") 
log.info("important information") 
log.warn("you better be prepared") 
log.error("now you are in trouble") 
log.fatal("this is the end...")

We will add those two callback methods to our client to log if errors are happening:


@client.on_delete do |status_id, user_id|
 log.error "Tweet deleted"
end

@client.on_limit do |skip_count|
 log.error "Limit exceeded"
end

And we will replace our output to console through the logger:


...
   rescue
    log.fatal "Could not insert tweet. Possibly db lock error"
    #puts "Couldnt insert tweet. Possibly db lock error"
   end
...

Now comes the trickest part. I would like the program to report to a log file every 10 minutes that it is up and running and doing fine.

loop collecting tweets
...
    time = Time.now.min
    if time % 10  == 0 && do_once
        log.info "Collection up and running"
        do_once = false
    elsif time  % 10 != 0
        do_once = true
    end
...

What will this do. Every time I insert a tweet I will check the time. Every 10 Minutes I want to once write my status. Notice that time.min % 10, would write the logging message to the log during that whole minute that it is runnung. So thats why we made a little flag do_once. It gets reset on between those 10 minutes. This should do just fine.

If we look in our log now we see:

I, [2010-05-25T09:10:02.436575 #2040]  INFO -- : Collection up and running
I, [2010-05-25T09:20:03.007758 #2040]  INFO -- : Collection up and running
I, [2010-05-25T09:30:03.002217 #2040]  INFO -- : Collection up and running
I, [2010-05-25T09:40:03.040313 #2040]  INFO -- : Collection up and running

Perfect. Now we can always look into this file and see how things have been. If the process somehow crashed we at least know when it happened.

In the next part I will show you how to use the deamons gem in combination with cronjob to make sure our process gets restarted if it somehow crashed.

Cheers Thomas

Datamining Twitter: Part 2 Accessing The Gardenhose

So in the first part of the tutorial we have set up a sqlite database with sequel. The only thing left we have to do is to access the twitter stream and save our tweets to the database.

Step 1:

What twitter offers are two sorts of streams:

  •  the firehose ( A stream that supplies you with all the tweets created on twitter, which can be up to 50Mio a day) This stream is only available to big clients of twitter like yahoo, microsoft or google. Since storing that stream gives all of twitters data away I guess it costs quite a bit to get access to this stream.
  • the gardenhose ( A stream that only gives you a tiny bit of those streams, yet in most cases is totally enough for us)

We will acess the gardenhose, since the firehose is for the big players like google etc..

Step 2:

Luckily there is a good gem for this that makes our work easy. Michael Bleigh from Intridea has created a gem called Tweetstream, that makes the twitter stream API even easier to use.

gem sources -a http://gems.github.com
gem install intridea-tweetstream

Step 3:

After installing the gem we are ready to rock. Lets create a file alice_stream.rb and start collecting.

require "rubygems"
require "sequel"
require "tweetstream"

#connect to db
DB = Sequel.sqlite("tweets.sqlite")
tweets = DB[:tweets]

@client = TweetStream::Client.new('yourname','yourpassword')

@client.track('alice', 'wonderland') do |status|
  begin
   tweets.insert(
    :text =&gt; status.text,
    :username =&gt; status.user.screen_name,
    :created_at =&gt; status.created_at,
    :lang =&gt; status.user.lang,
    :time_zone =&gt; status.user.time_zone,
    :guid =&gt; status[:id]
    )
   puts "[#{status.user.screen_name}] #{status.text}"
  rescue
   puts "Couldnt insert tweet. Possibly db lock error"
  end
end

So after loading the rubygems, sequel and tweetstream we are connecting to the database we created in part one. Notice how simple the database connection takes place. Its only two lines of code and we are done.

After that we initialize the twitter client that will provide us with the stream. I will have to check up on OAuth since from June 2010 twitter won’t support basic authentification anymore.

Once the client is initialized we use the track command to filter the twitter stream for certain keywords. The important thing to know here is that the keywords can only be combined in a OR fashion. So we will collect everything that contains alice OR everything that contains wonderland. We will have to filter those tweets later to only keep those that contained alice in wonderland.

I wrapped the database insert in a begin rescue block since sqlite doesn’t allow us concurrency and if later we are reading from the database and locking it, our client won’t be able to put in those tweets and fail. If you use a mysql database which supports row locking like innodb, you won’t have to deal with this problem. Maybe we will come back to this later.

The insert saves the text of the status message, the username, the created_at date and the timezone and the guid of the tweet that identifies it, and makes it able for us to look it up later on twittter.

To see how fast the tweets are coming in I am just putting them into console to have something to read while waiting.

Step 4.

Done. 🙂 Start the collecting by ruby alice_stream.rb and watch those tweets coming in. Once you have enough and are bored quit with CTRL+C.

In the next part of the tutorial I will show you how to analyze those tweets. We will start by plotting them with gnuplot, which is quite fun.

Enjoy
Thomas

Datamining Twitter: Part 1

In this short tutorial you will learn how to collect tweets using ruby and only two gems.

It is part of a series where I will show you what fantastic things you can do with twitter these days, if you love mining
data 🙂

The first gem I would like to introduce is sequel. It is a lightweight ORM layer that allows to to intterface a couple of of a
databases in ruby without pain. It works great with mysql or sqlite. We will use sqlite today.I have been using mysql in combination wit rails and the nice activerecord ORM, but for the most tasks it is a bit too bulky. The problem with Sqlite can be though that it does not provide multitasking capabilities. But we will bump into that later…

To get you started have a visit on http://sequel.rubyforge.org/
and have a look on the example. They are pretty straight forward. I can also recommend the cheatsheet under: http://sequel.rubyforge.org/rdoc/files/doc/cheat_sheet_rdoc.html

Step 1.

Install the sequel gem by and you are ready to go.

sudo gem install sequel

Step 2

Let us set up a little database to hold the tweets. If you are familiar with activerecord, you have probably used migrations before. So sequel works the same way. You write migration files and then simply run them. So here is mine to get you started with a very easy table. Its important to save it as a 01_migration_name.rb file the number is important otherwise sequel wont recognize which migration to run first. I saved it as 01_create_table.rbclass CreateTweetTable < Sequel::Migration

class CreateTweetTable &lt; Sequel::Migration

def up
  create_table :tweets do
    primary_key :id
    String :text
    String :username
    Time :created_at
   end
end

def down
    drop_table(:tweets)
end

end

Step 3

Run the first migration. You will find a great tutorial on migrations on http://steamcode.blogspot.com/2009/03/sequel-migrations.html

sequel -m . -M 1 sqlite://tweets.db

If you are getting a “URI::InvalidURIError: the scheme sqlite does not accept registry part: …” then your database name probably contains some characters it shouldnt. Just try to use only letters and numbers.

So now you should have a sqlite database for the very basic needs of your tweets. But maybe you need a little bit more information on what you are capturing. So lets´write our second migration. In addition to just storing the text and the
username, I want to store the guid of the tweet and the timezone and the language used.

class AddLangAndGuid &lt; Sequel::Migration

    def up
        alter_table :tweets do
            add_column  :guid, Integer
            add_column  :lang, String
            add_column  :time_zone, String
        end
    end

    def down
        alter_table :tweets do
            drop_column :guid
            drop_column :lang
            drop_column :time_zone
        end
    end
end

After running

sequel -m . -M 2 sqlite://tweets.db

you have created a a nice database that will hold your tweets.

Step 4:

Lets see how it worked. To use sequel in your scripts you have to require rubygems and the seqel gem. What we want to do is to
connect to the  database. Just fire up your irb and get us started:

require 'rubygems'
require 'sequel'

DB = Sequel.sqlite("tweets.rb")
tweets = DB[:tweets]

In those few lines you loaded up your database and now have a tweets collection that holds your data. I think that is really convenient.In part 2 I will show you how to collect them. Enjoy.

Cheers
Thomas