Archive for

Datamining Twitter Part 5 – Collecting multiple keywords

We are doing quite fine now, we are able to store the stream and make sure that the collection is running smoothly. It also restarts in case something happened. So everything is in place. Apart from the fact that I want to monitor multiple things at once and filter my tweets before storing them.

So lets get started. We will do only a few modifications to the collect_tweets.rb file.

Step 1

I like the yaml format for storing stuff that should be human readable. I know there is json and xml but they are just not fun to read.

Lets suppose we want to see which movies are doing well, and which
aren’t so we set up our collection to monitor 5 movies at the same time
and store the tweets.

So we will create a small config file config.yaml that will hold the information we need:

  db: ateam.sqlite
  keywords: a-team

  db: macgruber.sqlite
  keywords: macgruber

  db: marmaduke.sqlite
  keywords: marmaduke

  db: princepersia.sqlite
    - prince
    - persia

  db: robinhood.sqlite
    - robin
    - hood

  db: shrek.sqlite
  keywords: shrek

The file holds the db parameters for each movie (although we also could use tables instead) and the keywords we want to monitor. For some movies like robin hood we want to look for two keywords robin AND hood. For some others like shrek one is fine.

Step 2

Now we have the file lets read it in.

path = File.dirname(File.expand_path(__FILE__))
config = YAML.load_file(path + "/" + "config.yaml")

Wasn’t that easy? I mean how much more convinient can it get :). So our config parameters are now stored in the config hash. Lets use this hash to configure our application.

To get the keywords we can do :

keywords = config.values.collect {|k| k["keywords"]}.flatten

To have a more convenient access to the tweet databases we could do:

tweet_tables = {}
config.values.each do |k|
  tweet_tables[k["db"]] = Sequel.sqlite(path + "/" + k["db"])[:tweets]

So we have all the connectors to the databases and can get going.

Step 3

The only thing we need to change now is the collection process. While we encounter our keywords I would like to store the tweets in the appropriate databases.

So our client gets a new starting line, which makes sure he collects all the keywords.

@client.track(keywords.join(",")) do |status|

The problem is that all of those keywords are connected with an OR. Actually its a good thing, otherwise we wouldn’t be able to track multiple things at once. So in the inner loop we have to make sure that we dispatch those tweets and store them appropriately.

config.values.each do |k|
   if k["keywords"].all? {|str| status.text.downcase.include? str}
    selected = k["db"]
  if selected == ""
    puts red("[Not all keywords found]") +  status.text
    puts "[" + green(selected)+ "]" + "[#{status.user.screen_name}] #{status.text}"

I’ve left out the uninteresting stuff, but thats all you need to store the tweets in the databases. So what is happening here?

  • First I am checking if all the keywords are contained in the tweet. Notice how nicely the all? enumerator helps me out here. If we only have a movie with one keyword or 10 doesn’t matter.
  • Secondly depending on the keyowrd i select the database.
  • And last in case the tweets did not match all of the keywords I print a little line saying that I didn’t find all of the keywords otherwise i store the tweet in the appropriate database.

You might ask what those funny green and red methods do? Its a little trick I learned on dmytros blog. You can have two nice helper methods that will color the output in your console. I think it makes supervising the process much more fun.

So in case you want to use them too, here they are

def colorize(text, color_code)

def red(text); colorize(text, 31); end
def green(text); colorize(text, 32); end

So we are pretty much done. We have a nice config file that contains all the information and we have our collection process that collects the tweets and puts them into the right databases. Make sure to create those databases before you start the collection process, otherwise it might compain.

Have fun with your ideas and drop me a line if you have a question.


Datamining Twitter Part 4 – Daemons and Cron

Although we covered in part 3, that we can use screen to run our collection in the background and detach from it safely, it has some minor drawbacks.

  • To start the process I have to go through a manual setup routine of starting screen then executing the collection and then detaching from it.
  • If my process dies somehow in screen, either by a buffer overflow or because I haven’t been prepared for all the eventualities and the process somehow disconneted from the source my datacollection will be corrupted.

Step 1: Deamons

So to compensate for those things I will show you a set up that allows
us better to make sure our process is running and is collecting
tweets.  daemon gem  and a bit of cronjob magic. To install the daemons gem them just write:

gem install daemons

We will need to create an additional file that will serve as our control program that will start and end the collection. I will call it the collect_tweets_control.rb

require 'rubygems'
require 'daemons'


We can use it like this:

ruby collect_tweets_control.rb start
      (collect_tweets.rb is now running in the background)
  ruby collect_tweets_control.rb.rb restart
  ruby collect_tweets_control.rb stop

I think it is quite cool :).

For the first time we will test it by running collect_tweets.rbwithout
in the background:

  ruby collect_tweets_control.rb run

If you are using files in the collect_tweets method, make sure you are using them with their full path.

path = File.dirname(File.expand_path(__FILE__))
#log = Logger.new('collect_tweets.log')
log = Logger.new(path + "/" + 'collect_tweets.log')
# This also applies for your sqlite database
tweets = Sequel.sqlite(path + "/" + "tweets.sqlite")[:tweets]

Otherwise the daemon will complain about not finding your files. Make sure to check if it is running fine by running:

ruby collect_tweets_control.rb run

So now its time to start our process by:

ruby collect_tweets_control.rb start

You will notice that it created a little .pid file that indicates our deamon is up and running. You can also check by:

ps aux | grep collect_tweets.rb

It should show you your process.

Step 2: Script

So our collection process is up and running. We can check the logfile to see if things are going well. But in any case something might happen and our process dies.

Thats why I would like to have a cronjob that checks every 10 Minutes if my process is still doing fine.

If you are on debian it should automatically come with cronjob, or just install it with apt-get.

In Debian the cron package is installed as part of the base system, and will be running by default.

You will find a nice tutorial on cronjob on debian-administration: here

We will first create a little .sh script that will check if our collection is still in progress. I call it check_collection.rb

up=`ps aux | grep collect_tweets.rb |grep  -v "grep" | wc -l`
if [ $up -eq 0 ]
    /usr/local/bin/ruby /home/plotti/twitter/filme/collect_tweets_control.rb start
    echo "Collection is running fine at  `date` "

Watch out for those different quotation marks around date. What it does is using the ps command in combination with grep to look four our collection process. If it can find it it will output a 1 otherwise a 0.

If it is not running we will start our deamon again and otherwise just output that the collection process is doing fine.

You might want to make it runable with chmod and try it out by typing:


Step 3: Cronjob

Now everything is in place we just need an entry in cronjob that starts our little script which will take care of a respawn.To check if cron is running:

ps aux | grep cron

If its not running on debian you can start it like this:

/etc/init.d/cron start

Type the following command to enter cronjob:

 crontab -e

Each cronjob has following syntax:

# +---------------- minute (0 - 59)
# |  +------------- hour (0 - 23)
# |  |  +---------- day of month (1 - 31)
# |  |  |  +------- month (1 - 12)
# |  |  |  |  +---- day of week (0 - 6) (Sunday=0 or 7)
# |  |  |  |  |
  *  *  *  *  *  command to be executed

So our command will look like this:

*/10 * * * * /home/plotti/twitter/check_collection.sh &gt;&gt; /var/log/cron

Which is a nice shortcut (instead of writing 0,10,20,30,40,50 * * * * ) to getting what we want. There is a cool cron genarator here.

The last part redirects the output of our script to the /var/log/cron file so we can see that it actually ran. You might want to check your /var/cron/log file to see if anything went wrong.


Datamining Twitter Part 3 – Logging

I am a nervous person, so if my collection of tweets is running on the server I would somehow like to log what is going on so in case things go down I can at least know when it happened.

We will be using the logging library  .The logger gem is part of the standard package that ruby comes with so there is nothing to install. There is a nice comparison of loggers for ruby here (in German). 

Logging stuff in ruby is easy. You simply need this:

require 'rubygems'
require 'logger'

#since we want to write out to a file:
log = Logger.new("collect_tweets.log")

#You can use all of those different level errors to make your file more readable and see what is going on. 
log.debug("just a debug message") 
log.info("important information") 
log.warn("you better be prepared") 
log.error("now you are in trouble") 
log.fatal("this is the end...")

We will add those two callback methods to our client to log if errors are happening:

@client.on_delete do |status_id, user_id|
 log.error "Tweet deleted"

@client.on_limit do |skip_count|
 log.error "Limit exceeded"

And we will replace our output to console through the logger:

    log.fatal "Could not insert tweet. Possibly db lock error"
    #puts "Couldnt insert tweet. Possibly db lock error"

Now comes the trickest part. I would like the program to report to a log file every 10 minutes that it is up and running and doing fine.

loop collecting tweets
    time = Time.now.min
    if time % 10  == 0 && do_once
        log.info "Collection up and running"
        do_once = false
    elsif time  % 10 != 0
        do_once = true

What will this do. Every time I insert a tweet I will check the time. Every 10 Minutes I want to once write my status. Notice that time.min % 10, would write the logging message to the log during that whole minute that it is runnung. So thats why we made a little flag do_once. It gets reset on between those 10 minutes. This should do just fine.

If we look in our log now we see:

I, [2010-05-25T09:10:02.436575 #2040]  INFO -- : Collection up and running
I, [2010-05-25T09:20:03.007758 #2040]  INFO -- : Collection up and running
I, [2010-05-25T09:30:03.002217 #2040]  INFO -- : Collection up and running
I, [2010-05-25T09:40:03.040313 #2040]  INFO -- : Collection up and running

Perfect. Now we can always look into this file and see how things have been. If the process somehow crashed we at least know when it happened.

In the next part I will show you how to use the deamons gem in combination with cronjob to make sure our process gets restarted if it somehow crashed.

Cheers Thomas

How to use linux screen

If you are using linux you have probably sumbled across screen. It is a great tool to run some processes that take a long time and detach them from the console.

In our example we are collecting tweets, since this process can go on for a while or actually forever unless we stop it, we will use screen to start our collecting program and then let it continue for some days. Normally closing your console also closes your program unless you don’t deamonize it. Screen is an easy alternative to do that.

Step 1.

If you are using debian just type

apt-get install screen 

and you are ready to go

Step 2:

To start using screen simply type screen and you will be greeted with a welcome screen. Now you are inside screen and everything looks the same apart that you can do a few cool tricks:


The following are some of the most used shortcuts that lets you
navigate through your screen environment. Note that unless modified by
your .screenrc, by default every screen shortcut is preceded by Ctrl+a.
Note that these shortcuts are case-sensitive.

  • 0 through 9 – Switches between windows
  • Ctrl+n – Switches to the next available window
  • Backspace – Switches to the previous available
  • Ctrl+a – Switches back to the last window you were on
  • A – Changes window session name
  • K – Kills a window session
  • c – Creates a new window
  • [ – Then use arrows to scroll up and down terminal

Find out about more shortcuts in Screen’s man pages. In your terminal, run: man screen.

You will find great tutorials on the web on screen like this or this. They will explain screen in much more detail to you.

Step 3:

What we will do is start the collecting process and then detach the window. This is pretty easy:


ruby collect_tweets.rb 

to start collecting the tweets, you will see them popping up in your console.

Now to detach the window simply press

ctrl+ a + d

and you are back in your console. Notice that screen is still running your tweet collection.

To go back to screen and see what it is doing simply type:

screen -ls

It will provide you with the instances of screen that you are running now. You should find one entry. To connect to that session simply type:

screen -r and the pid number for example

screen -r 27699

And voila you are back in watching your ruby program collecting those tweets.

Sweet isn’t it?

Step 4:

Extra goodies:

You can have multiple consoles in screen to create a new one simply press

ctrl + a + c

and you have a new console inside screen. To cycle between those consoles press

ctrl + a + n 

to flip to the next window.

Now you are able to start those collecting processes and safely disconnect from your box while it is still preforming the collection process.

In the next tutorial I will show you how to monitor what is going on from the browser.


Datamining Twitter: Part 2 Accessing The Gardenhose

So in the first part of the tutorial we have set up a sqlite database with sequel. The only thing left we have to do is to access the twitter stream and save our tweets to the database.

Step 1:

What twitter offers are two sorts of streams:

  •  the firehose ( A stream that supplies you with all the tweets created on twitter, which can be up to 50Mio a day) This stream is only available to big clients of twitter like yahoo, microsoft or google. Since storing that stream gives all of twitters data away I guess it costs quite a bit to get access to this stream.
  • the gardenhose ( A stream that only gives you a tiny bit of those streams, yet in most cases is totally enough for us)

We will acess the gardenhose, since the firehose is for the big players like google etc..

Step 2:

Luckily there is a good gem for this that makes our work easy. Michael Bleigh from Intridea has created a gem called Tweetstream, that makes the twitter stream API even easier to use.

gem sources -a http://gems.github.com
gem install intridea-tweetstream

Step 3:

After installing the gem we are ready to rock. Lets create a file alice_stream.rb and start collecting.

require "rubygems"
require "sequel"
require "tweetstream"

#connect to db
DB = Sequel.sqlite("tweets.sqlite")
tweets = DB[:tweets]

@client = TweetStream::Client.new('yourname','yourpassword')

@client.track('alice', 'wonderland') do |status|
    :text =&gt; status.text,
    :username =&gt; status.user.screen_name,
    :created_at =&gt; status.created_at,
    :lang =&gt; status.user.lang,
    :time_zone =&gt; status.user.time_zone,
    :guid =&gt; status[:id]
   puts "[#{status.user.screen_name}] #{status.text}"
   puts "Couldnt insert tweet. Possibly db lock error"

So after loading the rubygems, sequel and tweetstream we are connecting to the database we created in part one. Notice how simple the database connection takes place. Its only two lines of code and we are done.

After that we initialize the twitter client that will provide us with the stream. I will have to check up on OAuth since from June 2010 twitter won’t support basic authentification anymore.

Once the client is initialized we use the track command to filter the twitter stream for certain keywords. The important thing to know here is that the keywords can only be combined in a OR fashion. So we will collect everything that contains alice OR everything that contains wonderland. We will have to filter those tweets later to only keep those that contained alice in wonderland.

I wrapped the database insert in a begin rescue block since sqlite doesn’t allow us concurrency and if later we are reading from the database and locking it, our client won’t be able to put in those tweets and fail. If you use a mysql database which supports row locking like innodb, you won’t have to deal with this problem. Maybe we will come back to this later.

The insert saves the text of the status message, the username, the created_at date and the timezone and the guid of the tweet that identifies it, and makes it able for us to look it up later on twittter.

To see how fast the tweets are coming in I am just putting them into console to have something to read while waiting.

Step 4.

Done. 🙂 Start the collecting by ruby alice_stream.rb and watch those tweets coming in. Once you have enough and are bored quit with CTRL+C.

In the next part of the tutorial I will show you how to analyze those tweets. We will start by plotting them with gnuplot, which is quite fun.


Datamining Twitter: Part 1

In this short tutorial you will learn how to collect tweets using ruby and only two gems.

It is part of a series where I will show you what fantastic things you can do with twitter these days, if you love mining
data 🙂

The first gem I would like to introduce is sequel. It is a lightweight ORM layer that allows to to intterface a couple of of a
databases in ruby without pain. It works great with mysql or sqlite. We will use sqlite today.I have been using mysql in combination wit rails and the nice activerecord ORM, but for the most tasks it is a bit too bulky. The problem with Sqlite can be though that it does not provide multitasking capabilities. But we will bump into that later…

To get you started have a visit on http://sequel.rubyforge.org/
and have a look on the example. They are pretty straight forward. I can also recommend the cheatsheet under: http://sequel.rubyforge.org/rdoc/files/doc/cheat_sheet_rdoc.html

Step 1.

Install the sequel gem by and you are ready to go.

sudo gem install sequel

Step 2

Let us set up a little database to hold the tweets. If you are familiar with activerecord, you have probably used migrations before. So sequel works the same way. You write migration files and then simply run them. So here is mine to get you started with a very easy table. Its important to save it as a 01_migration_name.rb file the number is important otherwise sequel wont recognize which migration to run first. I saved it as 01_create_table.rbclass CreateTweetTable < Sequel::Migration

class CreateTweetTable &lt; Sequel::Migration

def up
  create_table :tweets do
    primary_key :id
    String :text
    String :username
    Time :created_at

def down


Step 3

Run the first migration. You will find a great tutorial on migrations on http://steamcode.blogspot.com/2009/03/sequel-migrations.html

sequel -m . -M 1 sqlite://tweets.db

If you are getting a “URI::InvalidURIError: the scheme sqlite does not accept registry part: …” then your database name probably contains some characters it shouldnt. Just try to use only letters and numbers.

So now you should have a sqlite database for the very basic needs of your tweets. But maybe you need a little bit more information on what you are capturing. So lets´write our second migration. In addition to just storing the text and the
username, I want to store the guid of the tweet and the timezone and the language used.

class AddLangAndGuid &lt; Sequel::Migration

    def up
        alter_table :tweets do
            add_column  :guid, Integer
            add_column  :lang, String
            add_column  :time_zone, String

    def down
        alter_table :tweets do
            drop_column :guid
            drop_column :lang
            drop_column :time_zone

After running

sequel -m . -M 2 sqlite://tweets.db

you have created a a nice database that will hold your tweets.

Step 4:

Lets see how it worked. To use sequel in your scripts you have to require rubygems and the seqel gem. What we want to do is to
connect to the  database. Just fire up your irb and get us started:

require 'rubygems'
require 'sequel'

DB = Sequel.sqlite("tweets.rb")
tweets = DB[:tweets]

In those few lines you loaded up your database and now have a tweets collection that holds your data. I think that is really convenient.In part 2 I will show you how to collect them. Enjoy.