We are doing quite fine now, we are able to store the stream and make sure that the collection is running smoothly. It also restarts in case something happened. So everything is in place. Apart from the fact that I want to monitor multiple things at once and filter my tweets before storing them.
So lets get started. We will do only a few modifications to the collect_tweets.rb file.
I like the yaml format for storing stuff that should be human readable. I know there is json and xml but they are just not fun to read.
Lets suppose we want to see which movies are doing well, and which
aren’t so we set up our collection to monitor 5 movies at the same time
and store the tweets.
So we will create a small config file config.yaml that will hold the information we need:
movie1: db: ateam.sqlite keywords: a-team movie2: db: macgruber.sqlite keywords: macgruber movie3: db: marmaduke.sqlite keywords: marmaduke movie4: db: princepersia.sqlite keywords: - prince - persia movie5: db: robinhood.sqlite keywords: - robin - hood movie6: db: shrek.sqlite keywords: shrek
The file holds the db parameters for each movie (although we also could use tables instead) and the keywords we want to monitor. For some movies like robin hood we want to look for two keywords robin AND hood. For some others like shrek one is fine.
Now we have the file lets read it in.
path = File.dirname(File.expand_path(__FILE__)) config = YAML.load_file(path + "/" + "config.yaml")
Wasn’t that easy? I mean how much more convinient can it get :). So our config parameters are now stored in the config hash. Lets use this hash to configure our application.
To get the keywords we can do :
keywords = config.values.collect {|k| k["keywords"]}.flatten </pre> To have a more convenient access to the tweet databases we could do: <pre> tweet_tables = {} config.values.each do |k| tweet_tables[k["db"]] = Sequel.sqlite(path + "/" + k["db"])[:tweets] end
So we have all the connectors to the databases and can get going.
The only thing we need to change now is the collection process. While we encounter our keywords I would like to store the tweets in the appropriate databases.
So our client gets a new starting line, which makes sure he collects all the keywords.
@client.track(keywords.join(",")) do |status|
The problem is that all of those keywords are connected with an OR. Actually its a good thing, otherwise we wouldn’t be able to track multiple things at once. So in the inner loop we have to make sure that we dispatch those tweets and store them appropriately.
config.values.each do |k| if k["keywords"].all? {|str| status.text.downcase.include? str} selected = k["db"] end end if selected == "" puts red("[Not all keywords found]") + status.text else tweet_tables[selected].insert(tweet) puts "[" + green(selected)+ "]" + "[#{status.user.screen_name}] #{status.text}" end end
I’ve left out the uninteresting stuff, but thats all you need to store the tweets in the databases. So what is happening here?
You might ask what those funny green and red methods do? Its a little trick I learned on dmytros blog. You can have two nice helper methods that will color the output in your console. I think it makes supervising the process much more fun.
So in case you want to use them too, here they are
def colorize(text, color_code) "\e[#{color_code}m#{text}\e[0m" end def red(text); colorize(text, 31); end def green(text); colorize(text, 32); end
So we are pretty much done. We have a nice config file that contains all the information and we have our collection process that collects the tweets and puts them into the right databases. Make sure to create those databases before you start the collection process, otherwise it might compain.
Have fun with your ideas and drop me a line if you have a question.
Cheers
Thomas
Although we covered in part 3, that we can use screen to run our collection in the background and detach from it safely, it has some minor drawbacks.
So to compensate for those things I will show you a set up that allows
us better to make sure our process is running and is collecting
tweets. daemon gem and a bit of cronjob magic. To install the daemons gem them just write:
gem install daemons
We will need to create an additional file that will serve as our control program that will start and end the collection. I will call it the collect_tweets_control.rb
require 'rubygems' require 'daemons' Daemons.run("collect_tweets.rb")
We can use it like this:
ruby collect_tweets_control.rb start (collect_tweets.rb is now running in the background) ruby collect_tweets_control.rb.rb restart (...) ruby collect_tweets_control.rb stop
I think it is quite cool :).
For the first time we will test it by running collect_tweets.rbwithout
forking in the background:
ruby collect_tweets_control.rb run
If you are using files in the collect_tweets method, make sure you are using them with their full path.
path = File.dirname(File.expand_path(__FILE__)) #log = Logger.new('collect_tweets.log') log = Logger.new(path + "/" + 'collect_tweets.log') # This also applies for your sqlite database tweets = Sequel.sqlite(path + "/" + "tweets.sqlite")[:tweets]
Otherwise the daemon will complain about not finding your files. Make sure to check if it is running fine by running:
ruby collect_tweets_control.rb run
So now its time to start our process by:
ruby collect_tweets_control.rb start
You will notice that it created a little .pid file that indicates our deamon is up and running. You can also check by:
ps aux | grep collect_tweets.rb
It should show you your process.
So our collection process is up and running. We can check the logfile to see if things are going well. But in any case something might happen and our process dies.
Thats why I would like to have a cronjob that checks every 10 Minutes if my process is still doing fine.
If you are on debian it should automatically come with cronjob, or just install it with apt-get.
In Debian the cron package is installed as part of the base system, and will be running by default.
You will find a nice tutorial on cronjob on debian-administration: here
We will first create a little .sh script that will check if our collection is still in progress. I call it check_collection.rb
!/bin/sh up=`ps aux | grep collect_tweets.rb |grep -v "grep" | wc -l` if [ $up -eq 0 ] then /usr/local/bin/ruby /home/plotti/twitter/filme/collect_tweets_control.rb start else echo "Collection is running fine at `date` " fi
Watch out for those different quotation marks around date. What it does is using the ps command in combination with grep to look four our collection process. If it can find it it will output a 1 otherwise a 0.
If it is not running we will start our deamon again and otherwise just output that the collection process is doing fine.
You might want to make it runable with chmod and try it out by typing:
./check_collection.sh
Now everything is in place we just need an entry in cronjob that starts our little script which will take care of a respawn.To check if cron is running:
ps aux | grep cron
If its not running on debian you can start it like this:
/etc/init.d/cron start
Type the following command to enter cronjob:
crontab -e
Each cronjob has following syntax:
# +---------------- minute (0 - 59) # | +------------- hour (0 - 23) # | | +---------- day of month (1 - 31) # | | | +------- month (1 - 12) # | | | | +---- day of week (0 - 6) (Sunday=0 or 7) # | | | | | * * * * * command to be executed
So our command will look like this:
*/10 * * * * /home/plotti/twitter/check_collection.sh >> /var/log/cron
Which is a nice shortcut (instead of writing 0,10,20,30,40,50 * * * *
) to getting what we want. There is a cool cron genarator here.
The last part redirects the output of our script to the /var/log/cron file so we can see that it actually ran. You might want to check your /var/cron/log file to see if anything went wrong.
Cheers
Thomas
I am a nervous person, so if my collection of tweets is running on the server I would somehow like to log what is going on so in case things go down I can at least know when it happened.
We will be using the logging library .The logger gem is part of the standard package that ruby comes with so there is nothing to install. There is a nice comparison of loggers for ruby here (in German).
Logging stuff in ruby is easy. You simply need this:
require 'rubygems' require 'logger' #since we want to write out to a file: log = Logger.new("collect_tweets.log") #You can use all of those different level errors to make your file more readable and see what is going on. log.debug("just a debug message") log.info("important information") log.warn("you better be prepared") log.error("now you are in trouble") log.fatal("this is the end...")We will add those two callback methods to our client to log if errors are happening:
@client.on_delete do |status_id, user_id| log.error "Tweet deleted" end @client.on_limit do |skip_count| log.error "Limit exceeded" endAnd we will replace our output to console through the logger:
... rescue log.fatal "Could not insert tweet. Possibly db lock error" #puts "Couldnt insert tweet. Possibly db lock error" end ...Now comes the trickest part. I would like the program to report to a log file every 10 minutes that it is up and running and doing fine.
loop collecting tweets ... time = Time.now.min if time % 10 == 0 && do_once log.info "Collection up and running" do_once = false elsif time % 10 != 0 do_once = true end ...What will this do. Every time I insert a tweet I will check the time. Every 10 Minutes I want to once write my status. Notice that time.min % 10, would write the logging message to the log during that whole minute that it is runnung. So thats why we made a little flag do_once. It gets reset on between those 10 minutes. This should do just fine.
If we look in our log now we see:
I, [2010-05-25T09:10:02.436575 #2040] INFO -- : Collection up and running I, [2010-05-25T09:20:03.007758 #2040] INFO -- : Collection up and running I, [2010-05-25T09:30:03.002217 #2040] INFO -- : Collection up and running I, [2010-05-25T09:40:03.040313 #2040] INFO -- : Collection up and runningPerfect. Now we can always look into this file and see how things have been. If the process somehow crashed we at least know when it happened.
In the next part I will show you how to use the deamons gem in combination with cronjob to make sure our process gets restarted if it somehow crashed.
Cheers Thomas
If you are using linux you have probably sumbled across screen. It is a great tool to run some processes that take a long time and detach them from the console.
In our example we are collecting tweets, since this process can go on for a while or actually forever unless we stop it, we will use screen to start our collecting program and then let it continue for some days. Normally closing your console also closes your program unless you don’t deamonize it. Screen is an easy alternative to do that.
If you are using debian just type
apt-get install screen
and you are ready to go
To start using screen simply type screen and you will be greeted with a welcome screen. Now you are inside screen and everything looks the same apart that you can do a few cool tricks:
The following are some of the most used shortcuts that lets you
navigate through your screen environment. Note that unless modified by
your .screenrc, by default every screen shortcut is preceded by Ctrl+a.
Note that these shortcuts are case-sensitive.
Find out about more shortcuts in Screen’s man pages. In your terminal, run: man screen
.
You will find great tutorials on the web on screen like this or this. They will explain screen in much more detail to you.
What we will do is start the collecting process and then detach the window. This is pretty easy:
Type
ruby collect_tweets.rb
to start collecting the tweets, you will see them popping up in your console.
Now to detach the window simply press
ctrl+ a + d
and you are back in your console. Notice that screen is still running your tweet collection.
To go back to screen and see what it is doing simply type:
screen -ls
It will provide you with the instances of screen that you are running now. You should find one entry. To connect to that session simply type:
screen -r and the pid number for example
screen -r 27699
And voila you are back in watching your ruby program collecting those tweets.
Sweet isn’t it?
Extra goodies:
You can have multiple consoles in screen to create a new one simply press
ctrl + a + c
and you have a new console inside screen. To cycle between those consoles press
ctrl + a + n
to flip to the next window.
Now you are able to start those collecting processes and safely disconnect from your box while it is still preforming the collection process.
In the next tutorial I will show you how to monitor what is going on from the browser.
Cheers
Thomas
So in the first part of the tutorial we have set up a sqlite database with sequel. The only thing left we have to do is to access the twitter stream and save our tweets to the database.
What twitter offers are two sorts of streams:
We will acess the gardenhose, since the firehose is for the big players like google etc..
Luckily there is a good gem for this that makes our work easy. Michael Bleigh from Intridea has created a gem called Tweetstream, that makes the twitter stream API even easier to use.
gem sources -a http://gems.github.com gem install intridea-tweetstream
After installing the gem we are ready to rock. Lets create a file alice_stream.rb and start collecting.
require "rubygems" require "sequel" require "tweetstream" #connect to db DB = Sequel.sqlite("tweets.sqlite") tweets = DB[:tweets] @client = TweetStream::Client.new('yourname','yourpassword') @client.track('alice', 'wonderland') do |status| begin tweets.insert( :text => status.text, :username => status.user.screen_name, :created_at => status.created_at, :lang => status.user.lang, :time_zone => status.user.time_zone, :guid => status[:id] ) puts "[#{status.user.screen_name}] #{status.text}" rescue puts "Couldnt insert tweet. Possibly db lock error" end endSo after loading the rubygems, sequel and tweetstream we are connecting to the database we created in part one. Notice how simple the database connection takes place. Its only two lines of code and we are done.
After that we initialize the twitter client that will provide us with the stream. I will have to check up on OAuth since from June 2010 twitter won’t support basic authentification anymore.
Once the client is initialized we use the track command to filter the twitter stream for certain keywords. The important thing to know here is that the keywords can only be combined in a OR fashion. So we will collect everything that contains alice OR everything that contains wonderland. We will have to filter those tweets later to only keep those that contained alice in wonderland.
I wrapped the database insert in a begin rescue block since sqlite doesn’t allow us concurrency and if later we are reading from the database and locking it, our client won’t be able to put in those tweets and fail. If you use a mysql database which supports row locking like innodb, you won’t have to deal with this problem. Maybe we will come back to this later.
The insert saves the text of the status message, the username, the created_at date and the timezone and the guid of the tweet that identifies it, and makes it able for us to look it up later on twittter.
To see how fast the tweets are coming in I am just putting them into console to have something to read while waiting.
Step 4.
Done. 🙂 Start the collecting by ruby alice_stream.rb and watch those tweets coming in. Once you have enough and are bored quit with CTRL+C.
In the next part of the tutorial I will show you how to analyze those tweets. We will start by plotting them with gnuplot, which is quite fun.
Enjoy
Thomas
In this short tutorial you will learn how to collect tweets using ruby and only two gems.
It is part of a series where I will show you what fantastic things you can do with twitter these days, if you love mining
data 🙂
The first gem I would like to introduce is sequel. It is a lightweight ORM layer that allows to to intterface a couple of of a
databases in ruby without pain. It works great with mysql or sqlite. We will use sqlite today.I have been using mysql in combination wit rails and the nice activerecord ORM, but for the most tasks it is a bit too bulky. The problem with Sqlite can be though that it does not provide multitasking capabilities. But we will bump into that later…
To get you started have a visit on http://sequel.rubyforge.org/
and have a look on the example. They are pretty straight forward. I can also recommend the cheatsheet under: http://sequel.rubyforge.org/rdoc/files/doc/cheat_sheet_rdoc.html
Install the sequel gem by and you are ready to go.
sudo gem install sequel
Let us set up a little database to hold the tweets. If you are familiar with activerecord, you have probably used migrations before. So sequel works the same way. You write migration files and then simply run them. So here is mine to get you started with a very easy table. Its important to save it as a 01_migration_name.rb file the number is important otherwise sequel wont recognize which migration to run first. I saved it as 01_create_table.rbclass CreateTweetTable < Sequel::Migration
class CreateTweetTable < Sequel::Migration def up create_table :tweets do primary_key :id String :text String :username Time :created_at end end def down drop_table(:tweets) end endStep 3
Run the first migration. You will find a great tutorial on migrations on http://steamcode.blogspot.com/2009/03/sequel-migrations.html
sequel -m . -M 1 sqlite://tweets.dbIf you are getting a “URI::InvalidURIError: the scheme sqlite does not accept registry part: …” then your database name probably contains some characters it shouldnt. Just try to use only letters and numbers.
So now you should have a sqlite database for the very basic needs of your tweets. But maybe you need a little bit more information on what you are capturing. So lets´write our second migration. In addition to just storing the text and the
username, I want to store the guid of the tweet and the timezone and the language used.class AddLangAndGuid < Sequel::Migration def up alter_table :tweets do add_column :guid, Integer add_column :lang, String add_column :time_zone, String end end def down alter_table :tweets do drop_column :guid drop_column :lang drop_column :time_zone end end endAfter running
sequel -m . -M 2 sqlite://tweets.dbyou have created a a nice database that will hold your tweets.
Step 4:
Lets see how it worked. To use sequel in your scripts you have to require rubygems and the seqel gem. What we want to do is to
connect to the database. Just fire up your irb and get us started:require 'rubygems' require 'sequel' DB = Sequel.sqlite("tweets.rb") tweets = DB[:tweets]In those few lines you loaded up your database and now have a tweets collection that holds your data. I think that is really convenient.In part 2 I will show you how to collect them. Enjoy.
Cheers
Thomas