//
You are reading..
Datamining

Datamining Twitter Part 4 – Daemons and Cron


Although we covered in part 3, that we can use screen to run our collection in the background and detach from it safely, it has some minor drawbacks.

  • To start the process I have to go through a manual setup routine of starting screen then executing the collection and then detaching from it.
  • If my process dies somehow in screen, either by a buffer overflow or because I haven’t been prepared for all the eventualities and the process somehow disconneted from the source my datacollection will be corrupted.

Step 1: Deamons

So to compensate for those things I will show you a set up that allows
us better to make sure our process is running and is collecting
tweets.  daemon gem  and a bit of cronjob magic. To install the daemons gem them just write:

gem install daemons

We will need to create an additional file that will serve as our control program that will start and end the collection. I will call it the collect_tweets_control.rb

require 'rubygems'
require 'daemons'

Daemons.run("collect_tweets.rb")

We can use it like this:

ruby collect_tweets_control.rb start
      (collect_tweets.rb is now running in the background)
  ruby collect_tweets_control.rb.rb restart
      (...)
  ruby collect_tweets_control.rb stop

I think it is quite cool :).

For the first time we will test it by running collect_tweets.rbwithout
forking
in the background:

  ruby collect_tweets_control.rb run

If you are using files in the collect_tweets method, make sure you are using them with their full path.

path = File.dirname(File.expand_path(__FILE__))
#log = Logger.new('collect_tweets.log')
log = Logger.new(path + "/" + 'collect_tweets.log')
# This also applies for your sqlite database
tweets = Sequel.sqlite(path + "/" + "tweets.sqlite")[:tweets]

Otherwise the daemon will complain about not finding your files. Make sure to check if it is running fine by running:

 
ruby collect_tweets_control.rb run

So now its time to start our process by:

ruby collect_tweets_control.rb start

You will notice that it created a little .pid file that indicates our deamon is up and running. You can also check by:

ps aux | grep collect_tweets.rb

It should show you your process.

Step 2: Script

So our collection process is up and running. We can check the logfile to see if things are going well. But in any case something might happen and our process dies.

Thats why I would like to have a cronjob that checks every 10 Minutes if my process is still doing fine.

If you are on debian it should automatically come with cronjob, or just install it with apt-get.

In Debian the cron package is installed as part of the base system, and will be running by default.

You will find a nice tutorial on cronjob on debian-administration: here

We will first create a little .sh script that will check if our collection is still in progress. I call it check_collection.rb

!/bin/sh
up=`ps aux | grep collect_tweets.rb |grep  -v "grep" | wc -l`
if [ $up -eq 0 ]
then
    /usr/local/bin/ruby /home/plotti/twitter/filme/collect_tweets_control.rb start
else
    echo "Collection is running fine at  `date` "
fi

Watch out for those different quotation marks around date. What it does is using the ps command in combination with grep to look four our collection process. If it can find it it will output a 1 otherwise a 0.

If it is not running we will start our deamon again and otherwise just output that the collection process is doing fine.

You might want to make it runable with chmod and try it out by typing:

./check_collection.sh

Step 3: Cronjob

Now everything is in place we just need an entry in cronjob that starts our little script which will take care of a respawn.To check if cron is running:

ps aux | grep cron

If its not running on debian you can start it like this:

/etc/init.d/cron start

Type the following command to enter cronjob:

 crontab -e

Each cronjob has following syntax:

# +---------------- minute (0 - 59)
# |  +------------- hour (0 - 23)
# |  |  +---------- day of month (1 - 31)
# |  |  |  +------- month (1 - 12)
# |  |  |  |  +---- day of week (0 - 6) (Sunday=0 or 7)
# |  |  |  |  |
  *  *  *  *  *  command to be executed

So our command will look like this:

*/10 * * * * /home/plotti/twitter/check_collection.sh >> /var/log/cron

Which is a nice shortcut (instead of writing 0,10,20,30,40,50 * * * * ) to getting what we want. There is a cool cron genarator here.

The last part redirects the output of our script to the /var/log/cron file so we can see that it actually ran. You might want to check your /var/cron/log file to see if anything went wrong.

Cheers
Thomas

Advertisements

About plotti2k1

Thomas Plotkowiak is working at the MCM Institute in the Social Media and Mobile communication group which belongs to the University of St. Gallen. His PhD research in Social Media is researching how the structure of social networks like Facebook and Twitter influences the diffusion of information. His main focus of work is Twitter, since it allows public access (and has a nice API). Make sure to also have a look at his recent publications. Thomas majored 2008 in Computer Science and Economics at the University of Mannheim and was involved at the computer science institutes for software development and multimedia technoIogy: SWT and PI4. During his studies I focused on Artificial Intelligence, Multimedia Technology, Logistics and Business Informatics. In his diploma/master thesis he developed an adhoc p2p audio engine for 3D Games. Thomas was also a researcher for a year at the University of Waterloo in Canada and in the Macquarie University in Sydney. He was part of the CSIRO ICT researcher group. In his freetime thomas likes to swim in his houselake (drei weiher) and run and enjoy hiking in the Appenzell region. Otherwise you will find him coding ideas he recently had or enjoying a beer with colleagues in the MeetingPoint or Schwarzer Engel.

Discussion

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: