You are reading..

Datamining Twitter: Part 2 Accessing The Gardenhose

So in the first part of the tutorial we have set up a sqlite database with sequel. The only thing left we have to do is to access the twitter stream and save our tweets to the database.

Step 1:

What twitter offers are two sorts of streams:

  •  the firehose ( A stream that supplies you with all the tweets created on twitter, which can be up to 50Mio a day) This stream is only available to big clients of twitter like yahoo, microsoft or google. Since storing that stream gives all of twitters data away I guess it costs quite a bit to get access to this stream.
  • the gardenhose ( A stream that only gives you a tiny bit of those streams, yet in most cases is totally enough for us)

We will acess the gardenhose, since the firehose is for the big players like google etc..

Step 2:

Luckily there is a good gem for this that makes our work easy. Michael Bleigh from Intridea has created a gem called Tweetstream, that makes the twitter stream API even easier to use.

gem sources -a http://gems.github.com
gem install intridea-tweetstream

Step 3:

After installing the gem we are ready to rock. Lets create a file alice_stream.rb and start collecting.

require "rubygems"
require "sequel"
require "tweetstream"

#connect to db
DB = Sequel.sqlite("tweets.sqlite")
tweets = DB[:tweets]

@client = TweetStream::Client.new('yourname','yourpassword')

@client.track('alice', 'wonderland') do |status|
    :text => status.text,
    :username => status.user.screen_name,
    :created_at => status.created_at,
    :lang => status.user.lang,
    :time_zone => status.user.time_zone,
    :guid => status[:id]
   puts "[#{status.user.screen_name}] #{status.text}"
   puts "Couldnt insert tweet. Possibly db lock error"

So after loading the rubygems, sequel and tweetstream we are connecting to the database we created in part one. Notice how simple the database connection takes place. Its only two lines of code and we are done.

After that we initialize the twitter client that will provide us with the stream. I will have to check up on OAuth since from June 2010 twitter won’t support basic authentification anymore.

Once the client is initialized we use the track command to filter the twitter stream for certain keywords. The important thing to know here is that the keywords can only be combined in a OR fashion. So we will collect everything that contains alice OR everything that contains wonderland. We will have to filter those tweets later to only keep those that contained alice in wonderland.

I wrapped the database insert in a begin rescue block since sqlite doesn’t allow us concurrency and if later we are reading from the database and locking it, our client won’t be able to put in those tweets and fail. If you use a mysql database which supports row locking like innodb, you won’t have to deal with this problem. Maybe we will come back to this later.

The insert saves the text of the status message, the username, the created_at date and the timezone and the guid of the tweet that identifies it, and makes it able for us to look it up later on twittter.

To see how fast the tweets are coming in I am just putting them into console to have something to read while waiting.

Step 4.

Done. 🙂 Start the collecting by ruby alice_stream.rb and watch those tweets coming in. Once you have enough and are bored quit with CTRL+C.

In the next part of the tutorial I will show you how to analyze those tweets. We will start by plotting them with gnuplot, which is quite fun.



About plotti2k1

Thomas Plotkowiak is working at the MCM Institute in the Social Media and Mobile communication group which belongs to the University of St. Gallen. His PhD research in Social Media is researching how the structure of social networks like Facebook and Twitter influences the diffusion of information. His main focus of work is Twitter, since it allows public access (and has a nice API). Make sure to also have a look at his recent publications. Thomas majored 2008 in Computer Science and Economics at the University of Mannheim and was involved at the computer science institutes for software development and multimedia technoIogy: SWT and PI4. During his studies I focused on Artificial Intelligence, Multimedia Technology, Logistics and Business Informatics. In his diploma/master thesis he developed an adhoc p2p audio engine for 3D Games. Thomas was also a researcher for a year at the University of Waterloo in Canada and in the Macquarie University in Sydney. He was part of the CSIRO ICT researcher group. In his freetime thomas likes to swim in his houselake (drei weiher) and run and enjoy hiking in the Appenzell region. Otherwise you will find him coding ideas he recently had or enjoying a beer with colleagues in the MeetingPoint or Schwarzer Engel.


No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: