//
You are reading..
Datamining

Datamining Twitter Part 5 – Collecting multiple keywords


We are doing quite fine now, we are able to store the stream and make sure that the collection is running smoothly. It also restarts in case something happened. So everything is in place. Apart from the fact that I want to monitor multiple things at once and filter my tweets before storing them.

So lets get started. We will do only a few modifications to the collect_tweets.rb file.

Step 1

I like the yaml format for storing stuff that should be human readable. I know there is json and xml but they are just not fun to read.

Lets suppose we want to see which movies are doing well, and which
aren’t so we set up our collection to monitor 5 movies at the same time
and store the tweets.

So we will create a small config file config.yaml that will hold the information we need:

movie1:
  db: ateam.sqlite
  keywords: a-team

movie2:
  db: macgruber.sqlite
  keywords: macgruber

movie3:
  db: marmaduke.sqlite
  keywords: marmaduke

movie4:
  db: princepersia.sqlite
  keywords:
    - prince
    - persia

movie5:
  db: robinhood.sqlite
  keywords:
    - robin
    - hood

movie6:
  db: shrek.sqlite
  keywords: shrek

The file holds the db parameters for each movie (although we also could use tables instead) and the keywords we want to monitor. For some movies like robin hood we want to look for two keywords robin AND hood. For some others like shrek one is fine.

Step 2

Now we have the file lets read it in.

path = File.dirname(File.expand_path(__FILE__))
config = YAML.load_file(path + "/" + "config.yaml")

Wasn’t that easy? I mean how much more convinient can it get :). So our config parameters are now stored in the config hash. Lets use this hash to configure our application.

To get the keywords we can do :

keywords = config.values.collect {|k| k["keywords"]}.flatten

</pre>
To have a more convenient access to the tweet databases we could do:
<pre>

tweet_tables = {}
config.values.each do |k|
  tweet_tables[k["db"]] = Sequel.sqlite(path + "/" + k["db"])[:tweets]
end

So we have all the connectors to the databases and can get going.

Step 3

The only thing we need to change now is the collection process. While we encounter our keywords I would like to store the tweets in the appropriate databases.

So our client gets a new starting line, which makes sure he collects all the keywords.

@client.track(keywords.join(",")) do |status|

The problem is that all of those keywords are connected with an OR. Actually its a good thing, otherwise we wouldn’t be able to track multiple things at once. So in the inner loop we have to make sure that we dispatch those tweets and store them appropriately.

config.values.each do |k|
   if k["keywords"].all? {|str| status.text.downcase.include? str}
    selected = k["db"]
   end
  end
  if selected == ""
    puts red("[Not all keywords found]") +  status.text
  else
    tweet_tables[selected].insert(tweet)
    puts "[" + green(selected)+ "]" + "[#{status.user.screen_name}] #{status.text}"
  end
end

I’ve left out the uninteresting stuff, but thats all you need to store the tweets in the databases. So what is happening here?

  • First I am checking if all the keywords are contained in the tweet. Notice how nicely the all? enumerator helps me out here. If we only have a movie with one keyword or 10 doesn’t matter.
  • Secondly depending on the keyowrd i select the database.
  • And last in case the tweets did not match all of the keywords I print a little line saying that I didn’t find all of the keywords otherwise i store the tweet in the appropriate database.

You might ask what those funny green and red methods do? Its a little trick I learned on dmytros blog. You can have two nice helper methods that will color the output in your console. I think it makes supervising the process much more fun.

So in case you want to use them too, here they are

def colorize(text, color_code)
  "\e[#{color_code}m#{text}\e[0m"
end

def red(text); colorize(text, 31); end
def green(text); colorize(text, 32); end

So we are pretty much done. We have a nice config file that contains all the information and we have our collection process that collects the tweets and puts them into the right databases. Make sure to create those databases before you start the collection process, otherwise it might compain.

Have fun with your ideas and drop me a line if you have a question.

Cheers
Thomas

Advertisements

About plotti2k1

Thomas Plotkowiak is working at the MCM Institute in the Social Media and Mobile communication group which belongs to the University of St. Gallen. His PhD research in Social Media is researching how the structure of social networks like Facebook and Twitter influences the diffusion of information. His main focus of work is Twitter, since it allows public access (and has a nice API). Make sure to also have a look at his recent publications. Thomas majored 2008 in Computer Science and Economics at the University of Mannheim and was involved at the computer science institutes for software development and multimedia technoIogy: SWT and PI4. During his studies I focused on Artificial Intelligence, Multimedia Technology, Logistics and Business Informatics. In his diploma/master thesis he developed an adhoc p2p audio engine for 3D Games. Thomas was also a researcher for a year at the University of Waterloo in Canada and in the Macquarie University in Sydney. He was part of the CSIRO ICT researcher group. In his freetime thomas likes to swim in his houselake (drei weiher) and run and enjoy hiking in the Appenzell region. Otherwise you will find him coding ideas he recently had or enjoying a beer with colleagues in the MeetingPoint or Schwarzer Engel.

Discussion

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: