We all know the problem although there seems to be an “interest based” community out there, we honestly don’t know how to start to “tap” in into these communities.
A first way is to see what is out there by looking at Twitter directories such as wefollow.com or twellow.com
What you might notice that when sorting these tags by people that are in these tags we get a nice folksonomy of Twitter listing all the interests that people have from most popular to less popular. This page btw. has 25 more pages with less and less people. You might also notice that from a broad point of view we have a couple of double entries like for example: “tech” and “technology” or “blogger” and “blog” or “tv” and “television”, which mean basically the same. Since I want to study these communities I wanted to create a list of all similar items that are in this list. So my first step was to scrape this website and save the results to a nice csv containing the name and the number of people that are listed in this category.
class CollectTwitterAccounts < Struct.new(:text) require 'rubygems' require 'scrapi' require 'csv' require 'cgi' uris = [] BASE_URL = "http://wefollow.com/twitter/" search_words = ARGV[0] scraper = Scraper.define do array :items #div+div>div.person-box process "#results>div", :items => Scraper.define { process "div.result_row>div.result_details>p>strong>a", :name => :text } result :items end PAGES = 25 outfile = File.open("../data/#{ARGV[0]}.csv", 'w') CSV::Writer.generate(outfile) do |csv| #csv << ["Twitter User", "Language"] end CSV::Writer.generate(outfile) do |csv| search_words.each do |word| for page in 1..PAGES if page == 1 uri = URI.parse(BASE_URL + word + "/followers") else uri = URI.parse(BASE_URL + word + "/page#{page}" + "/followers") end puts uri begin scraper.scrape(uri).each do |entry| name = entry.name puts result_string #name = result_string.gsub(/'(.*?)'/).first.gsub(/'/, "") #csv << [result_string, word.to_s] csv << [result_string] end rescue puts "Couldnt find any page for #{uri}" end end end end outfile.close end
Without going into much detail I used the ruby library scrapi . Since their pages have a nice format e.g. http://wefollow.com/twitter/socialmedia/page2/followers it is easy to go through all the pages and get the tag and then extract the tag name and the number of members.I have posted above the code to extract members for each of the communities, but the idea is the same :). You will end up with a list of tags and members like this:
Although this list is great we want to get rid of the problems of many different meanings for the same community.
Examples are:
or
We are only interested in the biggest categories for now and want to make a list that list the name of the tag with the highest amount of members, and then the other tags that are similar and their number of members respectively.
So we will create a number of rules according to which we want to build the final set:
A smilar word is when:
To write a programm that does this based on the csv list that we generated is not so hard in ruby. We will read in the list of words twice and go through the outer list and compare each word with the inner list. If a word meets our criteria of a double entry we add it to the collection of already processed words and also save it as a hash in the list of similar words. At the end we will sort this list of similar words according to member size and then put the similar words sorted by membership in each line:
require '../config/environment' require 'faster_csv' require 'text' in_list = FasterCSV.read("results/all_groups_without_mine.csv") master = in_list.select{|a| a[0].length > 2}.sort{|a,b| a[0].length b[0].length} slave = in_list.select{|a| a[0].length > 2}.sort{|a,b| a[0].length b[0].length} outfile = File.open("results/testrun.csv", 'wb') i = 0 CSV::Writer.generate(outfile) do |csv| processed_words = [] collection = [] master.each do |master_word| #Skip words that have been part of the duplicate finding process found = false processed_words.each do |word| if master_word[0].include? word #word.include? master_word[0] #or master_word[0].include? word found = true end end if found next end similar_words = [] similar_words << {:word => master_word[0], :members => master_word[1].to_i, :distance => 0} #puts "Working on Row #{i} lenght of word #{master_word[0].length}" slave.each do |slave_word| similar = false included = false #They start with the same two letters if master_word[0].chars.first.downcase == slave_word[0].chars.first.downcase && master_word[0].chars.to_a[1].downcase == slave_word[0].chars.to_a[1].downcase #A Levensthein distance lower than x distance = Text::Levenshtein.distance(master_word[0], slave_word[0]) if (distance > 0 && distance < 4) && master_word[0].length > 6 && slave_word[0].length > 6 #6 for long words only... similar = true end #One is included in the other if master_word[0].length != slave_word[0].length if (master_word[0].include? slave_word[0]) or (slave_word[0].include? master_word[0]) included = true end end if similar or included similar_words << {:word => slave_word[0], :members => slave_word[1].to_i, :distance => distance} if master_word[0].include? "entre" puts "For word #{master_word[0]} found similar word #{slave_word[0]}" end processed_words << slave_word[0].downcase end end end #CSV Output collection << similar_words i += 1 end collection.sort{|a,b| b.collect{|item| item[:members]}.max a.collect{|item| item[:members]}.max}.each do |row| output = [] row.sort{|a,b| b[:members] a[:members]}.each do |word| output << word[:word] output << word[:members] output << word[:distance] end csv << output end end
As you might have noticed there is one little neat thing in this code which is a computation ov the levensthein distance . This distance is small for words that are kind of similar to each other like: journalist and journalism and lets us detect such words. If we chose a small levensthein distance we won’t be making stupid errors like distance = Text::Levenshtein.distance(“artist”, “autism”) Here distance is only 2.
Another way around this might be to use porter stemming to reduce conjugated words to their stem. An example for stemming could be the follwing:
P.S. I’ve updated the code and the output a bit The most significant changes I made were that the words have to have their first two letters in common and the distance is only computed for long words > 6 with a bit higher tolerance < 4 (like thi s I am able to capture those nice misspellings like
Even if we make quite a few error which we will have to correct by hand the output is quite usable. This table gives an overview of the output, showing the word, the amount of members and the levinsthein distance to the word it started with:
As you can see we got some nice results like:
What we basically see that often when there are combinations of words like wine and winelover the community becomes smaller but often more specific like in the examples of film and filmfestival or green and greenbusiness. lev Well now its time to collect 100 most listed members from each community and see how the 100 first “big” general coommunities are linked with each other. I will try to cover this in the next blogpost.
Cheers Thomas