//
archives

Archive for

How to make sense out of Twitter tags

We all know the problem although there seems to be an “interest based” community out there, we honestly don’t know how to start to “tap” in into these communities.
A first way is to see what is out there by looking at Twitter directories such as wefollow.com or twellow.com

What you might notice that when sorting these tags by people that are in these tags we get a nice folksonomy of Twitter listing all the interests that people have from most popular to less popular. This page btw. has 25 more pages with less and less people. You might also notice that from a broad point of view we have a couple of double entries like for example: “tech” and “technology” or “blogger” and “blog” or “tv” and “television”, which mean basically the same. Since I want to study these communities I wanted to create a list of all similar items that are in this list. So my first step was to scrape this website and save the results to a nice csv containing the name and the number of people that are listed in this category.

Scraping Wefollow

class CollectTwitterAccounts < Struct.new(:text)
 require 'rubygems'
require 'scrapi'
require 'csv'
require 'cgi'
uris = []
BASE_URL = "http://wefollow.com/twitter/"
search_words = ARGV[0]

    scraper = Scraper.define do     array :items     #div+div>div.person-box
    process "#results>div", :items => Scraper.define {
      process "div.result_row>div.result_details>p>strong>a", :name => :text
    }
    result :items
  end

  PAGES = 25

  outfile = File.open("../data/#{ARGV[0]}.csv", 'w')

  CSV::Writer.generate(outfile) do |csv|
     #csv << ["Twitter User", "Language"]
  end

  CSV::Writer.generate(outfile) do |csv|
    search_words.each do |word|
      for page in 1..PAGES
        if page == 1
          uri = URI.parse(BASE_URL + word + "/followers")
        else
          uri = URI.parse(BASE_URL + word + "/page#{page}" + "/followers")
        end
        puts uri
        begin
          scraper.scrape(uri).each do |entry|
            name = entry.name
            puts result_string
            #name = result_string.gsub(/'(.*?)'/).first.gsub(/'/, "")
            #csv << [result_string, word.to_s]
            csv << [result_string]
          end
        rescue
          puts "Couldnt find any page for #{uri}"
        end
      end
    end
  end

  outfile.close
end

Without going into much detail I used the ruby library scrapi . Since their pages have a nice format e.g. http://wefollow.com/twitter/socialmedia/page2/followers it is easy to go through all the pages and get the tag and then extract the tag name and the number of members.I have posted above the code to extract members for each of the communities, but the idea is the same :). You will end up with a list of tags and members like this:

View the List in Google Docs

Sorting

Although this list is great we want to get rid of the problems of many different meanings for the same community.
Examples are:

  • music with 69382 members
  • musician with 11799 members
  • musiclover 6304 members

or

  • mommy with 10000 members
  • mom with 6205 members
  • mompreneur with 545 members

We are only interested in the biggest categories for now and want to make a list that list the name of the tag with the highest amount of members, and then the other tags that are similar and their number of members respectively.

So we will create a number of rules according to which we want to build the final set:

  •  All words that are shorter than 3 words like UK, IT will be excluded. (Those words are often too short and ambigious)

A smilar word is when:

  • Both words start with the same letter (T,t)
  • One word is included in the other like tech and technology
  • OR these words are very similar to each other like journalist and journalism (here one is not included in the other)

To write a programm that does this based on the csv list that we generated is not so hard in ruby. We will read in the list of words twice and go through the outer list and compare each word with the inner list. If a word meets our criteria of a double entry we add it to the collection of already processed words and also save it as a hash in the list of similar words. At the end we will sort this list of similar words according to member size and then put the similar words sorted by membership in each line:


require '../config/environment'
require 'faster_csv'
require 'text'

in_list  = FasterCSV.read("results/all_groups_without_mine.csv")
master = in_list.select{|a| a[0].length > 2}.sort{|a,b| a[0].length  b[0].length}
slave = in_list.select{|a| a[0].length > 2}.sort{|a,b| a[0].length  b[0].length}

outfile = File.open("results/testrun.csv", 'wb')

i = 0
CSV::Writer.generate(outfile) do |csv|

  processed_words = []
  collection = []

  master.each do |master_word|

    #Skip words that have been part of the duplicate finding process
    found = false
    processed_words.each do |word|
      if master_word[0].include? word #word.include? master_word[0] #or master_word[0].include? word
        found = true
      end
    end
    if found
      next
    end

    similar_words = []
    similar_words << {:word => master_word[0], :members => master_word[1].to_i, :distance => 0}

    #puts "Working on Row #{i} lenght of word #{master_word[0].length}"
    slave.each do |slave_word|
      similar = false
      included = false

      #They start with the same two letters
      if master_word[0].chars.first.downcase == slave_word[0].chars.first.downcase && master_word[0].chars.to_a[1].downcase == slave_word[0].chars.to_a[1].downcase

        #A Levensthein distance lower than x
        distance = Text::Levenshtein.distance(master_word[0], slave_word[0])
        if (distance > 0 && distance < 4) && master_word[0].length > 6 && slave_word[0].length > 6 #6 for long words only...
          similar = true
        end

        #One is included in the other
        if master_word[0].length != slave_word[0].length
          if (master_word[0].include? slave_word[0]) or (slave_word[0].include? master_word[0])
            included = true
          end
        end

        if similar or included
          similar_words << {:word => slave_word[0], :members => slave_word[1].to_i, :distance => distance}
          if master_word[0].include? "entre"
            puts "For word #{master_word[0]} found similar word #{slave_word[0]}"
          end
          processed_words << slave_word[0].downcase
        end
      end
    end

    #CSV Output
    collection << similar_words
    i += 1
  end

  collection.sort{|a,b| b.collect{|item| item[:members]}.max  a.collect{|item| item[:members]}.max}.each do |row|
    output = []
    row.sort{|a,b| b[:members]  a[:members]}.each do |word|
      output << word[:word]
      output << word[:members]
      output << word[:distance]
    end
    csv << output
  end
end

As you might have noticed there is one little neat thing in this code which is a computation ov the levensthein distance . This distance is small for words that are kind of similar to each other like: journalist and journalism and lets us detect such words. If we chose a small levensthein distance we won’t be making stupid errors like distance = Text::Levenshtein.distance(“artist”, “autism”) Here distance is only 2.

Another way around this might be to use porter stemming to reduce conjugated words to their stem. An example for stemming could be the follwing:

  • result = Text::PorterStemming.stem(“designing”).
  • The result is “design”.

P.S. I’ve updated the code and the output a bit The most significant changes I made were that the words have to have their first two letters in common and the distance is only computed for long words > 6 with a bit higher tolerance < 4 (like thi s I am able to capture those nice misspellings like

  • entrepreneur,30716,1,entrepeneur,651,0,entrepreneurs,143,2,entreprenuer,447,0 .

The Result

Even if we make quite a few error which we will have to correct by hand the output is quite usable. This table gives an overview of the output, showing the word, the amount of members and the levinsthein distance to the word it started with:

The Result in Google docs

As you can see we got some nice results like:

  • art,10098,11,arts,1855,10,artsandculture,235,0
  • film,5687,0,filmmaker,2842,5,filmmaking,843,6,films,797,1,farm,312,2,filmfestival,223,8,fire,162,2,fly,151,2,filmes,141,2
  • singer,4276,0,single,902,2,singersongwriter,893,10,swingers,161,2,singer_songwriter,147,11

What we basically see that often when there are combinations of words like wine and winelover the community becomes smaller but often more specific like in the examples of film and filmfestival or green and greenbusiness. lev Well now its time to collect 100 most listed members from each community and see how the 100 first  “big” general coommunities are linked with each other. I will try to cover this in the next blogpost.

Cheers Thomas