//
archives

Ontology

This category contains 4 posts

How to generate interest based communities part 2

In the last blog post last week (https://twitterresearcher.wordpress.com/2012/06/08/how-to-generate-interest-based-communities-part-1/)   I have described my way of collecting people on Twitter that are highly listed on lists for certain keywords such as swimming, running, perl, ruby and so on. I have then sorted each of those persons in each category according to how often they were listed in each category. This lead to lists like these below, where you see a listing people found on list that contained the word “actor”.

We might say this is a satisfactory result, because the list seems to contain people that actually seem relevant in regard to this keyword. But what about the persons that we collected for the keyword “hollywood”. Lets have a look:

If you look at the first persons you notice that a lot of these people are the same. Although in my last attempts (https://twitterresearcher.wordpress.com/2012/04/16/5/ and https://twitterresearcher.wordpress.com/2012/03/16/a-net-of-words-a-high-level-ontology-for-twitter-tags/)  I tried hard to find keywords that are semantically related such as “car” and “automotive”, the list of user interests ended up having some examples like “actor” and “hollywood”. What are we going to do about this prolem? My solution is to merge those two lists into one since it seems to cover the same interest. But how do I do this without having to subjectively decide on each list?

First step: Calculating number overlapping members between lists

An idea is to calculate how often members from one list appear on other lists. The lists that have a high overlap will be then merged into one list and the counts that those people received will be added up. The new position on the list will be then determined by the new count. We will need two parameters: the maximum number of persons that we want to look at in each list (i simply called it MAX) and a threshold percentage of % of similar people which decides when to merge two  lists. If we merge two lists “actor” and “hollywood” into “actor_hollywood” we also want to run this list against all remaining keywords such as “tvshows” and also merge it with them if the criteria s are met, resulting in “actor_hollywood_tvshows”. The result is a nice  clustering of the members we found for our interests. Although these interests have different keywords, if they contain the same members they seem to capture the same semantical concept or user interest. The code to perform this is shown below:

For further processing the code also saves which concepts it merged into which keys and also makes sure that if we merge 200 people from one list with 200 from another list we only take the first 200 from the resulting list.

What does the result look like? I’ve displayed the resulting merged categories using a threshold of 0.1 and the checking the first 1000 places for overlap.

Below you see the final output where I have used a threshold of 0.2 and looked at only the first 200 users in each list. Regarding the final number of communities there is a  trade off: When setting the threshold too low we end up with “big” user interest areas where lots of nodes are clumped together. When  having a too high threshold, it seems like the groups that obviously should be united (e.g. “theater” and “theatre” ) won’t be merged. I have had good experiences with setting the threshold to 0.2 which means that groups that share 20% of their members are merged into one.

Second step: Allowing members to switch groups

The results of the above attempts are not bad they can be improved. Why ? Well imagine your name was in the actors category which got merged with drama, hollywood, tv_shows and you ended up having the 154th place in this category. This is not bad, but it might be that people actually think that you are more of a “theatre” guy and that is why in the category of theatre you rank 20th. Although knowing that a person can belong to multiple interest groups, if I were to chose the one that best represents you I would say that you are in the theatre category because you ranked 20th there, while only ranking 154th in the actor category.

So this means that I am comparing the rankings that you achieved in each cateogory. But I could also compare the total number of votes that you received on each list. If I did that you would end up being in the actor category because the total number of lists for this category is much higher than for theatre, and the 200 votes received by somebody on the 154th place in the actor category are higher than the 50 votes received  by the same person on the 20th place in the theatre category. I have chosen to go with the ranking method, because it is more stable in regard to this problem. Popular interests do not “outweigh” the more specific ones, and if a person can be placed in a specific category then it should be the specific one and not the popular one. The code below does exactly this. Additionally it also notes for each person how often this person was also part of other categories, but the person gets assigned to the category where it got on the higher place.

There is also a small array called final_candidates that is used to put exactly 100 persons in each category at the end. What does the output look like? In most of the cases it leaves the persons in the same category, but in some cases people actually switch categories. These are the interesting cases. I have filtered the output in Excel and sorted it by the number of competing categories, to showcase some of the cases that took place. You notice that e.g. the “DalaiLama” started in the “yoga” category but according to our algorithm (or actually the people’s votes) he fitted more into “buddhism”, or “NASA” started in “tech” but was moved to “astronomy”, which seems even more fitting.

To provide an idea how often this switcheroo took place I have created a simple pivot table listing the average value of competing categories per category (see below). We see that for the majority of categories their people don’t compete for other categories (right side of the chart), but maybe for a handful of categories their people compete for other categories (left peaks of the chart). What you also notice on this graph, is that the lower the threshold, the smaller the final groups, but  these groups have a smaller cometing average count  (e.g compare violet line size:1000, threshold 0.1 vs. geen line size 1000 threshold 0.2). What you also see is that if we consider only the first 200 places vs. the first 1000 places we get actually better results (compare violet line with red line). This is a bit counter intuitive. Since I was thinking the that the more people we take into consideration the better the results. It rather turns out that after a certain point this voting mechanism seems to get “blurrier”. People getting voted on the 345th place somewhere don’t really matter that much, but eventually they lead to merging these categories together, which shouldn’t have had been merged.

No matter which threshold and size we use there are always a couple of groups that always seem “problematic” (aka the high peaks in the chart on the left) where it seems hard for people to decide where these people belong to. Below I have provided an an excerpt for group size 200 and threshold 0.2. For people in these categories it seems really hard to “pin” them down to a certain interest.

  • Category Name, Average competing categories for group
  • tech 1.871287129
  • comedy_funny 1.693069307
  • developer 1.603960396
  • recipes_cooking 1.554455446
  • magazine 1.544554455
  • food_chef 1.544554455
  • tvshows_drama_actor_hollywood 1.534653465
  • politics_news 1.524752475
  • finance_economics 1.524752475
  • mac_iphone 1.514851485
  • teaching 1.465346535
  • director 1.465346535
  • liberal 1.465346535
  • ipad 1.455445545
  • healthcare_medicine 1.435643564

For the rest of the groups we get very stable results. These interest groups seem to be well defined and people don’t think that those people belong to other categories:

  • hockey 1
  • army_military_veteran 1
  • composer 1
  • rugby 1
  • piano 1
  • astrology 1
  • wedding 1
  • dental 1
  • wrestling 1
  • linux_opensource 1
  • skiing 1
  • perl 1
  • golf 1
  • accounting 1

Conclusion

For these remaining interest groups we will now take a look at their internal group structure, looking how e.g. opinon leaders (people being very central in the group) are able to get a lot of retweets (or not). Additionally we will  take a look on how there are people between different groups (e.g. programming languages ruby and perl) that work as brokers or “boundry spanners”, and if these people are able to get retweets from both communities or only one or none at all. For questions like these these interest groups provide an interesting data source.

Cheers Thomas

Advertisements

User interests ontology

I’ve been blogging about the idea that people form networks based on their interests for a while now.  As you maybe remember we used to use the tags on wefollow to find out what people are interested in on Twitter.

And in the  last post I have shown how to create a post-hoc ontology  from tags that we collected on wefollow, which represent people’s interests. Yet the results of this attempt were kind of mediocre:

  • We have found that people like to tag a lot of people describing “somebody” like rapper, artist, celebrity and so on.
  • And we have found out that people like to tag a lot of twitter users based on a certain activity, like. swimming, running, hacking, dancing, cooking and so on… But apart from this insight I was still lacking any insight into this “bag of words” that I got from wefollow.
  • We also have found out that rugby is similar to soccer, and those two are similar to cricket because these are all field games and so on…

In another attempt I  have also tried  to find out which of the keywords on wefollow are somewhat similar simply by looking for words that sound the same or are spelled the same. The results were interesting.

  • We have found out people like to use different keywords with different popularity to describe “kind of ” similar things. For example:film,5687,0,filmmaker,2842,5,filmmaking,843,6,films,797,1,farm,312,2,filmfestival,223,8,fire,162,2,fly,151,2,filmes,141,2
  • These are all keywords that share the word film, but apparently simply tagging users with the word film seems to yield in the highest results.
  • Similar for other words:singer,4276,0,single,902,2,singersongwriter,893,10,swingers,161,2,singer_songwriter,147,11singer is the most popular keyword to tag people followed by single and so on.

User interest providers

Yet despite those two attempts we are lacking some more insight into the users interests of twitter users. What I am looking for is some kind of hierarchy between those words, but not so much as in the wordnet approach (see above) and not so much as in the word similarity approach but more in a ontology based approach where we split up the users interests into lets say 6-12 high class categories and put our keywords into those. We have used two different approaches, now its time get some overviews about other options

The table below is a comparison of providers of users interests that have chosen to categorize them accordingly. As you can see the approaches differ by how many keywords are used and if the ordering is hierarchical (as in dmoz or yahoo= or simply some sort of folksonomy as in delicious. The first two providers are commercial and do not offer any subcategories or networks, but give us some clue about the number of top level categories. In general we can say that most of these “interest directories” seem to contain the same top level categories (which is great because it seems we can agree on something). Apart from that I think that dmoz or yahoo give us the best
chance to order our keywords in a reasonable manner.

So after agreeing to use the yahoo category (since it is the most comprehensive, contains the most amount of subcategories and is curated by paid profesionals) to find out more about how to order users interests, its time to take a look at their dictionary.


A screenshot of the Yahoo Directory

Since there is no API or something I have chosen to scrape the first 2-3 levels of their directory and save them to a file. (I have used nokogiri) You will find the listing below. What it does is it goes through each of the toplevels defined beforehand (we are skipping new additions, subscribe via rss and regional) and looks at each of those links. It notes how many subcategories are in there in those brackets and it notes if a link is pointing towards another category. It contains an @-sign.  After going through all of those links it writes them down in a simple manner:

Topcategory, Subcategory, Count

Like this we get a network of categories.

Scraping the categories

require'rubygems'require'nokogiri'require'open-uri'
require 'rubygems'
require 'nokogiri'
require 'open-uri'

topdomains = ["business_and_economy", "recreation", "computers_and_internet", "reference", "education", "regional", "entertainment", "science", "government", "social_science", "health", "society_and_culture"]

@seen_words = []
File.readlines("seenwords.csv").each do |line|
@seen_words << line.sub!(/\n/,"")
end
@seen_words_file = File.open("seenwords.csv", "a+")

def write_net(father, son)
	 if @seen_words.include?(son) && !son.include?("@")
@file.puts "#{father} #{father}_#{son}"
else
@file.puts "#{father} #{son}"
end
@seen_words_file.puts son
end

i = 0
topdomains.each do |domain|
	@file = File.open("#{domain}.csv", "w+")
	puts "done domain #{domain}"
	site =  Nokogiri::HTML(open("http://dir.yahoo.com/#{domain}"))
	site.css("div.cat li a").each do |link|
		first_level_link = link.content.gsub(" ","_").downcase
		write_net(domain,first_level_link)
		puts "working on #{first_level_link}"
		if first_level_link.include? "@"
			first_level_link.gsub!("@","")
			sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{first_level_link}"))
		else
			sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{domain}/#{first_level_link}"))
		end
		sub_site.css("div.cat li a").each do |sub_link|
			i += 1
			puts i.to_s
			second_level_link = sub_link.content.gsub(" ","_").downcase
			write_net(first_level_link, second_level_link)
			if second_level_link.include? "@"
				second_level_link.gsub!("@","")
				sub_sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{second_level_link}"))
			else
				sub_sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{domain}/#{first_level_link}/#{second_level_link}"))
			end
			sub_sub_site.css("div.cat li a").each do |sub_sub_link|
				third_level_link = sub_sub_link.content.gsub(" ","_").downcase
				write_net(second_level_link, third_level_link)
			end
		end
	end
end
@file.close

Visualizing the network

Having downloaded the network we end up with something that we can visualize in gephi (see below).=. The visualization is nice, since it allows to see which fields the links with the @-sign connect. We can see clusters emerge between different concepts and see that most of the second level categories are not connected to the rest. As you will note in the listing below I have also made sure to not include subcategories like “organisation” or “people” since every category contains such a subcategory and this subcategory would end op being most central in our network. Instead each subcategory gets an explicit name e.g. “sports_organisations” and only the categories with an @ are allowed to link other groups.

The downside of this approach is that the result is pretty big and creates even more confusion than our keywords from wefollow. Now we have a network with approximately 7000 nodes and 20.000 Edges. We would now search for each of the wefollow keywods and see where we can find it, and then drop the rest. This idea is not bad, but we are neglecting the majority of the great insights that the yahoo directory told us. If for example wefollow does not contain keywords regarding health, does it mean that Twitter users are not interested in health issues, or did we not look properly? Therefore I decided to take a hybrid approach. First I will cut down the yahoo directory only to words that contain a lot of entries and at the same time see how the 200 most frequent wefollow keywords fit into this ontology.

A mind map of user interests

The result is a mind map of user interests, rather than a network, since I’ve chosen to write it down by hand in order to be able to change small things. For example I would like to exclude the keywords that link together other topdomains and have rather a tree. Additionally I’ve decided to mark the words I have included from wefollow with a “wefollow: ” preposition in order to make the process more transparent for everybody. The result shows that actually quite a lot of the keywords that we have found on wefollow were also part of the existing yahoo directory and although the directory was quite big it did not contain a number of new words such as “youtube”, “podcast” and so on. Additionally concepts like for example “animals and pets” were added by my by hand since they have been at a very deep level in the yahoo ontology (Science / Zoology / Animals / …) but are actually quite popular among twitter users. So below you see the result of my work. THis mindmap represents a hybrid of the  200 most frequent wefollow keywords and the most popular yahoo categories. I am quite happy with the result since it seems to be useful in describing the bag of words I had before.

I am right now collecting the communities of those users on Twitter in order to analyze them and will keep you updated about the progress.

Thats it for today.
Cheers
Thomas

A net of words ( A high level ontology for Twitter tags)

Knowing that people form networks in twitter based on their interest I have investigated the tags that are listed on wefollow (see below) below.

Motivation:

Since those tags are rather chaotic and in no particular order except that they are listed by the number of followers I was thinking how do others organise those interests. The most prominent websites that offer such a service are peerindex.com and appinions.com (founded by CMU members). Both websites allow users to find influential users based on a certain interest.

Peerindex divides all topics into 8 different areas. On the left you can see my topical fingerprint in these areas.

  • AME  – arts, media, entertainment
  • TEC – technology, internet
  • SCI – science, environment
  • MED – health, medical
  • LIF – leisure, lifestyle
  • SPO – sports
  • POL – news, politics, society
  • BIZ – finance, business, economics

Appinion offers 10 different categories, which kind of map to the categories selected by peerindex (see below)

  • POLITICS ~ POL
  • TECHNOLOGY ~ TEC
  • RECREATION ~ LIF
  • MEDIA ~ AME
  • ENTERTAINMENT ~ AME
  • EDUCATION ~ SCI
  • FASHION –> no equivalent
  • BUSINESS ~ BIZ
  • TRAVEL ~ LIF
  • HEALTH ~ MED

Question: How do we either a) map the tags from wefollow to the concepts above or b) create our own ontology of things?

What we are looking for is a kind of similarity between the semantic concepts that these tags are standing for. So for example soccer is similar to football, those words can be considered synonyms. But what about other “relations” such as cricket or hockey and football ? We know that these two words are not synonyms but they are somewhat close to each other. If i were interested in football i could  be probably interested in cars. To find these kind of relations we need  a database that contains semantic relations beyond synonyms. One great tool to use is wordnet. What is wordnet? I’ve cut and pasted the definition from their website:

Wordnet

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download.

WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.

Being armed with the knowledge contained in wordnet we can start to see if we can come up with a relation between engine and car. If you use their built-in browser you might find the following entries (see below). Note that I have unfolded the so-called hypernyms for those two words. Hypernyms are exsiting between synsets (which are groups of words that have the similar meaning). The definition of hypernyms is from the website:

The most frequently encoded relation among synsets is the super-subordinate relation (also called hyperonymy, hyponymy or ISA relation). It links more general synsets like {furniture, piece_of_furniture} to increasingly specific ones like {bed} and {bunkbed}. Thus, WordNet states that the category furniture includes bed, which in turn includes bunkbed; conversely, concepts like bed and bunkbed make up the category furniture. All noun hierarchies ultimately go up the root node {entity}.

So if I enter soccer I get the so called hyperonomy tree that goes up to the root node entity (not shown). But we know soccer is a football game and this is a field game.

And if I enter hockey I get:

If I enter rugby i get:

Ok you get the idea, this leads to a tree that connects these concepts as shown below.

If you went through these concepts by hand and noted down whenever two words connect on some level in the higher hierarchy you will end up knowing that these two words are somewhat similar. Of course doing this by hand is tiresome. Thats why we will use ruby and the gem rwordnet to computationally create such a tree for us.

require 'rubygems'
require 'wordnet'
require 'nokogiri'
require 'cgi'
require 'open-uri'

words = []
File.readlines("groups.txt").each do |line|
	words << line.sub!(/\n/,"")
end
index = WordNet::NounIndex.instance
file = File.open("output.csv", "w+")
words.each do |word|
	puts "Working on word: #{word}"
	wordnet = index.find(word)
	if wordnet != nil
		puts "#{wordnet.synsets.count} Synsets found for #{word}"
		max = 0
		best_synset = ""
		best_synset = wordnet.synsets.first
		last_word = word
		next_word = best_synset.hypernym
		while next_word != nil && next_word.words.first != last_word
			file.puts "#{last_word};#{next_word.words.first}\n"
			#puts "#{last_word} H: #{next_word.words.join(" ")}"
			last_word = next_word.words.first
			next_word = next_word.hypernym
		end
	else
		puts "Nothing found for #{word}"
	end
end
file.close

So what does this file do?

  • It reads in the 1500 group keyword tags that we collected from wefollow
  • It then takes each word and checks if we can find a meaning for it on wordnet. (For some things like twitter, youtube, etc.. .there are no entries…)
  • Since the result is ordered by the meanings frequency we take the first meaning of the word (we will come back to this later…)
  • For this word we compute a tree of hypernyms by going up as long as there are no hypernyms
    • In each of these steps we note a pair: SOURCE – TARGET
  • We dump this network to disk and visualize it with gephi.

So after doing this and visualizing it with gephi you get a tree that looks like the one above. But there was a problem with finding the most frequent meaning of a word. For example for the word “poker” these days people would think of the card game and not not a fire hook.

Noun

Google for Frequencies

Since I think wordnet has computed the frequencies for these words based on some book corpus that might be outdated, the frequencies are too. So I needed to find a way of finding the meaning that is more popular. And I thought why not use google search. The more entries you find for the combination of the word with its so-called “gloss”, which is an informal definition of the concept the more common it is to assume that this is the main thing that users on Twitter had in mind when entering this keyword. So I changed the listing above a bit and changed the part about chosing the best synset.

require 'nokogiri'
...
    best_synset = ""
		wordnet.synsets.each do |synset|
			searchterm = "#{word}#{synset.gloss}".map { |w| CGI.escape(w) }.join("+")
			site =  Nokogiri::HTML(open("http://www.google.ch/search?q=#{searchterm}"))
			r = site.css("#subform_ctrl div").children.last.content.to_s
			r.gsub!("'","")
			results = r.gsub(/[^0-9]/,"").to_i
			puts "Found #{results} for gloss #{synset.gloss}"
			if results > max
				max = results
				best_synset = synset
			end
		end

So now the program looks scrapes from google search  how many entries it found, and the concept with the highest number of entries wins.

Output

So finally after going through all this how does the output look like and is it any helpful in organising our word tags?

The global view shows that the network looks more like a tree with long thin arms fading out. We can recognize some main concepts: A lot of tags have been unified under “somebody” so twitter is about persons, and a lot of tags have been sumified under activity, so it is about what people are doing. If you want to dig through the network yourself, I’ve attached it to this post. Feel free to download it.

Outlook

So whats next? If we were to find how similar two words are all I have to do now is to see how many steps I have to take to find a connection between them. So in the case of hockey and soccer and rugby that would be quite close. But in the case of gardening and rapper that would be quite far. Remember that this ontology was created by wordnet and therefore the distance between concepts depends on this ontology. But what if we look up these communities up on Twitter and see how close they really are? Thats something we will do in the next blog post.

And finally I still owe you the ordering of these words into the broader categories from the other providers. My idea on how to do this would be to include these concepts in the network and see how far our keywords are away from them and chose the one that is closest. This has some implications, for example a concept that has a high centrality (such as media for example) would win the majority of words. And I  have to ask myself if I have a  constraint that each category should contain a similar amount of words ? If yes I have to think about how to solve it. Maybe you have some ideas?

Cheers
Thomas

How to make sense out of Twitter tags

We all know the problem although there seems to be an “interest based” community out there, we honestly don’t know how to start to “tap” in into these communities.
A first way is to see what is out there by looking at Twitter directories such as wefollow.com or twellow.com

What you might notice that when sorting these tags by people that are in these tags we get a nice folksonomy of Twitter listing all the interests that people have from most popular to less popular. This page btw. has 25 more pages with less and less people. You might also notice that from a broad point of view we have a couple of double entries like for example: “tech” and “technology” or “blogger” and “blog” or “tv” and “television”, which mean basically the same. Since I want to study these communities I wanted to create a list of all similar items that are in this list. So my first step was to scrape this website and save the results to a nice csv containing the name and the number of people that are listed in this category.

Scraping Wefollow

class CollectTwitterAccounts < Struct.new(:text)
 require 'rubygems'
require 'scrapi'
require 'csv'
require 'cgi'
uris = []
BASE_URL = "http://wefollow.com/twitter/"
search_words = ARGV[0]

    scraper = Scraper.define do     array :items     #div+div>div.person-box
    process "#results>div", :items => Scraper.define {
      process "div.result_row>div.result_details>p>strong>a", :name => :text
    }
    result :items
  end

  PAGES = 25

  outfile = File.open("../data/#{ARGV[0]}.csv", 'w')

  CSV::Writer.generate(outfile) do |csv|
     #csv << ["Twitter User", "Language"]
  end

  CSV::Writer.generate(outfile) do |csv|
    search_words.each do |word|
      for page in 1..PAGES
        if page == 1
          uri = URI.parse(BASE_URL + word + "/followers")
        else
          uri = URI.parse(BASE_URL + word + "/page#{page}" + "/followers")
        end
        puts uri
        begin
          scraper.scrape(uri).each do |entry|
            name = entry.name
            puts result_string
            #name = result_string.gsub(/'(.*?)'/).first.gsub(/'/, "")
            #csv << [result_string, word.to_s]
            csv << [result_string]
          end
        rescue
          puts "Couldnt find any page for #{uri}"
        end
      end
    end
  end

  outfile.close
end

Without going into much detail I used the ruby library scrapi . Since their pages have a nice format e.g. http://wefollow.com/twitter/socialmedia/page2/followers it is easy to go through all the pages and get the tag and then extract the tag name and the number of members.I have posted above the code to extract members for each of the communities, but the idea is the same :). You will end up with a list of tags and members like this:

View the List in Google Docs

Sorting

Although this list is great we want to get rid of the problems of many different meanings for the same community.
Examples are:

  • music with 69382 members
  • musician with 11799 members
  • musiclover 6304 members

or

  • mommy with 10000 members
  • mom with 6205 members
  • mompreneur with 545 members

We are only interested in the biggest categories for now and want to make a list that list the name of the tag with the highest amount of members, and then the other tags that are similar and their number of members respectively.

So we will create a number of rules according to which we want to build the final set:

  •  All words that are shorter than 3 words like UK, IT will be excluded. (Those words are often too short and ambigious)

A smilar word is when:

  • Both words start with the same letter (T,t)
  • One word is included in the other like tech and technology
  • OR these words are very similar to each other like journalist and journalism (here one is not included in the other)

To write a programm that does this based on the csv list that we generated is not so hard in ruby. We will read in the list of words twice and go through the outer list and compare each word with the inner list. If a word meets our criteria of a double entry we add it to the collection of already processed words and also save it as a hash in the list of similar words. At the end we will sort this list of similar words according to member size and then put the similar words sorted by membership in each line:


require '../config/environment'
require 'faster_csv'
require 'text'

in_list  = FasterCSV.read("results/all_groups_without_mine.csv")
master = in_list.select{|a| a[0].length > 2}.sort{|a,b| a[0].length  b[0].length}
slave = in_list.select{|a| a[0].length > 2}.sort{|a,b| a[0].length  b[0].length}

outfile = File.open("results/testrun.csv", 'wb')

i = 0
CSV::Writer.generate(outfile) do |csv|

  processed_words = []
  collection = []

  master.each do |master_word|

    #Skip words that have been part of the duplicate finding process
    found = false
    processed_words.each do |word|
      if master_word[0].include? word #word.include? master_word[0] #or master_word[0].include? word
        found = true
      end
    end
    if found
      next
    end

    similar_words = []
    similar_words << {:word => master_word[0], :members => master_word[1].to_i, :distance => 0}

    #puts "Working on Row #{i} lenght of word #{master_word[0].length}"
    slave.each do |slave_word|
      similar = false
      included = false

      #They start with the same two letters
      if master_word[0].chars.first.downcase == slave_word[0].chars.first.downcase && master_word[0].chars.to_a[1].downcase == slave_word[0].chars.to_a[1].downcase

        #A Levensthein distance lower than x
        distance = Text::Levenshtein.distance(master_word[0], slave_word[0])
        if (distance > 0 && distance < 4) && master_word[0].length > 6 && slave_word[0].length > 6 #6 for long words only...
          similar = true
        end

        #One is included in the other
        if master_word[0].length != slave_word[0].length
          if (master_word[0].include? slave_word[0]) or (slave_word[0].include? master_word[0])
            included = true
          end
        end

        if similar or included
          similar_words << {:word => slave_word[0], :members => slave_word[1].to_i, :distance => distance}
          if master_word[0].include? "entre"
            puts "For word #{master_word[0]} found similar word #{slave_word[0]}"
          end
          processed_words << slave_word[0].downcase
        end
      end
    end

    #CSV Output
    collection << similar_words
    i += 1
  end

  collection.sort{|a,b| b.collect{|item| item[:members]}.max  a.collect{|item| item[:members]}.max}.each do |row|
    output = []
    row.sort{|a,b| b[:members]  a[:members]}.each do |word|
      output << word[:word]
      output << word[:members]
      output << word[:distance]
    end
    csv << output
  end
end

As you might have noticed there is one little neat thing in this code which is a computation ov the levensthein distance . This distance is small for words that are kind of similar to each other like: journalist and journalism and lets us detect such words. If we chose a small levensthein distance we won’t be making stupid errors like distance = Text::Levenshtein.distance(“artist”, “autism”) Here distance is only 2.

Another way around this might be to use porter stemming to reduce conjugated words to their stem. An example for stemming could be the follwing:

  • result = Text::PorterStemming.stem(“designing”).
  • The result is “design”.

P.S. I’ve updated the code and the output a bit The most significant changes I made were that the words have to have their first two letters in common and the distance is only computed for long words > 6 with a bit higher tolerance < 4 (like thi s I am able to capture those nice misspellings like

  • entrepreneur,30716,1,entrepeneur,651,0,entrepreneurs,143,2,entreprenuer,447,0 .

The Result

Even if we make quite a few error which we will have to correct by hand the output is quite usable. This table gives an overview of the output, showing the word, the amount of members and the levinsthein distance to the word it started with:

The Result in Google docs

As you can see we got some nice results like:

  • art,10098,11,arts,1855,10,artsandculture,235,0
  • film,5687,0,filmmaker,2842,5,filmmaking,843,6,films,797,1,farm,312,2,filmfestival,223,8,fire,162,2,fly,151,2,filmes,141,2
  • singer,4276,0,single,902,2,singersongwriter,893,10,swingers,161,2,singer_songwriter,147,11

What we basically see that often when there are combinations of words like wine and winelover the community becomes smaller but often more specific like in the examples of film and filmfestival or green and greenbusiness. lev Well now its time to collect 100 most listed members from each community and see how the 100 first  “big” general coommunities are linked with each other. I will try to cover this in the next blogpost.

Cheers Thomas