Archive for

A net of words ( A high level ontology for Twitter tags)

Knowing that people form networks in twitter based on their interest I have investigated the tags that are listed on wefollow (see below) below.


Since those tags are rather chaotic and in no particular order except that they are listed by the number of followers I was thinking how do others organise those interests. The most prominent websites that offer such a service are peerindex.com and appinions.com (founded by CMU members). Both websites allow users to find influential users based on a certain interest.

Peerindex divides all topics into 8 different areas. On the left you can see my topical fingerprint in these areas.

  • AME  – arts, media, entertainment
  • TEC – technology, internet
  • SCI – science, environment
  • MED – health, medical
  • LIF – leisure, lifestyle
  • SPO – sports
  • POL – news, politics, society
  • BIZ – finance, business, economics

Appinion offers 10 different categories, which kind of map to the categories selected by peerindex (see below)

  • FASHION –> no equivalent

Question: How do we either a) map the tags from wefollow to the concepts above or b) create our own ontology of things?

What we are looking for is a kind of similarity between the semantic concepts that these tags are standing for. So for example soccer is similar to football, those words can be considered synonyms. But what about other “relations” such as cricket or hockey and football ? We know that these two words are not synonyms but they are somewhat close to each other. If i were interested in football i could  be probably interested in cars. To find these kind of relations we need  a database that contains semantic relations beyond synonyms. One great tool to use is wordnet. What is wordnet? I’ve cut and pasted the definition from their website:


WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download.

WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.

Being armed with the knowledge contained in wordnet we can start to see if we can come up with a relation between engine and car. If you use their built-in browser you might find the following entries (see below). Note that I have unfolded the so-called hypernyms for those two words. Hypernyms are exsiting between synsets (which are groups of words that have the similar meaning). The definition of hypernyms is from the website:

The most frequently encoded relation among synsets is the super-subordinate relation (also called hyperonymy, hyponymy or ISA relation). It links more general synsets like {furniture, piece_of_furniture} to increasingly specific ones like {bed} and {bunkbed}. Thus, WordNet states that the category furniture includes bed, which in turn includes bunkbed; conversely, concepts like bed and bunkbed make up the category furniture. All noun hierarchies ultimately go up the root node {entity}.

So if I enter soccer I get the so called hyperonomy tree that goes up to the root node entity (not shown). But we know soccer is a football game and this is a field game.

And if I enter hockey I get:

If I enter rugby i get:

Ok you get the idea, this leads to a tree that connects these concepts as shown below.

If you went through these concepts by hand and noted down whenever two words connect on some level in the higher hierarchy you will end up knowing that these two words are somewhat similar. Of course doing this by hand is tiresome. Thats why we will use ruby and the gem rwordnet to computationally create such a tree for us.

require 'rubygems'
require 'wordnet'
require 'nokogiri'
require 'cgi'
require 'open-uri'

words = []
File.readlines("groups.txt").each do |line|
	words << line.sub!(/\n/,"")
index = WordNet::NounIndex.instance
file = File.open("output.csv", "w+")
words.each do |word|
	puts "Working on word: #{word}"
	wordnet = index.find(word)
	if wordnet != nil
		puts "#{wordnet.synsets.count} Synsets found for #{word}"
		max = 0
		best_synset = ""
		best_synset = wordnet.synsets.first
		last_word = word
		next_word = best_synset.hypernym
		while next_word != nil && next_word.words.first != last_word
			file.puts "#{last_word};#{next_word.words.first}\n"
			#puts "#{last_word} H: #{next_word.words.join(" ")}"
			last_word = next_word.words.first
			next_word = next_word.hypernym
		puts "Nothing found for #{word}"

So what does this file do?

  • It reads in the 1500 group keyword tags that we collected from wefollow
  • It then takes each word and checks if we can find a meaning for it on wordnet. (For some things like twitter, youtube, etc.. .there are no entries…)
  • Since the result is ordered by the meanings frequency we take the first meaning of the word (we will come back to this later…)
  • For this word we compute a tree of hypernyms by going up as long as there are no hypernyms
    • In each of these steps we note a pair: SOURCE – TARGET
  • We dump this network to disk and visualize it with gephi.

So after doing this and visualizing it with gephi you get a tree that looks like the one above. But there was a problem with finding the most frequent meaning of a word. For example for the word “poker” these days people would think of the card game and not not a fire hook.


Google for Frequencies

Since I think wordnet has computed the frequencies for these words based on some book corpus that might be outdated, the frequencies are too. So I needed to find a way of finding the meaning that is more popular. And I thought why not use google search. The more entries you find for the combination of the word with its so-called “gloss”, which is an informal definition of the concept the more common it is to assume that this is the main thing that users on Twitter had in mind when entering this keyword. So I changed the listing above a bit and changed the part about chosing the best synset.

require 'nokogiri'
    best_synset = ""
		wordnet.synsets.each do |synset|
			searchterm = "#{word}#{synset.gloss}".map { |w| CGI.escape(w) }.join("+")
			site =  Nokogiri::HTML(open("http://www.google.ch/search?q=#{searchterm}"))
			r = site.css("#subform_ctrl div").children.last.content.to_s
			results = r.gsub(/[^0-9]/,"").to_i
			puts "Found #{results} for gloss #{synset.gloss}"
			if results > max
				max = results
				best_synset = synset

So now the program looks scrapes from google search  how many entries it found, and the concept with the highest number of entries wins.


So finally after going through all this how does the output look like and is it any helpful in organising our word tags?

The global view shows that the network looks more like a tree with long thin arms fading out. We can recognize some main concepts: A lot of tags have been unified under “somebody” so twitter is about persons, and a lot of tags have been sumified under activity, so it is about what people are doing. If you want to dig through the network yourself, I’ve attached it to this post. Feel free to download it.


So whats next? If we were to find how similar two words are all I have to do now is to see how many steps I have to take to find a connection between them. So in the case of hockey and soccer and rugby that would be quite close. But in the case of gardening and rapper that would be quite far. Remember that this ontology was created by wordnet and therefore the distance between concepts depends on this ontology. But what if we look up these communities up on Twitter and see how close they really are? Thats something we will do in the next blog post.

And finally I still owe you the ordering of these words into the broader categories from the other providers. My idea on how to do this would be to include these concepts in the network and see how far our keywords are away from them and chose the one that is closest. This has some implications, for example a concept that has a high centrality (such as media for example) would win the majority of words. And I  have to ask myself if I have a  constraint that each category should contain a similar amount of words ? If yes I have to think about how to solve it. Maybe you have some ideas?