//
You are reading..
Ontology

User interests ontology


I’ve been blogging about the idea that people form networks based on their interests for a while now.  As you maybe remember we used to use the tags on wefollow to find out what people are interested in on Twitter.

And in the  last post I have shown how to create a post-hoc ontology  from tags that we collected on wefollow, which represent people’s interests. Yet the results of this attempt were kind of mediocre:

  • We have found that people like to tag a lot of people describing “somebody” like rapper, artist, celebrity and so on.
  • And we have found out that people like to tag a lot of twitter users based on a certain activity, like. swimming, running, hacking, dancing, cooking and so on… But apart from this insight I was still lacking any insight into this “bag of words” that I got from wefollow.
  • We also have found out that rugby is similar to soccer, and those two are similar to cricket because these are all field games and so on…

In another attempt I  have also tried  to find out which of the keywords on wefollow are somewhat similar simply by looking for words that sound the same or are spelled the same. The results were interesting.

  • We have found out people like to use different keywords with different popularity to describe “kind of ” similar things. For example:film,5687,0,filmmaker,2842,5,filmmaking,843,6,films,797,1,farm,312,2,filmfestival,223,8,fire,162,2,fly,151,2,filmes,141,2
  • These are all keywords that share the word film, but apparently simply tagging users with the word film seems to yield in the highest results.
  • Similar for other words:singer,4276,0,single,902,2,singersongwriter,893,10,swingers,161,2,singer_songwriter,147,11singer is the most popular keyword to tag people followed by single and so on.

User interest providers

Yet despite those two attempts we are lacking some more insight into the users interests of twitter users. What I am looking for is some kind of hierarchy between those words, but not so much as in the wordnet approach (see above) and not so much as in the word similarity approach but more in a ontology based approach where we split up the users interests into lets say 6-12 high class categories and put our keywords into those. We have used two different approaches, now its time get some overviews about other options

The table below is a comparison of providers of users interests that have chosen to categorize them accordingly. As you can see the approaches differ by how many keywords are used and if the ordering is hierarchical (as in dmoz or yahoo= or simply some sort of folksonomy as in delicious. The first two providers are commercial and do not offer any subcategories or networks, but give us some clue about the number of top level categories. In general we can say that most of these “interest directories” seem to contain the same top level categories (which is great because it seems we can agree on something). Apart from that I think that dmoz or yahoo give us the best
chance to order our keywords in a reasonable manner.

So after agreeing to use the yahoo category (since it is the most comprehensive, contains the most amount of subcategories and is curated by paid profesionals) to find out more about how to order users interests, its time to take a look at their dictionary.


A screenshot of the Yahoo Directory

Since there is no API or something I have chosen to scrape the first 2-3 levels of their directory and save them to a file. (I have used nokogiri) You will find the listing below. What it does is it goes through each of the toplevels defined beforehand (we are skipping new additions, subscribe via rss and regional) and looks at each of those links. It notes how many subcategories are in there in those brackets and it notes if a link is pointing towards another category. It contains an @-sign.  After going through all of those links it writes them down in a simple manner:

Topcategory, Subcategory, Count

Like this we get a network of categories.

Scraping the categories

require'rubygems'require'nokogiri'require'open-uri'
require 'rubygems'
require 'nokogiri'
require 'open-uri'

topdomains = ["business_and_economy", "recreation", "computers_and_internet", "reference", "education", "regional", "entertainment", "science", "government", "social_science", "health", "society_and_culture"]

@seen_words = []
File.readlines("seenwords.csv").each do |line|
@seen_words << line.sub!(/\n/,"")
end
@seen_words_file = File.open("seenwords.csv", "a+")

def write_net(father, son)
	 if @seen_words.include?(son) && !son.include?("@")
@file.puts "#{father} #{father}_#{son}"
else
@file.puts "#{father} #{son}"
end
@seen_words_file.puts son
end

i = 0
topdomains.each do |domain|
	@file = File.open("#{domain}.csv", "w+")
	puts "done domain #{domain}"
	site =  Nokogiri::HTML(open("http://dir.yahoo.com/#{domain}"))
	site.css("div.cat li a").each do |link|
		first_level_link = link.content.gsub(" ","_").downcase
		write_net(domain,first_level_link)
		puts "working on #{first_level_link}"
		if first_level_link.include? "@"
			first_level_link.gsub!("@","")
			sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{first_level_link}"))
		else
			sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{domain}/#{first_level_link}"))
		end
		sub_site.css("div.cat li a").each do |sub_link|
			i += 1
			puts i.to_s
			second_level_link = sub_link.content.gsub(" ","_").downcase
			write_net(first_level_link, second_level_link)
			if second_level_link.include? "@"
				second_level_link.gsub!("@","")
				sub_sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{second_level_link}"))
			else
				sub_sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{domain}/#{first_level_link}/#{second_level_link}"))
			end
			sub_sub_site.css("div.cat li a").each do |sub_sub_link|
				third_level_link = sub_sub_link.content.gsub(" ","_").downcase
				write_net(second_level_link, third_level_link)
			end
		end
	end
end
@file.close

Visualizing the network

Having downloaded the network we end up with something that we can visualize in gephi (see below).=. The visualization is nice, since it allows to see which fields the links with the @-sign connect. We can see clusters emerge between different concepts and see that most of the second level categories are not connected to the rest. As you will note in the listing below I have also made sure to not include subcategories like “organisation” or “people” since every category contains such a subcategory and this subcategory would end op being most central in our network. Instead each subcategory gets an explicit name e.g. “sports_organisations” and only the categories with an @ are allowed to link other groups.

The downside of this approach is that the result is pretty big and creates even more confusion than our keywords from wefollow. Now we have a network with approximately 7000 nodes and 20.000 Edges. We would now search for each of the wefollow keywods and see where we can find it, and then drop the rest. This idea is not bad, but we are neglecting the majority of the great insights that the yahoo directory told us. If for example wefollow does not contain keywords regarding health, does it mean that Twitter users are not interested in health issues, or did we not look properly? Therefore I decided to take a hybrid approach. First I will cut down the yahoo directory only to words that contain a lot of entries and at the same time see how the 200 most frequent wefollow keywords fit into this ontology.

A mind map of user interests

The result is a mind map of user interests, rather than a network, since I’ve chosen to write it down by hand in order to be able to change small things. For example I would like to exclude the keywords that link together other topdomains and have rather a tree. Additionally I’ve decided to mark the words I have included from wefollow with a “wefollow: ” preposition in order to make the process more transparent for everybody. The result shows that actually quite a lot of the keywords that we have found on wefollow were also part of the existing yahoo directory and although the directory was quite big it did not contain a number of new words such as “youtube”, “podcast” and so on. Additionally concepts like for example “animals and pets” were added by my by hand since they have been at a very deep level in the yahoo ontology (Science / Zoology / Animals / …) but are actually quite popular among twitter users. So below you see the result of my work. THis mindmap represents a hybrid of the  200 most frequent wefollow keywords and the most popular yahoo categories. I am quite happy with the result since it seems to be useful in describing the bag of words I had before.

I am right now collecting the communities of those users on Twitter in order to analyze them and will keep you updated about the progress.

Thats it for today.
Cheers
Thomas

Advertisements

About plotti2k1

Thomas Plotkowiak is working at the MCM Institute in the Social Media and Mobile communication group which belongs to the University of St. Gallen. His PhD research in Social Media is researching how the structure of social networks like Facebook and Twitter influences the diffusion of information. His main focus of work is Twitter, since it allows public access (and has a nice API). Make sure to also have a look at his recent publications. Thomas majored 2008 in Computer Science and Economics at the University of Mannheim and was involved at the computer science institutes for software development and multimedia technoIogy: SWT and PI4. During his studies I focused on Artificial Intelligence, Multimedia Technology, Logistics and Business Informatics. In his diploma/master thesis he developed an adhoc p2p audio engine for 3D Games. Thomas was also a researcher for a year at the University of Waterloo in Canada and in the Macquarie University in Sydney. He was part of the CSIRO ICT researcher group. In his freetime thomas likes to swim in his houselake (drei weiher) and run and enjoy hiking in the Appenzell region. Otherwise you will find him coding ideas he recently had or enjoying a beer with colleagues in the MeetingPoint or Schwarzer Engel.

Discussion

Trackbacks/Pingbacks

  1. Pingback: How to generate interest based communities part 1 « Twitter Research - June 8, 2012

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: