In the last blog post last week (https://twitterresearcher.wordpress.com/2012/06/08/how-to-generate-interest-based-communities-part-1/) I have described my way of collecting people on Twitter that are highly listed on lists for certain keywords such as swimming, running, perl, ruby and so on. I have then sorted each of those persons in each category according to how often they were listed in each category. This lead to lists like these below, where you see a listing people found on list that contained the word “actor”.
We might say this is a satisfactory result, because the list seems to contain people that actually seem relevant in regard to this keyword. But what about the persons that we collected for the keyword “hollywood”. Lets have a look:
If you look at the first persons you notice that a lot of these people are the same. Although in my last attempts (https://twitterresearcher.wordpress.com/2012/04/16/5/ and https://twitterresearcher.wordpress.com/2012/03/16/a-net-of-words-a-high-level-ontology-for-twitter-tags/) I tried hard to find keywords that are semantically related such as “car” and “automotive”, the list of user interests ended up having some examples like “actor” and “hollywood”. What are we going to do about this prolem? My solution is to merge those two lists into one since it seems to cover the same interest. But how do I do this without having to subjectively decide on each list?
First step: Calculating number overlapping members between lists
An idea is to calculate how often members from one list appear on other lists. The lists that have a high overlap will be then merged into one list and the counts that those people received will be added up. The new position on the list will be then determined by the new count. We will need two parameters: the maximum number of persons that we want to look at in each list (i simply called it MAX) and a threshold percentage of % of similar people which decides when to merge two lists. If we merge two lists “actor” and “hollywood” into “actor_hollywood” we also want to run this list against all remaining keywords such as “tvshows” and also merge it with them if the criteria s are met, resulting in “actor_hollywood_tvshows”. The result is a nice clustering of the members we found for our interests. Although these interests have different keywords, if they contain the same members they seem to capture the same semantical concept or user interest. The code to perform this is shown below:
For further processing the code also saves which concepts it merged into which keys and also makes sure that if we merge 200 people from one list with 200 from another list we only take the first 200 from the resulting list.
What does the result look like? I’ve displayed the resulting merged categories using a threshold of 0.1 and the checking the first 1000 places for overlap.
Below you see the final output where I have used a threshold of 0.2 and looked at only the first 200 users in each list. Regarding the final number of communities there is a trade off: When setting the threshold too low we end up with “big” user interest areas where lots of nodes are clumped together. When having a too high threshold, it seems like the groups that obviously should be united (e.g. “theater” and “theatre” ) won’t be merged. I have had good experiences with setting the threshold to 0.2 which means that groups that share 20% of their members are merged into one.
Second step: Allowing members to switch groups
The results of the above attempts are not bad they can be improved. Why ? Well imagine your name was in the actors category which got merged with drama, hollywood, tv_shows and you ended up having the 154th place in this category. This is not bad, but it might be that people actually think that you are more of a “theatre” guy and that is why in the category of theatre you rank 20th. Although knowing that a person can belong to multiple interest groups, if I were to chose the one that best represents you I would say that you are in the theatre category because you ranked 20th there, while only ranking 154th in the actor category.
So this means that I am comparing the rankings that you achieved in each cateogory. But I could also compare the total number of votes that you received on each list. If I did that you would end up being in the actor category because the total number of lists for this category is much higher than for theatre, and the 200 votes received by somebody on the 154th place in the actor category are higher than the 50 votes received by the same person on the 20th place in the theatre category. I have chosen to go with the ranking method, because it is more stable in regard to this problem. Popular interests do not “outweigh” the more specific ones, and if a person can be placed in a specific category then it should be the specific one and not the popular one. The code below does exactly this. Additionally it also notes for each person how often this person was also part of other categories, but the person gets assigned to the category where it got on the higher place.
There is also a small array called final_candidates that is used to put exactly 100 persons in each category at the end. What does the output look like? In most of the cases it leaves the persons in the same category, but in some cases people actually switch categories. These are the interesting cases. I have filtered the output in Excel and sorted it by the number of competing categories, to showcase some of the cases that took place. You notice that e.g. the “DalaiLama” started in the “yoga” category but according to our algorithm (or actually the people’s votes) he fitted more into “buddhism”, or “NASA” started in “tech” but was moved to “astronomy”, which seems even more fitting.
To provide an idea how often this switcheroo took place I have created a simple pivot table listing the average value of competing categories per category (see below). We see that for the majority of categories their people don’t compete for other categories (right side of the chart), but maybe for a handful of categories their people compete for other categories (left peaks of the chart). What you also notice on this graph, is that the lower the threshold, the smaller the final groups, but these groups have a smaller cometing average count (e.g compare violet line size:1000, threshold 0.1 vs. geen line size 1000 threshold 0.2). What you also see is that if we consider only the first 200 places vs. the first 1000 places we get actually better results (compare violet line with red line). This is a bit counter intuitive. Since I was thinking the that the more people we take into consideration the better the results. It rather turns out that after a certain point this voting mechanism seems to get “blurrier”. People getting voted on the 345th place somewhere don’t really matter that much, but eventually they lead to merging these categories together, which shouldn’t have had been merged.
No matter which threshold and size we use there are always a couple of groups that always seem “problematic” (aka the high peaks in the chart on the left) where it seems hard for people to decide where these people belong to. Below I have provided an an excerpt for group size 200 and threshold 0.2. For people in these categories it seems really hard to “pin” them down to a certain interest.
- Category Name, Average competing categories for group
- tech 1.871287129
- comedy_funny 1.693069307
- developer 1.603960396
- recipes_cooking 1.554455446
- magazine 1.544554455
- food_chef 1.544554455
- tvshows_drama_actor_hollywood 1.534653465
- politics_news 1.524752475
- finance_economics 1.524752475
- mac_iphone 1.514851485
- teaching 1.465346535
- director 1.465346535
- liberal 1.465346535
- ipad 1.455445545
- healthcare_medicine 1.435643564
For the rest of the groups we get very stable results. These interest groups seem to be well defined and people don’t think that those people belong to other categories:
- hockey 1
- army_military_veteran 1
- composer 1
- rugby 1
- piano 1
- astrology 1
- wedding 1
- dental 1
- wrestling 1
- linux_opensource 1
- skiing 1
- perl 1
- golf 1
- accounting 1
For these remaining interest groups we will now take a look at their internal group structure, looking how e.g. opinon leaders (people being very central in the group) are able to get a lot of retweets (or not). Additionally we will take a look on how there are people between different groups (e.g. programming languages ruby and perl) that work as brokers or “boundry spanners”, and if these people are able to get retweets from both communities or only one or none at all. For questions like these these interest groups provide an interesting data source.