In this blog post I want to talk about how to find people on Twitter that are interested in the “same things”. I have posted a number on entries about
- How to create an ontology of user’s interests (https://twitterresearcher.wordpress.com/2012/04/16/5/ and https://twitterresearcher.wordpress.com/2012/03/16/a-net-of-words-a-high-level-ontology-for-twitter-tags/)
- How to scrape off the seed users representing those interests from wefollow (https://twitterresearcher.wordpress.com/2012/02/17/how-to-make-sense-out-of-twitter-tags/) which is a common Twitter directory
Today I want to go through the process of how to use the approximately 200 different keywords representing user interests (e.g. swimming, running, ruby, php, jazz, career, hunting, islam and so on…) and how to get all of the relevant users that are highly contributing to these topics aka. forming the interest based community.
Capturing the collective knowledge of Twitter
To capture the collective knowledge of Twitter I will make use of Twitters “list-feature”, shown below:
As you can see I am listed in a number of lists such as SNA, social media, dataviz and so on. These lists have been created by people in order to organize Twitter followers into some categories, similar to book lists on amazon and so on.
Having had scraped off the first 100 people for each of the 200 keywords in the last blog post (https://twitterresearcher.wordpress.com/2012/02/17/how-to-make-sense-out-of-twitter-tags/) and storing them in the database I will use these people to find more lists that feature similar people for the keyword. Why? Mainly because Twitter doesnt let you search for lists with a certain name, and because the alternatives such as wefollow.com, twellow.com or listorious.com do not give you all the lists for a given search term. That is why I will have to snow-ball through Twitter lists and keep those that are relevant for a given topic. This process consists of three parts:
- For the seed users collect all the lists that they are listed on . From these lists only keep those that match the keyword and are thus relevant for the topic and do some filtering
- From the remaining lists collect every person that is on that list
- For those persons count how often they are listed on the lists relevant for a certain field
This process is shown in the figure below:
1. Collecting lists
How do we collect lists? Well we start by checking if we have enough API calls left on Twitter, if this is the case we start by collecting the memberships for a given user and keep on paging until there are no more lists that the user is listed on. As you can see Twitter can be a bit sensitive to the page size, it can be 1000 items max, but in practice it is around 200-400 items, before we get timeouts. That’s why the function is adopting dynamically to those. Also collecting more than lets say 10000 lists for a given user does not make much sense since, we are probably wasting our API calls for a celebrity like aplusk. Once all of the lists that the user is listed on have been collected I store them into a csv file and the most important part of this procedure: persist these lists in my database that contain the keyword that the person was originally collected for. This means e.g. if the seed user was in the category “swimming”, I will only keep those lists that include the keyword swimming in it. Additionally I make sure that if I already have encountered this list I don’t add it twice to my database.
2. Collect List members
Once I have collected tons of lists that match the category keywords, I will collect all of the list members that are listed on these lists. The code below is run for every list in the database. As you can see I make sure that there are enough API calls left, and then start to collect all of the members on the list. For this I am using delayed job, which is a nice ruby library https://github.com/tobi/delayed_job that allows me to wrap time consuming tasks in a neat job that can then be run later or by multiple machines on multiple computers. I have made good experiences using around 10-15 workers on a single machine which then process these jobs in the background. Anyway at the end of this step we end up having projects each containing a a high number of people that seem to be relevant for this user interest because they have been listed on lists for explicitly this interest.
3. Creating projects with most listed persons
After step 1 and 2 we have a number of potential candidates that are relevant for a given topic but we are only interested in those that represent the user interest the most. That is why we need a procedure that filters those people according to how often they have been listed for a certain topic. How do we do that? Well for each topic we have a number of lists that each list which people are relevant according to this list for this topic. Now if we go though all the lists for a given topic and count how often certain persons were listed for this topic we might end up with finding the most relevant users for a given topic (As we will see later in part two, this process gives some nice resulsts, but it can be greatly improved in regard to it’s accuracy). So what does the code below do?
This function is run on the projects that contain the bunch of people that have been collected for a certain Topic. First for all the persons in there it stores the persons in memory for faster computation. It then goes through all the lists that we collected for a given topic and checks again if the lists matches the topic keyword, if it does it checks if we have not encountered this list before (which should not happen, since we made sure we don’t add lists twice in the insertion process, but double checking won’t hurt). If this list has members, then it collects all of these member usernames into an array and checks if among the seen_membersets (which are simply the collection of usernames) there is a set that contains exactly the same members already. Why are we doing this? Because there is list spam out there in Twitter and people or bots end up copying lists only to save it under a different name. So in our case if there happens to be no lists that already has the same members (99%), then we actually analyze this list, otherwise we drop it, because it is too similar to the lists that we already encountered. For each of the list members we check if the the persons we are computing the list count for can be found on the list, if this is the case we add plus 1 to the persons counter. Otherwise if there is a person on the list that we somehow have not captured before we add it to our pool of persons and also raise the persons counter. At the end of this procedure we end up having a list count for each person that was on these lists and can directly see how relevant this person is in regard to a certain topic. We output the sorted list count into a simple csv only to use these later in part 2, to improve our accuracy.
To show the result of this process I have cut and pasted a small part of the result of step 3 below. As you can see we managed to find out that it seems like aplusk, JimCarrey, tomhanks and so seem to be the most relevant Twitter users for the community of actors. This list contains ~18.000 entries, where people towards the end do not seem to be that much representing the actor community as the people at the beginning of the list.
If we now take e.g. 100 – 200 people from each of these lists (assuming that people cannot manage more than this amount of people according to the dunbar or wellman number http://en.wikipedia.org/wiki/Dunbar%27s_number) we end up having those interest based communities of people on Twitter that share interest.
Studying those communities is what I am trying to do in my work, but more on that later. These communities are also interesting for advertisers: Imagine someone who wants to sell swimming underwear, this retailer would be highly interested if you could show him all the people on Twitter that are interested in swimming. Those people could be his first customer group. If his swimming underwear gets approved by those people then it is very likely that they will talk about it and so inform other swimming interested people about this product.
And in part two I will show you how we can improve these communities by allowing people to move from one community to the other, if they “fit” better into this community.