This category contains 9 posts

DGPUK Social TV: Neue Kommunikationsformen, neue Öffentlichkeit?

A short presentation I gave at the DGPUK 2014.

Abstract in German: Die Öffentlichkeit, welche durch klassische Massenmedien wie fernsehen, radio und Zeitungen hergestellt wird, hat in den letzten Jahren einen starken Wandel erlebt. Die Digitalisierung der Verbreitungskanäle hat nicht nur zu einer Vervielfachung der ver- fügbaren angebote geführt, sondern auch neue nutzungsformen ermöglicht (z.B. zeit- versetztes und mobiles fernsehen). Während diese Entwicklungen eine zunehmende segmentierung des Publikums nahelegen, wurden durch die Digitalisierung der an- schlusskommunikation auch neue formen der Vernetzung dieser Teilöffentlichkeiten geschaffen. Beim Fernsehen findet die Anschlusskommunikation, welche früher auf familie, arbeits-/ ausbildungsplatz oder stammtisch beschränkt war, neuerdings zum Beispiel über facebook, Twitter und spezielle social TV Programme statt. Breitere auf- merksamkeit erlangte in jüngster Zeit die beachtliche anzahl Tweets während TV-sen- dungen wie dem amerikanische superbowl und der oskarverleihung oder auch dem deutschen Kanzlerduell. neben der etablierten Währung der reichweite scheint sich die fachwelt wie auch die breite Öffentlichkeit zunehmend für neue Partizipationsfor- men zu interessieren und sucht nach Kennzahlen, welche die aktivität der rezipienten quantifizieren. Bisher ist allerdings ungeklärt, ob es sich dabei um eine Parallelwährung handelt, die nur durch die reichweite determiniert ist, oder aber neue aspekte der TV- nutzung zum ausdruck kommen. Dieser offenen fragestellung nimmt sich das vorlie- gende forschungsprojekt an und untersucht, in welchem Verhältnis Zuschauerzahlen und rezipientenaktivitäten über social Media Kanäle stehen.


Long Time no Update

It’s been a while since the last update. But here is the gist: I finished my phD and am working now at the University of Zürich and at the IaKom Institute. I will try to provide interesting insights from my new work.



Dimensions of Social Capital

Thinking about social capital can be a bit tricky because it has been reviewed from so many perspectives and dimensions. In this blog post I will introduce the individual and group perspective on social capital and the bonding and bridging dimension. I will then show how  Stephen Borgatti unite these views in a matrix (here is his paper on it) and then try to extend his ideas.

Individual social capital vs. Group social capital

Generally the first problem that I had with social capital is that it can be found and analyzed in different hierarchical degrees such as on a individual- and collective-level and on a on a micro-, meso- and macro-level: In the micro-level relations between individuals, household or neighborhoods are analyzed. The meso-level contains municipalities, institutions and organizations. The macro-level deals with whole regions and nations which are analyzed.

This multi-level view leads to a bit of confusion: Theorists debate whether social capital is a private good, where individuals invest in the formation of relationships so they can access the resources of others, or a public good such that everybody belonging to a social group with social capital may enjoy its benefits. While for example Putnam’s work (Bowling alone) describes social capital a quality of groups or societies,  Burt or Lin’s work (Brokerage and Closure) describe social capital for individuals.  This leads to a lot of confusion if social capital is a good of an individual or of a group.

Bonding (internal) vs. bridging (external) social capital

The second problem with it is that some people highlight the bonding nature of it while others focus on its bridging attributes. Much of the discussion these two perspectives of bonding and bridging social capital has already been captured in the views of Coleman and Burt  (Brokerage and Closure) which have highlighted different sources of social capital.

While in Coleman’s view social capital mostly results from analyzing internal group closure, in Burt’s view it results from exactly the opposite group mechanic, namely structural holes. While closure corresponds to creating internal ties with group members, the structural holes theory corresponds to creating ties to members outside the group. So what is social capital? Well we can either decide that we don’t like this theory anymore or try to somehow combine these dimensions into one framework. This is what Stephen Borgatti tried.

Borgatti’s Social Capital Matrix

Bortatti says that discussion on social capital mostly suffers from a different perspective on the group concept. In most cases the group “has been implicitly conceived as a universe, nothing outside the group is considered.”[p.3]  Adler and Kwon in their review come to similar insights on theoretical views on social capital stating that “external ties at a given level of analysis become internal ties at the higher levels of analysis and conversely, internal ties become external at the lower levels”[p.35].

Consequently we have to accept the duality of bonding and bridging social capital and see groups as embedded actors in their own social environments. The logical conclusion is to combine the individual vs. group dimension with the inside (bonding) vs. outside (bridging) dimension: This can be demonstrated nicely in the example of a work department: To look at it the group level internally one would analyze the working relationships among the members of the department, looking at the individual level one would analyze the individual ties of members of the group. But looking beyond this department one would have to analyze the relationships that the department has with other departments outside of it. Thinking in network terms an internal view looks at the group’s relationships within the group, while an external view looks at the structure of the group’s relationships to outsiders. This leads to a 4 fold classification matrix which has been suggested by Borgatti.

Social Capital Matrix of Borgatti

Using Borgatti’s original interpretation, quadrant A should be left empty, since there is no way to analyze the individual’s internal ties. This would correspond to analyzing the inner workings of an individual, such as the networks formed in his brain.I argue that the internal focus might also be used to describe the individual internal-group focus. This means defining internal focus as a focus inside the group and by defining external focus as a focus outside of the group. Using this logic the social capital that can be harvested in quadrant A depends on the individuals’ ties with the internal group. This type of thinking correlates with the popular concepts postulated in e.g. the works of Coleman saying that the bonding social capital in school classes benefits weak pupils.

Quadrant B corresponds to the individual social capital a person acquires by maintaining external ties outside of the group. This view has also been predominantly described by the works of Ronald Burt saying that managers benefit from brokerage between departments. Quadrant C describes the collective-good idea of social capital as described by Putnam describing the concept of social capital as a public good. (e.g. the more the people of a country are connected the better for everybody) Finally quadrant D describes the potential social capital that a whole group can acquire by maintaining ties to other groups. Mainly due to the available data collection survey based methods in the past, this concept has rarely have been operationalized and marginally been explored in the literature of social capital.

Extended version of the social capital matrix

If we now take this matrix a bit further and think of the different network “zoom levels” we can sort of create a recursive definition of Borgattis matrix where quadrant D at the inner level becomes quadrant A in the next version of the matrix if we “zoom out”. I have depicted this concept in the figure below and described it on a company example. (Attention I have flipped the original matrix 90°). So at the highest zoom level (which is the top of the figure) the individual is the person, and the department is the group.

Extended social capital matrix, with three different zoom levels on the example of a company.

The top left cell describes individual bonding social capital and deals with the position of the person in the department. The top right cell describes the group’s bonding social capital as a whole. The lower left cell describes how an individual creates bridging social capital by connecting two departments. Finally the lower right cell describes how the department as a whole creates bridging social capital, by being centrally connected to other departments in the company.

This brigs us to the next “zoom level”. Where the department becomes the individual unit of analysis and the company becomes the group concept, which is boundary of the system. So in this zoom level we focus on the study of how different departments are connected with each other inside the company.

If we zoom out again the whole company becomes the unit of analysis. And for example the country becomes the group concept. This brings us to studies that study how different companies connect with each other and how it benefits them from a social capital perspective. Finally if we were to zoom out once more then the whole country becomes the unit of analysis, bringing us to studies on a global level which analyze how different countries for example do trade with each other and so on.

As a conclusion I think that the social capital matrix is very handy when we try to conceptualize a network perspective on things. It helps to unite the bridging and bonding and the individual and group concepts of social capital and reminds us that the “group” concept repeats over and over again only on higher levels, yet the questions remain the same.

If you liked that post please forward it on Twitter, reblog it or leave a comment. I would love to hear what you think about it.

Great blog entry for anyone working at the intersection of social science, networks and software development.


Graph theory and network science are two related academic fields that have found application in numerous commercial industries. The terms ‘graph’ and ‘network’ are synonymous and one or the other is favored depending on the domain of application. A Rosetta Stone of terminology is provided below to help ground the academic terms to familiar, real-world structures.

graph network brain knowledge society circuit web
vertices nodes neurons concepts people elements pages
edges links axons relations ties wires hrefs

Graph theory is a branch of discrete mathematics concerned with proving theorems and developing algorithms for arbitrary graphs (e.g. random graphs, lattices, hierarchies). For example, can a graph with four vertices, seven edges, and structured according to the landmasses and bridges of Königsberg have its edges traversed once and only once? From such problems, the field of graph theory has developed numerous algorithms that can be applied to any graphical…

View original post 757 more words

On the weakness of weak ties

A few months ago I’ve made a blog post (https://twitterresearcher.wordpress.com/2012/01/17/the-strength-of-ties-revisited/)  investigating tie strenghts on Twitter and their influence on  retweets. Well it turns out  that my analysis was lacking a lot of detail, so I re-did it again considering more aspects than before. So lets get started.


The data that I am using for this analysis is the following: Each group of people consists of 100 people that have been highly listed for a given topic in Twitter e.g. snowboarding or comedy or any other topical interest that people have on Twitter. There are 170 of such groups, each consisting of exactly 100 members (You can read how I created such groups in my recent blog posts here https://twitterresearcher.wordpress.com/2012/06/08/how-to-generate-interest-based-communities-part-1/ and here https://twitterresearcher.wordpress.com/2012/06/12/how-to-generate-interest-based-communities-part-2/). In an abstract way you can imagine the structure of the network to looks something like this:

The graphic above indicates that we only have the friend-follower ties on Twitter between those people. But indeed there are quite a few more ties between people, resulting in a multiplex network between them. This network consists of three layers:

  1. The friend-follower ties
  2. The @interaction ties (whenever a user mentions another user this corresponds to a tie)
  3. And finally the retweet ties (whenever a user retweets another user this corresponds to a tie)

Schematically this looks something like this:


Now when we think about ties between those people especially in regard to tie-strengths we can come up with a couple of different definitions of ties ( I mentioned a couple of those in my blog post here https://twitterresearcher.wordpress.com/2012/05/24/tie-strength-in-twitter/)


  • No Tie: Neither in the Friend and Follower network, nor in the @interaction network there are any ties between those people.
  • Non-reciprocated-friend-follower-tie: Person A follows a person B in the friend and follower network. Person B does not follow person A.
  • Reciprocated-friend-follower-tie: Person A follows person B. Person B follows person A.
  • Non-reciprocated-@-interaction-tie: Person A mentions person B EXACTLY one time. Person B does not mention person A.
  • Reciprocated-@-interaction-tie: Person A mentions person B EXACTLY one time. Person B mentions person A at least one time.

Valued ties:

  • Interaction tie with strength x: Person A mentions person B EXACTLY X times. (e.g. tie of strength 10 would mean person A has mentioned person B 10 times)

Bridging vs. bonding ties:

  • Bridging ties: We call bridging ties all of those ties that are BETWEEN groups (see schematic network graphic above the ties in red)
  • Bonding ties: We call bonding ties all of those ties that are INSIDE groups (see schematic network graphic above the ties in black)
  • Notice that our definition of bridging and bonding ties might differ a bit from the pure network perspective, where maybe by definition bonding ties would have to have a certain strength, reciprocity and so on. Here we rather take the underlying groups, that we created artificially, but which represent nicely users that strongly share a certain interest.

Research Question:

Having all those definitions of ties we can now come up with a number of observations regarding the information diffusion between those people. The information diffusion is captured in the retweet network (see third layer in the schematic graphic) and the corresponding ties. In generall we want to look at how the different tie types affect the information diffused (retweets) between those people.

Analysis per Group:

To get an overview over the data I will first have a look how many retweets have in total have been exchanged between the analyzed groups. I count how many retweets took place inside the group (blue) and between the groups (red). Each of the 170 groups is shown below:

Approximately a total of 214.000 retweets  took place between groups (red) and  414.000  retweets that took place inside the groups (blue). In the graphic above we can clearly see the differences between the different interest groups. I’ve ordered the groups ascending to retweets inside the community and which makes us see that there are some groups that focus mostly on retweets inside the group (e.g. tennis or astronomy_physics) while other groups rather get mostly retweets from outside of their own group and do not retweet each other so much inside the group (e.g.poltics_news or liberal). Although we cannot clearly say that the group has an influence if it gets retweeted from outside the group, we can say that the members of the group at least have the choice to retweet other members of the group. If these members do not retweet each other it might have a reason about which you are free to speculate (or I will try to answer in the next blog post)

On the influence of types of ties on retweets

Given the different types of ties described above we can now ask the most important question:

How do the different non-valued bridging ties differ from the bonding ties in regard to their influence on the information diffused through those ties?

What do I mean by that? Having all retweets between the persons in the sample I want to find out through which ties these retweets have flown. So for example given that A has retweeted B three times , I ask the question which ties (that A and B  already have in the friend and follower network or the interaction network)  were “responsible” for this flow of information between those actors?

EXAMPLE: If two people have mentioned each other at least once, I will assume (according to the definition above) that  they hold a reciprocated interaction tie. I will then assume that this tie was “responsible” for the retweet between them.  NOTICE: This is a simplifying assumption because I assume that  if there is a stronger tie it is always was responsible for the retweet and not the maybe underlying weaker tie (as in form of a friend and follower tie).

The assumption that I make here is therefore:

  • >  means this connection is supposed to be stronger
  • AT_reciprocated_tie > AT_directed_tie_with_strength_1
  • AT_directed_tie_with_strength_1 > FF_reciprocated_tie
  • FF_reciprocated_tie > FF_non_reciprocated_tie
  • FF_non_reciprocated_tie > No Tie

In order to compute which kind of ties were most successful of transmitting retweets, I compute the ratio of ties that had retweets that have flown through this TYPE of tie (e.g. ff_reciprocated_ties) and divide it through the amount of the same ties that no had no retweets (e.g. ff_reciprocated_ties between people where no retweet was exchanged between those persons). So if I have a total of 10.000 reciprocated ties and over 2000 a retweet took place while over the remaining 8000 no retweets have been transmitted the ratio for this type of tie is 0.25.


I have summarized the results in the table below. The std. deviation reports the deviation in the different retweet ties that belong to a certain edge type. (In the case of no_tie we have no data for no retweets because here we would have to count all the ties that are not present, which seems a bit unrealistic, given the structure of social networks)

As you can see in the table I have first of all differentiated if a tie belongs to a bridging tie or a bonding tie. Remember that bonding ties are between people who hold the same interest while bridging ties are between people who belong to different groups and thus share different interests.

No ties

As you can see first of all there are a couple of retweets that have taken place between people despite those people actually holding any ties. In the case of bridging ties we a bit more retweets than in the case of bonding ties. Yet regarding the total of almost 660.000 retweets, the approximately 73.000 retweets that took place without a tie are more or less only 10% of the total information diffusion. (So my appologies for the  blog post on the importance of no ties was overstating their importance, given this new interpretation)

Friend and follower ties

What is more interesting are the friend and follower ties. We can see that in both cases holding a reciprocated tie with a person, results in a higher chance of getting retweeted by this person. Although when we look at the bonding ties this chance is almost 4 times as high, while in the bridging ties our chances improve only by less than 10%. When we compare the bonding with the bridging ties we clearly see that the reciprocated bonding ties have a magnitude of 10 higher chance of leading to a retweet than the bridging ties. This is very interesting. So despite the fact that of course bridging ties are important because they lead to a diffusion of information outside of the interest group, they are much more difficult to activate than ties between people who share the same interest. So from my point of view this fact shows exactly the weakness of weak ties. When I mean weak ties I refer to the bridging ties that link different topic interest communities together. We see that not only the weaker the tie the lower the chance of it carrying a retweet but also if the tie is a bridging tie the chances drop significantly.

Additionally we can also see that the reciprocated friend and follower ties correspond to the majority of the bandwidth of information exchanged. This is also an interesting fact since the stronger the ties get the higher the chance of obtaining a retweet through this tie, but at the same time the total amount of retweets flowing through these ties drops dramatically (we will also see this when we take a look at the valued at-interaction ties). Just by adding up the numbers we see that almost 3/4ths of all retweets inside the group have flown through the reciprocated friend and follower ties. So although those ties have only a ratio of 0.8 of retweets / no retweets they are the ties that are mostly responsible for the whole information diffusion inside the group.

Interaction ties

When we analyze the interaction ties we find a similar pattern. We see that the bonding ties have a much higher chance of resulting in a retweet than their bridging counterparts, although the difference is not as dramatic. In general we also notice that the reciprocated at_ties have the higher chance of leading to retweets. Actually the ratio is higher than one in the reciprocated bonding ties. This means that per tie we obtain more than one retweet. From tie “maintainance perspective” it would seem smart to maintain such ties with your followers because on average they lead to the highest “earnings” or retweets. We shouldn’t jump the gun too early here, because up till now we have analyzed the rather “weak” ties. Why weak? Well having had a reciprocated conversation with a person is great but having had received 10 or 50 @ replies from that person is definitely a stronger tie, and might lead to a higher chance of getting retweeted by this person.

Valued ties

If we look at the valued ties we could replicate the table above and go through each tie strength separately, but its more fun to do this in a graphical way. I have therefore plotted the tie strength between two persons on the X-axis and the ratio (ties that had retweets flow through this type of tie / same type of ties that had no retweet) on the Y axis (make sure to click on the graphic to see it in full resolution)

So what do we see? Well first of all the red line marks the ratio of 1, which is receiving more retweets through this type of tie than not receiving retweets. Anything above one is awesome ;). You also notice that there is quite a lot of variance in the retweets, which is indicated by the error bars (std deviation). As the ties get stronger I would say that the standard deviation also gets higher (due to higher and less values in the retweets)

Bridging ties vs. bonding ties

What we notice is that both the bridging and bonding ties have a tendency to result in a higher chance of retweets flowing through this tie, the stronger they get. I would say this holds up to a certain point maybe the strength of 40? After this the curve starts to fluctuate so much that we can’t really tell if this behavior looks like this simply  by chance (notice the high error bars). What we also see is that clearly the bridging ties have a lower chance of resulting in retweets than their bonding counterparts (comare green curve with the blue one). This is an observation that we have also noticed before. So again here it is, the weakness of weak ties. Weaker ties lead to a lower chance of resulting in retweets and the typical weak bridging ties also are much harder to activate than their bonding counterparts. What is not shown in this graph is the total number of retweets that have flown through those strong ties. Those are ~ 29000 retweets for bridging ties and ~ 37000 for bonding ties. Compared to the other tie types this is only a fraction of the total of exchanged retweets. Yet these strong ties in comparison have a very high chance leading to retweets, having sometimes ratios higher than 3 (i.e. there are thee times more retweets than flowing through this type of tie than no retweets flowing through this tie).

Well that was it for today. I will update this blog post with the reverse direction of ties tomorrow where Iwill have a look on the influence of outgoing ties on the incoming retweets. But don’t expect any surprises ;). Plus I will post the code that I used to generate this type of analysis.



Finding out what people are interested in by using only structural information

A lot of recommendation algorithms these days suffer under the so called cold start problem. Usually this problem is tackled by having the user fill out some initial forms, or provide some initial ratings e.g. for movies in order to give the algorithm something to work on. Another idea is to use what is already out there namely the information encoded in the friends and follower graph on Twitter.

I thought it would be fun to use my recent corpus of 16.000 Twitter users (that have been categorized by how people list them in the list feature)  to determine what an arbitrary user is interested in. If this user follows one of these people this means that he might also be interested in the area that they represent.  See schema figure below. The approach is really quite simple. Collect all the friends edges from a user, go through them and see if we can find this person in our pre-tagged set of users.  The more users we find from one category the more this users seems to be interested in this topic.


Below is all that is needed to perform this user interest aggregation:

The final partitions file in the code is only the output of a task that I performed in my last blog post . I think  results of this very simple idea are quite satisfactory. But see for yourself. I have pre-computed the results for some people that I follow and am thinking of putting this online somewhere so you can also check for yourself.  Below is the sample yaml output for the user zephoria (danah boyd). The second number next to the person in each category lists how high this person has been ranked in this category.

Here are some shortened results (omitting the individual persons) of people I follow on twitter . If you like, you can tell me in the comments how well this approach actually captured your interests.

  • Name: plotti Interests: java 12 python 7 ruby 6 sociology 4 investor 2 database 2 tech 2 anthropology 1 anime 1 developer 1 innovation 1 mac_iphone 1 publicrelations 1 university 1 teaching 1
  • Name: barrywellman Interests: sociology 7 anthropology 3 linguistics 1 tech 1 multimedia 1 innovation 1 developer 1 highered 1
  • Name: marc_smith Interests: sociology 19 tech 14 innovation 7 ceo 7 investor 6 politics_news 5 geography 5 marketing 5 charity_philanthropy 4 developer 3 finance_economics 3 climatechange 2 comedy_funny 2 healthcare_medicine 2 highered 2 religion 1 director 1 publicrelations 1 mobile_smartphone 1 blogs 1 humanrights_activism_justice 1 pharma 1 multimedia 1 anthropology 1 engineering 1 hacking 1 branding 1 banking 1 mac_iphone 1 basketball 1 university 1 management 1 biology 1 democrat 1 radio 1 newspaper 1 ruby 1 agriculture 1 author 1 psychology_mentalhealth 1
  • Name: jorgefabrega Interests: sociology 8 tech 3 innovation 2 anthropology 1 mathematics 1 marketing 1 geography 1 developer 1 teaching 1 database 1 politics_news 1 university 1 finance_economics 1 philosophy 1 reporter 1
  • Name: PFCdgayo Interests: sociology 6 innovation 2 developer 2 mathematics 2 database 2 psychology_mentalhealth 1 php 1 comedy_funny 1 politics_news 1 engineering 1 anime 1 anthropology 1 tech 1 hacking 1 comics 1 biology 1 university 1
  • Name: chl Interests: python 11 flash 9 developer 8 tech 6 investor 6 ceo 5 database 3 ruby 3 html 3 biology 2 astronomy_physics 2 multimedia 1 gaming 1 geography 1 anthropology 1 mathematics 1 sociology 1 innovation 1 photography 1 neuroscience 1 buddhism 1 java 1 comedy_funny 1 chemistry 1 banking 1
  • Name: orgnet Interests: innovation 2 sociology 2 jewish 1 geography 1 publicrelations 1 management 1
  • Name: jure Interests: university 2 sociology 2 investor 1 liberal 1 sailing 1 finance_economics 1 database 1
  • Name: arnicas Interests: flash 8 python 7 developer 4 tech 4 html 3 database 2 sociology 2 innovation 2 jokes 2 engineering 2 astronomy_physics 2 tvshows_drama_actor_hollywood 2 mathematics 2 anime 1 dating 1 charity_philanthropy 1 multimedia 1 marketing 1 investor 1 anthropology 1 comedy_funny 1 politics_news 1 history 1 blogs 1 author 1 neuroscience 1 university 1 management 1 teaching 1 cinema 1 biology 1 climatechange 1 comics 1 reporter 1

Again I’d like to note that in order to find out about user’s interests using this method, there is no need to study his tweets. His friends ties already reveal quite a lot. The first couple of interests are often not that surprising, but some of the later interests reveal things about persons that I was not aware of.



How to generate interest based communities part 1

In this blog post I want to talk about how to find people on Twitter that are interested in the “same things”. I have posted a number on entries about

Today I want to go through the process of how to use the approximately 200 different keywords representing user interests (e.g. swimming, running, ruby, php, jazz, career, hunting, islam and so on…) and how to get all of the relevant users that are highly contributing to these topics aka. forming the interest based community.

Capturing the collective knowledge of Twitter

To capture the collective knowledge of Twitter I  will make use of Twitters  “list-feature”, shown below:

As you can see I am listed in a number of lists such as SNA, social media, dataviz and so on. These lists have been created by people in order to organize Twitter followers into some categories, similar to book lists on amazon and so on.

Having had scraped off the first 100 people for each of the 200 keywords in the last blog post (https://twitterresearcher.wordpress.com/2012/02/17/how-to-make-sense-out-of-twitter-tags/) and storing them in the database I will use these people to find more lists that feature similar people for the keyword. Why? Mainly because Twitter doesnt let you search for lists with a certain name, and because the alternatives such as wefollow.com, twellow.com or listorious.com do not give you all the lists for a given search term. That is why I will have to snow-ball through Twitter lists and keep those that are relevant for a given topic. This process consists of three parts:

  1. For the seed users collect all the lists that they are listed on . From these lists only keep those that match the keyword and are thus relevant for the topic and do some filtering
  2. From the remaining lists collect every person that is on that list
  3. For those persons count how often they are listed on the lists relevant for a certain field

This process is shown in the figure below:

1. Collecting lists

How do we collect lists? Well we start by checking if we have enough API calls left on Twitter, if this is the case we start by collecting the memberships for a given user and keep on paging until there are no more lists that the user is listed on. As you can see Twitter can be a bit sensitive to the page size, it can be 1000 items max, but in practice it is around 200-400 items, before we get timeouts. That’s why the function is adopting dynamically to those. Also collecting more than lets say 10000 lists for a given user does not make much sense since, we are probably wasting our API calls for a celebrity like aplusk. Once all of the lists that the user is listed on have been collected I store them into a csv file and the most important part of this procedure: persist these lists in my database that contain the keyword that the person was originally collected for. This means e.g. if the seed user was in the category “swimming”, I will only keep those lists that include the keyword swimming in it. Additionally I make sure that if I already have encountered this list I don’t add it twice to my database.  

2. Collect List members

Once I have collected tons of lists that match the category keywords, I will collect all of the list members that are listed on these lists. The code below is run for every list in the database. As you can see I make sure that there are enough API calls left, and then start to collect all of the members on the list. For this I am using delayed job, which is a nice ruby library https://github.com/tobi/delayed_job that allows me to wrap time consuming tasks in a neat job that can then be run later or by multiple machines on multiple computers. I have made good experiences using around 10-15 workers on a single machine which then process these jobs in the background. Anyway at the end of this step we end up having projects each containing a a high number of people that seem to be relevant for this user interest because they have been listed on lists for explicitly this interest.

3. Creating projects with most listed persons

After step 1 and 2 we have a number of potential candidates that are relevant for a given topic but we are only interested in those that represent the user interest the most. That is why we need a procedure that filters those people according to how often they have been listed for a certain topic. How do we do that? Well for each topic we have a number of lists that each list which people are relevant according to this list for this topic. Now if we go though all the lists for a given topic and count how often certain persons were listed for this topic we might end up with finding the most relevant users for a given topic (As we will see later in part two, this process gives some nice resulsts, but it can be greatly improved in regard to it’s accuracy). So what does the code below do?

This function is run on the projects that contain the bunch of people that have been collected for a certain Topic. First for all the persons in there it stores the  persons in memory for faster computation. It then goes through all the lists that we collected for a given topic and checks again if the lists matches the topic keyword, if it does it checks if we have not encountered this list before (which should not happen, since we made sure we don’t add lists twice in the insertion process, but double checking won’t hurt). If this list has members, then it collects all of these member usernames into an array and checks if among the seen_membersets (which are simply the collection of usernames) there is a set that contains exactly the same members already. Why are we doing this? Because there is list spam out there in Twitter and people or bots end up copying lists only to save it under a different name. So in our case if there happens to be no lists that already has the same members (99%), then we actually analyze this list, otherwise we drop it, because it is too similar to the lists that we already encountered. For each of the list members we check if the the persons we are computing the list count for can be found on the list, if this is the case we add plus 1 to the persons counter. Otherwise if there is a person on the list that we somehow have not captured before we add it to our pool of persons and also raise the persons counter. At the end of this procedure we end up having a list count for each person that was on these lists and can directly see how relevant this person is in regard to a certain topic. We output the sorted list count into a simple csv only to use these later in part 2, to improve our accuracy.

To show the result of this process I have cut and pasted a small part of the result of step 3 below. As you can see we managed to find out that it seems like aplusk, JimCarrey, tomhanks and so seem to be the most relevant Twitter users for the community of actors. This list contains ~18.000 entries, where people towards the end do not seem to be that much representing the actor community as the people at the beginning of the list.

Last step

If we now take e.g. 100 – 200 people from each of these lists (assuming that people cannot manage more than this amount of people according to the dunbar or wellman number http://en.wikipedia.org/wiki/Dunbar%27s_number) we end up having those interest based communities of people on Twitter that share interest.

Studying those communities is what I am trying to do in my work, but more on that later. These communities are also interesting for advertisers: Imagine someone who wants to sell swimming underwear, this retailer would be highly interested if you could show him all the people on Twitter that are interested in swimming. Those people could be his first customer group. If his swimming underwear gets approved by those people then it is very likely that they will talk about it and so inform other swimming interested people about this product.

And in part two I will show you how we can improve these communities by allowing people to move from one community to the other, if they “fit” better into this community.

Cheers Thomas

Perspective Changes Everything

An excellent talk by Rory Sutherland, member of the ogivly group on how value of things is purely subjective. He has also published a book on this which is called the wiki man.

As a reseacher being involved in communication science, economics and computer science, I see that it the same way that the most interesting ideas evolve in between domains, as for example networks science and marketing. In a way this made me think about interest based social networks can be used for social media marketing or marketing in general. The simple fact that people have interests and connect with other people because of these interests is already enough to dramatically change how targeting is done in marketing. These days you are no longer needing focus groups or social millieus to find out what people like. The will simply tell you in social media, and beyond that the will tell you which person is the most important person they turn towards, if they want some good info for this interest. I will think about this on that sunny monday morning 🙂



What's The Big Data?

I’m in the process of researching the origin and evolution of data science as a discipline and a profession. Here are the milestones that I have picked up so far, tracking the evolution of the term “data science,” attempts to define it, and some related developments.  I would greatly appreciate any pointers to additional key milestones (events, publications, etc.).

[An updated version of this timeline is at Forbes.com]

1974Peter Naur publishes Concise Survey of Computer Methods in Sweden and the United States. The book is a survey of contemporary data processing methods that are used in a wide range of applications. It is organized around the concept of data as defined in the IFIP Guide to Concepts and Terms in Data Processing, which defines data as “a representation of facts or ideas in a formalized manner capable of being communicated or manipulated by some process.“…

View original post 2,452 more words