Archive for

What's The Big Data?

I’m in the process of researching the origin and evolution of data science as a discipline and a profession. Here are the milestones that I have picked up so far, tracking the evolution of the term “data science,” attempts to define it, and some related developments.  I would greatly appreciate any pointers to additional key milestones (events, publications, etc.).

[An updated version of this timeline is at Forbes.com]

1974Peter Naur publishes Concise Survey of Computer Methods in Sweden and the United States. The book is a survey of contemporary data processing methods that are used in a wide range of applications. It is organized around the concept of data as defined in the IFIP Guide to Concepts and Terms in Data Processing, which defines data as “a representation of facts or ideas in a formalized manner capable of being communicated or manipulated by some process.“…

View original post 2,452 more words


Audience analysis of major Twitter news outlets


A very interesting blog post from the people at socialflow was the inspiration for this little study. The socialflow study analyzed the Twitter outlets of the main news providers like CNN, NYT to find out if they have a common audience, how they compare when it comes to being retweeted an so on. So I thought it is a good idea to try to come up with something similar for German newspapers. Another issue is the simple fact that the media analysis of TV, Radio or newspapers is strongly focusing on the demographics of their readers (see screenshot below)  but totally neglect the following issues:

  • Social media as a medium (incl. Twitter, Facebook etc..) is not analyzed at all (how do accounts compare on their followers, friending, tweet content/frequency …)
  • The reader’s relationships with each other ( is there a connected audience?)
  • How the readership is extended by the sharing functions (retweets)  (How do stories get passed along, which ones are the most popular…)

A screenshot from ma-reichweite.de

Research Questions

I’ve decided to focus on a couple of very general research questions:

  • How many outlets does each publisher have and how are they connected with each other?
  • How do accounts compare regarding their Followers, Friends and Messages?
  • How does the  user engagement in terms of retweets differ between the outlets?
  • How do retweets help to reach a wider audience?
  • Is there a shared audience between those accounts and publishers?


Since the german newspaper ecosystem is quite fragmented there are quite a couple of different publishers and thousands of different (daily, weekly) newspapers and magazines. I’ve decided to focus on the following ones:


What we see from the general overview is that News agencies differ quite a lot in the number of active accounts. SPIEGEL has 24 Twitter accounts with a total of almost 500.000 followers. The leading tabloid BILD despite having a huge reach of 12 Mio Users offline, only accumulates 170.000 followers on Twitter.

Structure of the Twitter Outlets among each other

When we have a look how the total of these 118 Twitter accounts are linked with each other a pattern appears (see figure below). It seems like the general norm is to have a main Twitter Outlet (e.g. BILD_News or zeitonline) which is connected with the remaining topic specific accounts, which themselves are all connected to the other Twitter outlets of the same publishing house. Twitter Outlets are not connected with each other between different publishers. Comparison of Followers, Messages and Friends Looking at the distribution of followers, I have found that more than 70% of the analyzed accounts have less than 10.000 Followers. Among the Top 20 Follower outlets it is striking that we find more than 8 Outlets of the SPIEGEL Account. Apparently this publisher seems to dominate the field.

Overview of Tweets

If we look at the number of Tweets produced, we see that again around 70% of all accounts have generated less than 10.000 Tweets during their existence. In the top 20 we find such extreme examples like focussport or focusonline, which produce up to 90 tweets / day. At such a frequency I am asking myself how the followers of such accounts cope with the flood of tweets from these accounts.

Followees distribution

Looking at the figure of followees we find the most surprising finding. It seems like only the account of TAZ (tazgezwitscher) is following his readers back and at least offers potential to read what readers have to say.This brings us to the question: If we are in social media and interaction with the readership is a given, how do these outlets actually interact with their readers?

 Interaction with readers

To measure how much these outlets interact with their readers, I have collected all tweets of each account and counted how often they refer to somebody using the @ sign. I have made the distinction of counting how often thy refer to their own accounts, and how often they refer to actual readers. The results are rather surprising: Out of 270.000 tweets only 13.000  tweets are actually interacting with somebody. Out of these almost 10000 tweets are referring to own accounts (eg. when  BILD_NEWS refers to BILD_Sport). So only 3000 tweets are actually interacting with readers, which is a meager 1%. So we can say that interaction with readers is  taking place at a shockingly low level.

How does a readership of an account look like

There are theories about a connected readership online, speculating that the social media readers of accounts are connected to each other and are exchanging and discussing content online. In order to find out if such a structure is emerging, I have exemplary analyzed the account of fr_online, and collected all of its ~9000 readers. Below you see a spring layout of 9000 nodes in gephi. You find the typical core-periphery structure, where 10% of readers do not have any connections to other readers, 50% of readers have less than 15 links and finally you see that there is a core of highly connected readers. Among these highly connected readers we actually find commercial or celebrity accounts such as: ntvde, derfreitag, Calmund, tagesspiegel_de, Piratenpartei, handelsblatt, hronline, spdde, …

Network layout of 9000 readers of fr_online

Engagement of Readers

In order to measure how the accounts differ in reader engagement, I have collected all retweets for all tweets of all accounts and created two ratios:

  • Retweets / Message
  • Retweets / Follower

Retweets / Message

Looking at all accounts I found that almost 90% of all accounts got less than  one retweet / tweet on average. This is still a respectable result, if we think of the findings of Romero et. al, who found that users retweet only one in 318 links.  If we look at the top 20 accounts with the highest retweets/message ratio, we see that the news breaking account of spiegel emerges with a total of 10 retweets / tweets on average. Similar results are only yielded by  the main accounts of ZEIT and TAZ. On the other end of the spectrum we find accounts like focuspanorama ( 11.000 Messages / 14 Retweets) oder focussport (95.000 Nachrichten / 3 Retweets).

Retweets / Follower

Regarding the retweets / follower 79 accounts had a ratio of less than 0,1. Which means that every 10 followers they got one retweet. Among the top 20 the highest ratio of 1 retweet for each 3 followers was achieved by tazgezwitscher. It seems that this account has the most engaged readership, that helps this account to spread their news well beyond their direct readership. Among the accounts with the lowest audience engagement we find  BILD_Bundesliga (with 40.000 Followers and 1000 Retweets) or  SPIEGEL_Rezens (with 30.000 Followers and 300 Retweets). We can speculate that especially sports related content is not retweeted that often because soccer results are simply consumed and not shared. Exemplary analysis of the engaged readership of one account In order to see the structure of readers that have retweeted at least one tweet from an account I have collected such users for the account of fr_online, laid them out with gephi, and applied the modularisation community finding algorithm. The results below show that readers actually cluster in different communities, which differ on their political orientation or interests.

Structural overview of readers of fr_online that retweeted at least one of its messages

 Extended Readership

Knowing that retweets yield an extended readership (see below), one goal was to take a glimpse of what such an extended readership might mean for the reach of one account.

Extended Readership through retweets

To get an idea how the extended readership helps to boost an accounts reach I have collected all tweets and respective retweets of these accounts. For each retweet I looked up how many followers this reader had. By simply adding up all followers for each reader that did a retweet for this account you get a number that is the potentially maximal extended audience that might have been reached through these retweets. I am saying potentially maximal because I am not taking into account if persons who retweeted messages might have a shared audience (e.g. Imagine reader5 and reader6 being the same person in the figure above)

Extended audience through retweets

We notice that the total of 27.000 Retweets of zeitonline have generated an extended audience of 4.2 Mio readers or in the case of tazgezwitscher we see that 15000 retweets resulted in more than 2 Mio additional readers. What we can take away from this calculation is that retweets really change the distribution game: While zeitonline has approximately 80.000 followers they have managed to get some of their news to be seen by a total of 4.2 Mio people , which is a multiplication of ~50x. I think this shows the true power of social media.

Potential multipliers

When drilling down in the data we have found readers that are especially valuable for an account because they have a high number of followers themselves, serving as huge multipliers for the audience. We find that three cases emerge quite often:

  • Publishers use their own main-accounts to boost the readership of smaller thematic-accounts (e.g. when bild_sport (10.000 followers) is retweeted by bild_news (80.000 followers), or zeitonline_wir (3.000 followers) is retweeted by zeitonline (80.000)
  • Influential users retweet the content (e.g. tweets from BILD_Digital – 4600 Follower, SPIEGEL_Reise – 14000 Follower , SPIEGEL_Netz -22000 Follower are retweeted by rather unknown readers that have a high number of followers einerHaupka -170000 Follower, AxelKoster – 120000 Follower, haukepetersen 70000 Follower)
  • The subject of the content retweets himself ( e.g. A tweet about the band “jetward”  from bild_aktuell(35.000 followers is retweeted by a fan account planetjetward 300.000 followers, or jeffjarvis retweets (80.000 followers) retweets the focuslive account (10.000) who made an interview with him

“Two-Step-Flow” of information

Regarding this diffusion patterns I asked myself if we can compute something similar like a two step flow of information, which is the percentage of retweeted material that has been retweeted because it has been seen not on the original account itself, but has reached a reader by an intermediary. We defined the two-step-flow ratio as:

The number of people that have retweeted an account and follow  directly / total amount of people that have retweeted the account.

Readers following an account and retweeting it (green) , Readers NOT following an account and retweeting it (orange). Potential Two-Step-Flow dashed line.

The ratio can be as high as 1 if everybody that retweeted that account is directly following him and as low as 0 when everybody that retweeted an account is not directly following this account. We have ordered the accounts by the lowest ratio first, and we see that some accounts like zeitonline_wir achieve a ratio of less than 0.5 which means that half of their retweets were from people who were not directly following this account. Now there can be two explanations for such a low ratio: a) people have received the retweet from a broker or middleman and then retweeted it (which is in favor of the two-step-flow hypothesis) or people simply have seen the article on the website and decided to retweet it. Since we didn’t analyse this in detail we can only guess about the percentage, but it would definitely be worth an own analysis.

(in red) Ratio of people that tweeted an article and were directly following an account / all people that retweeted an article

Shared Readers

The final step of this analysis was to find out how many readers the outlets had in common (see orange people in the graphic below). The common readers measure can have a maximal value of 0.5 when e.g. each account has 100 users and both are following both accounts (100/200) or can be minimal 0 when 0 users are in common .

Shared Readers

We computed this ratio for each combination of accounts and displayed in a symmetric matrix (see image below). We additionally grouped the accounts in the matrix by publisher (see blue boxes). The higher the ratio the greener the cells , red = lower.

Shared audience by publisher

Symetric matrix of shared audience What we see in this visualization is that especially among accounts of the same publisher (e.g. Spiegel_eil, Spiegel_news, Spiegel_reise…) a common readership emerges. Thus people who like the spiegel are very often following the other accounts. This pattern emerges even better when we group the shared audience by the publisher (below). What really strikes out is that the tabloid paper BILD has an audience which is very different from the other audiences. On the other hand “intellectual” and social media established newspapers such as the ZEIT or SPIEGEL seem to share a relative  big audience (~ 8%). View of shared audience grouped by publisher

Shared audience by account

If we highlight the shared audience that is three deviations higher than the average value (0,03) we also note that there are certain accounts that are not part of the same publisher but have a very big shared audience (green cells in the matrix below)

Shared audience with between accounts (Green = three SD higher than average)

Since the matrix above is not really good at showing the structure that emerges in the data, we have simply visualized the data in a network format, connecting the accounts that share an audience, the line-strength was chosen accordingly to the percentage of shared audience (see below)

Shared audience network visualization

In this visualization a number of interesting observations emerge:

  • Accounts focusing on the spread of top-news (red e.g. Spiegel_EIL, BILD_NEWS, BILD_AKTUELL, Spiegel_TOP, tazgezwitscher) have a shared audience.
  • We see the same pattern of readers of readers following accounts of the same publisher (e.g. zeitonline_wir, zeitonline_kul, zeitonline_wis und zeitonline_pol or Spiegel_wirtsch, Spiegel_politik, Spiegel_pano, Spiegel_seite2, Spiegelzwischen, Spiegel_SPAM)
  • Accounts that have a thematic focus seem to generate a shared audience. See Travel:  Stern_reise, Welt_reise, Faz_reise, Focusreise. Or Cars: ocusauto, FAZauto, SZ_Auto


We have arrived at the end of our little explorative analysis. A couple of take aways are:

  • Some publishers use Twitter quite successfully as a channel to enhance their reach and the interaction with their readers (as in the examples of spiegel, zeit or taz)
  • Despite the enthusiasm, the image of an interconnected audience, does not emerge that strongly, as readers do not interact with the outlets too much and a high number of readers is only weakly connected to each other
  • Engagement of readers can quite nicely be measured in retweets/message and retweets/follower capturing different aspects.
  • Using a simple modularity analysis  of the retweets network of an account can bring interesting insights on how the audience of an account is clustered (as in the case of fr_online)
  • Retweets in general and the resulting Two-Step-Flow of information can boost the reach of an account by a potential magnitude of ~10-50x
  • Some very influential readers emerge as their audience often is bigger than the audience of the outlet itself
  • A shared audience emerges between accounts of the same publisher, but it also emerges between accounts of different publishers when they share a common topic (e.g. travel)

That is it for today, I am excited to hear your comments




I am presenting this small analysis tomorrow at the SGKM conference (on journalism, social media and communication) and am excited to hear what the audience has to say.

User interests ontology

I’ve been blogging about the idea that people form networks based on their interests for a while now.  As you maybe remember we used to use the tags on wefollow to find out what people are interested in on Twitter.

And in the  last post I have shown how to create a post-hoc ontology  from tags that we collected on wefollow, which represent people’s interests. Yet the results of this attempt were kind of mediocre:

  • We have found that people like to tag a lot of people describing “somebody” like rapper, artist, celebrity and so on.
  • And we have found out that people like to tag a lot of twitter users based on a certain activity, like. swimming, running, hacking, dancing, cooking and so on… But apart from this insight I was still lacking any insight into this “bag of words” that I got from wefollow.
  • We also have found out that rugby is similar to soccer, and those two are similar to cricket because these are all field games and so on…

In another attempt I  have also tried  to find out which of the keywords on wefollow are somewhat similar simply by looking for words that sound the same or are spelled the same. The results were interesting.

  • We have found out people like to use different keywords with different popularity to describe “kind of ” similar things. For example:film,5687,0,filmmaker,2842,5,filmmaking,843,6,films,797,1,farm,312,2,filmfestival,223,8,fire,162,2,fly,151,2,filmes,141,2
  • These are all keywords that share the word film, but apparently simply tagging users with the word film seems to yield in the highest results.
  • Similar for other words:singer,4276,0,single,902,2,singersongwriter,893,10,swingers,161,2,singer_songwriter,147,11singer is the most popular keyword to tag people followed by single and so on.

User interest providers

Yet despite those two attempts we are lacking some more insight into the users interests of twitter users. What I am looking for is some kind of hierarchy between those words, but not so much as in the wordnet approach (see above) and not so much as in the word similarity approach but more in a ontology based approach where we split up the users interests into lets say 6-12 high class categories and put our keywords into those. We have used two different approaches, now its time get some overviews about other options

The table below is a comparison of providers of users interests that have chosen to categorize them accordingly. As you can see the approaches differ by how many keywords are used and if the ordering is hierarchical (as in dmoz or yahoo= or simply some sort of folksonomy as in delicious. The first two providers are commercial and do not offer any subcategories or networks, but give us some clue about the number of top level categories. In general we can say that most of these “interest directories” seem to contain the same top level categories (which is great because it seems we can agree on something). Apart from that I think that dmoz or yahoo give us the best
chance to order our keywords in a reasonable manner.

So after agreeing to use the yahoo category (since it is the most comprehensive, contains the most amount of subcategories and is curated by paid profesionals) to find out more about how to order users interests, its time to take a look at their dictionary.

A screenshot of the Yahoo Directory

Since there is no API or something I have chosen to scrape the first 2-3 levels of their directory and save them to a file. (I have used nokogiri) You will find the listing below. What it does is it goes through each of the toplevels defined beforehand (we are skipping new additions, subscribe via rss and regional) and looks at each of those links. It notes how many subcategories are in there in those brackets and it notes if a link is pointing towards another category. It contains an @-sign.  After going through all of those links it writes them down in a simple manner:

Topcategory, Subcategory, Count

Like this we get a network of categories.

Scraping the categories

require 'rubygems'
require 'nokogiri'
require 'open-uri'

topdomains = ["business_and_economy", "recreation", "computers_and_internet", "reference", "education", "regional", "entertainment", "science", "government", "social_science", "health", "society_and_culture"]

@seen_words = []
File.readlines("seenwords.csv").each do |line|
@seen_words << line.sub!(/\n/,"")
@seen_words_file = File.open("seenwords.csv", "a+")

def write_net(father, son)
	 if @seen_words.include?(son) && !son.include?("@")
@file.puts "#{father} #{father}_#{son}"
@file.puts "#{father} #{son}"
@seen_words_file.puts son

i = 0
topdomains.each do |domain|
	@file = File.open("#{domain}.csv", "w+")
	puts "done domain #{domain}"
	site =  Nokogiri::HTML(open("http://dir.yahoo.com/#{domain}"))
	site.css("div.cat li a").each do |link|
		first_level_link = link.content.gsub(" ","_").downcase
		puts "working on #{first_level_link}"
		if first_level_link.include? "@"
			sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{first_level_link}"))
			sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{domain}/#{first_level_link}"))
		sub_site.css("div.cat li a").each do |sub_link|
			i += 1
			puts i.to_s
			second_level_link = sub_link.content.gsub(" ","_").downcase
			write_net(first_level_link, second_level_link)
			if second_level_link.include? "@"
				sub_sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{second_level_link}"))
				sub_sub_site = Nokogiri::HTML(open("http://dir.yahoo.com/#{domain}/#{first_level_link}/#{second_level_link}"))
			sub_sub_site.css("div.cat li a").each do |sub_sub_link|
				third_level_link = sub_sub_link.content.gsub(" ","_").downcase
				write_net(second_level_link, third_level_link)

Visualizing the network

Having downloaded the network we end up with something that we can visualize in gephi (see below).=. The visualization is nice, since it allows to see which fields the links with the @-sign connect. We can see clusters emerge between different concepts and see that most of the second level categories are not connected to the rest. As you will note in the listing below I have also made sure to not include subcategories like “organisation” or “people” since every category contains such a subcategory and this subcategory would end op being most central in our network. Instead each subcategory gets an explicit name e.g. “sports_organisations” and only the categories with an @ are allowed to link other groups.

The downside of this approach is that the result is pretty big and creates even more confusion than our keywords from wefollow. Now we have a network with approximately 7000 nodes and 20.000 Edges. We would now search for each of the wefollow keywods and see where we can find it, and then drop the rest. This idea is not bad, but we are neglecting the majority of the great insights that the yahoo directory told us. If for example wefollow does not contain keywords regarding health, does it mean that Twitter users are not interested in health issues, or did we not look properly? Therefore I decided to take a hybrid approach. First I will cut down the yahoo directory only to words that contain a lot of entries and at the same time see how the 200 most frequent wefollow keywords fit into this ontology.

A mind map of user interests

The result is a mind map of user interests, rather than a network, since I’ve chosen to write it down by hand in order to be able to change small things. For example I would like to exclude the keywords that link together other topdomains and have rather a tree. Additionally I’ve decided to mark the words I have included from wefollow with a “wefollow: ” preposition in order to make the process more transparent for everybody. The result shows that actually quite a lot of the keywords that we have found on wefollow were also part of the existing yahoo directory and although the directory was quite big it did not contain a number of new words such as “youtube”, “podcast” and so on. Additionally concepts like for example “animals and pets” were added by my by hand since they have been at a very deep level in the yahoo ontology (Science / Zoology / Animals / …) but are actually quite popular among twitter users. So below you see the result of my work. THis mindmap represents a hybrid of the  200 most frequent wefollow keywords and the most popular yahoo categories. I am quite happy with the result since it seems to be useful in describing the bag of words I had before.

I am right now collecting the communities of those users on Twitter in order to analyze them and will keep you updated about the progress.

Thats it for today.