Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 18 2013

Tweets loud and quiet

Writers who cover Twitter find the grandiose irresistible: nearly every article about the service’s IPO this fall mentioned the heroes of the Arab Spring who toppled dictators with 140-character stabs, or the size of Lady Gaga’s readership, which is larger than the population of Argentina.

But the bulk of the service is decidedly smaller-scale–a low murmur with an occasional celebrity shouting on top of it. In comparative terms, almost nobody on Twitter is somebody: the median Twitter account has a single follower. Among the much smaller subset of accounts that have posted in the last 30 days, the median account has just 61 followers. If you’ve got a thousand followers, you’re at the 96th percentile of active Twitter users. (I write “active users” to refer to publicly-viewable accounts that have posted at least once in the last 30 days; Twitter uses a more generous definition of that term, including anyone who has logged into the service.)

You're a bigger deal on Twitter than you thinkYou're a bigger deal on Twitter than you think

This is a histogram of Twitter accounts by number of followers. Only accounts that have posted in the last 30 days are included.

For a few weeks this fall I had my computer probe the Twitterverse, gathering details on a random sampling of about 400,000 Twitter accounts. The profile that emerges suggests that Twitter is more a consumption medium than a conversational one–an only-somewhat-democratized successor to broadcast television, in which a handful of people wield enormous influence and everyone else chatters with a few friends on living-room couches. There are undoubtedly some influential Twitter users who would not be influential without Twitter, but I suspect that most people who have, say, 3,000 followers (the top one percent) were prominent commentators, industry experts, or gregarious accumulators of friends to begin with.

Active Twitter accounts follow a median 117 users, and the vast majority of them–76%–follow more people than follow them. Which brings to mind both discussions about the mathematics of pairing and studies that suggest reciprocated friendship is both rare and valuable. Here’s the histogram from above with the distribution of number of accounts that users follow superimposed.

followers_following_comparison_histogramfollowers_following_comparison_histogram

Not that number of followers is an indicator of quality. Twitter’s users are prone to swarms and fads; they flock to famous people as soon as they appear on Twitter, irrespective of both activity and brow height. Former New York Times editor Bill Keller amassed thousands of followers in his first months on Twitter, despite posting just eight times in 2009 (and then baffling his readers with this tweet upon reappearing on Christmas Eve in 2010). On the other end, just under one in every thousand Twitter accounts has a name that refers to Justin Bieber in some way; an additional one in every thousand refers to Bieber in its account description.

Far more inscrutable than the famous zombies are the anonymous ones, like a Wayne Rooney fan account, a skin-care promotion feed, and a fake Taylor Lautner account that each managed to amass thousands of followers with just a single tweet. (The commercial accounts of this sort are probably the result of promotions–“follow us on Twitter for a discount!”–that got no follow-up, or are the beneficiaries of bot armies hired to make a business look popular.)

Twitter is giant, and it has an outsize influence on popular and not-so-popular culture, but that influence seems due to the fact that it’s popular among influential people and provides energetic reverberation for their thoughts–and lots and lots of people who sit back and listen.

How you stack up

Percentile of active Twitter accounts Number of follwers 10 3 20 9 30 19 40 36 50 61 60 98 70 154 80 246 90 458 95 819 96 978 97 1,211 98 1,675 99 2,991 99.9 24,964

The technical mumbo-jumbo

Twitter assigns each account a numerical ID on creation. These IDs aren’t consecutive, but they do, with just a few exceptions, monotonically increase over time–that is, a newer account will always have a higher ID number than an older account. In mid-September, new accounts were being assigned IDs just under 1.9 billion.

Every few minutes, a Python script that I wrote generated a fresh list of 300 random numbers between zero and 1.9 billion and asked Twitter’s API to return basic information for the corresponding accounts. I logged the results–including empty results when an ID number didn’t correspond to any account–in a MySQL table and let the script run on a cronjob for 32 days. I’ve only included accounts created before September 2013 in my analysis in order to avoid under-sampling accounts that were created during the period of data collection.

Twitter IDs are assigned at an overall density of about 63%–that is, given an integer between zero and the highest number so far assigned, there’s a 63% chance that a Twitter account has been opened with that number at some point. That density isn’t constant over the whole range of ID numbers, though; Twitter appears to have changed its ID-assignment scheme around July 2012. Before then, Twitter assigned IDs at a density of about 86% and afterward at 49%.

With a large survey sample of Twitter accounts, I was able to project the size and characteristics of the Twitter ecosystem as a whole, using R and ggplot2 for my analysis.

This post was modified after publication in order to add the table of follower percentiles above.

October 22 2013

Mining the social web, again

When we first published Mining the Social Web, I thought it was one of the most important books I worked on that year. Now that we’re publishing a second edition (which I didn’t work on), I find that I agree with myself. With this new edition, Mining the Social Web is more important than ever.

While we’re seeing more and more cynicism about the value of data, and particularly “big data,” that cynicism isn’t shared by most people who actually work with data. Data has undoubtedly been overhyped and oversold, but the best way to arm yourself against the hype machine is to start working with data yourself, to find out what you can and can’t learn. And there’s no shortage of data around. Everything we do leaves a cloud of data behind it: Twitter, Facebook, Google+ — to say nothing of the thousands of other social sites out there, such as Pinterest, Yelp, Foursquare, you name it. Google is doing a great job of mining your data for value. Why shouldn’t you?

There are few better ways to learn about mining social data than by starting with Twitter; Twitter is really a ready-made laboratory for the new data scientist. And this book is without a doubt the best and most thorough approach to mining Twitter data out there. But that’s only a starting point. We hear a lot in the press about sentiment analysis and mining unstructured text data; this book shows you how to do it. If you need to mine the data in web pages or email archives, this book shows you how. And if you want to understand how to people collaborate on projects, Mining the Social Web is the only place I’ve seen that analyzes GitHub data.

All of the examples in the book are available on Github. In addition to the example code, which is bundled into IPython notebooks, Matthew has provided a VirtualBox VM that installs Python, all the libraries you need to run the examples, the examples themselves, and an IPython server. Checking out the examples is as simple as installing Virtual Box, installing Vagrant, cloning the 2nd edition’s Github archive, and typing “vagrant up.” (This quick start guide summarizes all of that.) You can execute the examples for yourself in the virtual machine; modify them; and use the virtual machine for your own projects, since it’s a fully functional Linux system with Python, Java, MongoDB, and other necessities pre-installed. You can view this as a book with accompanying examples in a particularly nice package, or you can view the book as “premium support” for an open source project that consists of the examples and the VM.

If you want to engage with the data that’s surrounding you, Mining the Social Web is the best place to start. Use it to learn, to experiment, and to build your own data projects.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl