Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 10 2014

Four short links: 10 February 2014

  1. Bruce Sterling at transmediale 2014 (YouTube) — “if it works, it’s already obsolete.” Sterling does a great job of capturing the current time: spies in your Internet, lost trust with the BigCos, the impermanence of status quo, the need to create. (via BoingBoing)
  2. No-one Should Fork Android (Ars Technica) — this article is bang on. Google Mobile Services (the Play functionality) is closed-source, what makes Android more than a bare-metal OS, and is where G is focusing its development. Google’s Android team treats openness like a bug and routes around it.
  3. Data Pipelines (Hakkalabs) — interesting overview of the data pipelines of Stripe, Tapad, Etsy, and Square.
  4. Visualising Salesforce Data in Minecraft — would almost make me look forward to using Salesforce. Almost.

January 02 2014

The Snapchat Leak

The number of Snapchat users by area code.The number of Snapchat users by area code.

The number of Snapchat users by geographic location. Users are predominately located in New York, San Francisco and the surrounding greater New York and Bay Areas.

While the site crumbled quickly under the weight of so many people trying to get to the leaked data—and has now been suspended—there isn’t really such a thing as putting the genie back in the bottle on the Internet.

Just before Christmas the Australian based Gibson Security published a report highlighting two exploits in the Snapchat API claiming that hackers could easily gain access to users’ personal data. Snapchat dismissed the report, responding that,

Theoretically, if someone were able to upload a huge set of phone numbers, like every number in an area code, or every possible number in the U.S., they could create a database of the results and match usernames to phone numbers that way.

Adding that they had various “safeguards” in place to make it difficult to do that. However it seems likely that—despite being explicitly mentioned in the initial report four months previously—none of these safeguards included rate limiting requests to their server, because someone seems to have taken them up on their offer.

Data Release

Earlier today the creators of the now defunct SnapchatDB site released 4.6 million records—both as an SQL dump and as a CSV file. With an estimated 8 million users (May, 2013) of the app this represents around half the Snapchat user base.

Each record consists of a Snapchat user name, a geographical location for the user, and partially anonymised phone number—the last two digits of the phone number having been obscured.

While Gibson Security’s find_friends exploit has been patched by Snapchat, minor variations on the exploit are reported to still function, and if this data did come from the exploit—or a minor variation on it—uncovered by Gibson, then the dataset published by SnapchatDB is only part of the data the hackers now hold.

In addition to the data already released they would have the full phone number of each user, and as well as the user name they should also have the—perhaps more revealing—screen name.

Data Analysis

Taking an initial look at the data, there are no international numbers in the leaked database. All entries are US numbers, with the bulk of the users from—as you might expect—the greater New York, San Francisco and Bay areas.

However I’d assume that the absence of international numbers is  an indication of laziness rather than due to any technical limitation. For US based hackers it would be easy to iterate rapidly through the fairly predictable US number space, while foreign” numbers formats might present more of a challenge when writing a script to exploit the hole in Snapchat’s security.

Only 76 of the 322 area codes in the United States appear in the leaked database, alongside another two Canadian area codes, mapping to 67 discrete geographic locations—although not all the area codes and locations match suggesting that perhaps the locations aren’t derived directly from the area code data.

Despite some initial scepticism about the provenance of the data I’ve confirmed that this is a real data set. A quick trawl through the data has got multiple hits amongst my own friend group, including some I didn’t know were on Snapchat—sorry guys.

Since the last two digits were obscured in the leaked dataset are obscured the partial phone number string might—and frequently does—generate multiple matches amongst the 4.6 million records against a comparison number.

I compared the several hundred US phone numbers amongst my own contacts against the database—you might want to do that yourself—and generated several spurious hits where the returned user names didn’t really seem to map in any way to my contact. That said, as I already mentioned, I found several of my own friends amongst the leaked records, although I only knew it was them for sure because I knew both their phone number and typical choices of user names.


As it stands therefore this data release is not—yet—critical, although it is certainly concerning, and for some individuals it might well be unfortunate. However if the SnapchatDB creators choose to release their full dataset things might well get a lot more interesting.

If the full data set was released to the public, or obtained by a malicious third party, then the username, geographic location, phone number, and screen name—which might, for a lot of people, be their actual full name—would be available.

This eventuality would be bad enough. However taking this data and cross-correlating it with another large corpus of data, say from Twitter or Gravatar, by trying to find matching user or real names on those services—people tend to reuse usernames on multiple services after all—you might end up with a much larger aggregated data set including email addresses, photographs, and personal information.

While there would be enough false positives—if matching solely against user names—that you’d have a interesting data cleaning task afterwards, it wouldn’t be impossible. Possibly not even that difficult.

I’m not interested in doing that correlation myself. But others will.


December 18 2013

Tweets loud and quiet

Writers who cover Twitter find the grandiose irresistible: nearly every article about the service’s IPO this fall mentioned the heroes of the Arab Spring who toppled dictators with 140-character stabs, or the size of Lady Gaga’s readership, which is larger than the population of Argentina.

But the bulk of the service is decidedly smaller-scale–a low murmur with an occasional celebrity shouting on top of it. In comparative terms, almost nobody on Twitter is somebody: the median Twitter account has a single follower. Among the much smaller subset of accounts that have posted in the last 30 days, the median account has just 61 followers. If you’ve got a thousand followers, you’re at the 96th percentile of active Twitter users. (I write “active users” to refer to publicly-viewable accounts that have posted at least once in the last 30 days; Twitter uses a more generous definition of that term, including anyone who has logged into the service.)

You're a bigger deal on Twitter than you thinkYou're a bigger deal on Twitter than you think

This is a histogram of Twitter accounts by number of followers. Only accounts that have posted in the last 30 days are included.

For a few weeks this fall I had my computer probe the Twitterverse, gathering details on a random sampling of about 400,000 Twitter accounts. The profile that emerges suggests that Twitter is more a consumption medium than a conversational one–an only-somewhat-democratized successor to broadcast television, in which a handful of people wield enormous influence and everyone else chatters with a few friends on living-room couches. There are undoubtedly some influential Twitter users who would not be influential without Twitter, but I suspect that most people who have, say, 3,000 followers (the top one percent) were prominent commentators, industry experts, or gregarious accumulators of friends to begin with.

Active Twitter accounts follow a median 117 users, and the vast majority of them–76%–follow more people than follow them. Which brings to mind both discussions about the mathematics of pairing and studies that suggest reciprocated friendship is both rare and valuable. Here’s the histogram from above with the distribution of number of accounts that users follow superimposed.


Not that number of followers is an indicator of quality. Twitter’s users are prone to swarms and fads; they flock to famous people as soon as they appear on Twitter, irrespective of both activity and brow height. Former New York Times editor Bill Keller amassed thousands of followers in his first months on Twitter, despite posting just eight times in 2009 (and then baffling his readers with this tweet upon reappearing on Christmas Eve in 2010). On the other end, just under one in every thousand Twitter accounts has a name that refers to Justin Bieber in some way; an additional one in every thousand refers to Bieber in its account description.

Far more inscrutable than the famous zombies are the anonymous ones, like a Wayne Rooney fan account, a skin-care promotion feed, and a fake Taylor Lautner account that each managed to amass thousands of followers with just a single tweet. (The commercial accounts of this sort are probably the result of promotions–“follow us on Twitter for a discount!”–that got no follow-up, or are the beneficiaries of bot armies hired to make a business look popular.)

Twitter is giant, and it has an outsize influence on popular and not-so-popular culture, but that influence seems due to the fact that it’s popular among influential people and provides energetic reverberation for their thoughts–and lots and lots of people who sit back and listen.

How you stack up

Percentile of active Twitter accounts Number of follwers 10 3 20 9 30 19 40 36 50 61 60 98 70 154 80 246 90 458 95 819 96 978 97 1,211 98 1,675 99 2,991 99.9 24,964

The technical mumbo-jumbo

Twitter assigns each account a numerical ID on creation. These IDs aren’t consecutive, but they do, with just a few exceptions, monotonically increase over time–that is, a newer account will always have a higher ID number than an older account. In mid-September, new accounts were being assigned IDs just under 1.9 billion.

Every few minutes, a Python script that I wrote generated a fresh list of 300 random numbers between zero and 1.9 billion and asked Twitter’s API to return basic information for the corresponding accounts. I logged the results–including empty results when an ID number didn’t correspond to any account–in a MySQL table and let the script run on a cronjob for 32 days. I’ve only included accounts created before September 2013 in my analysis in order to avoid under-sampling accounts that were created during the period of data collection.

Twitter IDs are assigned at an overall density of about 63%–that is, given an integer between zero and the highest number so far assigned, there’s a 63% chance that a Twitter account has been opened with that number at some point. That density isn’t constant over the whole range of ID numbers, though; Twitter appears to have changed its ID-assignment scheme around July 2012. Before then, Twitter assigned IDs at a density of about 86% and afterward at 49%.

With a large survey sample of Twitter accounts, I was able to project the size and characteristics of the Twitter ecosystem as a whole, using R and ggplot2 for my analysis.

This post was modified after publication in order to add the table of follower percentiles above.

December 16 2013

Four short links: 16 December 2013

  1. Suro (Github) — Netflix data pipeline service for large volumes of event data. (via Ben Lorica)
  2. NIPS Workshop on Data Driven Education — lots of research papers around machine learning, MOOC data, etc.
  3. Proofist — crowdsourced proofreading game.
  4. 3D-Printed Shoes (YouTube) — LeWeb talk from founder of the company, Continuum Fashion). (via Brady Forrest)

December 12 2013

Four short links: 12 December 2013

  1. iBeacons — Bluetooth LE enabling tighter coupling of physical world with digital. I’m enamoured with the interaction possibilities: The latest Apple TV software brought a fantastically clever workaround. You just tap your iPhone to the Apple TV itself, and it passes your Wi-Fi and iTunes credentials over and sets everything up instantaneously.
  2. Better and Better Keyboards (Jesse Vincent) — It suffered from the same problem as every other 3D-printed keyboard I’d made to date – When I showed it to someone, they got really excited about the fact that I had a 3D printer. In contrast, whenever I showed someone one of the layered acrylic prototype keyboards I’d built, they got excited about the keyboard.
  3. — open source modular web service for dataset storage and retrieval.
  4. state.jsOpen source JavaScript state machine supporting most UML 2 features.

December 04 2013

Four short links: 4 December 2013

  1. Skyjack — drone that takes over other drones. Welcome to the Malware of Things.
  2. Bootstrap Worlda curricular module for students ages 12-16, which teaches algebraic and geometric concepts through computer programming. (via Esther Wojicki)
  3. Harvestopen source BSD-licensed toolkit for building web applications for integrating, discovering, and reporting data. Designed for biomedical data first. (via Mozilla Science Lab)
  4. Project ILIAD — crowdsourced antibiotic discovery.

November 28 2013

November 06 2013

Four short links: 6 November 2013

  1. Apple Transparency Report (PDF) — contains a warrant canary, the statement Apple has never received an order under Section 215 of the USA Patriot Act. We would expect to challenge an order if served on us which will of course be removed if one of the secret orders is received. Bravo, Apple, for implementing a clever hack to route around excessive secrecy. (via Boing Boing)
  2. You’re Probably Polluting Your Statistics More Than You Think — it is insanely easy to find phantom correlations in random data without obviously being foolish. Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this. (via Stijn Debrouwere)
  3. CyPhy Funded (Quartz) — the second act of iRobot co-founder Helen Greiner, maker of the famed Roomba robot vacuum cleaner. She terrified ETech long ago—the audience were expecting Roomba cuteness and got a keynote about military deathbots. It would appear she’s still in the deathbot niche, not so much with the cute. Remember this when you build your OpenCV-powered recoil-resistant load-bearing-hoverbot and think it’ll only ever be used for the intended purpose of launching fertiliser pellets into third world hemp farms.
  4. User-Agent String History — a light-hearted illustration of why the formal semantic value of free-text fields is driven to zero in the face of actual use.

November 05 2013

Four short links: 5 November 2013

  1. Influx DBopen-source, distributed, time series, events, and metrics database with no external dependencies.
  2. Omega (PDF) — ���exible, scalable schedulers for large compute clusters. From Google Research.
  3. GraspJSSearch and replace your JavaScript code based on its structure rather than its text.
  4. Amazon Mines Its Data Trove To Bet on TV’s Next Hit (WSJ) — Amazon produced about 20 pages of data detailing, among other things, how much a pilot was viewed, how many users gave it a 5-star rating and how many shared it with friends.

October 28 2013

Four short links: 28 October 2013

  1. A Cyber Attack Against Israel Shut Down a RoadThe hackers targeted the Tunnels’ camera system which put the roadway into an immediate lockdown mode, shutting it down for twenty minutes. The next day the attackers managed to break in for even longer during the heavy morning rush hour, shutting the entire system for eight hours. Because all that is digital melts into code, and code is an unsolved problem.
  2. Random Decision Forests (PDF) — “Due to the nature of the algorithm, most Random Decision Forest implementations provide an extraordinary amount of information about the final state of the classifier and how it derived from the training data.” (via Greg Borenstein)
  3. BITalino — 149 Euro microcontroller board full of physiological sensors: muscles, skin conductivity, light, acceleration, and heartbeat. A platform for healthcare hardware hacking?
  4. How to Be a Programmer — a braindump from a guru.

October 25 2013

Four short links: 25 October 2013

  1. Seagate Kinetic Storage — In the words of Geoff Arnold: The physical interconnect to the disk drive is now Ethernet. The interface is a simple key-value object oriented access scheme, implemented using Google Protocol Buffers. It supports key-based CRUD (create, read, update and delete); it also implements third-party transfers (“transfer the objects with keys X, Y and Z to the drive with IP address”). Configuration is based on DHCP, and everything can be authenticated and encrypted. The system supports a variety of key schemas to make it easy for various storage services to shard the data across multiple drives.
  2. Masters of Their Universe (Guardian) — well-written and fascinating story of the creation of the Elite game (one founder of which went on to make the Raspberry Pi). The classic action game of the early 1980s – Defender, Pac Man – was set in a perpetual present tense, a sort of arcade Eden in which there were always enemies to zap or gobble, but nothing ever changed apart from the score. By letting the player tool up with better guns, Bell and Braben were introducing a whole new dimension, the dimension of time.
  3. Micropolar (github) — A tiny polar charts library made with D3.js.
  4. Introduction to R (YouTube) — 21 short videos from Google.

October 24 2013

Four short links: 24 October 2013

  1. Visually Programming Arduino — good for little minds.
  2. Rapid Hardware Iteration at Scale (Forbes) — It’s part of the unique way that Xiaomi operates, closely analyzing the user feedback it gets on its smartphones and following the suggestions it likes for the next batch of 100,000 phones. It releases them every Tuesday at noon Beijing time.
  3. Machine Learning of Hierarchical Clustering to Segment 2D and 3D Images (PLoS One) — We propose an active learning approach for performing hierarchical agglomerative segmentation from superpixels. Our method combines multiple features at all scales of the agglomerative process, works for data with an arbitrary number of dimensions, and scales to very large datasets.
  4. Kratuan Open Source client-side analysis framework to create simple yet powerful renditions of data. It allows you to dynamically adjust your view of the data to highlight issues, opportunities and correlations in the data.

October 22 2013

Four short links: 22 October 2013

  1. Sir Trevor — nice rich-text editing. Interesting how Markdown has become the way to store formatted text without storing HTML (and thus exposing the CSRF-inducing HTML-escaping stuckfastrophe).
  2. Slate for Excel — visualising spreadsheet structure. I’d be surprised if it took MSFT or Goog 30 days to acquire them.
  3. Project Shield — Google project to protect against DDoSes.
  4. Digital Attack Map — DDoS attacks going on around the world. (via Jim Stogdill)

October 04 2013

GDELT : What can we learn from the last 200 million things that happened in the world ? | War of…

#GDELT: What can we learn from the last 200 million things that happened in the world? | War of Ideas

The excitement over Global Data on Events, Location, and Tone - to give its full name — is understandable. The singularly ambitious project could have a transformative effect on how we use data to understand and anticipate political events.

Essentially, GDELT is a massive list of important political events that have happened — more than 200 million and counting — identified by who did what to whom, when and where, drawn from news accounts and assembled entirely by software. Everything from a riot over food prices in Khartoum, to a suicide bombing in Sri Lanka, to a speech by the president of Paraguay goes into the system.

Similar event databases have been built for particular regions, and DARPA has been working along similar lines for the Pentagon with a project known as ICEWS, but for a publicly accessible program (you can download it here though you’ll need some programming skills to use it) GDELT is unprecedented in it geographic and historic scale. The database updates with new events every night following the day’s news and while it currently goes back to 1979, its developers are working on adding events going back as far as 1800 according to lead author Kalev Leetaru, a fellow at the University of Illinois Graduate School of Library and Information Science.

#histoire #politique #conflits #data via @francoisbriatte


Package GDELT pour #R

Guardian datablog

Quantifying memory

Mapping with GDELT

Mapping Syria’s conflict

September 29 2013

N.S.A. Gathers Data on Social Connections of U.S. Citizens -

N.S.A. Gathers Data on Social Connections of U.S. Citizens -

Since 2010, the National Security Agency has been exploiting its huge collections of data to create sophisticated graphs of some Americans’ social connections that can identify their associates, their locations at certain times, their traveling companions and other personal information, according to newly disclosed documents and interviews with officials.

#snowden #nsa #surveillance #data

September 27 2013

Qatar's migrants : how have they changed the country ? | News |

Qatar’s migrants: how have they changed the country? | News |

Qatar has become almost unrecognisable from the tiny nation it once was. We look at the data to find out how migration changed everything and what happens when a nation swells so quickly.

The real answer lies in Qatar’s migrant population, otherwise bluntly referred to in government statistics as ’non-Qataris’. In terms of rights, migrants might not be powerful - but in numbers they are.

’Qataris’ in work: 71,076
’Non-Qataris’ in work: 1,199,107

That means immigrants make up an astounding 94% of Qatar’s workforce, and 70% of it’s total population.
#data #Qatar #migration #travail

September 25 2013

Seenthis et Big data Avec quelques centaines de contributeurs actifs, et des dizaines de milliers…

Seenthis et Big data

Avec quelques centaines de contributeurs actifs, et des dizaines de milliers de billets, de signalement d’articles, de blogs de sites d’images, d’émissions de radio, de blogs improbables et d’autres probables ; d’infos prises sur le vif, de résumés de conf, de traductions d’articles, de synthèses...

Seenthis va bientôt devenir si ça ne l’est déjà un énorme ensemble (très riche) de big data, non ?

Que pourrait-on faire d’intelligent de cet ensemble, de ce savoir ? comment peut-on le « traiter » de sorte que rien n’en soit perdu, que tout soit « reconsultable » facilement (j’avoue, y a des trucs que je ne retrouve pas mais il est possible que je me démerde comme une brel et que je ne sache pas me servir du moteur de recherche vraiment) ?

Y aurait-il des belles synthèses à faire sur des thèmes pour lesquels il y a de vraies contributions collectives (si vous voyez de quoi je veux parler...)

#seenthis #données #data #big_data #données_quantitatives #données_qualitatives #savoir #connaissance

September 22 2013

September 05 2013

BBC News - Census consultation has option to replace 200-year-old survey

BBC News - Census consultation has option to replace 200-year-old survey

The census faces its biggest shake-up in its 200-year history under Office for National Statistics proposals.

An online survey could replace the study - carried out every 10 years - or information could instead be collated using data already held by government.

The plans will be fleshed out and put out to consultation this month before Parliament makes a decision in 2014.

#royaume-uni #data #recensement #démographie

August 27 2013

On the need for research on Citizen's data, big and small | geosocialite

On the need for research on Citizen’s data, big and small | geosocialite

NB This post is not about Citizen Science, but about the data trail that each and everyone generates, willingly or not, volunteered or not. It’s also a bit longer than usual. And yes, of course I focus on geographic data.
Isn’t there already a “Big Citizen Data” research band wagon?

Yes, indeed, that’s true. There is a large and still rapidly growing body of research on the collection, analysis and utility of information from Citizens. The labels are just as diverse as the research, and include volunteered geographic information, neogeography, user-generated geographic content, or crowdsourced mapping – and that’s the geospatial domain only! The objectives range from improving humanitarian assistence for those in imminent danger and need, to improving your dinner experience by removing spam from peer rating platforms.

What I am missing, though, is research that explicitly aims to help Citizens in protecting their political rights and their ability to determine what information on them is available to whom. Call it critical geographic information science or counter mapping 2.0. (btw, I would be delighted by comments that prove me wrong on this one!).

#data #big_data #données #citoyens #statistiques

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!