Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 26 2014

Four short links: 26 February 2014

  1. Librarybox 2.0fork of PirateBox for the TP-Link MR 3020, customized for educational, library, and other needs. Wifi hotspot with free and anonymous file sharing. v2 adds mesh networking and more. (via BoingBoing)
  2. Chicago PD’s Using Big Data to Justify Racial Profiling (Cory Doctorow) — The CPD refuses to share the names of the people on its secret watchlist, nor will it disclose the algorithm that put it there. [...] Asserting that you’re doing science but you can’t explain how you’re doing it is a nonsense on its face. Spot on.
  3. Cloudwash (BERG) — very good mockup of how and why your washing machine might be connected to the net and bound to your mobile phone. No face on it, though. They’re losing their touch.
  4. What’s Left of Nokia to Bet on Internet of Things (MIT Technology Review) — With the devices division gone, the Advanced Technologies business will cut licensing deals and perform advanced R&D with partners, with around 600 people around the globe, mainly in Silicon Valley and Finland. Hopefully will not devolve into being a patent troll. [...] “We are now talking about the idea of a programmable world. [...] If you believe in such a vision, as I do, then a lot of our technological assets will help in the future evolution of this world: global connectivity, our expertise in radio connectivity, materials, imaging and sensing technologies.”

February 19 2014

Four short links: 19 February 2014

  1. 1746 Slippy Map of London — very nice use of Google Maps to recontextualise historic maps. (via USvTh3m)
  2. TPP Comic — the comic explaining TPP that you’ve been waiting for. (via BoingBoing)
  3. Synthetic Biology Investor’s Lament — some hypotheses about why synbio is so slow to fire.
  4. vizcities — open source 3D (OpenGL) city and data visualisation platform, using open data.

February 17 2014

Four short links: 17 February 2014

  1. imsg — use iMessage from the commandline.
  2. Facebook Data Science Team Posts About Love — I tell people, “this is what you look like to SkyNet.”
  3. A System for Detecting Software Plagiarism — the research behind the undergraduate bete noir.
  4. 3D GIFs — this is awesome because brain.

February 10 2014

Four short links: 10 February 2014

  1. Bruce Sterling at transmediale 2014 (YouTube) — “if it works, it’s already obsolete.” Sterling does a great job of capturing the current time: spies in your Internet, lost trust with the BigCos, the impermanence of status quo, the need to create. (via BoingBoing)
  2. No-one Should Fork Android (Ars Technica) — this article is bang on. Google Mobile Services (the Play functionality) is closed-source, what makes Android more than a bare-metal OS, and is where G is focusing its development. Google’s Android team treats openness like a bug and routes around it.
  3. Data Pipelines (Hakkalabs) — interesting overview of the data pipelines of Stripe, Tapad, Etsy, and Square.
  4. Visualising Salesforce Data in Minecraft — would almost make me look forward to using Salesforce. Almost.

February 05 2014

Four short links: 5 February 2014

  1. sigma.js — Javascript graph-drawing library (node-edge graphs, not charts).
  2. DARPA Open Catalog — all the open source published by DARPA. Sweet!
  3. Quantified Vehicle Meetup — Boston meetup around intelligent automotive tech including on-board diagnostics, protocols, APIs, analytics, telematics, apps, software and devices.
  4. AT&T See Future In Industrial Internet — partnering with GE, M2M-related customers increased by more than 38% last year. (via Jim Stogdill)

January 28 2014

Four short links: 28 January 2014

  1. Intel On-Device Voice Recognition (Quartz) — interesting because the tension between client-side and server-side functionality is still alive and well. Features migrate from core to edge and back again as cycles, data, algorithms, and responsiveness expectations change.
  2. Meet Microsoft’s Personal Assistant (Bloomberg) — total information awareness assistant. By Seeing, Hearing, and Knowing All, in the future even elevators will be trying to read our minds. (via The Next Web)
  3. Microsoft Contributes Cloud Server Designs to Open Compute ProjectAs part of this effort, Microsoft Open Technologies Inc. is open sourcing the software code we created for the management of hardware operations, such as server diagnostics, power supply and fan control. We would like to help build an open source software community within OCP as well. (via Data Center Knowledge)
  4. Open Tissue Wiki — open source (ZLib license) generic algorithms and data structures for rapid development of interactive modeling and simulation.

January 22 2014

Four short links: 22 January 2014

  1. How a Math Genius Hacked OkCupid to Find True Love (Wired) — if he doesn’t end up working for OK Cupid, productising this as a new service, something is wrong with the world.
  2. Humin: The App That Uses Context to Enable Better Human Connections (WaPo) — Humin is part of a growing trend of apps and services attempting to use context and anticipation to better serve users. The precogs are coming. I knew it.
  3. Spoiled Onions — analysis identifying bad actors in the Tor network, Since September 2013, we discovered several malicious or misconfigured exit relays[...]. These exit relays engaged in various attacks such as SSH and HTTPS MitM, HTML injection, and SSL stripping. We also found exit relays which were unintentionally interfering with network traffic because they were subject to DNS censorship.
  4. My Mind (Github) — a web application for creating and managing Mind maps. It is free to use and you can fork its source code. It is distributed under the terms of the MIT license.

January 21 2014

Four short links: 21 January 2014

  1. On Being a Senior Engineer (Etsy) — Mature engineers know that no matter how complete, elegant, or superior their designs are, it won’t matter if no one wants to work alongside them because they are assholes.
  2. Control Theory (Coursera) — Learn about how to make mobile robots move in effective, safe, predictable, and collaborative ways using modern control theory. (via DIY Drones)
  3. US Moves Towards Open Access (WaPo) — Congress passed a budget that will make about half of taxpayer-funded research available to the public.
  4. NHS Patient Data Available for Companies to Buy (The Guardian) — Once live, organisations such as university research departments – but also insurers and drug companies – will be able to apply to the new Health and Social Care Information Centre (HSCIC) to gain access to the database, called If an application is approved then firms will have to pay to extract this information, which will be scrubbed of some personal identifiers but not enough to make the information completely anonymous – a process known as “pseudonymisation”. Recipe for disaster as it has been repeatedly shown that it’s easy to identify individuals, given enough scrubbed data. Can’t see why the NHS just doesn’t make it an app in Facebook. “Nat’s Prostate status: it’s complicated.”

January 15 2014

Four short links: 15 January 2014

  1. Hackers Gain ‘Full Control’ of Critical SCADA Systems (IT News) — The vulnerabilities were discovered by Russian researchers who over the last year probed popular and high-end ICS and supervisory control and data acquisition (SCADA) systems used to control everything from home solar panel installations to critical national infrastructure. More on the Botnet of Things.
  2. mclMarkov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for graphs (also known as networks) based on simulation of (stochastic) flow in graphs.
  3. Facebook to Launch Flipboard-like Reader (Recode) — what I’d actually like to see is Facebook join the open web by producing and consuming RSS/Atom/anything feeds, but that’s a long shot. I fear it’ll either limit you to whatever circle-jerk-of-prosperity paywall-penetrating content-for-advertising-eyeballs trades the Facebook execs have made, or else it’ll be a leech on the scrotum of the open web by consuming RSS without producing it. I’m all out of respect for empire-builders who think you’re a fool if you value the open web. AOL might have died, but its vision of content kings running the network is alive and well in the hands of Facebook and Google. I’ll gladly post about the actual product launch if it is neither partnership eyeball-abuse nor parasitism.
  4. Map Projections Illustrated with a Face (Flowing Data) — really neat, wish I’d had these when I was getting my head around map projections.

January 08 2014

Four short links: 8 January 2014

  1. Launching the Wolfram Connected Devices Project — Wolfram Alpha is cognition-as-a-service, which they hope to embed in devices. This data-powered Brain-in-the-Cloud play will pit them against Google, but G wants to own the devices and the apps and the eyeballs that watch them … interesting times ahead!
  2. How the USA Almost Killed the Internet (Wired) — “At first we were in an arms race with sophisticated criminals,” says Eric Grosse, Google’s head of security. “Then we found ourselves in an arms race with certain nation-state actors [with a reputation for cyberattacks]. And now we’re in an arms race with the best nation-state actors.”
  3. Intel Edison — SD-card sized, with low-power 22nm 400MHz Intel Quark processor with two cores, integrated Wi-Fi and Bluetooth.
  4. N00b 2 L33t, Now With Graphs (Tom Stafford) — open science research validating many of the findings on learning, tested experimentally via games. In the present study, we analyzed data from a very large sample (N = 854,064) of players of an online game involving rapid perception, decision making, and motor responding. Use of game data allowed us to connect, for the first time, rich details of training history with measures of performance from participants engaged for a sustained amount of time in effortful practice. We showed that lawful relations exist between practice amount and subsequent performance, and between practice spacing and subsequent performance. Our methodology allowed an in situ confirmation of results long established in the experimental literature on skill acquisition. Additionally, we showed that greater initial variation in performance is linked to higher subsequent performance, a result we link to the exploration/exploitation trade-off from the computational framework of reinforcement learning.

January 02 2014

The Snapchat Leak

The number of Snapchat users by area code.The number of Snapchat users by area code.

The number of Snapchat users by geographic location. Users are predominately located in New York, San Francisco and the surrounding greater New York and Bay Areas.

While the site crumbled quickly under the weight of so many people trying to get to the leaked data—and has now been suspended—there isn’t really such a thing as putting the genie back in the bottle on the Internet.

Just before Christmas the Australian based Gibson Security published a report highlighting two exploits in the Snapchat API claiming that hackers could easily gain access to users’ personal data. Snapchat dismissed the report, responding that,

Theoretically, if someone were able to upload a huge set of phone numbers, like every number in an area code, or every possible number in the U.S., they could create a database of the results and match usernames to phone numbers that way.

Adding that they had various “safeguards” in place to make it difficult to do that. However it seems likely that—despite being explicitly mentioned in the initial report four months previously—none of these safeguards included rate limiting requests to their server, because someone seems to have taken them up on their offer.

Data Release

Earlier today the creators of the now defunct SnapchatDB site released 4.6 million records—both as an SQL dump and as a CSV file. With an estimated 8 million users (May, 2013) of the app this represents around half the Snapchat user base.

Each record consists of a Snapchat user name, a geographical location for the user, and partially anonymised phone number—the last two digits of the phone number having been obscured.

While Gibson Security’s find_friends exploit has been patched by Snapchat, minor variations on the exploit are reported to still function, and if this data did come from the exploit—or a minor variation on it—uncovered by Gibson, then the dataset published by SnapchatDB is only part of the data the hackers now hold.

In addition to the data already released they would have the full phone number of each user, and as well as the user name they should also have the—perhaps more revealing—screen name.

Data Analysis

Taking an initial look at the data, there are no international numbers in the leaked database. All entries are US numbers, with the bulk of the users from—as you might expect—the greater New York, San Francisco and Bay areas.

However I’d assume that the absence of international numbers is  an indication of laziness rather than due to any technical limitation. For US based hackers it would be easy to iterate rapidly through the fairly predictable US number space, while foreign” numbers formats might present more of a challenge when writing a script to exploit the hole in Snapchat’s security.

Only 76 of the 322 area codes in the United States appear in the leaked database, alongside another two Canadian area codes, mapping to 67 discrete geographic locations—although not all the area codes and locations match suggesting that perhaps the locations aren’t derived directly from the area code data.

Despite some initial scepticism about the provenance of the data I’ve confirmed that this is a real data set. A quick trawl through the data has got multiple hits amongst my own friend group, including some I didn’t know were on Snapchat—sorry guys.

Since the last two digits were obscured in the leaked dataset are obscured the partial phone number string might—and frequently does—generate multiple matches amongst the 4.6 million records against a comparison number.

I compared the several hundred US phone numbers amongst my own contacts against the database—you might want to do that yourself—and generated several spurious hits where the returned user names didn’t really seem to map in any way to my contact. That said, as I already mentioned, I found several of my own friends amongst the leaked records, although I only knew it was them for sure because I knew both their phone number and typical choices of user names.


As it stands therefore this data release is not—yet—critical, although it is certainly concerning, and for some individuals it might well be unfortunate. However if the SnapchatDB creators choose to release their full dataset things might well get a lot more interesting.

If the full data set was released to the public, or obtained by a malicious third party, then the username, geographic location, phone number, and screen name—which might, for a lot of people, be their actual full name—would be available.

This eventuality would be bad enough. However taking this data and cross-correlating it with another large corpus of data, say from Twitter or Gravatar, by trying to find matching user or real names on those services—people tend to reuse usernames on multiple services after all—you might end up with a much larger aggregated data set including email addresses, photographs, and personal information.

While there would be enough false positives—if matching solely against user names—that you’d have a interesting data cleaning task afterwards, it wouldn’t be impossible. Possibly not even that difficult.

I’m not interested in doing that correlation myself. But others will.


December 26 2013

Four short links: 26 December 2013

  1. Nest Protect Teardown (Sparkfun) — initial teardown of another piece of domestic industrial Internet.
  2. LogsThe distributed log can be seen as the data structure which models the problem of consensus. Not kidding when he calls it “real-time data’s unifying abstraction”.
  3. Mining the Web to Predict Future Events (PDF) — Mining 22 years of news stories to predict future events. (via Ben Lorica)
  4. Nanocubesa fast datastructure for in-memory data cubes developed at the Information Visualization department at AT&T Labs – Research. Nanocubes can be used to explore datasets with billions of elements at interactive rates in a web browser, and in some cases it uses sufficiently little memory that you can run a nanocube in a modern-day laptop. (via Ben Lorica)

December 16 2013

Four short links: 16 December 2013

  1. Suro (Github) — Netflix data pipeline service for large volumes of event data. (via Ben Lorica)
  2. NIPS Workshop on Data Driven Education — lots of research papers around machine learning, MOOC data, etc.
  3. Proofist — crowdsourced proofreading game.
  4. 3D-Printed Shoes (YouTube) — LeWeb talk from founder of the company, Continuum Fashion). (via Brady Forrest)

December 10 2013

Four short links: 10 December 2013

  1. ArangoDBopen-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript extensions.
  2. Google’s Seven Robotics Companies (IEEE) — The seven companies are capable of creating technologies needed to build a mobile, dexterous robot. Mr. Rubin said he was pursuing additional acquisitions. Rundown of those seven companies.
  3. Hebel (Github) — GPU-Accelerated Deep Learning Library in Python.
  4. What We Learned Open Sourcing — my eye was caught by the way they offered APIs to closed source code, found and solved performance problems, then open sourced the fixed code.

December 09 2013

Four short links: 9 December 2013

  1. Reform Government Surveillance — hard not to view this as a demarcation dispute. “Ruthlessly collecting every detail of online behaviour is something we do clandestinely for advertising purposes, it shouldn’t be corrupted because of your obsession over national security!”
  2. Brian Abelson — Data Scientist at the New York Times, blogging what he finds. He tackles questions like what makes a news app “successful” and how might we measure it. Found via this engaging interview at the quease-makingly named Content Strategist.
  3. StageXL — Flash-like 2D package for Dart.
  4. BayesDBlets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries. Open source.

December 06 2013

Four short links: 6 December 2013

  1. Society of Mind — Marvin Minsky’s book now Creative-Commons licensed.
  2. Collaboration, Stars, and the Changing Organization of Science: Evidence from Evolutionary BiologyThe concentration of research output is declining at the department level but increasing at the individual level. [...] We speculate that this may be due to changing patterns of collaboration, perhaps caused by the rising burden of knowledge and the falling cost of communication, both of which increase the returns to collaboration. Indeed, we report evidence that the propensity to collaborate is rising over time. (via Sciblogs)
  3. As Engineers, We Must Consider the Ethical Implications of our Work (The Guardian) — applies to coders and designers as well.
  4. Eyewire — a game to crowdsource the mapping of 3D structure of neurons.

December 05 2013

Four short links: 5 December 2013

  1. DeducerAn R Graphical User Interface (GUI) for Everyone.
  2. Integration of Civil Unmanned Aircraft Systems (UAS) in the National Airspace System (NAS) Roadmap (PDF, FAA) — first pass at regulatory framework for drones. (via Anil Dash)
  3. Bitcoin Stats — $21MM traded, $15MM of electricity spent mining. Goodness. (via Steve Klabnik)
  4. iOS vs Android Numbers (Luke Wroblewski) — roundup comparing Android to iOS in recent commerce writeups. More Android handsets, but less revenue per download/impression/etc.

December 03 2013

Four short links: 3 December 2013

  1. SAMOA — Yahoo!’s distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms. (via Introducing SAMOA)
  2. madliban open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
  3. Data Portraits: Connecting People of Opposing Views — Yahoo! Labs research to break the filter bubble. Connect people who disagree on issue X (e.g., abortion) but who agree on issue Y (e.g., Latin American interventionism), and present the differences and similarities visually (they used wordclouds). Our results suggest that organic visualisation may revert the negative effects of providing potentially sensitive content. (via MIT Technology Review)
  4. Disguise Detection — using Raspberry Pi, Arduino, and Python.

November 11 2013

Four short links: 11 November 2013

  1. Living Light — 3D printed cephalopods filled with bioluminescent bacteria. PAGING CORY DOCTOROW, YOUR ORGASMATRON HAS ARRIVED. (via Sci Blogs)
  2. Repacking Lego Batteries with a CNC Mill — check out the video. Patrick programmed a CNC machine to drill out the rivets holding the Mindstorms battery pack together. Coding away a repetitive task like this is gorgeous to see at every scale. We don’t have to teach our kids a particular programming language, but they should know how to automate cruft.
  3. My Thoughts on Google+ (YouTube) — when your fans make hatey videos like this one protesting Google putting the pig of Google Plus onto the lipstick that was YouTube, you are Doin’ It Wrong.
  4. Presto: Interacting with Petabytes of Data at Facebooka distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. For details, see the Facebook post about its launch.

November 05 2013

Four short links: 5 November 2013

  1. Influx DBopen-source, distributed, time series, events, and metrics database with no external dependencies.
  2. Omega (PDF) — ���exible, scalable schedulers for large compute clusters. From Google Research.
  3. GraspJSSearch and replace your JavaScript code based on its structure rather than its text.
  4. Amazon Mines Its Data Trove To Bet on TV’s Next Hit (WSJ) — Amazon produced about 20 pages of data detailing, among other things, how much a pilot was viewed, how many users gave it a 5-star rating and how many shared it with friends.
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!