Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 24 2014

Four short links: 24 February 2014

  1. Understanding Understanding Source Code with Functional Magnetic Resonance Imaging (PDF) — we observed 17 participants inside an fMRI scanner while they were comprehending short source-code snippets, which we contrasted with locating syntax error. We found a clear, distinct activation pattern of five brain regions, which are related to working memory, attention, and language processing. I’m wary of fMRI studies but welcome more studies that try to identify what we do when we code. (Or, in this case, identify syntax errors—if they wanted to observe real programming, they’d watch subjects creating syntax errors) (via Slashdot)
  2. Oobleck Security (O’Reilly Radar) — if you missed or skimmed this, go back and reread it. The future will be defined by the objects that turn on us. 50s scifi was so close but instead of human-shaped positronic robots, it’ll be our cars, HVAC systems, light bulbs, and TVs. Reminds me of the excellent Old Paint by Megan Lindholm.
  3. Google Readying Android Watch — just as Samsung moves away from Android for smart watches and I buy me and my wife a Pebble watch each for our anniversary. Watches are in the same space as Goggles and other wearables: solutions hunting for a problem, a use case, a killer tap. “OK Google, show me offers from brands I love near me” isn’t it (and is a low-lying operating system function anyway, not a userland command).
  4. Most Winning A/B Test Results are Illusory (PDF) — Statisticians have known for almost a hundred years how to ensure that experimenters don’t get misled by their experiments [...] I’ll show how these methods ensure equally robust results when applied to A/B testing.

December 30 2013

Four short links: 30 December 2013

  1. tooldiaga collection of methods for statistical pattern recognition. Implemented in C.
  2. Hacking MicroSD Cards (Bunnie Huang) — In my explorations of the electronics markets in China, I’ve seen shop keepers burning firmware on cards that “expand” the capacity of the card — in other words, they load a firmware that reports the capacity of a card is much larger than the actual available storage. The fact that this is possible at the point of sale means that most likely, the update mechanism is not secured. MicroSD cards come with embedded microcontrollers whose firmware can be exploited.
  3. 30c3 — recordings from the 30th Chaos Communication Congress.
  4. IOT Companies, Products, Devices, and Software by Sector (Mike Nicholls) — astonishing amount of work in the space, especially given this list is inevitably incomplete.

December 09 2013

Four short links: 9 December 2013

  1. Reform Government Surveillance — hard not to view this as a demarcation dispute. “Ruthlessly collecting every detail of online behaviour is something we do clandestinely for advertising purposes, it shouldn’t be corrupted because of your obsession over national security!”
  2. Brian Abelson — Data Scientist at the New York Times, blogging what he finds. He tackles questions like what makes a news app “successful” and how might we measure it. Found via this engaging interview at the quease-makingly named Content Strategist.
  3. StageXL — Flash-like 2D package for Dart.
  4. BayesDBlets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries. Open source.

December 05 2013

Four short links: 5 December 2013

  1. DeducerAn R Graphical User Interface (GUI) for Everyone.
  2. Integration of Civil Unmanned Aircraft Systems (UAS) in the National Airspace System (NAS) Roadmap (PDF, FAA) — first pass at regulatory framework for drones. (via Anil Dash)
  3. Bitcoin Stats — $21MM traded, $15MM of electricity spent mining. Goodness. (via Steve Klabnik)
  4. iOS vs Android Numbers (Luke Wroblewski) — roundup comparing Android to iOS in recent commerce writeups. More Android handsets, but less revenue per download/impression/etc.

November 06 2013

Four short links: 6 November 2013

  1. Apple Transparency Report (PDF) — contains a warrant canary, the statement Apple has never received an order under Section 215 of the USA Patriot Act. We would expect to challenge an order if served on us which will of course be removed if one of the secret orders is received. Bravo, Apple, for implementing a clever hack to route around excessive secrecy. (via Boing Boing)
  2. You’re Probably Polluting Your Statistics More Than You Think — it is insanely easy to find phantom correlations in random data without obviously being foolish. Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this. (via Stijn Debrouwere)
  3. CyPhy Funded (Quartz) — the second act of iRobot co-founder Helen Greiner, maker of the famed Roomba robot vacuum cleaner. She terrified ETech long ago—the audience were expecting Roomba cuteness and got a keynote about military deathbots. It would appear she’s still in the deathbot niche, not so much with the cute. Remember this when you build your OpenCV-powered recoil-resistant load-bearing-hoverbot and think it’ll only ever be used for the intended purpose of launching fertiliser pellets into third world hemp farms.
  4. User-Agent String History — a light-hearted illustration of why the formal semantic value of free-text fields is driven to zero in the face of actual use.

October 30 2013

Four short links: 30 October 2013

  1. Offline.js — Javascript library so web app developers can gracefully deal with users going offline.
  2. Android Guideslots of info on coding for Android.
  3. Statistics Done Wrong — learn from these failure modes. Not medians or means. Modes.
  4. Streaming, Sketching, and Sufficient Statistics (YouTube) — how to process huge data sets as they stream past your CPU (e.g., those produced by sensors). (via Ben Lorica)

October 18 2013

Four short links: 18 October 2013

  1. Science Not as Self-Correcting As It Thinks (Economist) — REALLY good discussion of the shortcomings in statistical practice by scientists, peer-review failures, and the complexities of experimental procedure and fuzziness of what reproducibility might actually mean.
  2. Reproducibility Initiative Receives Grant to Validate Landmark Cancer StudiesThe key experimental findings from each cancer study will be replicated by experts from the Science Exchange network according to best practices for replication established by the Center for Open Science through the Center’s Open Science Framework, and the impact of the replications will be tracked on Mendeley’s research analytics platform. All of the ultimate publications and data will be freely available online, providing the first publicly available complete dataset of replicated biomedical research and representing a major advancement in the study of reproducibility of research.
  3. $20 SDR Police Scanner — using software-defined radio to listen to the police band.
  4. Reimagine the Chemistry Set — $50k prize in contest to design a “chemistry set” type kit that will engage kids as young as 8 and inspire people who are 88. We’re looking for ideas that encourage kids to explore, create, build and question. We’re looking for ideas that honor kids’ curiosity about how things work. Backed by the Moore Foundation and Society for Science and the Public.

August 08 2013

Four short links: 8 August 2013

  1. Reducing the Roots of Some Evil (Etsy) — Based on our first two months of data we have removed a number of unused CA certificates from some pilot systems to test the effects, and will run CAWatch for a full six months to build up a more comprehensive view of what CAs are in active use. Sign of how broken the CA system for SSL is. (via Alex Dong)
  2. Mind the Brain — PLOS podcast interviews Sci Foo alum and delicious neuroscience brain of awesome, Vaughan Bell. (via Fabiana Kubke)
  3. How Often are Ineffective Interventions Still Used in Practice? (PLOSone) — tl;dr: 8% of the time. Imagine the number if you asked how often ineffective software development practices are still used.
  4. Announcing Evan’s Awesome A/B ToolsI am calling these tools awesome because they are intuitive, visual, and easy-to-use. Unlike other online statistical calculators you’ve probably seen, they’ll help you understand what’s going on “under the hood” of common statistical tests, and by providing ample visual context, they make it easy for you to explain p-values and confidence intervals to your boss. (And they’re free!)

May 13 2013

Four short links: 13 May 2013

  1. Exploiting a Bug in Google Glass — unbelievably detailed and yet easy-to-follow explanation of how the bug works, how the author found it, and how you can exploit it too. The second guide was slightly more technical, so when he returned a little later I asked him about the Debug Mode option. The reaction was interesting: he kind of looked at me, somewhat confused, and asked “wait, what version of the software does it report in Settings”? When I told him “XE4″ he clarified “XE4, not XE3″, which I verified. He had thought this feature had been removed from the production units.
  2. Probability Through Problems — motivating problems to hook students on probability questions, structured to cover high-school probability material.
  3. Connbox — love the section “The importance of legible products” where the physical UI interacts seamless with the digital device … it’s glorious. Three amazing videos.
  4. The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees (PLoSONE) — The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks. [...] An implementation of ISMA in Java is freely available.

February 04 2013

Four short links: 4 February 2013

  1. Hands on Learning (HuffPo) — Unfortunately, engaged and enlightened tinkering is disappearing from contemporary American childhood. (via BoingBoing)
  2. FlashProxy (Stanford) — a miniature proxy that runs in a web browser. It checks for clients that need access, then conveys data between them and a Tor relay. [...] If your browser runs JavaScript and has support for WebSockets then while you are viewing this page your browser is a potential proxy available to help censored Internet users.
  3. Dark Patterns (Slideshare) — User interfaces to trick people. (via Beta Knowledge)
  4. Bill Gates is Naive: Data Are Not Objective (Math Babe) — examples at the end of biased models/data should be on the wall of everyone analyzing data. (via Karl Fisch)

January 04 2013

November 27 2012

Four short link: 27 November 2012

  1. Statistical Misdirection Master Class — examples from Fox News. The further through the list you go, the more horrifying^Wedifying they are. Some are clearly classics from the literature, but some are (as far as I can tell) newly developed graphical “persuasion” techniques.
  2. Wall of Awesome — give your coworkers some love.
  3. Dave Winer on Medium — Dave hits some interesting points: Users can create new buckets or collections and call them anything they want. A bucket is analogous to a blog post. Then other people can post to it. That’s like a comment. But it doesn’t look like a comment. It’s got a place for a big image at the top. It looks much prettier than a comment, and much bigger. Looks are important here.
  4. SIGGraph Asia Trailer (YouTube) — resuiting Sims and rotating city blocks, at the end, were my favourite. (via Andy Baio)

October 10 2012

Four short links: 10 October 2012

  1. An Intuitive Guide to Linear AlgebraHere’s the linear algebra introduction I wish I had. I wish I’d had it, too. (via Hacker News)
  2. Think Bayesan introduction to Bayesian statistics using computational methods.
  3. The State of Javascript 2012 (Brendan Eich) — Javascript continues its march up and down the stack, simultaneously becoming an application language while becoming the bytecode for the world.
  4. Divshot — a startup turning mockups into web apps, built on top of the Bootstrap front-end framework. I feel momentum and a tipping point approaching, where building things on the web is about to get easier again (the way it did with Ruby on Rails). cf Jetstrap.

October 09 2012

Four short links: 9 October 2012

  1. Finland Crowdsourcing New Laws (GigaOm) — online referenda. The Finnish government enabled something called a “citizens’ initiative”, through which registered voters can come up with new laws – if they can get 50,000 of their fellow citizens to back them up within six months, then the Eduskunta (the Finnish parliament) is forced to vote on the proposal. Now this crowdsourced law-making system is about to go online through a platform called the Open Ministry. Petitions and online voting are notoriously prone to fraud, so it will be interesting to see how well the online identity system behind this holds up.
  2. WebPlatform — wiki of information about developing for the open web. Joint production of many of the $BIGCOs of the web and the W3C, so will be interesting to see, as it develops, whether it has the best aspects of each or the worst.
  3. Why Your Phone, Cable, Internet Bills Cost So Much (Yahoo) — “The companies essentially have a business model that is antithetical to economic growth,” he says. “Profits go up if they can provide slow Internet at super high prices.” Excellent piece!
  4. Probability and Statistics Cookbook (Matthias Vallentin) — The cookbook contains a succinct representation of various topics in probability theory and statistics. It provides a comprehensive reference reduced to the mathematical essence, rather than aiming for elaborate explanations. CC-BY-NC-SA licensed, LaTeX source on github.

September 06 2012

Digging into the UDID data

Over the weekend the hacker group Antisec released one million UDID records that they claim to have obtained from an FBI laptop using a Java vulnerability. In reply the FBI stated:

The FBI is aware of published reports alleging that an FBI laptop was compromised and private data regarding Apple UDIDs was exposed. At this time there is no evidence indicating that an FBI laptop was compromised or that the FBI either sought or obtained this data.

Of course that statement leaves a lot of leeway. It could be the agent’s personal laptop, and the data may well have been “property” of an another agency. The wording doesn’t even explicitly rule out the possibility that this was an agency laptop, they just say that right now they don’t have any evidence to suggest that it was.

This limited data release doesn’t have much impact, but the possible release of the full dataset, which is claimed to include names, addresses, phone numbers and other identifying information, is far more worrying.

While there are some almost dismissing the issue out of hand, the real issues here are: Where did the data originate? Which devices did it come from and what kind of users does this data represent? Is this data from a cross-section of the population, or a specifically targeted demographic? Does it originate within the law enforcement community, or from an external developer? What was the purpose of the data, and why was it collected?

With conflicting stories from all sides, the only thing we can believe is the data itself. The 40-character strings in the release at least look like UDID numbers, and anecdotally at least we have a third-party confirmation that this really is valid UDID data. We therefore have to proceed at this point as if this is real data. While there is a possibility that some, most, or all of the data is falsified, that’s looking unlikely from where we’re standing standing at the moment.

With that as the backdrop, the first action I took was to check the released data for my own devices and those of family members. Of the nine iPhones, iPads and iPod Touch devices kicking around my house, none of the UDIDs are in the leaked database. Of course there isn’t anything to say that they aren’t amongst the other 11 million UDIDs that haven’t been released.

With that done, I broke down the distribution of leaked UDID numbers by device type. Interestingly, considering the number of iPhones in circulation compared to the number of iPads, the bulk of the UDIDs were self-identified as originating on an iPad.

Distribution of UDID by device type

What does that mean? Here’s one theory: If the leak originated from a developer rather than directly from Apple, and assuming that this subset of data is a good cross-section on the total population, and assuming that the leaked data originated with a single application … then the app that harvested the data is likely a Universal application (one that runs on both the iPhone and the iPad) that is mostly used on the iPad rather than on the iPhone.

The very low numbers of iPod Touch users might suggest either demographic information, or that the application is not widely used by younger users who are the target demographic for the iPod Touch, or alternatively perhaps that the application is most useful when a cellular data connection is present.

The next thing to look at, as the only field with unconstrained text, was the Device Name data. That particular field contains a lot of first names, e.g. “Aaron’s iPhone,” so roughly speaking the distribution of first letters in the this field should give a decent clue as to the geographical region of origin of the leaked list of UDIDs. This distribution is of course going to be different depending on the predominant language in the region.

Distribution of UDID by the first letter of the “Device Name” field

The immediate stand out from this distribution is the predominance of device name strings starting with the letter “i.” This can be ascribed to people who don’t have their own name prepended to the Device Name string, and have named their device “iPhone,” “iPad” or “iPod Touch.”

The obvious next step was to compare this distribution with the relative frequency of first letters in words in the English language.

Comparing the distribution of UDID by first letter of the “Device Name” field against the relative frequencies of the first letters of a word in the English language

The spike for the letter “i” dominated the data, so the next step was to do some rough and ready data cleaning.

I dropped all the Device Name strings that started with the string “iP.” That cleaned out all those devices named “iPhone,” “iPad” and “iPod Touch.” Doing that brought the number of device names starting with an “i” down from 159,925 to just 13,337. That’s a bit more reasonable.

Comparing the distribution of UDID by first letter of the “Device Name” field, ignoring all names that start with the string “iP,” against the relative frequencies of the first letters of a word in the English language

I had a slight over-abundance of “j,” although that might not be statistically significant. However, the stand out was that there was a serious under-abundance of strings starting with the letter “t,” which is interesting. Additionally, with my earlier data cleaning I also had a slight under-abundance of “i,” which suggested I may have been too enthusiastic about cleaning the data.

Looking at the relative frequency of letters in languages other than English it’s notable that amongst them Spanish has a much lower frequency of the use of “t.”

As the de facto second language of the United States, Spanish is the obvious next choice  to investigate. If the devices are predominantly Spanish in origin then this could solve the problem introduced by our data cleaning. As Marcos Villacampa noted in a tweet, in Spanish you would say “iPhone de Mark” rather than “Mark’s iPhone.”

Comparing the distribution of UDID by first letter of the “Device Name” field, ignoring all names that start with the string “iP,” against the relative frequencies of the first letters of a word in the Spanish language

However, that distribution didn’t really fit either. While “t” was much better, I now had an under-abundance of words with an ”e.” Although it should be noted that, unlike our English language relative frequencies, the data I was using for Spanish is for letters in the entire word, rather than letters that begin the word. That’s certainly going to introduce biases, perhaps fatal ones.

Not that I can really make the assumption that there is only one language present in the data, or even that one language predominates, unless that language is English.

At this stage it’s obvious that the data is, at least more or less, of the right order of magnitude. The data probably shows devices coming from a Western country. However, we’re a long way from the point where I’d come out and say something like ” … the device names were predominantly in English.” That’s not a conclusion I can make.

I’d be interested in tracking down the relative frequency of letters used in Arabic when the language is transcribed into the Roman alphabet. While I haven’t been able to find that data, I’m sure it exists somewhere. (Please drop a note in the comments if you have a lead.)

The next step for the analysis is to look at the names themselves. While I’m still in the process of mashing up something that will access U.S. census data and try and reverse geo-locate a name to a “most likely” geographical origin, such services do already exist. And I haven’t really pushed the boundaries here, or even started a serious statistical analysis of the subset of data released by Antisec.

This brings us to Pete Warden’s point that you can’t really anonymize your data. The anonymization process for large datasets such as this is simply an illusion. As Pete wrote:

Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone’s actions has a good chance of matching identifiable public records.

While this release in itself is fairly harmless, a number of “harmless” releases taken together — or cleverly cross-referenced with other public sources such as Twitter, Google+, Facebook and other social media — might well be more damaging. And that’s ignoring the possibility that Antisec really might have names, addresses and telephone numbers to go side-by-side with these UDID records.

The question has to be asked then, where did this data originate? While 12 million records might seem a lot, compared to the number of devices sold it’s not actually that big a number. There are any number of iPhone applications with a 12-million-user installation base, and this sort of backend database could easily have been built up by an independent developer with a successful application who downloaded the device owner’s contact details before Apple started putting limitations on that.

Ignoring conspiracy theories, this dataset might be the result of a single developer. Although how it got into the FBI’s possession and the why of that, if it was ever there in the first place, is another matter entirely.

I’m going to go on hacking away at this data to see if there are any more interesting correlations, and I do wonder whether Antisec would consider a controlled release of the data to some trusted third party?

Much like the reaction to #locationgate, where some people were happy to volunteer their data, if enough users are willing to self-identify, then perhaps we can get to the bottom of where this data originated and why it was collected in the first place.

Thanks to Hilary Mason, Julie Steele, Irene RosGemma Hobson and Marcos Villacampa for ideas, pointers to comparative data sources, and advice on visualisation of the data.



In response to a post about this article on Google+, Josh Hendrix made the suggestion that I should look at word as well as letter frequency. It was a good idea, so I went ahead and wrote a quick script to do just that…

The top two words in the list are “iPad,” which occurs 445,111 times, and “iPhone,” which occurs 252,106 times. The next most frequent word is “iPod,” but that occurs only 36,367 times. This result backs up my earlier result looking at distribution by device type.

Then there are various misspellings and mis-capitalisations of “iPhone,” “iPad,” and “iPod.”

The first real word that isn’t an Apple trademark is “Administrator,” which occurs 10,910 times. Next are “David” (5,822), “John” (5,447), and “Michael” (5,034). This is followed by “Chris” (3,744), “Mike” (3,744), “Mark” (3,66) and “Paul” (3,096).

Looking down the list of real names, as opposed to partial strings and tokens, the first female name doesn’t occur until we’re 30 places down the list — it’s “Lisa” (1,732) with the next most popular female name being “Sarah” (1,499), in 38th place.

The top 100 names occurring in the UDID list.

The word “Dad” occurs 1,074 times, with “Daddy” occurring 383 times. For comparison the word “Mum” occurs just 58 times, and “Mummy” just 33. “Mom” came in with 150 occurrences, and “mommy” with 30. The number of occurrences for “mum,” “mummy,” “mom,” and “mommy” combined is 271, which is still very small compared to the combined total of 1,457 for “dad” and “daddy.”

[Updated: Greg Yardly wisely pointed out on Twitter that I was being a bit English-centric in only looking for the words "mum" and "mummy," which is why I expanded the scope to include "mom" and "mommy."]

There is a definite gender bias here, and I can think of at least a few explanations. The most likely is fairly simplistic: The application where the UDID numbers originated either appeals to, or is used more, by men.

Alternatively, women may be less likely to include their name in the name of their device, perhaps because amongst other things this name is used to advertise the device on wireless networks?

Either way I think this definitively pins it down as a list of devices originating in an Anglo-centric geographic region.

Sometimes the simplest things work better. Instead of being fancy perhaps I should have done this in the first place. However this, combined with my previous results, suggest that we’re looking at an English speaking, mostly male, demographic.

Correlating the top 20 or so names and with the list of most popular baby names (by year) all the way from the mid-’60s up until the mid-’90s (so looking at the most popular names for people between the ages of say 16 and 50) might give a further clue as to the exact demographic involved.

Both Gemma Hobson and Julie Steele directed me toward the U.S. Social Security Administration’s Popular Baby Names By Decade list. A quick and dirty analysis suggests that the UDID data is dominated by names that were most popular in the ’70s and ’80s. This maps well to my previous suggestion that the lack of iPod Touch usage might suggest that the demographic was older.

I’m going to do a year-by-year breakdown and some proper statistics later on, but we’re looking at an application that’s probably used by: English speaking males with an Anglo-American background in their 30s or 40s. It’s most used on the iPad, and although it also works on the iPhone, it’s used far less on that platform.

Thanks to Josh Hendrix, and again to Gemma Hobson and Julie Steele, for ideas and pointers to sources for this part of the analysis.


August 08 2012

Four short links: 8 August 2012

  1. Reconstructing Visual Experiences (PDF) — early visual areas represent the information in movies. To demonstrate the power of our approach, we also constructed a Bayesian decoder by combining estimated encoding models with a sampled natural movie prior. The decoder provides remarkable reconstructions of the viewed movies. These results demonstrate that dynamic brain activity measured under naturalistic conditions can be decoded using current fMRI technology.
  2. Earth Engine — satellite imagery and API for coding against it, to do things like detecting deforestation, classifying land cover, estimating forest biomass and carbon, and mapping the world’s roadless areas.
  3. Microlives — 30m of your life expectancy. Here are some things that would, on average, cost a 30-year-old man 1 microlife: Smoking 2 cigarettes; Drinking 7 units of alcohol (eg 2 pints of strong beer); Each day of being 5 Kg overweight. A chest X-ray will set a middle-aged person back around 2 microlives, while a whole body CT-scan would weigh in at around 180 microlives.
  4. Autistics Need Opportunities More Than Treatment — Laurent gave a powerful talk at Sci Foo: if the autistic brain is better at pattern matching, find jobs where that’s useful. Like, say, science. The autistic woman who was delivering mail became a research assistant in his lab, now has papers galore to her name for original research.

August 02 2012

Four short links: 2 August 2012

  1. Patton Oswalt’s Letters to Both SidesYou guys need to stop thinking like gatekeepers. You need to do it for the sake of your own survival. Because all of us comedians after watching Louis CK revolutionize sitcoms and comedy recordings and live tours. And listening to “WTF With Marc Maron” and “Comedy Bang! Bang!” and watching the growth of the UCB Theatre on two coasts and seeing careers being made on Twitter and Youtube. Our careers don’t hinge on somebody in a plush office deciding to aim a little luck in our direction. (via Jim Stogdill)
  2. Headliner — interesting Guardian experiment with headlines and presentation. As always, reading the BERG designers’ notes are just as interesting as the product itself. E.g., how they used computer vision to find faces and zoom in on them to make articles more attractive to browsing readers.
  3. Google Earth Glitches — where 3d maps and aerial imagery don’t match up. (via Beta Knowledge)
  4. Campbell’s LawThe more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor. (via New York Times)

May 11 2012

Four short links: 11 May 2012

  1. Stanford Med School Contemplates Flipped Classroom -- the real challenge isn't sending kids home with videos to watch, it's using tools like OceanBrowser to keep on top of what they're doing. Few profs at universities have cared whether students learned or not.
  2. Inclusive Tech Companies Win The Talent War (Gina Trapani) -- she speaks the truth, and gently. The original CNN story flushed out an incredible number of vitriolic commenters apparently lacking the gene for irony.
  3. Buyers and Sellers Guide to Web Design and Development Firms (Lance Wiggs) -- great idea, particularly "how to be a good client". There are plenty of dodgy web shops, but more projects fail because of the clients than many would like to admit.
  4. What Does It Mean to Say That Something Causes 16% of Cancers? (Discover Magazine) -- hey, all you infographic jockeys with your aspirations to add Data Scientist to your business card: read this and realize how hard it is to make sense of a lot of numbers and then communicate that sense. Data Science isn't about Hadoop any more than Accounting is about columns. Both try to tell a story (the original meaning of your company's "accounts") and what counts is the informed, disciplined, honest effort of knowing that your story is honest.

May 04 2012

Four short links: 4 May 2012

  1. Common Statistical Fallacies (Flowing Data) -- once you know to look for them, you see them everywhere. Or is that confirmation bias?
  2. Project Hijack -- Hijacking power and bandwidth from the mobile phone's audio interface. Creating a cubic-inch peripheral sensor ecosystem for the mobile phone.
  3. Peak Plastic -- Deb Chachra points out that if we’re running out of oil, that also means that we’re running out of plastic. Compared to fuel and agriculture, plastic is small potatoes. Even though plastics are made on a massive industrial scale, they still account for less than 10% of the world’s oil consumption. So recycling plastic saves plastic and reduces its impact on the environment, but it certainly isn’t going to save us from the end of oil. Peak oil means peak plastic. And that means that much of the physical world around us will have to change. I hadn't pondered plastics in medicine before. (via BoingBoing)
  4. web.go (GitHub) -- web framework for the Go programming language.

January 05 2012

Understanding randomness is a double-edged sword

The Drunkard's Walk coverLeonard Mlodinow's "The Drunkard's Walk: How Randomness Rules our Lives" is a great book on an important subject. As data scientists know, random phenomenon are everywhere, and humans don't understand them well. We're not wired to understand them well. This book is a huge help, and will be a relief to anyone who's heard people say "I don't believe in global warming because last winter we got a lot of snow," or some load of crap like that. The book is well written, there's a lot of storytelling, and the storytelling is fun and interesting. Along the way Mlodinow gives coherent explanations of Bayes' theorem, the Monty Hall problem (offering the simplest correct explanation I've ever seen), the origins of statistics and more. If you want an excellent non-mathematical introduction to probabilistic thinking, this is the book to get. (If you want the mathematics, this book studiously avoids equations. Get William Feller's "An Introduction to Probability Theory and Its Applications" for the deeper material.)

But there's always a but. But, but, but ...

I have two problems with "The Drunkard's Walk." They've been nagging me ever since I finished.

First, Mlodinow spends a lot of time debunking the notion of "hot streaks." He's right, and that's important: most hot streaks in sports and elsewhere can be adequately explained by randomness. Randomness is inherently streaky and clumpy; it's not just a smooth gray. In fact, if you get something that looks smooth and "random," it's almost certainly not random. So far, so good. But — when he moves from Roger Maris' record-breaking season to portfolio managers picking hot stocks, there's a fundamental asymmetry.

With Maris, the author starts with the long-term batting average. We're not just "flipping coins"; we're flipping a weighted coin, a coin that happens to land with the "home run" side facing up a lot more frequently than it would if I were in the batter's box. That's all well and good. If I faced a season's worth of professional baseball pitching, I daresay I wouldn't get a single hit, let alone any home runs. But — and this is important — he doesn't do the same for the stock pickers, book acquisition editors, or Hollywood movie execs that he talks about. For them, it's just flipping coins. And it's one thing to say that, if you just flipped coins for 10 years, you'd have a 75% chance of duplicating a great financial manager's performance over some five-year period. It's another thing to imply that the manager's performance is just a matter of luck, not skill. Yes, there is a lot of luck involved, but where's the notion of baseline performance, of long term success or failure, that was the starting point for analyzing Maris' hot year? Maris' hot year may have been a random phenomenon, but it was a random phenomenon in the context of five years hitting more than 20 home runs per season, during which his cumulative batting average was somewhere around .271. What's the stock picker's cumulative batting average? Who are the other financial analysts working at the same level? We never find out. And that's a big part of the story to omit.

Second, Mlodinow frequently forgets one of the most important aspects of the mathematical study of random processes. When we're talking probability and statistics, we're talking about interchangeable events. It's easy to forget this, but as Mlodinow himself points out, there are many, many ways to make important mistakes when you're talking about probability. The important thing about urns with black and white balls is that the balls are the same. (If you don't know about urns, take a probability course or read the book; they're baked into the history of probability theory.) If some of the balls were ovals and some were star-shaped, these probability experiments wouldn't work.

So, back again to the stock pickers, the acquisitions editors, and the Hollywood execs. We agree at some level that all at-bats in baseball are equivalent. This is, of course, an idealization, but it's one we're fairly comfortable with. But all stocks are not the same, all books are not the same, and all movies are not the same. They may be the same within a certain class (energy stocks, cheap romance novels, spy movies). A stock analyst who's good with financials may have nothing to say about manufacturing. But at the high end of the spectrum (literary novels, fine wines, art movies), everything is unique, precisely in a way that Harlequin romances aren't. Probability and statistics are still powerful tools, but you have to be very careful about how you apply them.

Since I'm in the publishing business, I'm particularly annoyed by the story of an editor who, in an experiment, was given a typewritten chapter of a V. S. Naipaul novel that had won a major award. She rejected it. I'm not a fan of Naipaul, so I'm sympathetic. But is that evidence of her editorial skill (or lack thereof), or of random processes? Since we're now in a world where every event is unique we have to ask more questions: What publisher was she working for? Grove Press, which publishes top drawer literary fiction with a tendency toward the avant garde (for whom Naipaul might have been too stodgy)? Or Bantam, which specializes in lightweight beach-side reading? In both cases, a rejection would have been perfectly appropriate. Probability aside, it's a cheap shot to say: "Because this book won a major award, we'd expect editors at a publishing company to accept it. If they don't, that's evidence that publishing is a random process."

Publishing (and movies, and wines, and maybe even stocks) are a different world, and the disagreements are precisely what is important. Modeling disagreement as random fluctuation isn't doing anyone a service. I may dislike Naipaul's fiction, but I hardly see that as a random result. We could ask about the conditional probability that an English major will dislike Naipaul, given that the English major plays piano, has a strong background in electrical engineering and mathematics, and likes Salman Rushdie, and use that to come up with some sort of number. But I'd have no idea what that number means. We're not picking black and white balls out of urns here — or if we are, the balls are of different shapes and sizes.

Am I just going back to the human tendency to build stories where there is nothing but randomness? Am I just refusing to deal with the stark realities of random phenomenon that surround us everywhere? Perhaps. Then again, that's what makes us human. And in the many situations where probability and statistics aren't appropriate tools, such as picking books or movies, then all we have to fall back on is our ability to make stories, our ability to make sense. Where "make" is precisely the most important word in that last sentence.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


There's an important, but subtle, distinction to be made between events that can be modelled by random processes and events that are actually random processes. Mlodinow makes this same distinction in his discussion of Maris. At one point, he grants that a hot or cold streak could be the result of changes in exercise, eating habits, personal stress, or any number of non-random factors; but since we can't account for these, he rolls them up into randomness. So, essentially he's saying that a hot streak can be modelled as a random process, though it may have an underlying cause that isn't random at all. Say Maris signed a contract to appear on the front of Wheaties boxes, and decided that he might as well eat the stuff. And say that eating Wheaties actually did increase his slugging percentage significantly. If so, betting heavily on Maris during a hot streak might not be such a bad idea, since he's not just hitting well because he happens to be lucky. And if so, I would still bet heavily that Maris' record-breaking year could be modelled as a random process. After all, probability and statistics are very blunt instruments.

Rolling up potentially non-random factors that can't be measured into "randomness" is a common trick, and reasonably acceptable. You can't analyze what you don't know. But it's a trick that worries me. Let's take a situation that I think is similar, but with much more profound consequences. A decade or so ago, it was well-known that Tamoxifen was a useful drug against breast cancer, effective in roughly 80% of all cases. That's equivalent to saying that Tamoxifen has an .800 batting average. You could model Tamoxifen's success by flipping a coin that came up heads 80% of the time.

But more recent research has revealed that Tamoxifen's story isn't random; at least, not random in that way. It's successful almost 100% of the time on patients with certain genetics and almost 0% of the time on other patients. In other words, the randomness is in the stream of incoming patients, not the effects of the drug. That discovery has a huge practical effect on breast cancer treatment. You can do tests to figure out whether treating a patient with Tamoxifen is likely to be successful, or a waste of time. You can also look in a more focused way for treatments that will be effective on the remaining 20%. Even more important: It's my belief that the next generation of medicine will be "personalized." Rather than using drugs that have been successful in broad clinical trials involving thousands of patients, we'll be focusing on drugs that are tuned to an individual's genetic makeup. Is it possible that the drug that would be effective on the 20% of women who don't respond to Tamoxifen has already been discovered and discarded, because its success rate wasn't statistically significant? Is it possible that there's a drug that's 100% effective on only 5%? Or 1%? What methods will we use to evaluate the performance of these drugs?

Understanding randomness is a double-edged sword. Humans are built to create patterns, even when there's nothing going on but random phenomena. Granted, that's an extremely important story, and Mlodinow does an excellent job of telling it. At the same time, we are wired to create stories, and can't afford to let randomness stop us from doing so, particularly when a story that gives a richer understanding of the data is just beyond our grasp. Understanding what is random and what is not (or, more precisely stated, understanding what parts of any processes are really random) is the key. While humans are all too willing to grasp at the straws of a story when there's no story there (just go to any casino), we can also throw out the stories we haven't yet finished because we're convinced there's nothing there. And that's a tragedy.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!