Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 09 2013

Four short links: 9 December 2013

  1. Reform Government Surveillance — hard not to view this as a demarcation dispute. “Ruthlessly collecting every detail of online behaviour is something we do clandestinely for advertising purposes, it shouldn’t be corrupted because of your obsession over national security!”
  2. Brian Abelson — Data Scientist at the New York Times, blogging what he finds. He tackles questions like what makes a news app “successful” and how might we measure it. Found via this engaging interview at the quease-makingly named Content Strategist.
  3. StageXL — Flash-like 2D package for Dart.
  4. BayesDBlets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries. Open source.

May 13 2013

Four short links: 13 May 2013

  1. Exploiting a Bug in Google Glass — unbelievably detailed and yet easy-to-follow explanation of how the bug works, how the author found it, and how you can exploit it too. The second guide was slightly more technical, so when he returned a little later I asked him about the Debug Mode option. The reaction was interesting: he kind of looked at me, somewhat confused, and asked “wait, what version of the software does it report in Settings”? When I told him “XE4″ he clarified “XE4, not XE3″, which I verified. He had thought this feature had been removed from the production units.
  2. Probability Through Problems — motivating problems to hook students on probability questions, structured to cover high-school probability material.
  3. Connbox — love the section “The importance of legible products” where the physical UI interacts seamless with the digital device … it’s glorious. Three amazing videos.
  4. The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees (PLoSONE) — The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks. [...] An implementation of ISMA in Java is freely available.

October 10 2012

Four short links: 10 October 2012

  1. An Intuitive Guide to Linear AlgebraHere’s the linear algebra introduction I wish I had. I wish I’d had it, too. (via Hacker News)
  2. Think Bayesan introduction to Bayesian statistics using computational methods.
  3. The State of Javascript 2012 (Brendan Eich) — Javascript continues its march up and down the stack, simultaneously becoming an application language while becoming the bytecode for the world.
  4. Divshot — a startup turning mockups into web apps, built on top of the Bootstrap front-end framework. I feel momentum and a tipping point approaching, where building things on the web is about to get easier again (the way it did with Ruby on Rails). cf Jetstrap.

January 05 2012

Understanding randomness is a double-edged sword

The Drunkard's Walk coverLeonard Mlodinow's "The Drunkard's Walk: How Randomness Rules our Lives" is a great book on an important subject. As data scientists know, random phenomenon are everywhere, and humans don't understand them well. We're not wired to understand them well. This book is a huge help, and will be a relief to anyone who's heard people say "I don't believe in global warming because last winter we got a lot of snow," or some load of crap like that. The book is well written, there's a lot of storytelling, and the storytelling is fun and interesting. Along the way Mlodinow gives coherent explanations of Bayes' theorem, the Monty Hall problem (offering the simplest correct explanation I've ever seen), the origins of statistics and more. If you want an excellent non-mathematical introduction to probabilistic thinking, this is the book to get. (If you want the mathematics, this book studiously avoids equations. Get William Feller's "An Introduction to Probability Theory and Its Applications" for the deeper material.)

But there's always a but. But, but, but ...

I have two problems with "The Drunkard's Walk." They've been nagging me ever since I finished.

First, Mlodinow spends a lot of time debunking the notion of "hot streaks." He's right, and that's important: most hot streaks in sports and elsewhere can be adequately explained by randomness. Randomness is inherently streaky and clumpy; it's not just a smooth gray. In fact, if you get something that looks smooth and "random," it's almost certainly not random. So far, so good. But — when he moves from Roger Maris' record-breaking season to portfolio managers picking hot stocks, there's a fundamental asymmetry.

With Maris, the author starts with the long-term batting average. We're not just "flipping coins"; we're flipping a weighted coin, a coin that happens to land with the "home run" side facing up a lot more frequently than it would if I were in the batter's box. That's all well and good. If I faced a season's worth of professional baseball pitching, I daresay I wouldn't get a single hit, let alone any home runs. But — and this is important — he doesn't do the same for the stock pickers, book acquisition editors, or Hollywood movie execs that he talks about. For them, it's just flipping coins. And it's one thing to say that, if you just flipped coins for 10 years, you'd have a 75% chance of duplicating a great financial manager's performance over some five-year period. It's another thing to imply that the manager's performance is just a matter of luck, not skill. Yes, there is a lot of luck involved, but where's the notion of baseline performance, of long term success or failure, that was the starting point for analyzing Maris' hot year? Maris' hot year may have been a random phenomenon, but it was a random phenomenon in the context of five years hitting more than 20 home runs per season, during which his cumulative batting average was somewhere around .271. What's the stock picker's cumulative batting average? Who are the other financial analysts working at the same level? We never find out. And that's a big part of the story to omit.

Second, Mlodinow frequently forgets one of the most important aspects of the mathematical study of random processes. When we're talking probability and statistics, we're talking about interchangeable events. It's easy to forget this, but as Mlodinow himself points out, there are many, many ways to make important mistakes when you're talking about probability. The important thing about urns with black and white balls is that the balls are the same. (If you don't know about urns, take a probability course or read the book; they're baked into the history of probability theory.) If some of the balls were ovals and some were star-shaped, these probability experiments wouldn't work.

So, back again to the stock pickers, the acquisitions editors, and the Hollywood execs. We agree at some level that all at-bats in baseball are equivalent. This is, of course, an idealization, but it's one we're fairly comfortable with. But all stocks are not the same, all books are not the same, and all movies are not the same. They may be the same within a certain class (energy stocks, cheap romance novels, spy movies). A stock analyst who's good with financials may have nothing to say about manufacturing. But at the high end of the spectrum (literary novels, fine wines, art movies), everything is unique, precisely in a way that Harlequin romances aren't. Probability and statistics are still powerful tools, but you have to be very careful about how you apply them.

Since I'm in the publishing business, I'm particularly annoyed by the story of an editor who, in an experiment, was given a typewritten chapter of a V. S. Naipaul novel that had won a major award. She rejected it. I'm not a fan of Naipaul, so I'm sympathetic. But is that evidence of her editorial skill (or lack thereof), or of random processes? Since we're now in a world where every event is unique we have to ask more questions: What publisher was she working for? Grove Press, which publishes top drawer literary fiction with a tendency toward the avant garde (for whom Naipaul might have been too stodgy)? Or Bantam, which specializes in lightweight beach-side reading? In both cases, a rejection would have been perfectly appropriate. Probability aside, it's a cheap shot to say: "Because this book won a major award, we'd expect editors at a publishing company to accept it. If they don't, that's evidence that publishing is a random process."

Publishing (and movies, and wines, and maybe even stocks) are a different world, and the disagreements are precisely what is important. Modeling disagreement as random fluctuation isn't doing anyone a service. I may dislike Naipaul's fiction, but I hardly see that as a random result. We could ask about the conditional probability that an English major will dislike Naipaul, given that the English major plays piano, has a strong background in electrical engineering and mathematics, and likes Salman Rushdie, and use that to come up with some sort of number. But I'd have no idea what that number means. We're not picking black and white balls out of urns here — or if we are, the balls are of different shapes and sizes.

Am I just going back to the human tendency to build stories where there is nothing but randomness? Am I just refusing to deal with the stark realities of random phenomenon that surround us everywhere? Perhaps. Then again, that's what makes us human. And in the many situations where probability and statistics aren't appropriate tools, such as picking books or movies, then all we have to fall back on is our ability to make stories, our ability to make sense. Where "make" is precisely the most important word in that last sentence.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Postscript

There's an important, but subtle, distinction to be made between events that can be modelled by random processes and events that are actually random processes. Mlodinow makes this same distinction in his discussion of Maris. At one point, he grants that a hot or cold streak could be the result of changes in exercise, eating habits, personal stress, or any number of non-random factors; but since we can't account for these, he rolls them up into randomness. So, essentially he's saying that a hot streak can be modelled as a random process, though it may have an underlying cause that isn't random at all. Say Maris signed a contract to appear on the front of Wheaties boxes, and decided that he might as well eat the stuff. And say that eating Wheaties actually did increase his slugging percentage significantly. If so, betting heavily on Maris during a hot streak might not be such a bad idea, since he's not just hitting well because he happens to be lucky. And if so, I would still bet heavily that Maris' record-breaking year could be modelled as a random process. After all, probability and statistics are very blunt instruments.

Rolling up potentially non-random factors that can't be measured into "randomness" is a common trick, and reasonably acceptable. You can't analyze what you don't know. But it's a trick that worries me. Let's take a situation that I think is similar, but with much more profound consequences. A decade or so ago, it was well-known that Tamoxifen was a useful drug against breast cancer, effective in roughly 80% of all cases. That's equivalent to saying that Tamoxifen has an .800 batting average. You could model Tamoxifen's success by flipping a coin that came up heads 80% of the time.

But more recent research has revealed that Tamoxifen's story isn't random; at least, not random in that way. It's successful almost 100% of the time on patients with certain genetics and almost 0% of the time on other patients. In other words, the randomness is in the stream of incoming patients, not the effects of the drug. That discovery has a huge practical effect on breast cancer treatment. You can do tests to figure out whether treating a patient with Tamoxifen is likely to be successful, or a waste of time. You can also look in a more focused way for treatments that will be effective on the remaining 20%. Even more important: It's my belief that the next generation of medicine will be "personalized." Rather than using drugs that have been successful in broad clinical trials involving thousands of patients, we'll be focusing on drugs that are tuned to an individual's genetic makeup. Is it possible that the drug that would be effective on the 20% of women who don't respond to Tamoxifen has already been discovered and discarded, because its success rate wasn't statistically significant? Is it possible that there's a drug that's 100% effective on only 5%? Or 1%? What methods will we use to evaluate the performance of these drugs?

Understanding randomness is a double-edged sword. Humans are built to create patterns, even when there's nothing going on but random phenomena. Granted, that's an extremely important story, and Mlodinow does an excellent job of telling it. At the same time, we are wired to create stories, and can't afford to let randomness stop us from doing so, particularly when a story that gives a richer understanding of the data is just beyond our grasp. Understanding what is random and what is not (or, more precisely stated, understanding what parts of any processes are really random) is the key. While humans are all too willing to grasp at the straws of a story when there's no story there (just go to any casino), we can also throw out the stories we haven't yet finished because we're convinced there's nothing there. And that's a tragedy.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl