Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 21 2014

Decision making under uncertainty

The 2014 Edge Annual Question (EAQ) is out. This year, the question posed to the contributors is: What scientific idea is ready for retirement?

As usual with the EAQ, it provokes thought and promotes discussion. I have only read through a fraction of the responses so far, but I think it is important to highlight a few Edge contributors who answered with a common, and in my opinion a very important and timely, theme. The responses that initially caught my attention came from Laurence Smith (UCLA), Gavin Schmidt (NASA), Guilio Boccaletti (The Nature Conservancy) and Danny Hillis (Applied Minds). If I were to have been asked this question, my contribution for idea retirement would likely align most closely with these four responses: Smith and Boccaletti  want to see same idea disappear — stationarity; Schmidt’s response focused on the abolition of simple answers; and Hillis wants to do away with cause-and-effect.

In the age of big-data, from a decision-making standpoint, all of these responses address the complex nature of interconnected scientific topics and the search for one-size-fits-all answers. The conclusions should all point toward what science is supposed to do in the first place: to generate knowledge. Of course with every experiment there is the objective of finding an answer to a specific question, but each experiment, if performed properly, should fundamentally serve to generate more questions, not answers. When a newly-minted PhD successfully defends his or her dissertation, for a brief moment in time, they are the world’s expert on their particular subject. However, the next day, that may not hold true. This is progress and should be embraced. If we can apply findings of a study to address a specific problem, great. But science should be a humbling endeavor — as each day we should realize how much we actually don’t know.

As more and more universities are cranking out graduates under the data-science rubric, I hope that part of the curriculum stresses that while machine learning and advanced algorithms can be used to uncover new, useful and novel patterns contained within large datasets, it should also be stressed that these same tools and techniques can be trained to determine when false positives might lead down dark data alleys. This is all part of a proper lens through which to view scientific risk management. Taking a complex adaptive systems approach to data analysis will better prepare decision makers to identify tipping points and non-stationarity, while providing a foundation to continuously challenge assumptions, and at the same time, to embrace the notion of complexity, shifting baselines, and ambiguity.

December 09 2013

Four short links: 9 December 2013

  1. Reform Government Surveillance — hard not to view this as a demarcation dispute. “Ruthlessly collecting every detail of online behaviour is something we do clandestinely for advertising purposes, it shouldn’t be corrupted because of your obsession over national security!”
  2. Brian Abelson — Data Scientist at the New York Times, blogging what he finds. He tackles questions like what makes a news app “successful” and how might we measure it. Found via this engaging interview at the quease-makingly named Content Strategist.
  3. StageXL — Flash-like 2D package for Dart.
  4. BayesDBlets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries. Open source.

May 23 2013

Burning the silos

If I’ve seen any theme come up repeatedly over the past year, it’s getting product cycle times down. It’s not the sexiest or most interesting theme, but it’s everywhere: if it’s not on the front burner, it’s always simmering in the background.

Cutting product cycles to the bare minimum is one of the main themes of the Velocity Conference and the DevOps movement, where integration between developers and operations, along with practices like continuous deployment, allows web-native companies like Yahoo! to release upgrades to their web products many times a day. It’s no secret that many traditional enterprises are looking at this model, trying to determine what they can use or implement. Indeed, this is central to their long-term survival; companies as different from Facebook as GE and Ford are learning that they will need to become as agile and nimble as their web-native counterparts.

Integrating development and operations isn’t the only way to shorten product cycles. In his talk at Google IO, Braden Kowitz talked about shortening the design cycle: rather than build big, complete products that take a lot of time and money, start with something very simple and test it, then iterate quickly. This approach lets you generate and test lots of ideas, but be quick to throw away the ones that aren’t working. Rather than designing an Edsel, just to fail when the product is released, the shorter cycles that come from integrating product design with product development let you build iteratively, getting immediate feedback on what works and what doesn’t. To work like this, you need to break down the silos that separate engineers and designers; you need to integrate designers into the product team as early as possible, rather than at the last minute.

Data scientist DJ Patil makes a similar point in Building Data Science Teams. Data isn’t new, nor is data analysis. Our ability to collect data by the petabyte certainly is new, but it isn’t revolutionary in itself. What really characterizes data science is the way data is incorporated into the product development cycle. Data analysts are frustrated and ineffective when they live in silos, and all they do is design complex queries and generate reports. They need to be integrated into decision making and product design. A data group that’s isolated from the organization’s business isn’t going to have the understanding they need to create insights from data. Their role will be limited to confirming what some manager or other thinks, and that’s neither fulfilling or effective.

If you’ve ever optimized code, you’ve experience this phenomenon in a different way: neatly designed, modular code may be well-architected and easy to work on, but when you need to tune for performance, the biggest gains usually come by breaking down these carefully designed modules. The well-designed boundaries that characterize many modern software projects are the silos that need to be broken down for the sake of efficiency.

Organizational structure and culture are no different: boundaries make large organizations manageable. That’s ultimately why we start by developing an idea into a prototype, getting it working, then handing it off to a design team to make it look good. And when it looks good, we throw it over the wall to the operations group. The org chart looks nicer when all the designers report to a design manager, all the software developers report to a software manager, and the operations team is part of IT. But that kind of product development is inherently slow, inefficient, and prone to turf battles and misunderstandings between departments.

And whether you’re building products that weigh tons or products that weigh a few ounces, whether you call it Industrial Internet, Smarter Planet, Internet of Things, Sensor Networks, or some other name, the current wave that re-connects computing with the physical world is no different. You can’t separate that electrical engineering team that makes circuits from the mechanical engineering team that builds the moving parts, or from the software team that does the programming. All of these groups have to work together from the start.

DevOps, data science, and even software development are teaching us that nice organizations with well-defined boundaries may be easy to manage, but they don’t get the job done. Siloed development locks you into long and ineffective product cycles; they lead you to build products that are obsolete when they’re finished, or aren’t what the customer wants. The results are inflexible: they are hard to modify, extend, or fix, and it may take months, not days, to get from version 1.0 to 1.0.1.

Innovation is intensely collaborative. Whether we’re building products to help fix health care, or building jet engines, or just making better online web sites, the boundaries created by traditional management are just getting in the way. We can no longer afford that. It’s time to burn the silos.

May 22 2013

Four short links: 22 May 2013

  1. XBox One Kinect Controller (Guardian) — the new Kinect controller can detect gaze, heartbeat, and the buttons on your shirt.
  2. Surveillance and the Internet of Things (Bruce Schneier) — Lots has been written about the “Internet of Things” and how it will change society for the better. It’s true that it will make a lot of wonderful things possible, but the “Internet of Things” will also allow for an even greater amount of surveillance than there is today. The Internet of Things gives the governments and corporations that follow our every move something they don’t yet have: eyes and ears.
  3. Daniel Dennett’s Intuition Pumps (extract)How to compose a successful critical commentary: 1. Attempt to re-express your target’s position so clearly, vividly and fairly that your target says: “Thanks, I wish I’d thought of putting it that way.” 2. List any points of agreement (especially if they are not matters of general or widespread agreement). 3. Mention anything you have learned from your target.4. Only then are you permitted to say so much as a word of rebuttal or criticism.
  4. New Data Science Toolkit Out (Pete Warden) — with population data to let you compensate for population in your heatmaps. No more “gosh, EVERYTHING is more prevalent where there are lots of people!” meaningless charts.

May 10 2013

Four short links: 10 May 2013

  1. The Remixing Dilemma — summary of research on remixed projects, finding that (1) Projects with moderate amounts of code are remixed more often than either very simple or very complex projects. (2) Projects by more prominent creators are more generative. (3) Remixes are more likely to attract remixers than de novo projects.
  2. Scratch 2.0 — my favourite first programming language for kids and adults, now in the browser! Downloadable version for offline use coming soon. See the overview for what’s new.
  3. State Dept Takedown on 3D-Printed Gun (Forbes) — The government says it wants to review the files for compliance with arms export control laws known as the International Traffic in Arms Regulations, or ITAR. By uploading the weapons files to the Internet and allowing them to be downloaded abroad, the letter implies Wilson’s high-tech gun group may have violated those export controls.
  4. Data Science of the Facebook World (Stephen Wolfram) — More than a million people have now used our Wolfram|Alpha Personal Analytics for Facebook. And as part of our latest update, in addition to collecting some anonymized statistics, we launched a Data Donor program that allows people to contribute detailed data to us for research purposes. A few weeks ago we decided to start analyzing all this data… (via Phil Earnhardt)

May 07 2013

Steering the ship that is data science

Mike Loukides recently recapped a conversation we’d had about leading indicators for data science efforts in an organization. We also pondered where the role of data scientist is headed and realized we could treat software development as a prototype case.

It’s easy (if not eerie) to draw parallels between the Internet boom of the mid 1990s and the Big Data boom of the present day: in addition to the exuberance in the press and the new business models, a particular breed of technical skill became a competitive advantage and a household name. Back then, this was the software developer. Today, it’s the data scientist.

The time in the sun improved software development in some ways, but it also brought its share of problems. Some companies were short on the skill and discipline required to manage custom software projects, and they were equally ill-equipped to discern the true technical talent from the pretenders. That combination led to low-quality software projects that simply failed to deliver business value. (A number of these survive today as “repair-ware” that requires constant, expensive upkeep.)

How will the data science field avoid software development’s pitfalls? (As an aside, we shudder to think what would be the data science equivalent of “repair-ware.”) We started to explore some ideas but realized they were all rooted in education, business value, and openness:

Company leaders must educate themselves in order to understand how data analysis can improve their firm. That knowledge will guide them in building out a data science team and establishing its mission. Leaders otherwise risk trivializing the data scientist role or overindulging in analytics for the sake of analytics.

In turn, data scientists must understand how their work is meant to improve the business. That will serve as their compass when they explore new ideas so they can aim to deliver solid value. Without that guidance, it’s too easy to get stuck in rabbit-holes and yak-shaving.

Both parties must be vigilant of any needless barriers forming around the data science team, especially after the initial novelty fades. Open communication between the data science group and the rest of the business will ensure the former doesn’t land in a separate silo, marginalized out of the company mission.

These ideas may be a start, but are they enough? Probably not. What would you recommend to steer data science clear of pitfalls?

May 06 2013

Another Serving of Data Skepticism

I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!

That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.

Let’s do some thought experiments–unfortunately, totally devoid of data. But I don’t think we need data to get to the core of the problem. Think of the classic false correlation (when teaching logic, also used as an example of a false syllogism): there’s a strong correlation between people who eat pickles and people who die. Well, yeah. We laugh. But let’s take this a step further: correlation is a double headed arrow. So not only does this poor logic imply that we can reduce the death rate by preventing people from eating pickles, it also implies that we can harm the chemical companies that produce vinegar by preventing people from dying. And here we see what’s really happening: to remove one head of the double-headed arrow, we use “common sense” to choose between two stories: one that’s merely silly, and another that’s so ludicrous we never even think about it. Seems to work here (for a very limited value of “work”); but if I’ve learned one thing, it’s that good old common sense is frequently neither common nor sensible. For more realistic correlations, it certainly seems ironic that we’re doing all this data analysis just to end up relying on common sense.

Now let’s look at something equally hypothetical that isn’t silly. A drug is correlated with reduced risk of death due to heart failure. Good thing, right? Yes–but why? What if the drug has nothing to do with heart failure, but is really an anti-depressant that makes you feel better about yourself so you exercise more? If you’re in the “correlation is as good as causation” club, doesn’t make a difference: you win either way. Except that, if the key is really exercise, there might be much better ways to achieve the same result. Certainly much cheaper, since the drug industry will no doubt price the pills at $100 each. (Tangent: I once saw a truck drive up to an orthopedist’s office and deliver Vioxx samples with a street value probably in the millions…) It’s possible, given some really interesting work being done on the placebo effect, that a properly administered sugar pill will make the patient feel better and exercise, yielding the same result. (Though it’s possible that sugar pills only work as placebos if they’re expensive.) I think we’d like to know, rather than just saying that correlation is just as good as causation, if you have a lot of data.

Perhaps I haven’t gone far enough: with enough data, and enough dimensions to the data, it would be possible to detect the correlations between the drug, psychological state, exercise, and heart disease. But that’s not the point. First, if correlation really is as good as causation, why bother? Second, to analyze data, you have to collect it. And before you collect it, you have to decide what to collect. Data is socially constructed (I promise, this will be the subject of another post), and the data you don’t decide to collect doesn’t exist. Decisions about what data to collect are almost always driven by the stories we want to tell. You can have petabytes of data, but if it isn’t the right data, if it’s data that’s been biased by preconceived notions of what’s important, you’re going to be misled. Indeed, any researcher knows that huge data sets tend to create spurious correlations.

Causation has its own problems, not the least of which is that it’s impossible to prove. Unfortunately, that’s the way the world works. But thinking about cause and how events relate to each other helps us to be more critical about the correlations we discover. As humans we’re storytellers, and an important part of data work is building a story around the data. Mere correlations arising from a gigantic pool of data aren’t enough to satisfy us. But there are good stories and bad ones, and just as it’s possible to be careful in designing your experiments, it’s possible to be careful and ethical in the stories you tell with your data. Those stories may be the closest we get ever get to an understanding of cause; but we have to realize that they’re just stories, that they’re provisional, and that better evidence (which may just be correlations) may force us to retell our stories at any moment. Correlation is as good as causation is just an excuse for intellectual sloppiness; it’s an excuse to replace thought with an odd kind of “common sense,” and to shut down the discussion that leads to good stories and understanding.

April 30 2013

Leading Indicators

In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.

Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.

This reminded me of when my daughter was in first grade, and we looked (briefly) at private schools. All the schools talked the same talk. But if you looked at classes, it was pretty clear that the quality of the music program was a proxy for the quality of the school. After all, it’s easy to shortchange music, and both hard and expensive to do it right. Oddly enough, using the music program as a proxy for evaluating school quality has continued to work through middle school and (public) high school. It’s the first thing to cut when the budget gets tight; and if a school has a good music program with excellent teachers, they’re probably not shortchanging the kids elsewhere.

How does this connect to data science? What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting? We came up with a few ideas:

  • Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.
  • Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.
  • When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.
  • What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.
  • Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?
  • What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.
  • Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.

Coming up with these questions was an interesting thought experiment; we don’t know whether it holds water, but we suspect it does. Any ideas and opinions?

April 22 2013

A different take on data skepticism

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …

There is so much value to be gained if we can put the power of learning, inference, and prediction methods into the hands of more developers and domain experts. But how can we avoid the pitfalls that Cathy and Mike are rightly concerned about? If a seemingly simple method like k-nearest neighbors classification is dangerous in unskilled hands (and it certainly is), then what hope is there? Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

So, which methods are better in this regard? In general, it’s those that explore model space in addition to model parameters. In the case of k-means, for example, this would mean learning the number k in addition to the cluster assignment for each data point. For k-nearest neighbors, we could learn the number of exemplars to use and also the distance metric that provides the best explanation for the data. This multi-level approach might sound advanced, and it is true that these implementations are more complex. But complexity of implementation needn’t correlate with “danger” (thanks in part to software engineering), and it’s certainly not a sufficient reason to dismiss more robust methods.

I find the database analogy useful here: developers with only a foggy notion of database implementation routinely benefit from the expertise of the programmers who do understand these systems — i.e., the “professionals.” How? Well, decades of experience — and lots of trial and error — have yielded good abstractions in this area. As a result, we can meaningfully talk about the database “layer” in our overall “stack.” Of course, these abstractions are leaky, like all others, and there are plenty of sharp edges remaining (and, some might argue, more being created every day with the explosion of NoSQL solutions). Nevertheless, my weekend-project webapp can store and query insane amounts of data — and I have no idea how to implement a B-tree.

For ML to have a similarly broad impact, I think the tools need to follow a similar path. We need to push ourselves away from the viewpoint that sees ML methods as a bag of tricks, with the right method chosen on a per-problem basis, success requiring a good deal of art, and evaluation mainly by artificial measures of accuracy at the expense of other considerations. Trustworthiness, robustness, and conservatism are just as important, and will have far more influence on the long-run impact of ML.

Will well-intentioned people still be able to lie to themselves? Sure, of course! Let alone the greedy or malicious actors that Cathy and Mike are also concerned about. But our tools should make the common cases easy and safe, and that’s not the reality today.

Related:

April 11 2013

Data skepticism

A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.

But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality.

I had a similar conversation with David Reiley, an economist at Google, who is working on experimental design in social sciences. Heavily paraphrasing our conversation, he said that it was all too easy to think you have plenty of data, when in fact you have the wrong data, data that’s filled with biases that lead to misleading conclusions. As Reiley points out (pdf), “the population of people who sees a particular ad may be very different from the population who does not see an ad”; yet, many data-driven studies of advertising effectiveness don’t take this bias into account. The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.

Skepticism about data is normal, and it’s a good thing. If I had to give a one line definition of science, it might be something like “organized and methodical skepticism based on evidence.” So, if we really want to do data science, it has to be done by incorporating skepticism. And here’s the key: data scientists have to own that skepticism. Data scientists have to be the biggest skeptics. Data scientists have to be skeptical about models, they have to be skeptical about overfitting, and they have to be skeptical about whether we’re asking the right questions. They have to be skeptical about how data is collected, whether that data is unbiased, and whether that data — even if there’s an inconceivably large amount of it — is sufficient to give you a meaningful result.

Because the bottom line is: if we’re not skeptical about how we use and analyze data, who will be? That’s not a pretty thought.

January 31 2013

Hacking robotic arms, predicting flight arrival times, manufacturing in America, tracking Disney customers (industrial Internet links)

Flight Quest (GE, powered by Kaggle) — Last November GE, Alaska Airlines, and Kaggle announced the Flight Quest competition, which invites data scientists to build models that can accurately predict when a commercial airline flight touches down and reaches its gate. Since the leaderboard for the competition was activated on December 18, 2012, entrants have already beaten the benchmark prediction accuracy by more than 40%, and there are still two weeks before final submissions are due.

Robot Army (NYC Resistor) — A pair of robotic arms, stripped from their previous application with wire cutters, makes its way across the Manhattan Bridge on a bicycle and into the capable hands of NYC Resistor, a hardware-hacker collective in Brooklyn. There, Trammell Hudson installed new microcontrollers and brought them back into working condition.

The Next Wave of Manufacturing (MIT Technology Review) — This month’s TR special feature is on manufacturing, with special mention of the industrial Internet and its application in factories, as well as a worthwhile interview with the head of the Reshoring Initiative.

At Disney Parks, a Bracelet Meant to Build Loyalty (and Sales) (The New York Times) — A little outside the immediate industrial Internet area, but relevant nevertheless to the practice of measuring every component of an enormous system to look for things that can be improved. In this case, those components are Disney theme park visitors, who will soon use RFID wristbands to pay for concessions, open hotel doors, and get into short lines for amusement rides. Disney will use the resulting data to model consumer behavior in its parks.


This is a post in our industrial Internet series, an ongoing exploration of big machines and big data. The series is produced as part of a collaboration between O’Reilly and GE.

January 18 2013

Need speed for big data? Think in-memory data management

By Ben Lorica and Roger Magoulas

In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries on large, distributed data stores. Established technology companies have had interesting offerings, but what initially caught our attention were open source projects that started gaining traction last year.

An example we frequently hear about is the demand for tools that support interactive query performance. Faster query response times translate to more engaged and productive analysts, and real-time reports. Over the past two years several in-memory solutions emerged to deliver 5X-100X faster response times. A recent paper from Microsoft Research noted that even in this era of big data and Hadoop, many MapReduce jobs fit in the memory of a single server. To scale to extremely large datasets several new systems use a combination of distributed computing (in-memory grids), compression, and (columnar) storage technologies.

Another interesting aspect of in-memory technologies is that they seem to be everywhere these days. We’re looking at tools aimed at analysts (Tableau, Qlikview, Tibco Spotfire, Platfora), databases that target specific workloads or data types (VoltDB, SAP HANA, Hekaton, Redis, Druid, and Yarcdata), frameworks for analytics (Spark/Shark, GraphLab, GridGain, Asterix/Hyracks), and the data center (RAMCloud, memory Iocality).

We’ll be talking to companies and hackers to get a sense of how in-memory solutions fit into their planning. Along these lines, we would love to hear what you think about the rise of these technologies, as well as applications, companies and projects we should look at. Feel free to reach out to us on Twitter (Ben is @BigData and Roger is @rogerm) or leave a comment on this post.

January 01 2013

Four short links: 1 January 2013

  1. Robots Will Take Our Jobs (Wired) — I agree with Kevin Kelly that (in my words) software and hardware are eating wetware, but disagree that This is not a race against the machines. If we race against them, we lose. This is a race with the machines. You’ll be paid in the future based on how well you work with robots. Ninety percent of your coworkers will be unseen machines. Most of what you do will not be possible without them. And there will be a blurry line between what you do and what they do. You might no longer think of it as a job, at least at first, because anything that seems like drudgery will be done by robots. Civilizations which depend on specialization reward work and penalize idleness. We already have more people than work for them, and if we’re not to be creating a vast disconnected former workforce then we (society) need to get a hell of a lot better at creating jobs and not destroying them.
  2. Why Workers are Losing the War Against Machines (The Atlantic) — There is no economic law that says that everyone, or even most people, automatically benefit from technological progress.
  3. Early Quora Design Notes — I love reading post-mortems and learning from what other people did. Picking a starting point is important because it will be the axis the rest of the design revolves around — but it’s tricky and not always the first page in the flow. Ideally, you should start with the page that serves the most significant goals of the product.
  4. Free Data Science BooksI don’t mean free as in some guy paid for a PDF version of an O’Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, “the publisher has graciously agreed to allow a full, free version of my book to be available on this site.” (via Stein Debrouwere)

November 29 2012

New data competition tackles airline delays

Jeff Immelt and a GE jet engineJeff Immelt and a GE jet engine

Jeff Immelt speaking next to a GEnx jet engine at Minds + Machines: Unleashing the Industrial Internet.

The scenario is familiar: a flight leaves the gate in New York on time, sits in a runway queue for 45 minutes, gets a fortuitous reroute over Illinois, and makes it to San Francisco ahead of schedule — only to wait on the terminal apron, engines running, for 15 minutes while a gate and crew materialize. The uncertainty irritates passengers and is costly for the airline, which burns extra fuel, pays extra wages, and has to rebook passengers and crew at the last minute.

A new competition run by Kaggle and sponsored by GE and Alaska Airlines offers $500,000 to data scientists — professional or enthusiast — who can accurately predict when a flight will land and arrive at the gate given a slew of data on weather, flight plans, air-traffic control and past flight performance.

Called GE Flight Quest, it’s tied to the industrial Internet — the idea that networked machines and high-level software above them will drive the next generation of efficiency improvements in complicated systems like airlines, power grids and freight carriers.

Predicting when a plane will arrive is trickier than it sounds because it’s subject to lots of independent, real-time influences. Knowing about the runway queues, reroutings and arrival restrictions in advance would make it possible to figure out exactly when a flight will arrive before it takes off, but the factors that delay most flights — weather, congestion and maintenance — shift constantly and interact in complex ways.

The industrial Internet turns complicated machinery into a platform on which intelligent software can be built. Airlines and air-traffic controllers have gathered vast structured datasets that can be thrown open to any member of the public with a little data intuition. The challenge of flight prediction — handled within the airline industry by highly-specialized systems — becomes approachable as a generic prediction problem.

A companion competition, called Hospital Quest, invites people to propose apps to improve patient experience in hospitals.


This is a post in our industrial Internet series, an ongoing exploration of big machines and big data. The series is produced as part of a collaboration between O’Reilly and GE.

August 14 2012

Solving the Wanamaker problem for health care

By Tim O’Reilly, Julie Steele, Mike Loukides and Colin Hill

“The best minds of my generation are thinking about how to make people click ads.” — Jeff Hammerbacher, early Facebook employee

“Work on stuff that matters.” — Tim O’Reilly

Doctors in operating room with data

In the early days of the 20th century, department store magnate John Wanamaker famously said, “I know that half of my advertising doesn’t work. The problem is that I don’t know which half.”

The consumer Internet revolution was fueled by a search for the answer to Wanamaker’s question. Google AdWords and the pay-per-click model transformed a business in which advertisers paid for ad impressions into one in which they pay for results. “Cost per thousand impressions” (CPM) was replaced by “cost per click” (CPC), and a new industry was born. It’s important to understand why CPC replaced CPM, though. Superficially, it’s because Google was able to track when a user clicked on a link, and was therefore able to bill based on success. But billing based on success doesn’t fundamentally change anything unless you can also change the success rate, and that’s what Google was able to do. By using data to understand each user’s behavior, Google was able to place advertisements that an individual was likely to click. They knew “which half” of their advertising was more likely to be effective, and didn’t bother with the rest.

Since then, data and predictive analytics have driven ever deeper insight into user behavior such that companies like Google, Facebook, Twitter, Zynga, and LinkedIn are fundamentally data companies. And data isn’t just transforming the consumer Internet. It is transforming finance, design, and manufacturing — and perhaps most importantly, health care.

How is data science transforming health care? There are many ways in which health care is changing, and needs to change. We’re focusing on one particular issue: the problem Wanamaker described when talking about his advertising. How do you make sure you’re spending money effectively? Is it possible to know what will work in advance?

Too often, when doctors order a treatment, whether it’s surgery or an over-the-counter medication, they are applying a “standard of care” treatment or some variation that is based on their own intuition, effectively hoping for the best. The sad truth of medicine is that we don’t really understand the relationship between treatments and outcomes. We have studies to show that various treatments will work more often than placebos; but, like Wanamaker, we know that much of our medicine doesn’t work for half or our patients, we just don’t know which half. At least, not in advance. One of data science’s many promises is that, if we can collect data about medical treatments and use that data effectively, we’ll be able to predict more accurately which treatments will be effective for which patient, and which treatments won’t.

A better understanding of the relationship between treatments, outcomes, and patients will have a huge impact on the practice of medicine in the United States. Health care is expensive. The U.S. spends over $2.6 trillion on health care every year, an amount that constitutes a serious fiscal burden for government, businesses, and our society as a whole. These costs include over $600 billion of unexplained variations in treatments: treatments that cause no differences in outcomes, or even make the patient’s condition worse. We have reached a point at which our need to understand treatment effectiveness has become vital — to the health care system and to the health and sustainability of the economy overall.

Why do we believe that data science has the potential to revolutionize health care? After all, the medical industry has had data for generations: clinical studies, insurance data, hospital records. But the health care industry is now awash in data in a way that it has never been before: from biological data such as gene expression, next-generation DNA sequence data, proteomics, and metabolomics, to clinical data and health outcomes data contained in ever more prevalent electronic health records (EHRs) and longitudinal drug and medical claims. We have entered a new era in which we can work on massive datasets effectively, combining data from clinical trials and direct observation by practicing physicians (the records generated by our $2.6 trillion of medical expense). When we combine data with the resources needed to work on the data, we can start asking the important questions, the Wanamaker questions, about what treatments work and for whom.

The opportunities are huge: for entrepreneurs and data scientists looking to put their skills to work disrupting a large market, for researchers trying to make sense out of the flood of data they are now generating, and for existing companies (including health insurance companies, biotech, pharmaceutical, and medical device companies, hospitals and other care providers) that are looking to remake their businesses for the coming world of outcome-based payment models.

Making health care more effective

Downloadable Editions

This report will soon be available in PDF, EPUB and Mobi formats. Submit your email to be alerted when the downloadable editions are ready.

What, specifically, does data allow us to do that we couldn’t do before? For the past 60 or so years of medical history, we’ve treated patients as some sort of an average. A doctor would diagnose a condition and recommend a treatment based on what worked for most people, as reflected in large clinical studies. Over the years, we’ve become more sophisticated about what that average patient means, but that same statistical approach didn’t allow for differences between patients. A treatment was deemed effective or ineffective, safe or unsafe, based on double-blind studies that rarely took into account the differences between patients. With the data that’s now available, we can go much further. The exceptions to this are relatively recent and have been dominated by cancer treatments, the first being Herceptin for breast cancer in women who over-express the Her2 receptor. With the data that’s now available, we can go much further for a broad range of diseases and interventions that are not just drugs but include surgery, disease management programs, medical devices, patient adherence, and care delivery.

For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients. But now we know much more: we know that it’s 100% effective in 70 to 80% of the patients, and ineffective in the rest. That’s not word games, because we can now use genetic markers to tell whether it’s likely to be effective or ineffective for any given patient, and we can tell in advance whether to treat with Tamoxifen or to try something else.

Two factors lie behind this new approach to medicine: a different way of using data, and the availability of new kinds of data. It’s not just stating that the drug is effective on most patients, based on trials (indeed, 80% is an enviable success rate); it’s using artificial intelligence techniques to divide the patients into groups and then determine the difference between those groups. We’re not asking whether the drug is effective; we’re asking a fundamentally different question: “for which patients is this drug effective?” We’re asking about the patients, not just the treatments. A drug that’s only effective on 1% of patients might be very valuable if we can tell who that 1% is, though it would certainly be rejected by any traditional clinical trial.

More than that, asking questions about patients is only possible because we’re using data that wasn’t available until recently: DNA sequencing was only invented in the mid-1970s, and is only now coming into its own as a medical tool. What we’ve seen with Tamoxifen is as clear a solution to the Wanamaker problem as you could ask for: we now know when that treatment will be effective. If you can do the same thing with millions of cancer patients, you will both improve outcomes and save money.

Dr. Lukas Wartman, a cancer researcher who was himself diagnosed with terminal leukemia, was successfully treated with sunitinib, a drug that was only approved for kidney cancer. Sequencing the genes of both the patient’s healthy cells and cancerous cells led to the discovery of a protein that was out of control and encouraging the spread of the cancer. The gene responsible for manufacturing this protein could potentially be inhibited by the kidney drug, although it had never been tested for this application. This unorthodox treatment was surprisingly effective: Wartman is now in remission.

While this treatment was exotic and expensive, what’s important isn’t the expense but the potential for new kinds of diagnosis. The price of gene sequencing has been plummeting; it will be a common doctor’s office procedure in a few years. And through Amazon and Google, you can now “rent” a cloud-based supercomputing cluster that can solve huge analytic problems for a few hundred dollars per hour. What is now exotic inevitably becomes routine.

But even more important: we’re looking at a completely different approach to treatment. Rather than a treatment that works 80% of the time, or even 100% of the time for 80% of the patients, a treatment might be effective for a small group. It might be entirely specific to the individual; the next cancer patient may have a different protein that’s out of control, an entirely different genetic cause for the disease. Treatments that are specific to one patient don’t exist in medicine as it’s currently practiced; how could you ever do an FDA trial for a medication that’s only going to be used once to treat a certain kind of cancer?

Foundation Medicine is at the forefront of this new era in cancer treatment. They use next-generation DNA sequencing to discover DNA sequence mutations and deletions that are currently used in standard of care treatments, as well as many other actionable mutations that are tied to drugs for other types of cancer. They are creating a patient-outcomes repository that will be the fuel for discovering the relation between mutations and drugs. Foundation has identified DNA mutations in 50% of cancer cases for which drugs exist (information via a private communication), but are not currently used in the standard of care for the patient’s particular cancer.

The ability to do large-scale computing on genetic data gives us the ability to understand the origins of disease. If we can understand why an anti-cancer drug is effective (what specific proteins it affects), and if we can understand what genetic factors are causing the cancer to spread, then we’re able to use the tools at our disposal much more effectively. Rather than using imprecise treatments organized around symptoms, we’ll be able to target the actual causes of disease, and design treatments tuned to the biology of the specific patient. Eventually, we’ll be able to treat 100% of the patients 100% of the time, precisely because we realize that each patient presents a unique problem.

Personalized treatment is just one area in which we can solve the Wanamaker problem with data. Hospital admissions are extremely expensive. Data can make hospital systems more efficient, and to avoid preventable complications such as blood clots and hospital re-admissions. It can also help address the challenge of hot-spotting (a term coined by Atul Gawande): finding people who use an inordinate amount of health care resources. By looking at data from hospital visits, Dr. Jeffrey Brenner of Camden, NJ, was able to determine that “just one per cent of the hundred thousand people who made use of Camden’s medical facilities accounted for thirty per cent of its costs.” Furthermore, many of these people came from two apartment buildings. Designing more effective medical care for these patients was difficult; it doesn’t fit our health insurance system, the patients are often dealing with many serious medical issues (addiction and obesity are frequent complications), and have trouble trusting doctors and social workers. It’s counter-intuitive, but spending more on some patients now results in spending less on them when they become really sick. While it’s a work in progress, it looks like building appropriate systems to target these high-risk patients and treat them before they’re hospitalized will bring significant savings.

Many poor health outcomes are attributable to patients who don’t take their medications. Eliza, a Boston-based company started by Alexandra Drane, has pioneered approaches to improve compliance through interactive communication with patients. Eliza improves patient drug compliance by tracking which types of reminders work on which types of people; it’s similar to the way companies like Google target advertisements to individual consumers. By using data to analyze each patient’s behavior, Eliza can generate reminders that are more likely to be effective. The results aren’t surprising: if patients take their medicine as prescribed, they are more likely to get better. And if they get better, they are less likely to require further, more expensive treatment. Again, we’re using data to solve Wanamaker’s problem in medicine: we’re spending our resources on what’s effective, on appropriate reminders that are mostly to get patients to take their medications.

More data, more sources

The examples we’ve looked at so far have been limited to traditional sources of medical data: hospitals, research centers, doctor’s offices, insurers. The Internet has enabled the formation of patient networks aimed at sharing data. Health social networks now are some of the largest patient communities. As of November 2011, PatientsLikeMe has over 120,000 patients in 500 different condition groups; ACOR has over 100,000 patients in 127 cancer support groups; 23andMe has over 100,000 members in their genomic database; and diabetes health social network SugarStats has over 10,000 members. These are just the larger communities, thousands of small communities are created around rare diseases, or even uncommon experiences with common diseases. All of these communities are generating data that they voluntarily share with each other and the world.

Increasingly, what they share is not just anecdotal, but includes an array of clinical data. For this reason, these groups are being recruited for large-scale crowdsourced clinical outcomes research.

Thanks to ubiquitous data networking through the mobile network, we can take several steps further. In the past two or three years, there’s been a flood of personal fitness devices (such as the Fitbit) for monitoring your personal activity. There are mobile apps for taking your pulse, and an iPhone attachment for measuring your glucose. There has been talk of mobile applications that would constantly listen to a patient’s speech and detect changes that might be the precursor for a stroke, or would use the accelerometer to report falls. Tanzeem Choudhury has developed an app called Be Well that is intended primarily for victims of depression, though it can be used by anyone. Be Well monitors the user’s sleep cycles, the amount of time they spend talking, and the amount of time they spend walking. The data is scored, and the app makes appropriate recommendations, based both on the individual patient and data collected across all the app’s users.

Continuous monitoring of critical patients in hospitals has been normal for years; but we now have the tools to monitor patients constantly, in their home, at work, wherever they happen to be. And if this sounds like big brother, at this point most of the patients are willing. We don’t want to transform our lives into hospital experiences; far from it! But we can collect and use the data we constantly emit, our “data smog,” to maintain our health, to become conscious of our behavior, and to detect oncoming conditions before they become serious. The most effective medical care is the medical care you avoid because you don’t need it.

Paying for results

Once we’re on the road toward more effective health care, we can look at other ways in which Wanamaker’s problem shows up in the medical industry. It’s clear that we don’t want to pay for treatments that are ineffective. Wanamaker wanted to know which part of his advertising was effective, not just to make better ads, but also so that he wouldn’t have to buy the advertisements that wouldn’t work. He wanted to pay for results, not for ad placements. Now that we’re starting to understand how to make treatment effective, now that we understand that it’s more than rolling the dice and hoping that a treatment that works for a typical patient will be effective for you, we can take the next step: Can we change the underlying incentives in the medical system? Can we make the system better by paying for results, rather than paying for procedures?

It’s shocking just how badly the incentives in our current medical system are aligned with outcomes. If you see an orthopedist, you’re likely to get an MRI, most likely at a facility owned by the orthopedist’s practice. On one hand, it’s good medicine to know what you’re doing before you operate. But how often does that MRI result in a different treatment? How often is the MRI required just because it’s part of the protocol, when it’s perfectly obvious what the doctor needs to do? Many men have had PSA tests for prostate cancer; but in most cases, aggressive treatment of prostate cancer is a bigger risk than the disease itself. Yet the test itself is a significant profit center. Think again about Tamoxifen, and about the pharmaceutical company that makes it. In our current system, what does “100% effective in 80% of the patients” mean, except for a 20% loss in sales? That’s because the drug company is paid for the treatment, not for the result; it has no financial interest in whether any individual patient gets better. (Whether a statistically significant number of patients has side-effects is a different issue.) And at the same time, bringing a new drug to market is very expensive, and might not be worthwhile if it will only be used on the remaining 20% of the patients. And that’s assuming that one drug, not two, or 20, or 200 will be required to treat the unlucky 20% effectively.

It doesn’t have to be this way.

In the U.K., Johnson & Johnson, faced with the possibility of losing reimbursements for their multiple myeloma drug Velcade, agreed to refund the money for patients who did not respond to the drug. Several other pay-for-performance drug deals have followed since, paving the way for the ultimate transition in pharmaceutical company business models in which their product is health outcomes instead of pills. Such a transition would rely more heavily on real-world outcome data (are patients actually getting better?), rather than controlled clinical trials, and would use molecular diagnostics to create personalized “treatment algorithms.” Pharmaceutical companies would also focus more on drug compliance to ensure health outcomes were being achieved. This would ultimately align the interests of drug makers with patients, their providers, and payors.

Similarly, rather than paying for treatments and procedures, can we pay hospitals and doctors for results? That’s what Accountable Care Organizations (ACOs) are about. ACOs are a leap forward in business model design, where the provider shoulders any financial risk. ACOs represent a new framing of the much maligned HMO approaches from the ’90s, which did not work. HMOs tried to use statistics to predict and prevent unneeded care. The ACO model, rather than controlling doctors with what the data says they “should” do, uses data to measure how each doctor performs. Doctors are paid for successes, not for the procedures they administer. The main advantage that the ACO model has over the HMO model is how good the data is, and how that data is leveraged. The ACO model aligns incentives with outcomes: a practice that owns an MRI facility isn’t incentivized to order MRIs when they’re not necessary. It is incentivized to use all the data at its disposal to determine the most effective treatment for the patient, and to follow through on that treatment with a minimum of unnecessary testing.

When we know which procedures are likely to be successful, we’ll be in a position where we can pay only for the health care that works. When we can do that, we’ve solved Wanamaker’s problem for health care.

Enabling data

Data science is not optional in health care reform; it is the linchpin of the whole process. All of the examples we’ve seen, ranging from cancer treatment to detecting hot spots where additional intervention will make hospital admission unnecessary, depend on using data effectively: taking advantage of new data sources and new analytics techniques, in addition to the data the medical profession has had all along.

But it’s too simple just to say “we need data.” We’ve had data all along: handwritten records in manila folders on acres and acres of shelving. Insurance company records. But it’s all been locked up in silos: insurance silos, hospital silos, and many, many doctor’s office silos. Data doesn’t help if it can’t be moved, if data sources can’t be combined.

There are two big issues here. First, a surprising amount of medical records are still either hand-written, or in digital formats that are scarcely better than hand-written (for example, scanned images of hand-written records). Getting medical records into a format that’s computable is a prerequisite for almost any kind of progress. Second, we need to break down those silos.

Anyone who has worked with data knows that, in any problem, 90% of the work is getting the data in a form in which it can be used; the analysis itself is often simple. We need electronic health records: patient data in a more-or-less standard form that can be shared efficiently, data that can be moved from one location to another at the speed of the Internet. Not all data formats are created equal, and some are certainly better than others: but at this point, any machine-readable format, even simple text files, is better than nothing. While there are currently hundreds of different formats for electronic health records, the fact that they’re electronic means that they can be converted from one form into another. Standardizing on a single format would make things much easier, but just getting the data into some electronic form, any, is the first step.

Once we have electronic health records, we can link doctor’s offices, labs, hospitals, and insurers into a data network, so that all patient data is immediately stored in a data center: every prescription, every procedure, and whether that treatment was effective or not. This isn’t some futuristic dream; it’s technology we have now. Building this network would be substantially simpler and cheaper than building the networks and data centers now operated by Google, Facebook, Amazon, Apple, and many other large technology companies. It’s not even close to pushing the limits.

Electronic health records enable us to go far beyond the current mechanism of clinical trials. In the past, once a drug has been approved in trials, that’s effectively the end of the story: running more tests to determine whether it’s effective in practice would be a huge expense. A physician might get a sense for whether any treatment worked, but that evidence is essentially anecdotal: it’s easy to believe that something is effective because that’s what you want to see. And if it’s shared with other doctors, it’s shared while chatting at a medical convention. But with electronic health records, it’s possible (and not even terribly expensive) to collect documentation from thousands of physicians treating millions of patients. We can find out when and where a drug was prescribed, why, and whether there was a good outcome. We can ask questions that are never part of clinical trials: is the medication used in combination with anything else? What other conditions is the patient being treated for? We can use machine learning techniques to discover unexpected combinations of drugs that work well together, or to predict adverse reactions. We’re no longer limited by clinical trials; every patient can be part of an ongoing evaluation of whether his treatment is effective, and under what conditions. Technically, this isn’t hard. The only difficult part is getting the data to move, getting data in a form where it’s easily transferred from the doctor’s office to analytics centers.

To solve problems of hot-spotting (individual patients or groups of patients consuming inordinate medical resources) requires a different combination of information. You can’t locate hot spots if you don’t have physical addresses. Physical addresses can be geocoded (converted from addresses to longitude and latitude, which is more useful for mapping problems) easily enough, once you have them, but you need access to patient records from all the hospitals operating in the area under study. And you need access to insurance records to determine how much health care patients are requiring, and to evaluate whether special interventions for these patients are effective. Not only does this require electronic records, it requires cooperation across different organizations (breaking down silos), and assurance that the data won’t be misused (patient privacy). Again, the enabling factor is our ability to combine data from different sources; once you have the data, the solutions come easily.

Breaking down silos has a lot to do with aligning incentives. Currently, hospitals are trying to optimize their income from medical treatments, while insurance companies are trying to optimize their income by minimizing payments, and doctors are just trying to keep their heads above water. There’s little incentive to cooperate. But as financial pressures rise, it will become critically important for everyone in the health care system, from the patient to the insurance executive, to assume that they are getting the most for their money. While there’s intense cultural resistance to be overcome (through our experience in data science, we’ve learned that it’s often difficult to break down silos within an organization, let alone between organizations), the pressure of delivering more effective health care for less money will eventually break the silos down. The old zero-sum game of winners and losers must end if we’re going to have a medical system that’s effective over the coming decades.

Data becomes infinitely more powerful when you can mix data from different sources: many doctor’s offices, hospital admission records, address databases, and even the rapidly increasing stream of data coming from personal fitness devices. The challenge isn’t employing our statistics more carefully, precisely, or guardedly. It’s about letting go of an old paradigm that starts by assuming only certain variables are key and ends by correlating only these variables. This paradigm worked well when data was scarce, but if you think about, these assumptions arise precisely because data is scarce. We didn’t study the relationship between leukemia and kidney cancers because that would require asking a huge set of questions that would require collecting a lot of data; and a connection between leukemia and kidney cancer is no more likely than a connection between leukemia and flu. But the existence of data is no longer a problem: we’re collecting the data all the time. Electronic health records let us move the data around so that we can assemble a collection of cases that goes far beyond a particular practice, a particular hospital, a particular study. So now, we can use machine learning techniques to identify and test all possible hypotheses, rather than just the small set that intuition might suggest. And finally, with enough data, we can get beyond correlation to causation: rather than saying “A and B are correlated,” we’ll be able to say “A causes B,” and know what to do about it.

Building the health care system we want

The U.S. ranks 37th out of developed economies in life expectancy and other measures of health, while by far outspending other countries on per-capita health care costs. We spend 18% of GDP on health care, while other countries on average spend on the order of 10% of GDP. We spend a lot of money on treatments that don’t work, because we have a poor understanding at best of what will and won’t work.

Part of the problem is cultural. In a country where even pets can have hip replacement surgery, it’s hard to imagine not spending every penny you have to prolong Grandma’s life — or your own. The U.S. is a wealthy nation, and health care is something we choose to spend our money on. But wealthy or not, nobody wants ineffective treatments. Nobody wants to roll the dice and hope that their biology is similar enough to a hypothetical “average” patient. No one wants a “winner take all” payment system in which the patient is always the loser, paying for procedures whether or not they are helpful or necessary. Like Wanamaker with his advertisements, we want to know what works, and we want to pay for what works. We want a smarter system where treatments are designed to be effective on our individual biologies; where treatments are administered effectively; where our hospitals our used effectively; and where we pay for outcomes, not for procedures.

We’re on the verge of that new system now. We don’t have it yet, but we can see it around the corner. Ultra-cheap DNA sequencing in the doctor’s office, massive inexpensive computing power, the availability of EHRs to study whether treatments are effective even after the FDA trials are over, and improved techniques for analyzing data are the tools that will bring this new system about. The tools are here now; it’s up to us to put them into use.

Recommended reading:

We recommend the following books regarding technology, data, and health care reform:

August 13 2012

A grisly job for data scientists

Missing Person: Ai Weiwei by Daquella manera, on FlickrJavier Reveron went missing from Ohio in 2004. His wallet turned up in New York City, but he was nowhere to be found. By the time his parents arrived to search for him and hand out fliers, his remains had already been buried in an unmarked indigent grave. In New York, where coroner’s resources are precious, remains wait a few months to be claimed before they’re buried by convicts in a potter’s field on uninhabited Hart Island, just off the Bronx in Long Island Sound.

The story, reported by the New York Times last week, has as happy an ending as it could given that beginning. In 2010 Reveron’s parents added him to a national database of missing persons. A month later police in New York matched him to an unidentified body and his remains were disinterred, cremated and given burial ceremonies in Ohio.

Reveron’s ordeal suggests an intriguing, and impactful, machine-learning problem. The Department of Justice maintains separate national, public databases for missing people, unidentified people and unclaimed people. Many records are full of rich data that is almost never a perfect match to data in other databases — hair color entered by a police department might differ from how it’s remembered by a missing person’s family; weights fluctuate; scars appear. Photos are provided for many missing people and some unidentified people, and matching them is difficult. Free-text fields in many entries describe the circumstances under which missing people lived and died; a predilection for hitchhiking could be linked to a death by the side of a road.

I’ve called the Department of Justice (DOJ) to ask about the extent to which they’ve worked with computer scientists to match missing and unidentified people, and will update when I hear back. One thing that’s not immediately apparent is the public availability of the necessary training set — cases that have been successfully matched and removed from the lists. The DOJ apparently doesn’t comment on resolved cases, which could make getting this data difficult. But perhaps there’s room for a coalition to request the anonymized data and manage it to the DOJ’s satisfaction while distributing it to capable data scientists.

Photo: Missing Person: Ai Weiwei by Daquella manera, on Flickr

Related:

August 03 2012

StrataRx: Data science and health(care)

By Mike Loukides and Jim Stogdill

StrataRxWe are launching a conference at the intersection of health, health care, and data. Why?

Our health care system is in crisis. We are experiencing epidemic levels of obesity, diabetes, and other preventable conditions while at the same time our health care system costs are spiraling higher. Most of us have experienced increasing health care costs in our businesses or have seen our personal share of insurance premiums rise rapidly. Worse, we may be living with a chronic or life-threatening disease while struggling to obtain effective therapies and interventions — finding ourselves lumped in with “average patients” instead of receiving effective care designed to work for our specific situation.

In short, particularly in the United States, we are paying too much for too much care of the wrong kind and getting poor results. All the while our diet and lifestyle failures are demanding even more from the system. In the past few decades we’ve dropped from the world’s best health care system to the 37th, and we seem likely to drop further if things don’t change.

The very public fight over the Affordable Care Act (ACA) has brought this to the fore of our attention, but this is a situation that has been brewing for a long time. With the ACA’s arrival, increasing costs and poor outcomes, at least in part, are going to be the responsibility of the federal government. The fiscal outlook for that responsibility doesn’t look good and solving this crisis is no longer optional; it’s urgent.

There are many reasons for the crisis, and there’s no silver bullet. Health and health care live at the confluence of diet and exercise norms, destructive business incentives, antiquated care models, and a system that has severe learning disabilities. We aren’t preventing the preventable, and once we’re sick we’re paying for procedures and tests instead of results; and those interventions were designed for some non-existent average patient so much of it is wasted. Later we mostly ignore the data that could help the system learn and adapt.

It’s all too easy to be gloomy about the outlook for health and health care, but this is also a moment of great opportunity. We face this crisis armed with vast new data sources, the emerging tools and techniques to analyze them, an ACA policy framework that emphasizes outcomes over procedures, and a growing recognition that these are problems worth solving.

Data has a long history of being “unreasonably effective.” And at least from the technologist point of view it looks like we are on the cusp of something big. We have the opportunity to move from “Health IT” to an era of data-illuminated technology-enabled health.

For example, it is well known that poverty places a disproportionate burden on the health care system. Poor people don’t have medical insurance and can’t afford to see doctors; so when they’re sick they go to the emergency room at great cost and often after they are much sicker than they need to be. But what happens when you look deeper? One project showed that two apartment buildings in Camden, NJ accounted for a hugely disproportionate number of hospital admissions.

Targeting those buildings, and specific people within them, with integrated preventive care and medical intervention has led to significant savings.

That project was made possible by the analysis of hospital admissions, costs, and intervention outcomes — essentially, insurance claims data — across all the hospitals in Camden. Acting upon that analysis and analyzing the results of the action led to savings.

But claims data isn’t the only game in town anymore. Even more is possible as electronic medical records (EMR), genomic, mobile sensor, and other emerging data streams become available.

With mobile-enabled remote sensors like glucometers, blood pressure monitors, and futuristic tools like digital pills that broadcast their arrival in the stomach, we have the opportunity to completely revolutionize disease management. By moving from discrete and costly data events to a continuous stream of inexpensive remotely monitored data, care will improve for a broad range of chronic and life-threatening diseases. By involving fewer office visits, physician productivity will rise and costs will come down.

We are also beginning to see tantalizing hints of the future of personalized medicine in action. Cheap gene sequencing, better understanding of how drug molecules interact with our biology (and each other), and the tools and horsepower to analyze these complex interactions for a specific patient with specific biology in near real time will change how we do medicine. In the same way that Google’s AdSense took cost out of advertising by using data to target ads with precision, we’ll soon be able to make medical interventions that are much more patient-specific and cost effective.

StrataRx is based on the idea that data will improve health and health care, but we aren’t naive enough to believe that data alone solves all the problems we are facing. Health and health care are incredibly complex and multi-layered and big data analytics is only one piece of the puzzle. Solving our national crisis will also depend on policy and system changes, some of them to systems outside of health care. However, we know that data and its analysis have an important role to play in illuminating the current reality and creating those solutions.

StrataRx is a call for data scientists, technologists, health professionals, and the sector’s business leadership to convene, take part in the discussion, and make a difference!

Strata Rx — Strata Rx, being held Oct. 16-17 in San Francisco, is the first conference to bring data science to the urgent issues confronting healthcare.

Save 20% on registration with the code RADAR20

Related:

July 20 2012

Data Jujitsu: The art of turning data into product

Having worked in academia, government and industry, I’ve had a unique opportunity to build products in each sector. Much of this product development has been around building data products. Just as methods for general product development have steadily improved, so have the ideas for developing data products. Thanks to large investments in the general area of data science, many major innovations (e.g., Hadoop, Voldemort, Cassandra, HBase, Pig, Hive, etc.) have made data products easier to build. Nonetheless, data products are unique in that they are often extremely difficult, and seemingly intractable for small teams with limited funds. Yet, they get solved every day.

How? Are the people who solve them superhuman data scientists who can come up with better ideas in five minutes than most people can in a lifetime? Are they magicians of applied math who can cobble together millions of lines of code for high-performance machine learning in a few hours? No. Many of them are incredibly smart, but meeting big problems head-on usually isn’t the winning approach. There’s a method to solving data problems that avoids the big, heavyweight solution, and instead, concentrates building something quickly and iterating. Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small.

We call this Data Jujitsu: the art of using multiple data elements in clever ways to solve iterative problems that, when combined, solve a data problem that might otherwise be intractable. It’s related to Wikipedia’s definition of the ancient martial art of jujitsu: “the art or technique of manipulating the opponent’s force against himself rather than confronting it with one’s own force.”

How do we apply this idea to data? What is a data problem’s “weight,” and how do we use that weight against itself? These are the questions that we’ll work through in the subsequent sections.

To start, for me, a good definition of a data product is a product that facilitates an end goal through the use of data. It’s tempting to think of a data product purely as a data problem. After all, there’s nothing more fun than throwing a lot of technical expertise and fancy algorithmic work at a difficult problem. That’s what we’ve been trained to do; it’s why we got into this game in the first place. But in my experience, meeting the problem head-on is a recipe for disaster. Building a great data product is extremely challenging, and the problem will always become more complex, perhaps intractable, as you try to solve it.

Before investing in a big effort, you need to answer one simple question: Does anyone want or need your product? If no one wants the product, all the analytical work you throw at it will be wasted. So, start with something simple that lets you determine whether there are any customers. To do that, you’ll have to take some clever shortcuts to get your product off the ground. Sometimes, these shortcuts will survive into the finished version because they represent some fundamentally good ideas that you might not have seen otherwise; sometimes, they’ll be replaced by more complex analytic techniques. In any case, the fundamental idea is that you shouldn’t solve the whole problem at once. Solve a simple piece that shows you whether there’s an interest. It doesn’t have to be a great solution; it just has to be good enough to let you know whether it’s worth going further (e.g., a minimum viable product).

Here’s a trivial example. What if you want to collect a user’s address? You might consider a free-form text box, but writing a parser that can identify a name, street number, apartment number, city, zip code, etc., is a challenging problem due to the complexity of the edge cases. Users don’t necessarily put in separators like commas, nor do they necessarily spell states and cities correctly. The problem becomes much simpler if you do what most web applications do: provide separate text areas for each field, and make states drop-down boxes. The problem becomes even simpler if you can populate the city and state from a zip code (or equivalent).

Now for a less trivial example. A LinkedIn profile includes a tremendous amount of information. Can we use a profile like this to build a recommendation system for conferences? The answer is “yes.” But before answering “how,” it’s important to step back and ask some fundamental questions:

A) Does the customer care? Is there a market fit? If there isn’t, there’s no sense in building an application.

B) How long do we have to learn the answer to Question A?

We could start by creating and testing a full-fledged recommendation engine. This would require an information extraction system, an information retrieval system, a model training layer, a front end with a well-designed user interface, and so on. It might take well over 1,000 hours of work before we find out whether the user even cares.

Instead, we could build a much simpler system. Among other things, the LinkedIn profile lists books.

Book recommendations from LinkedIn profile

Books have ISBN numbers, and ISBN numbers are tagged with keywords. Similarly, there are catalogs of events that are also cataloged with keywords (Lanyrd is one). We can do some quick and dirty matching between keywords, build a simple user interface, and deploy it in an ad slot to a limited group of highly engaged users. The result isn’t the best recommendation system imaginable, but it’s good enough to get a sense of whether the users care. Most importantly, it can be built quickly (e.g., in a few days, if not a few hours). At this point, the product is far from finished. But now you have something you can test to find out whether customers are interested. If so, you can then gear up for the bigger effort. You can build a more interactive user interface, add features, integrate new data in real time, and improve the quality of the recommendation engine. You can use other parts of the profile (skills, groups and associations, even recent tweets) as part of a complex AI or machine learning engine to generate recommendations.

The key is to start simple and stay simple for as long as possible. Ideas for data products tend to start simple and become complex; if they start complex, they become impossible. But starting simple isn’t always easy. How do you solve individual parts of a much larger problem? Over time, you’ll develop a repertoire of tools that work for you. Here are some ideas to get you started.

Use product design

One of the biggest challenges of working with data is getting the data in a useful form. It’s easy to overlook the task of cleaning the data and jump to trying to build the product, but you’ll fail if getting the data into a usable form isn’t the first priority. For example, let’s say you have a simple text field into which the user types a previous employer. How many ways are there to type “IBM”? A few dozen? In fact, thousands: everything from “IBM” and “I.B.M.” to “T.J. Watson Labs” and “Netezza.” Let’s assume that to build our data product it’s necessary to have all these names tied to a common ID. One common approach to disambiguate the results would be to build a relatively complex artificial intelligence engine, but this would take significant time. Another approach would be to have a drop-down list of all the companies, but this would be a horrible user experience due to the length of the list and limited flexibility in choices.

What about Data Jujitsu? Is there a much simpler and more reliable solution? Yes, but not in artificial intelligence. It’s not hard to build a user interface that helps the user arrive at a clean answer. For example, you can:

  • Support type-ahead, encouraging the user to select the most popular term.
  • Prompt the user with “did you mean … ?”
  • If at this point you still don’t have anything usable, ask the user for more help: Ask for a stock ticker symbol or the URL of the company’s home page.

The point is to have a conversation rather than just a form. Engage the user to help you, rather than relying on analysis. You’re not just getting the user more involved (which is good in itself), you’re getting clean data that will simplify the work for your back-end systems. As a matter of practice, I’ve found that trying to solve a problem on the back end is 100-1,000 times more expensive than on the front end.

MapR delivers on the promise of Hadoop, making big data management and analysis a reality for more business users. The award-winning MapR Distribution brings unprecedented dependability, speed, and ease-of-use to Hadoop.

When in doubt, use humans

As technologists, we are predisposed to look for scalable technical solutions. We often jump to technical solutions before we know what solutions will work. Instead, see if you can break down the task into bite-size portions that humans can do, then figure out a technical solution that allows the process to scale. Amazon’s Mechanical Turk is a system for posting small problems online and paying people a small amount (typically a couple of cents) for solutions. It’s come to the rescue of many an entrepreneur who needed to get a product off the ground quickly but didn’t have months to spend on developing an analytical solution.

Here’s an example. A camera company wanted to test a product that would tell restaurant owners how many tables were occupied or empty during the day. If you treat this problem as an exercise in computer vision, it’s very complex. It can be solved, but it will take some PhDs, lots of time, and large amounts of computing power. But there’s a simpler solution. Humans can easily look at a picture and tell whether or not a table has anyone seated at it. So the company took images at regular intervals and used humans to count occupied tables. This gave them the opportunity to test their idea and determine whether the product was viable before investing in a solution to a very difficult problem. It also gave them the ability to find out what their customers really wanted to know: just the number of occupied tables? The average number of people at each table? How long customers stayed at the table? That way, when they start to build the real product, using computer vision techniques rather than humans, they know what problem to solve.

Humans are also useful for separating valid input from invalid. Imagine building a system to collect recipes for an online cookbook. You know you’ll get a fair amount of spam; how do you separate out the legitimate recipes? Again, this is a difficult problem for artificial intelligence without substantial investment, but a fairly simple problem for humans. When getting started, we can send each page to three people via Mechanical Turk. If all agree that the recipe is legitimate, we can use it. If all agree that the recipe is spam, we can reject it. And if the vote is split, we can escalate by trying another set of reviewers or adding additional data to those additional reviewers that allows them to make a better assessment. The key thing is to watch for the signals the humans use to make their decisions. When we’ve identified those signals, we can start building more complex automated systems. By using humans to solve the problem initially, we can learn a great deal about the problem at a very low cost.

Aardvark (a promising startup that was acquired by Google) took a similar path. Their goal was to build a question and answer service that routed users’ questions to real people with “inside knowledge.” For example, if a user wanted to know a good restaurant for a first date in Palo Alto, Calif., Aardvark would route the question to people living in the broader Palo Alto area, then compile the answers. They started by building tools that would allow employees to route the questions by hand. They knew this wouldn’t scale, but it let them learn enough about the routing problem to start building a more automated solution. The human solution not only made it clear what they needed to build, it proved that the technical solution was worth the effort and bought them the time they needed to build it.

In both cases, if you were to graph the work expended versus time, it would look something like this:

Work vs Time graph

Ignore the fact that I’ve violated a fundamental law of data science and presented a graph without scales on the axes. The point is that technical solutions will always win in the long run; they’ll always be more efficient, and even a poor technical solution is likely to scale better than using humans to answer questions. But when you’re getting started, you don’t care about the long run. You just want to survive long enough to have a long run, to prove that your product has value. And in the short term, human solutions require much less work. Worry about scaling when you need to.

Be opportunistic for wins

I’ve stressed building the simplest possible thing, even if you need to take shortcuts that appear to be extreme. Once you’ve got something working and you’ve proven that users want it, the next step is to improve the product. Amazon provides a good example. Back when they started, Amazon pages contained product details, reviews, the price, and a button to buy the item. But what if the customer isn’t sure he’s found what he wants and wants to do some comparison shopping? That’s simple enough in the real world, but in the early days of Amazon, the only alternative was to go back to the search engine. This is a “dead end flow”: Once the user has gone back to the search box, or to Google, there’s a good chance that he’s lost. He might find the book he wants at a competitor, even if Amazon sells the same product at a better price.

Amazon needed to build pages that channeled users into other related products; they needed to direct users to similar pages so that they wouldn’t lose the customer who didn’t buy the first thing he saw. They could have built a complex recommendation system, but opted for a far simpler system. They did this by building collaborative filters to add “People who viewed this product also viewed” to their pages. This addition had a profound effect: Users can do product research without leaving the site. If you don’t see what you want at first, Amazon channels you into another page. It was so successful that Amazon has developed many variants, including “People who bought this also bought” (so you can load up on accessories), and so on.

The collaborative filter is a great example of starting with a simple product that becomes a more complex system later, once you know that it works. As you begin to scale the collaborative filter, you have to track the data for all purchases correctly, build the data stores to hold that data, build a processing layer, develop the processes to update the data, and deal with relevancy issues. Relevance can be tricky. When there’s little data, it’s easy for a collaborative filter to give strange results; with a few errant clicks in the database, it’s easy to get from fashion accessories to power tools. At the same time, there are still ways to make the problem simpler. It’s possible to do the data analysis in a batch mode, reducing the time pressure; rather than compute “People who viewed this also viewed” on the fly, you can compute it nightly (or even weekly or monthly). You can make do with the occasional irrelevant answer (“People who bought leather handbags also bought power screwdrivers”), or perhaps even use Mechanical Turk to filter your pre-computed recommendations. Or even better, ask the users for help.

Being opportunistic can be done with analysis of general products, too. The Wall Street Journal chronicles a case in which Zynga was able to rapidly build on a success in their game FishVille. You can earn credits to buy fish, but you can also purchase credits. The Zynga Analytics team noticed that a particular set of fish was being purchased at six times the rate of all the other fish. Zynga took the opportunity to design several similar virtual fish, for which they charged $3 to $4 each. The data showed that they clearly had stumbled on to something. The common trait was that the translucent feature of the fish was what the customer wanted. Using this combination of quick observations and deploying lightweight tests, they were able to significantly add to their profits.

Ground your product in the real world

We can learn more from Amazon’s collaborative filters. What happens when you go into a physical store to buy something, say, headphones? You might look for sale prices, you might look for reviews, but you almost certainly don’t just look at one product. You look at a few, most likely something located near whatever first caught your eye. By adding “People who viewed this product also viewed,” Amazon built a similar experience into the web page. In essence, they “grounded” their virtual experience to a similar one in the real world via data.

LinkedIn’s People You May Know embodies both Data Jujitsu and grounding the product in the real world. Think about what happens when you arrive at a conference reception. You walk around the outer edge until you find someone you recognize, then you latch on to that person until you see some more people you know. At that point, your interaction style changes: Once you know there are friendly faces around, you’re free to engage with people you don’t know. (It’s a great exercise to watch this happen the next time you attend a conference.)

The same kind of experience takes place when you join a new social network. The first data scientists at LinkedIn recognized this and realized that their online world had two big challenges. First, because it is a website, you can’t passively walk around the outer edges of the group. It’s like looking for friends in a darkened room. Second, LinkedIn is fighting for every second you stay on its site; it’s not like a conference where you’re likely to have a drink or two while looking for friends. There’s a short window, really only a few seconds, for you to become engaged. If you don’t see any point to the site, you click somewhere else and you’re gone.

Earlier attempts to solve this problem, such as address book importers or search facilities, imposed too much friction. They required too much work for the poor user, who still didn’t understand why the site was valuable. But our LinkedIn team realized that a few simple heuristics could be used to determine a set of “people you may know.” We didn’t have the resources to build a complete solution. But to get something started, we could run a series of simple queries on the database: “what do you do,” “where do you live,” “where did you go to school,” and other questions that you might ask someone you met for the first time. We also used triangle closing (if Jane is connected to Mark, and Mark is connected to Sally, Sally and Jane have a high likelihood of knowing each other). To test the idea, we built a customized ad that showed each user the three people they were most likely to know. Clicking on one of those people took you to the “add connection” page. (Of course, if you saw the ad again, the results would have been the same, but the point was to quickly test with minimal impact to the user.) The results were overwhelming; it was clear that this needed to become a full-blown product, and it was quickly replicated by Facebook and all other social networks. Only after realizing that we had a hit on our hands did we do the work required to build the sophisticated machinery necessary to scale the results.

After People You May Know, our LinkedIn team realized that we could use a similar approach to build Groups You May Like. We built it almost as an exercise, when we were familiarizing ourselves with some new database technologies. It took under a week to build the first version and get it on to the home page, again using an ad slot. In the process, we learned a lot about the limitations and power of a recommendation system. On one hand, the numbers showed that people really loved the product. But additional filter rules were needed: Users didn’t like it when the system recommended political or religious groups. In hindsight, this seems obvious, almost funny, but it would have been very hard to anticipate all the rules we needed in advance. This lightweight testing gave us the flexibility to add rules as we discovered we needed them. Since we needed to test our new databases anyway, we essentially got this product “for free.” It’s another great example of a group that did something successful, then immediately took advantage of the opportunities for further wins.

Give data back to the user to create additional value

By giving data back to the user, you can create both engagement and revenue. We’re far enough into the data game that most users have realized that they’re not the customer, they’re the product. Their role in the system is to generate data, either to assist in ad targeting or to be sold to the highest bidder, or both. They may accept that, but I don’t know anyone who’s happy about it. But giving data back to the user is a way of showing that you’re on their side, increasing their engagement with your product.

How do you give data back to the user? LinkedIn has a product called “Who’s Viewed Your Profile.” This product lists the people who have viewed your profile (respecting their privacy settings, of course), and provides statistics about the viewers. There’s a time series view, a list of search terms that have been used to find you, and the geographical areas in which the viewers are located. It’s timely and actionable data, and it’s addictive. It’s visible on everyone’s home page, and it shows the number of profile views, so it’s not static. Every time you look at your LinkedIn page, you’re tempted to click.

Who Viewed Profile box from LinkedIn

And people do click. Engagement is so high that LinkedIn has two versions: one free, and the other part of the subscription package. This product differentiation benefits the casual user, who can see some summary statistics without being overloaded with more sophisticated features, while providing an easy upgrade path for more serious users.

LinkedIn isn’t the only product that provides data back to the user. Xobni analyzes your email to provide better contact management and help you control your inbox. Mint (acquired by Intuit) studies your credit cards to help you understand your expenses and compare them to others in your demographic. Pacific Gas and Electric has a SmartMeter that allows you to analyze your energy usage. We’re even seeing health apps that take data from your phone and other sensors and turn it into a personal dashboard.

In short, everyone reading this has probably spent the last year or more of their professional life immersed in data. But it’s not just us. Everyone, including users, has awakened to the value of data. Don’t hoard it; give it back, and you’ll create an experience that is more engaging and more profitable for both you and your company.

No data vomit

As data scientists, we prefer to interact with the raw data. We know how to import it, transform it, mash it up with other data sources, and visualize it. Most of your customers can’t do that. One of the biggest challenges of developing a data product is figuring out how to give data back to the user. Giving back too much data in a way that’s overwhelming and paralyzing is “data vomit.” It’s natural to build the product that you would want, but it’s very easy to overestimate the abilities of your users. The product you want may not be the product they want.

When we were building the prototype for “Who’s Viewed My Profile,” we created an early version that showed all sorts of amazing data, with a fantastic ability to drill down into the detail. How many clicks did we get when we tested it? Zero. Why? An “inverse interaction law” applies to most users: The more data you present, the less interaction.

Cool interactions graph

The best way to avoid data vomit is to focus on actionability of data. That is, what action do you want the user to take? If you want them to be impressed with the number of things that you can do with the data, then you’re likely producing data vomit. If you’re able to lead them to a clear set of actions, then you’ve built a product with a clear focus.

Expect unforeseen side effects

Of course, it’s impossible to avoid unforeseen side effects completely, right? That’s what “unforeseen” means. However, unforeseen side effects aren’t a joke. One of the best examples of an unforeseen side effect is “My TiVo Thinks I’m Gay.” Most digital video recorders have a recommendation system for other shows you might want to watch; they’ve learned from Amazon. But there are cases wherein a user has watched a particular show (say “Will & Grace”), and then it recommends other shows with similar themes (“The Ellen DeGeneres Show,” “Queer as Folk,” etc.). Along similar lines, An Anglo friend of mine who lives in a neighborhood with many people from Southeast Asia recently told me that his Netflix recommendations are overwhelmed with Bollywood films.

This sounds funny, and it’s even been used as the basis of a sitcom plot. But it’s a real pain point for users. Outsmarting the recommendation engine once it has “decided” what you want is difficult and frustrating, and you stand a good chance of losing the customer. What’s going wrong? In the case of the Bollywood recommendations, the algorithm is probably overemphasizing the movies that have been watched by the surrounding population. With the TiVo, there’s no easy way to tell the system that it’s wrong. Instead, you’re forced to try to outfox it, and users who have tried have discovered that it’s hard to out think an intelligent agent that has gotten the wrong idea.

Improving precision and recall

What tools do we have to think about bad results — things like unfortunate recommendations and collaborative filtering gone wrong? Two concepts, precision and recall, let us describe the problem more precisely. Here’s what they mean:

Precision — The ability to provide a result that exactly matches what’s desired. If you’re building a recommendation engine, can you give a good recommendation every time? If you’re displaying advertisements, will every ad result in a click? That’s high precision.

Recall — The set of possible good recommendations. Recall is fundamentally about inventory: Good recall means that you have a lot of good recommendations, or a lot of advertisements that you can potentially show the user.

It’s obvious that you’d like to have both high precision and high recall. For example, if you’re showing a user advertisements, you’d be in heaven if you have a lot of ads to show, and every ad has a high probability of resulting in a click. Unfortunately, precision and recall often work against each other: As precision increases, recall drops, and vice versa. The number of ads that have a 95% chance of resulting in a click is likely to be small indeed, and the number of ads with a 1% chance is obviously much larger.

So, an important issue in product design is the tradeoff between precision versus recall. If you’re working on a search engine, precision is the key, and having a large inventory of plausible search results is irrelevant. Results that will satisfy the user need to get to the top of the page. Low-precision search results yield a poor experience.

On the other hand, low-precision ads are almost harmless (perhaps because they’re low precision, but that’s another matter). It’s hard to know what advertisement will elicit a click, and generally it’s better to show a user something than nothing at all. We’ve seen enough irrelevant ads that we’ve learned to tune them out effectively.

The difference between these two cases is how the data is presented to the user. Search data is presented directly: If you search Google for “data science,” you’ll get 1.16 billion results in 0.47 seconds (as of this writing). The results on the first few pages will all have the term “data science” in them. You’re getting results directly related to your search; this makes intuitive sense. But the rationale behind advertising content is obfuscated. You see ads, but you don’t know why you were shown those ads. Nothing says, “We showed you this ad because you searched for data science and we know you live in Virginia, so here’s the nearest warehouse for all your data needs.” Since the relationship between the ad and your interests is obfuscated, it’s hard to judge an ad harshly for being irrelevant, but it’s also not something you’re going to pay attention to.

Generalizing beyond advertising, when building any data product in which the data is obfuscated (where there isn’t a clear relationship between the user and the result), you can compromise on precision, but not on recall. But when the data is exposed, focus on high precision.

Subjectivity

Another issue to contend with is subjectivity: How does the user perceive the results? One product at LinkedIn delivers a set of up to 10 job recommendations. The problem is that users focus on the bad recommendations, rather than the good ones. If nine results are spot on and one is off, the user will leave thinking that the entire product is terrible. One bad experience can spoil a consistently good experience. If, over five web sessions, we show you 49 perfect results in a row, but the 50th one doesn’t make sense, the damage is still done. It’s not quite as bad as if the bad result appeared in the first session, but it’s still done, and it’s hard to recover. The most common guideline is to strive for a distribution in which there are many good results, a few great ones, and no bad ones.

That’s only part of the story. You don’t really know what the user will consider a poor recommendation. Here are two sets of job recommendations:

Jobs You May Be Interested In example 1

Jobs You May Be Interested In example 2

What’s important: The job itself? Or the location? Or the title? Will the user consider a recommendation “bad” if it’s a perfect fit, but requires him to move to Minneapolis? What if the job itself is a great fit, but the user really wants “senior” in the title? You really don’t know. It’s very difficult for a recommendation engine to anticipate issues like these.

Enlisting other users

One jujitsu approach to solving this problem is to flip it around and use the social system to our advantage. Instead of sending these recommendations directly to the user, we can send the recommendations to their connections and ask them to pass along the relevant ones. Let’s suppose Mike sends me a job recommendation that, at first glance, I don’t like. One of these two things is likely to happen:

  • I’ll take a look at the job recommendation and realize it is a terrible recommendation and it’s Mike’s fault.
  • I’ll take a look at the job recommendation and try to figure out why Mike sent it. Mike may have seen something in it that I’m missing. Maybe he knows that the company is really great.

At no time is the system being penalized for making a bad recommendation. Furthermore, the product is producing data that now allows us to better train the models and increase overall precision. Thus, a little twist in the product can make a hard relevance problem disappear. This kind of cleverness lets you take a problem that’s extraordinarily challenging and gives you an edge to make the product work.

Referral Center example from LinkedIn

Ask and you shall receive

We often focus on getting a limited set of data from a user. But done correctly, you can engage the user to give you more useful, high-quality data. For example, if you’re building a restaurant recommendation service, you might ask the user for his or her zip code. But if you also ask for the zip code where the user works, you have much more information. Not only can you make recommendations for both locations, but you can predict the user’s typical commute patterns and make recommendations along the way. You increase your value to the user by giving the user a greater diversity of recommendations.

In keeping with Data Jujitsu, predicting commute patterns probably shouldn’t be part of your first release; you want the simplest thing that could possibly work. But asking for the data gives you the potential for a significantly more powerful and valuable product.

Take heed not just to demand data. You need to explain to the user why you’re asking for data; you need to disarm the user’s resistance to providing more information by telling him that you’re going to provide value (in this case, more valuable recommendations), rather than abusing the data. It’s essential to remember that you’re having a conversation with the user, rather than giving him a long form to fill out.

Anticipate failure

As we’ve seen, data products can fail because of relevance problems arising from the tradeoff between precision and recall. Design your product with the assumption that it will fail. And in the process, design it so that you can preserve the user experience even if it fails.

Two data products that demonstrate extremes in user experience are Sony’s AIBO (a robotic pet), and interactive voice response systems (IVR), such as the ones that answer the phone when you call an airline to change a flight.

Let’s consider the AIBO first. It’s a sophisticated data product. It takes in data from different sensors and uses this data to train models so that it can respond to you. What do you do if it falls over or does something similarly silly, like getting stuck walking into a wall? Do you kick it? Curse at it? No. Instead, you’re likely to pick it up and help it along. You are effectively compensating for when it fails. Let’s suppose instead of being a robotic dog, it was a robot that brought hot coffee to you. If it spilled the coffee on you, what would your reaction be? You might both kick it and curse at it. Why the difference? The difference is in the product’s form and execution. By making the robot a dog, Sony limited your expectations; you’re predisposed to cut the robot slack if it doesn’t perform correctly.

Now, let’s consider the IVR system. This is also a sophisticated data product. It tries to understand your speech and route you to the right person, which is no simple task. When you call one these systems, what’s your first response? If it is voice activated, you might say, “operator.” If that doesn’t work, maybe you’ll say “agent” or “representative.” (I suspect you’ll be wanting to scream “human” into the receiver.) Maybe you’ll start pressing the button “0.” Have you ever gone through this process and felt good? More often than not, the result is frustration.

What’s the difference? The IVR product inserts friction into the process (at least from the customer’s perspective), and limits his ability to solve a problem. Furthermore, there isn’t an easy way to override the system. Users think they’re up against a machine that thinks it is smarter than they are, and that is keeping them from doing what they want. Some could argue that this is a design feature, that adding friction is a way of controlling the amount of interaction with customer service agents. But, the net result is frustration for the customer.

You can give your data product a better chance of success by carefully setting the users’ expectations. The AIBO sets expectations relatively low: A user doesn’t expect a robotic dog to be much other than cute. Let’s think back to the job recommendations. By using Data Jujitsu and sending the results to the recipient’s network, rather than directly to him, we create a product that doesn’t act like an overly intelligent machine that the user is going to hate. By enlisting a human to do the filtering, we put a human face behind the recommendation.

One under-appreciated facet of designing data products is how the user feels after using the product. Does he feel good? Empowered? Or disempowered and dejected? A product like the AIBO, or like job recommendations sent via a friend, is structured so that the user is predisposed toward feeling good after he’s finished.

In many applications, a design treatment that gives the user control over the outcome can go far to create interactions that leave the user feeling good. For example, if you’re building a collaborative filter, you will inevitably generate incorrect recommendations. But you can allow the user to tell you about poor recommendations with a button that allows the user to “X” out recommendations he doesn’t like.

Facebook uses this design technique when they show you an ad. They also give you control to hide the ad, as well as an an opportunity to tell them why you don’t think the ad is relevant. The choices they give you range from not being relevant to being offensive. This provides an opportunity to engage users as well as give them control. It turns annoyance into empowerment; rather than being a victim of the bad ad targeting, users get to feel that they can make their own recommendations about which ads they will see in the future.

Facebook ad targeting and customization

Putting Data Jujitsu into practice

You’ve probably recognized some similarities between Data Jujitsu and some of the thought behind agile startups: Data Jujitsu embraces the notion of the minimum viable product and the simplest thing that could possibly work. While these ideas make intuitive sense, as engineers, many of us have to struggle against the drive to produce a beautiful, fully-featured, massively complex solution. There’s a reason that Rube Goldberg cartoons are so attractive. Data Jujitsu is all about saying “no” to our inner Rube Goldberg.

I talked at the start about getting clean data. It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. If you can come up with strategies for data entry that are inherently clean (such as populating city and state fields from a zip code), you’re much better off. Work done up front in getting clean data will be amply repaid over the course of the project.

A surprising amount of Data Jujitsu is about product design and user experience. If you can design your product so that users are predisposed to cut it some slack when it’s wrong (like the AIBO or, for that matter, the LinkedIn job recommendation engine), you’re way ahead. If you can enlist your users to help, you’re ahead on several levels: You’ve made the product more engaging, and you’ve frequently taken a shortcut around a huge data problem.

The key aspect of making a data product is putting the “product” first and “data” second. Saying it another way, data is one mechanism by which you make the product user-focused. With all products, you should ask yourself the following three questions:

  1. What do you want the user to take away from this product?
  2. What action do you want the user to take because of the product?
  3. How should the user feel during and after using your product?

If your product is successful, you will have plenty of time to play with complex machine learning algorithms, large computing clusters running in the cloud, and whatever you’d like. Data Jujitsu isn’t the end of the road; it’s really just the beginning. But it’s the beginning that allows you to get to the next step.

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World. Save 20% on registration with the code RADAR20

Related:

 

August 18 2010

The laws of information chemistry

In the course of my work on the elmcity project I've talked to a lot of people about forming networks of calendars. One of the major hurdles has been the very idea that we can form such networks, in an ad-hoc way, using informal contracts. Later in this series I'll explore why that's a tough concept, and mull over how we might soften it up. Here I'll focus on an even more basic conceptual stumbling block: information structure.

Everybody learns that things in the physical world are structured in ways that govern how they can or cannot interact. Whether it's proteins folded into biochemical locks and keys, or metallic parts formed into real locks and keys, we know the drill. The right shape will open the door, the wrong one won't. You can't get through grade school without being exposed to that idea.

Unless you're on an IT track, though, you'll likely graduate from college without ever learning this corollary: The right information structures open doors, the wrong ones won't.

My project has shown me that many otherwise well-educated professionals have no intuitive sense of the differences between these various representations of a calendar:

  1. As a PDF file

  2. As an HTML page

  3. As an RSS feed
  4. As an iCalendar feed

These are all just different flavors of computer files, most people think. Pick a format that can be read on a PC or a Mac and you're good to go. So my local high school, for example, uses PDF:

It irks me the school publishes this data without acknowledging that it is data, or providing it in a way that's appropriate for the kind of data it is. In 2010, one of the "tools to succeed in a diverse and interdependent world" has to be a basic working knowledge of information chemistry. The quotation about learning that appears at the top of that image speaks to the underlying principle:

"Treat it as an active process of constructing ideas, rather than a passive process of absorbing information." - Daniel J. Boorstin

I looked for that quotation's context, by the way, and didn't find it in any of Boorstin's works. Instead it shows up here:

Anyway, I agree with the authors of "From Risk To Renewal: Charting a Course for Reform." We don't just passively dwell in social information networks, we actively co-create them. To do that effectively we need to know what will or won't catalyze a chemical reaction in data space.

The reaction I hope the elmcity project will help catalyze is one that unlocks calendar data and enables it to flow freely through networks without loss of fidelity. In theory, any of the four document flavors listed above could work, if supported by tools that encode and exchange the core structure of an event: a title, a date and time, a link to the authoritative source. In fact there's only one common flavor that preserves that structure: iCalendar. And there's only one widely-deployed kind of application, examples of which include Google Calendar, Microsoft Outlook, Apple iCal, and Lotus Notes.

Among technical folk there's been an on-again, off-again effort to migrate the iCalendar standard from its existing plain text format to one based on XML, which didn't exist when iCalendar was born. For me it's a wash. Either flavor can encode the basic facts in a way that enables calendar networks to form. Will translation between the flavors be a problem? It shouldn't be, but if so I'd regard it as a good problem to have as compared to the one we've actually got, which is that nearly all the calendar information available online isn't in any calendar format. It's randomly dumped into PDF files, or into HTML pages that don't (as they might) encode event structure using the hCalendar microformat.

The calendar-like HTML page is so common that a service called FuseCal tried (with pretty good success) to scrape those web pages and turn them into standard iCalendar feeds. The service is gone now, and one piece of the elmcity project (which I describe in the companion how-to article "How to write an elmcity event parser plug-in") aims to recreate it in a modest way. I'm ambivalent about doing this, though, because web-page scraping sweeps the real problem under the rug. Of course we can't expect people to read and write raw data-exchange formats. But we can and should expect people to have a clue about what data-exchange formats are, and to know something about when and why to use them.

There has been progress. Starting with the early blogosophere, and continuing into the present era of Facebook and Twitter, the technocracy has introduced the masses to the concept of information feeds. Many people now know, in a general way, that some molecular strands of information combine more readily than others. But the concept isn't yet fully digested. So, for example, events pages on websites are far more likely to link to RSS or Atom feeds than to iCalendar feeds. With apologies to Guy Kawasaki and Terence Trent D'Arby, that's the Right Thing done the Wrong Way.

Here's why: Publishing a data feed is absolutely the right idea, but using RSS or Atom feeds to do it is a category error. Because these feeds don't encode dates, times, and locations in any standard way, they're part of the blogosophere but can't flow through calendar networks.

Calendars, of course, are just one of many types of data that can drive online chemical reactions. We're reaching a consensus that open publication of data is a necessary condition. But it's not sufficient. We've always expected educated citizens to know at least basic physics and chemistry. Now we need to discover, write down, and teach the analogous laws that govern social information networks.



Related:


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl