Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

October 03 2012

Biohacking: The next great wave of innovation

Genspace and Biocurious logosGenspace and Biocurious logosI’ve been following synthetic biology for the past year or so, and we’re about to see some big changes. Synthetic bio seems to be now where the computer industry was in the late 1970s: still nascent, but about to explode. The hacker culture that drove the development of the personal computer, and that continues to drive technical progress, is forming anew among biohackers.

Computers certainly existed in the ’60s and ’70s, but they were rare, and operated by “professionals” rather than enthusiasts. But an important change took place in the mid-’70s: computing became the domain of amateurs and hobbyists. I read recently that the personal computer revolution started when Steve Wozniak built his own computer in 1975. That’s not quite true, though. Woz was certainly a key player, but he was also part of a club. More important, Silicon Valley’s Homebrew Computer Club wasn’t the only one. At roughly the same time, a friend of mine was building his own computer in a dorm room. And hundreds of people, scattered throughout the U.S. and the rest of the world, were doing the same thing. The revolution wasn’t the result of one person: it was the result of many, all moving in the same direction.

Biohacking has the same kind of momentum. It is breaking out of the confines of academia and research laboratories. There are two significant biohacking hackerspaces in the U.S., GenSpace in New York and BioCurious in California, and more are getting started. Making glowing bacteria (the biological equivalent of “Hello, World!”) is on the curriculum in high school AP bio classes. iGem is an annual competition to build “biological robots.” A grassroots biohacking community is developing, much as it did in computing. That community is transforming biology from a purely professional activity, requiring lab coats, expensive equipment, and other accoutrements, to something that hobbyists and artists can do.

As part of this transformation, the community is navigating the transition from extremely low-level tools to higher-level constructs that are easier to work with. When I first leaned to program on a PDP-8, you had to start the computer by loading a sequence of 13 binary numbers through switches on the front panel. Early microcomputers weren’t much better, but by the time of the first Apples, things had changed. DNA is similar to machine language (except it’s in base four, rather than binary), and in principle hacking DNA isn’t much different from hacking machine code. But synthetic biologists are currently working on the notion of “standard biological parts,” or genetic sequences that enable a cell to perform certain standardized tasks. Standardized parts will give practitioners the ability to work in a “higher level language.” In short, synthetic biology is going through the same transition in usability that computing saw in the ’70s and ’80s.

Alongside this increase in usability, we’re seeing a drop in price, just as in the computer market. Computers cost serious money in the early ’70s, but the price plummeted, in part because of hobbyists: seminal machines like the Apple II, the TRS-80, and the early Macintosh would never have existed if not to serve the needs of hobbyists. Right now, setting up a biology lab is expensive; but we’re seeing the price drop quickly, as biohackers figure out clever ways to make inexpensive tools, such as the DremelFuge, and learn how to scrounge for used equipment.

And we’re also seeing an explosion in entrepreneurial activity. Just as the Homebrew Computer Club and other garage hackers led to Apple and Microsoft, the biohacker culture is full of similarly ambitious startups, working out of hackerspaces. It’s entirely possible that the next great wave of entrepreneurs will be biologists, not programmers.

What are the goals of synthetic biology? There are plenty of problems, from the industrial to the medical, that need to be solved. Drew Endy told me how one of the first results from synthetic biology, the creation of soap that would be effective in cold water, reduced the energy requirements of the U.S. by 10%. The holy grail in biofuels is bacteria that can digest cellulose (essentially, the leaves and stems of any plant) and produce biodiesel. That seems achievable. Can we create bacteria that would live in a diabetic’s intestines and produce insulin? Certainly.

But industrial applications aren’t the most interesting problems waiting to be solved. Endy is concerned that, if synthetic bio is dominated by a corporate agenda, it will cease to be “weird,” and won’t ask the more interesting questions. One Synthetic Aesthetics project made cheeses from microbes that were cultured from the bodies of people in the synthetic biology community. Christian Bok has inserted poetry into a microbe’s DNA. These are the projects we’ll miss if the agenda of synthetic biology is defined by business interests. And these are, in many ways, the most important projects, the ones that will teach us more about how biology works, and the ones that will teach us more about our own creativity.

The last 40 years of computing have proven what a hacker culture can accomplish. We’re about to see that again, this time in biology. And, while we have no idea what the results will be, it’s safe to predict that the coming revolution in biology will radically change the way we live — at least as radically as the computer revolution. It’s going to be an interesting and exciting ride.


September 06 2012

Digging into the UDID data

Over the weekend the hacker group Antisec released one million UDID records that they claim to have obtained from an FBI laptop using a Java vulnerability. In reply the FBI stated:

The FBI is aware of published reports alleging that an FBI laptop was compromised and private data regarding Apple UDIDs was exposed. At this time there is no evidence indicating that an FBI laptop was compromised or that the FBI either sought or obtained this data.

Of course that statement leaves a lot of leeway. It could be the agent’s personal laptop, and the data may well have been “property” of an another agency. The wording doesn’t even explicitly rule out the possibility that this was an agency laptop, they just say that right now they don’t have any evidence to suggest that it was.

This limited data release doesn’t have much impact, but the possible release of the full dataset, which is claimed to include names, addresses, phone numbers and other identifying information, is far more worrying.

While there are some almost dismissing the issue out of hand, the real issues here are: Where did the data originate? Which devices did it come from and what kind of users does this data represent? Is this data from a cross-section of the population, or a specifically targeted demographic? Does it originate within the law enforcement community, or from an external developer? What was the purpose of the data, and why was it collected?

With conflicting stories from all sides, the only thing we can believe is the data itself. The 40-character strings in the release at least look like UDID numbers, and anecdotally at least we have a third-party confirmation that this really is valid UDID data. We therefore have to proceed at this point as if this is real data. While there is a possibility that some, most, or all of the data is falsified, that’s looking unlikely from where we’re standing standing at the moment.

With that as the backdrop, the first action I took was to check the released data for my own devices and those of family members. Of the nine iPhones, iPads and iPod Touch devices kicking around my house, none of the UDIDs are in the leaked database. Of course there isn’t anything to say that they aren’t amongst the other 11 million UDIDs that haven’t been released.

With that done, I broke down the distribution of leaked UDID numbers by device type. Interestingly, considering the number of iPhones in circulation compared to the number of iPads, the bulk of the UDIDs were self-identified as originating on an iPad.

Distribution of UDID by device type

What does that mean? Here’s one theory: If the leak originated from a developer rather than directly from Apple, and assuming that this subset of data is a good cross-section on the total population, and assuming that the leaked data originated with a single application … then the app that harvested the data is likely a Universal application (one that runs on both the iPhone and the iPad) that is mostly used on the iPad rather than on the iPhone.

The very low numbers of iPod Touch users might suggest either demographic information, or that the application is not widely used by younger users who are the target demographic for the iPod Touch, or alternatively perhaps that the application is most useful when a cellular data connection is present.

The next thing to look at, as the only field with unconstrained text, was the Device Name data. That particular field contains a lot of first names, e.g. “Aaron’s iPhone,” so roughly speaking the distribution of first letters in the this field should give a decent clue as to the geographical region of origin of the leaked list of UDIDs. This distribution is of course going to be different depending on the predominant language in the region.

Distribution of UDID by the first letter of the “Device Name” field

The immediate stand out from this distribution is the predominance of device name strings starting with the letter “i.” This can be ascribed to people who don’t have their own name prepended to the Device Name string, and have named their device “iPhone,” “iPad” or “iPod Touch.”

The obvious next step was to compare this distribution with the relative frequency of first letters in words in the English language.

Comparing the distribution of UDID by first letter of the “Device Name” field against the relative frequencies of the first letters of a word in the English language

The spike for the letter “i” dominated the data, so the next step was to do some rough and ready data cleaning.

I dropped all the Device Name strings that started with the string “iP.” That cleaned out all those devices named “iPhone,” “iPad” and “iPod Touch.” Doing that brought the number of device names starting with an “i” down from 159,925 to just 13,337. That’s a bit more reasonable.

Comparing the distribution of UDID by first letter of the “Device Name” field, ignoring all names that start with the string “iP,” against the relative frequencies of the first letters of a word in the English language

I had a slight over-abundance of “j,” although that might not be statistically significant. However, the stand out was that there was a serious under-abundance of strings starting with the letter “t,” which is interesting. Additionally, with my earlier data cleaning I also had a slight under-abundance of “i,” which suggested I may have been too enthusiastic about cleaning the data.

Looking at the relative frequency of letters in languages other than English it’s notable that amongst them Spanish has a much lower frequency of the use of “t.”

As the de facto second language of the United States, Spanish is the obvious next choice  to investigate. If the devices are predominantly Spanish in origin then this could solve the problem introduced by our data cleaning. As Marcos Villacampa noted in a tweet, in Spanish you would say “iPhone de Mark” rather than “Mark’s iPhone.”

Comparing the distribution of UDID by first letter of the “Device Name” field, ignoring all names that start with the string “iP,” against the relative frequencies of the first letters of a word in the Spanish language

However, that distribution didn’t really fit either. While “t” was much better, I now had an under-abundance of words with an ”e.” Although it should be noted that, unlike our English language relative frequencies, the data I was using for Spanish is for letters in the entire word, rather than letters that begin the word. That’s certainly going to introduce biases, perhaps fatal ones.

Not that I can really make the assumption that there is only one language present in the data, or even that one language predominates, unless that language is English.

At this stage it’s obvious that the data is, at least more or less, of the right order of magnitude. The data probably shows devices coming from a Western country. However, we’re a long way from the point where I’d come out and say something like ” … the device names were predominantly in English.” That’s not a conclusion I can make.

I’d be interested in tracking down the relative frequency of letters used in Arabic when the language is transcribed into the Roman alphabet. While I haven’t been able to find that data, I’m sure it exists somewhere. (Please drop a note in the comments if you have a lead.)

The next step for the analysis is to look at the names themselves. While I’m still in the process of mashing up something that will access U.S. census data and try and reverse geo-locate a name to a “most likely” geographical origin, such services do already exist. And I haven’t really pushed the boundaries here, or even started a serious statistical analysis of the subset of data released by Antisec.

This brings us to Pete Warden’s point that you can’t really anonymize your data. The anonymization process for large datasets such as this is simply an illusion. As Pete wrote:

Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone’s actions has a good chance of matching identifiable public records.

While this release in itself is fairly harmless, a number of “harmless” releases taken together — or cleverly cross-referenced with other public sources such as Twitter, Google+, Facebook and other social media — might well be more damaging. And that’s ignoring the possibility that Antisec really might have names, addresses and telephone numbers to go side-by-side with these UDID records.

The question has to be asked then, where did this data originate? While 12 million records might seem a lot, compared to the number of devices sold it’s not actually that big a number. There are any number of iPhone applications with a 12-million-user installation base, and this sort of backend database could easily have been built up by an independent developer with a successful application who downloaded the device owner’s contact details before Apple started putting limitations on that.

Ignoring conspiracy theories, this dataset might be the result of a single developer. Although how it got into the FBI’s possession and the why of that, if it was ever there in the first place, is another matter entirely.

I’m going to go on hacking away at this data to see if there are any more interesting correlations, and I do wonder whether Antisec would consider a controlled release of the data to some trusted third party?

Much like the reaction to #locationgate, where some people were happy to volunteer their data, if enough users are willing to self-identify, then perhaps we can get to the bottom of where this data originated and why it was collected in the first place.

Thanks to Hilary Mason, Julie Steele, Irene RosGemma Hobson and Marcos Villacampa for ideas, pointers to comparative data sources, and advice on visualisation of the data.



In response to a post about this article on Google+, Josh Hendrix made the suggestion that I should look at word as well as letter frequency. It was a good idea, so I went ahead and wrote a quick script to do just that…

The top two words in the list are “iPad,” which occurs 445,111 times, and “iPhone,” which occurs 252,106 times. The next most frequent word is “iPod,” but that occurs only 36,367 times. This result backs up my earlier result looking at distribution by device type.

Then there are various misspellings and mis-capitalisations of “iPhone,” “iPad,” and “iPod.”

The first real word that isn’t an Apple trademark is “Administrator,” which occurs 10,910 times. Next are “David” (5,822), “John” (5,447), and “Michael” (5,034). This is followed by “Chris” (3,744), “Mike” (3,744), “Mark” (3,66) and “Paul” (3,096).

Looking down the list of real names, as opposed to partial strings and tokens, the first female name doesn’t occur until we’re 30 places down the list — it’s “Lisa” (1,732) with the next most popular female name being “Sarah” (1,499), in 38th place.

The top 100 names occurring in the UDID list.

The word “Dad” occurs 1,074 times, with “Daddy” occurring 383 times. For comparison the word “Mum” occurs just 58 times, and “Mummy” just 33. “Mom” came in with 150 occurrences, and “mommy” with 30. The number of occurrences for “mum,” “mummy,” “mom,” and “mommy” combined is 271, which is still very small compared to the combined total of 1,457 for “dad” and “daddy.”

[Updated: Greg Yardly wisely pointed out on Twitter that I was being a bit English-centric in only looking for the words "mum" and "mummy," which is why I expanded the scope to include "mom" and "mommy."]

There is a definite gender bias here, and I can think of at least a few explanations. The most likely is fairly simplistic: The application where the UDID numbers originated either appeals to, or is used more, by men.

Alternatively, women may be less likely to include their name in the name of their device, perhaps because amongst other things this name is used to advertise the device on wireless networks?

Either way I think this definitively pins it down as a list of devices originating in an Anglo-centric geographic region.

Sometimes the simplest things work better. Instead of being fancy perhaps I should have done this in the first place. However this, combined with my previous results, suggest that we’re looking at an English speaking, mostly male, demographic.

Correlating the top 20 or so names and with the list of most popular baby names (by year) all the way from the mid-’60s up until the mid-’90s (so looking at the most popular names for people between the ages of say 16 and 50) might give a further clue as to the exact demographic involved.

Both Gemma Hobson and Julie Steele directed me toward the U.S. Social Security Administration’s Popular Baby Names By Decade list. A quick and dirty analysis suggests that the UDID data is dominated by names that were most popular in the ’70s and ’80s. This maps well to my previous suggestion that the lack of iPod Touch usage might suggest that the demographic was older.

I’m going to do a year-by-year breakdown and some proper statistics later on, but we’re looking at an application that’s probably used by: English speaking males with an Anglo-American background in their 30s or 40s. It’s most used on the iPad, and although it also works on the iPhone, it’s used far less on that platform.

Thanks to Josh Hendrix, and again to Gemma Hobson and Julie Steele, for ideas and pointers to sources for this part of the analysis.


August 30 2012

A marriage of data and caregivers gives Dr. Atul Gawande hope for health care

Dr. Atul GawandeDr. Atul GawandeDr. Atul Gawande (@Atul_Gawande) has been a bard in the health care world, straddling medicine, academia and the humanities as a practicing surgeon, medical school professor, best-selling author and staff writer at the New Yorker magazine. His long-form narratives and books have helped illuminate complex systems and wicked problems to a broad audience.

One recent feature that continues to resonate for those who wish to apply data to the public good is Gawande’s New Yorker piece “The Hot Spotters,” where Gawande considered whether health data could help lower medical costs by giving the neediest patients better care. That story brings home the challenges of providing health care in a city, from cultural change to gathering data to applying it.

This summer, after meeting Gawande at the 2012 Health DataPalooza, I interviewed him about hot spotting, predictive analytics, networked transparency, health data, feedback loops and the problems that technology won’t solve. Our interview, lightly edited for content and clarity, follows.

Given what you’ve learned in Camden, N.J. — the backdrop for your piece on hot spotting — do you feel hot spotting is an effective way for cities and people involved in public health to proceed?

Gawande: The short answer, I think, is “yes.”

Here we have this major problem of both cost and quality — and we have signs that some of the best places that seem to do the best jobs can be among the least expensive. How you become one of those places is a kind of mystery.

It really parallels what happened in the police world. Here is something that we thought was an impossible problem: crime. Who could possibly lower crime? One of the ways we got a handle on it was by directing policing to the places where there was the most crime. It sounds kind of obvious, but it was not apparent that crime is concentrated and that medical costs are concentrated.

The second thing I knew but hadn’t put two and two together about is that the sickest people get the worst care in the system. People with complex illness just don’t fit into 20-minute office visits.

The work in Camden was emblematic of work happening in pockets all around the country where you prioritize. As soon as you look at the system, you see hundreds, thousands of things that don’t work properly in medicine. But when you prioritize by saying, “For the sickest people — the 5% who account for half of the spending — let’s look at what their $100,000 moments are,” you then understand it’s strengthening primary care and it’s the ability to manage chronic illness.

It’s looking at a few acute high-cost, high-failure areas of care, such as how heart attacks and congestive heart failure are managed in the system; looking at how renal disease patients are cared for; or looking at a few things in the commercial population, like back pain, being a huge source of expense. And then also end-of-life care.

With a few projects, it became more apparent to me that you genuinely could transform the system. You could begin to move people from depending on the most expensive places where they get the least care to places where you actually are helping people achieve goals of care in the most humane and least wasteful ways possible.

The data analytics office in New York City is doing fascinating predictive analytics. That approach could have transformative applications in health care, but it’s notable how careful city officials have been about publishing certain aspects of the data. How do you think about the relative risks and rewards here, including balancing social good with the need to protect people’s personal health data?

Gawande: Privacy concerns can sometimes be a barrier, but I haven’t seen it be the major barrier here. There are privacy concerns in the data about households as well in the police data.

The reason it works well for the police is not just because you have a bunch of data geeks who are poking at the data and finding interesting things. It’s because they’re paired with people who are responsible for responding to crime, and above all, reducing crime. The commanders who have the responsibility have a relationship with the people who have the data. They’re looking at their population saying, “What are we doing to make the system better?”

That’s what’s been missing in health care. We have not married the people who have the data with people who feel responsible for achieving better results at lower costs. When you put those people together, they’re usually within a system, and within a system, there is no privacy barrier to being able to look and say, “Here’s what we can be doing in this health system,” because it’s often that particular.

The beautiful aspect of the work in New York is that it’s not at a terribly abstract level. Yes, they’re abstracting the data, but they’re also helping the police understand: “It’s this block that’s the problem. It’s shifted in the last month into this new sector. The pattern of the crime is that it looks more like we have a problem with domestic violence. Here are a few more patterns that might give you a clue about what you can go in and do.” There’s this give and take about what can be produced and achieved.

That, to me, is the gold in the health care world — the ability to peer in and say: “Here are your most expensive patients and your sickest patients. You didn’t know it, but here, there’s an alcohol and drug addiction issue. These folks are having car accidents and major trauma and turning up in the emergency rooms and then being admitted with $12,000 injuries.”

That’s a system that could be improved and, lo and behold, there’s an intervention here that’s worked before to slot these folks into treatment programs, which by and large, we don’t do at all.

That sense of using the data to help you solve problems requires two things. It requires data geeks and it requires the people in a system who feel responsible, the way that Bill Bratton made commanders feel responsible in the New York police system for the rate of crime. We haven’t had physicians who felt that they were responsible for 10,000 ICU patients and how well they do on everything from the cost to how long they spend in the ICU.

Health data is creating opportunities for more transparency into outcomes, treatments and performance. As a practicing physician, do you welcome the additional scrutiny that such collective intelligence provides, or does it concern you?

Gawande: I think that transparency of our data is crucial. I’m not sure that I’m with the majority of my colleagues on this. The concerns are that the data can be inaccurate, that you can overestimate or underestimate the sickness of the people coming in to see you, and that my patients aren’t like your patients.

That said, I have no idea who gets better results at the kinds of operations I do and who doesn’t. I do know who has high reputations and who has low reputations, but it doesn’t necessarily correspond to the kinds of results they get. As long as we are not willing to open up data to let people see what the results are, we will never actually learn.

The experience of what happens in fields where the data is open is that it’s the practitioners themselves that use it. I’ll give a couple of examples. Mortality for childbirth in hospitals has been available for a century. It’s been public information, and the practitioners in that field have used that data to drive the death rates for infants and mothers down from the biggest killer in people’s lives for women of childbearing age and for newborns into a rarity.

Another field that has been able to do this is cystic fibrosis. They had data for 40 years on the performance of the centers around the country that take care of kids with cystic fibrosis. They shared the data privately. They did not tell centers how the other centers were doing. They just told you where you stood relative to everybody else and they didn’t make that information public. About four or five years ago, they began making that information public. It’s now available on the Internet. You can see the rating of every center in the country for cystic fibrosis.

Several of the centers had said, “We’re going to pull out because this isn’t fair.” Nobody ended up pulling out. They did not lose patients in hoards and go bankrupt unfairly. They were able to see from one another who was doing well and then go visit and learn from one and other.

I can’t tell you how fundamental this is. There needs to be transparency about our costs and transparency about the kinds of results. It’s murky data. It’s full of lots of caveats. And yes, there will be the occasional journalist who will use it incorrectly. People will misinterpret the data. But the broad result, the net result of having it out there, is so much better for everybody involved that it far outweighs the value of closing it up.

U.S. officials are trying to apply health data to improve outcomes, reduce costs and stimulate economic activity. As you look at the successes and failures of these sorts of health data initiatives, what do you think is working and why?

Gawande: I get to watch from the sidelines, and I was lucky to participate in Datapalooza this year. I mostly see that it seems to be following a mode that’s worked in many other fields, which is that there’s a fundamental role for government to be able to make data available.

When you work in complex systems that involve multiple people who have to, in health care, deal with patients at different points in time, no one sees the net result. So, no one has any idea of what the actual experience is for patients. The open data initiative, I think, has innovative people grabbing the data and showing what you can do with it.

Connecting the data to the physical world is where the cool stuff starts to happen. What are the kinds of costs to run the system? How do I get people to the right place at the right time? I think we’re still in primitive days, but we’re only two or three years into starting to make something more than just data on bills available in the system. Even that wasn’t widely available — and it usually was old data and not very relevant to this moment in time.

My concern all along is that data needs to be meaningful to both the patient and the clinician. It needs to be able to connect the abstract world of data to the physical world of what really happens, which means it has to be timely data. A six-month turnaround on data is not great. Part of what has made Wal-Mart powerful, for example, is they took retail operations from checking their inventory once a month to checking it once a week and then once a day and then in real-time, knowing exactly what’s on the shelves and what’s not.

That equivalent is what we’ll have to arrive at if we’re to make our systems work. Timeliness, I think, is one of the under-recognized but fundamentally powerful aspects because we sometimes over prioritize the comprehensiveness of data and then it’s a year old, which doesn’t make it all that useful. Having data that tells you something that happened this week, that’s transformative.

Are you using an iPad at work?

Gawande: I do use the iPad here and there, but it’s not readily part of the way I can manage the clinic. I would have to put in a lot of effort for me to make it actually useful in my clinic.

For example, I need to be able to switch between radiology scans and past records. I predominantly see cancer patients, so they’ll have 40 pages of records that I need to have in front of me, from scans to lab tests to previous notes by other folks.

I haven’t found a better way than paper, honestly. I can flip between screens on my iPad, but it’s too slow and distracting, and it doesn’t let me talk to the patient. It’s fun if I can pull up a screen image of this or that and show it to the patient, but it just isn’t that integrated into practice.

What problems are immune to technological innovation? What will need to be changed by behavior?

Gawande: At some level, we’re trying to define what great care is. Great care means being able to provide optimally knowledgeable care in the right time and the right way for people and not wasting resources.

Some of it’s crucially aided by information technology that connects information to where it needs to be so that good decision-making happens, both by patients and by the clinicians who work with them.

If you’re going to be able to make health care work better, you’ve got to be able to make that system work better for people, more efficiently and less wastefully, less harmfully and with much better teamwork. I think that information technology is a tool in that, but fundamentally you’re talking about making teams that can go from being disconnected cowboys in care to pit crews that actually work together toward solving a problem.

In a football team or a pit crew, technology is really helpful, but it’s only a tiny part of what makes that team great. What makes the team great is that they know what they’re aiming to do, they’re very clear about their goals, and they are able to make sure they execute every basic thing that’s crucial for that success.

What do you worry about in this surge of interest in more data-driven approaches to medicine?

Gawande: I worry the most about a disconnect between the people who have to use the information and technology and tools, and the people who make them. We see this in the consumer world. Fundamentally, there is not a single [health] application that is remotely like my iPod, which is instantly usable. There are a gazillion number of ways in which information would make a huge amount of difference.

That sense of being able to understand the world of the user, the task that’s accomplished and the complexity of what they have to do, and connecting that to the people making the technology — there just aren’t that many lines of marriage. In many of the companies that have some of the dominant systems out there, I don’t see signs that that’s necessarily going to get any better.

If people gain access to better information about the consequences of various choices, will that lead to improved outcomes and quality of life?

Gawande: That’s where the art comes in. There are problems because you lack information, but when you have information like “you shouldn’t drink three cans of Coke a day — you’re going to put on weight,” then having that information is not sufficient for most people.

Understanding what is sufficient to be able to either change the care or change the behaviors that we’re concerned about is the crux of what we’re trying to figure out and discover.

When the information is presented in a really interesting way, people have gradually discovered — for example, having a little ball on your dashboard that tells you when you’re accelerating too fast and burning off extra fuel — how that begins to change the actual behavior of the person in the car.

No amount of presenting the information that you ought to be driving in a more environmentally friendly way ends up changing anything. It turns out that change requires the psychological nuance of presenting the information in a way that provokes the desire to actually do it.

We’re at the very beginning of understanding these things. There’s also the same sorts of issues with clinician behavior — not just information, but how you are able to foster clinicians to actually talk to one another and coordinate when five different people are involved in the care of a patient and they need to get on the same page.

That’s why I’m fascinated by the police work, because you have the data people, but they’re married to commanders who have responsibility and feel responsibility for looking out on their populations and saying, “What do we do to reduce the crime here? Here’s the kind of information that would really help me.” And the data people come back to them and say, “Why don’t you try this? I’ll bet this will help you.”

It’s that give and take that ends up being very powerful.

Strata Rx — Strata Rx, being held Oct. 16-17 in San Francisco, is the first conference to bring data science to the urgent issues confronting health care.

Save 20% on registration with the code RADAR20


August 29 2012

Why big data is big: the digital nervous system

Where does all the data in “big data” come from? And why isn’t big data just a concern for companies such as Facebook and Google? The answer is that the web companies are the forerunners. Driven by social, mobile, and cloud technology, there is an important transition taking place, leading us all to the data-enabled world that those companies inhabit today.

From exoskeleton to nervous system

Until a few years ago, the main function of computer systems in society, and business in particular, was as a digital support system. Applications digitized existing real-world processes, such as word-processing, payroll and inventory. These systems had interfaces back out to the real world through stores, people, telephone, shipping and so on. The now-quaint phrase “paperless office” alludes to this transfer of pre-existing paper processes into the computer. These computer systems formed a digital exoskeleton, supporting a business in the real world.

The arrival of the Internet and web has added a new dimension, bringing in an era of entirely digital business. Customer interaction, payments and often product delivery can exist entirely within computer systems. Data doesn’t just stay inside the exoskeleton any more, but is a key element in the operation. We’re in an era where business and society are acquiring a digital nervous system.

As my sketch below shows, an organization with a digital nervous system is characterized by a large number of inflows and outflows of data, a high level of networking, both internally and externally, increased data flow, and consequent complexity.

This transition is why big data is important. Techniques developed to deal with interlinked, heterogenous data acquired by massive web companies will be our main tools as the rest of us transition to digital-native operation. We see early examples of this, from catching fraud in financial transactions, to debugging and improving the hiring process in HR: and almost everybody already pays attention to the massive flow of social network information concerning them.

From digital exoskeleton to nervous systemFrom digital exoskeleton to nervous system
From digital exoskeleton to nervous system.

Charting the transition

As technology has progressed within business, each step taken has resulted in a leap in data volume. To people looking at big data now, a reasonable question is to ask why, when their business isn’t Google or Facebook, does big data apply to them?

The answer lies in the ability of web businesses to conduct 100% of their activities online. Their digital nervous system easily stretches from the beginning to the end of their operations. If you have factories, shops and other parts of the real world within your business, you’ve further to go in incorporating them into the digital nervous system.

But “further to go” doesn’t mean it won’t happen. The drive of the web, social media, mobile, and the cloud is bringing more of each business into a data-driven world. In the UK, the Government Digital Service is unifying the delivery of services to citizens. The results are a radical improvement of citizen experience, and for the first time many departments are able to get a real picture of how they’re doing. For any retailer, companies such as Square, American Express and Foursquare are bringing payments into a social, responsive data ecosystem, liberating that information from the silos of corporate accounting.

What does it mean to have a digital nervous system? The key trait is to make an organization’s feedback loop entirely digital. That is, a direct connection from sensing and monitoring inputs through to product outputs. That’s straightforward on the web. It’s getting increasingly easier in retail. Perhaps the biggest shifts in our world will come as sensors and robotics bring the advantages web companies have now to domains such as industry, transport, and the military.

The reach of digital nervous system has grown steadily over the past 30 years, and each step brings gains in agility and flexibility, along with an order of magnitude more data. First, from specific application programs to general business use with the PC. Then, direct interaction over the web. Mobile adds awareness of time and place, along with instant notification. The next step, to cloud, breaks down data silos and adds storage and compute elasticity through cloud computing. Now, we’re integrating smart agents, able to act on our behalf, and connections to the real world through sensors and automation.

Coming, ready or not

If you’re not contemplating the advantages of taking more of your operation digital, you can bet your competitors are. As Marc Andreessen wrote last year, “software is eating the world.” Everything is becoming programmable.

It’s this growth of the digital nervous system that makes the techniques and tools of big data relevant to us today. The challenges of massive data flows, and the erosion of hierarchy and boundaries, will lead us to the statistical approaches, systems thinking and machine learning we need to cope with the future we’re inventing.

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.

Save 20% on registration with the code RADAR20

Follow up on big data and civil rights

A few weeks ago, I wrote a post about big data and civil rights, which seems to have hit a nerve. It was posted on Solve for Interesting and here on Radar, and then folks like Boing Boing picked it up.

I haven’t had this kind of response to a post before (well, I’ve had responses, such as the comments to this piece for GigaOm five years ago, but they haven’t been nearly as thoughtful).

Some of the best posts have really added to the conversation. Here’s a list of those I suggest for further reading and discussion:

Nobody notices offers they don’t get

On Oxford’s Practical Ethics blog, Anders Sandberg argues that transparency and reciprocal knowledge about how data is being used will be essential. Anders captured the core of my concerns in a single paragraph, saying what I wanted to far better than I could:

… nobody notices offers they do not get. And if these absent opportunities start following certain social patterns (for example not offering them to certain races, genders or sexual preferences) they can have a deep civil rights effect

To me, this is a key issue, and it responds eloquently to some of the comments on the original post. Harry Chamberlain commented:

However, what would you say to the criticism that you are seeing lions in the darkness? In other words, the risk of abuse certainly exists, but until we see a clear case of big data enabling and fueling discrimination, how do we know there is a real threat worth fighting?

I think that this is precisely the point: you can’t see the lions in the darkness, because you’re not aware of the ways in which you’re being disadvantaged. If whites get an offer of 20% off, but minorities don’t, that’s basically a 20% price hike on minorities — but it’s just marketing, so apparently it’s okay.

Context is everything

crystal ball ii by mararie, on Flickrcrystal ball ii by mararie, on FlickrMary Ludloff of Patternbuilders asks, “When does someone else’s problem become ours?” Mary is a presenter at Strata, and an expert on digital privacy. She has a very pragmatic take on things. One point Mary makes is that all this analysis is about prediction — we’re taking a ton of data and making a prediction about you:

The issue with data, particularly personal data, is this: context is everything. And if you are not able to personally question me, you are guessing the context.

If we (mistakenly) predict something, and act on it, we may have wronged someone. Mary makes clear that this is thoughtcrime — arresting someone because their behavior looked like that of a terrorist, or pedophile, or thief. Firing someone because their email patterns suggested they weren’t going to make their sales quota. That’s the injustice.

This is actually about negative rights, which Wikipedia describes as:

Rights considered negative rights may include civil and political rights such as freedom of speech, private property, freedom from violent crime, freedom of worship, habeas corpus, a fair trial, freedom from slavery.

Most philosophers agree that negative rights outweigh positive ones (i.e. I have a right to fresh air more than you have a right to smoke around me.) So our negative right (to be left unaffected by your predictions) outweighs your positive one. As analytics comes closer and closer to predicting actual behavior, we need to remember the lesson of negative rights.

Big bata is the new printing press

Lori Witzel compares the advent of big data to the creation of the printing press, pointing out — somewhat optimistically — that once books were plentiful, it was hard to control the spread of information. She has a good point — we’re looking at things from this side of the big data singularity:

And as the cost of Big Data and Big Data Analytics drops, I predict we’ll see a similar dispersion of technology, and similar destabilizations to societies where these technologies are deployed.

There’s a chance that we’ll democratize access to information so much that it’ll be the corporations, not the consumers, that are forced to change.

While you slept last night

TIBCO’s Chris Taylor, standing in for Kashmir Hill at Forbes, paints a dystopian picture of video-as-data, and just how much tracking we’ll face in the future:

This makes laughable the idea of an implanted chip as the way to monitor a population. We’ve implanted that chip in our phones, and in video, and in nearly every way we interact with the world. Even paranoids are right sometimes.

I had a wide-ranging chat with Chris last week. We’re sure to spend more time on this in the future.

The veil of ignorance

The idea for the original post came from a conversation I had with some civil rights activists in Atlanta a few months ago, who hadn’t thought about the subject. They (or their parents) walked with Martin Luther King, Jr. But to them big data was “just tech.” That bothered me, because unless we think of these issues in the context of society and philosophy, bad things will happen to good people.

Perhaps the best tool for thinking about these ethical issues is the Veil of Ignorance. It’s a philosophical exercise for deciding social issues that goes like this:

  1. Imagine you don’t know where you will be in the society you’re creating. You could be a criminal, a monarch, a merchant, a pauper, an invalid.
  2. Now design the best society you can.

Simple, right? When we’re looking at legislation for big data, this is a good place to start. We should set privacy, transparency, and use policies without knowing whether we’re ruling or oppressed, straight or gay, rich or poor.

This post originally appeared on Solve for Interesting. This version has been lightly edited. Photo: crystal ball ii by mararie, on Flickr

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.

Save 20% on registration with the code RADAR20


August 21 2012

Three kinds of big data

Photo of the columns of Castor and Pollux by OliverN5 on FlickrIn the past couple of years, marketers and pundits have spent a lot of time labeling everything ”big data.” The reasoning goes something like this:

  • Everything is on the Internet.
  • The Internet has a lot of data.
  • Therefore, everything is big data.

When you have a hammer, everything looks like a nail. When you have a Hadoop deployment, everything looks like big data. And if you’re trying to cloak your company in the mantle of a burgeoning industry, big data will do just fine. But seeing big data everywhere is a sure way to hasten the inevitable fall from the peak of high expectations to the trough of disillusionment.

We saw this with cloud computing. From early idealists saying everything would live in a magical, limitless, free data center to today’s pragmatism about virtualization and infrastructure, we soon took off our rose-colored glasses and put on welding goggles so we could actually build stuff.

So where will big data go to grow up?

Once we get over ourselves and start rolling up our sleeves, I think big data will fall into three major buckets: Enterprise BI, Civil Engineering, and Customer Relationship Optimization. This is where we’ll see most IT spending, most government oversight, and most early adoption in the next few years.

Enterprise BI 2.0

For decades, analysts have relied on business intelligence (BI) products like Hyperion, Microstrategy and Cognos to crunch large amounts of information and generate reports. Data warehouses and BI tools are great at answering the same question — such as “what were Mary’s sales this quarter?” — over and over again. But they’ve been less good at the exploratory, what-if, unpredictable questions that matter for planning and decision making because that kind of fast exploration of unstructured data is traditionally hard to do and therefore expensive.

Most “legacy” BI tools are constrained in two ways:

  • First, they’ve been schema-then-capture tools in which the analyst decides what to collect, then later capture that data for analysis.
  • Second, they’ve typically focused on reporting what Avinash Kaushik (channeling Donald Rumsfeld) refers to as “known unknowns” — things we know we don’t know, and generate reports for.

These tools are used for reporting and operational purposes, usually focused on controlling costs, executing against an existing plan, and reporting on how things are going.

As my Strata co-chair Edd Dumbill pointed out when I asked for thoughts on this piece:

“The predominant functional application of big data technologies today is in ETL (Extract, Transform, and Load). I’ve heard the figure that it’s about 80% of Hadoop applications. Just the real grunt work of log file or sensor processing before loading into an analytic database like Vertica.”

The availability of cheap, fast computers and storage, as well as open source tools, have made it okay to capture first and ask questions later. That changes how we use data because it makes it okay to speculate beyond the initial question that triggered the collection of data.

What’s more, the speed with which we can get results — sometimes as fast as a human can ask them — makes data easier to explore interactively. This combination of interactivity and speculation takes BI into the realm of “unknown unknowns,” the insights that can produce a competitive advantage or an out-of-the-box differentiator.

We saw this shift in cloud computing: first, big public clouds wooed green-field startups. Then, in a few years, incumbent IT vendors introduced their private cloud offerings. Private clouds included only a fraction of the benefits of public clouds, but were nevertheless a sufficient blend of smoke, mirrors, and features to delay the inevitable move to public resources by a few years and appease the business. For better or worse, that’s where most of IT budgets are being spent today according to IDC, Gartner, and others.

In the next few years, then, look for acquisitions and product introductions — and not a little vaporware — as BI vendors that enterprises trust bring them “big data lite”: enough to satisfy their CEO’s golf buddies, but not so much that their jobs are threatened. This, after all, is how change comes to big organizations.

Ultimately, we’ll see traditional “known unknowns” BI reporting living alongside big-data-powered data import and cleanup, and fast, exploratory data “unknown unknown” interactivity.

Civil Engineering

The second use of big data is in society and government. Already, data mining can be used to predict disease outbreaks, understand traffic patterns, and improve education.

Cities are facing budget crunches, infrastructure problems, and a crowding from rural citizens. Solving these problems is urgent, and cities are perfect labs for big data initiatives. Take a metropolis like New York: hackathons; open feeds of public data; and a population that generates a flood of information as it shops, commutes, gets sick, eats, and just goes about its daily life.

Datagotham is just one example of a city's efforts to hack itself

I think municipal data is one of the big three for several reasons: it’s a good tie breaker for partisanship, we have new interfaces everyone can understand, and we finally have a mostly-connected citizenry.

In an era of partisan bickering, hard numbers can settle the debate. So, they’re not just good government; they’re good politics. Expect to see big data applied to social issues, helping us to make funding more effective and scarce government resources more efficient (perhaps to the chagrin of some public servants and lobbyists). As this works in the world’s biggest cities, it’ll spread to smaller ones, to states, and to municipalities.

Making data accessible to citizens is possible, too: Siri and Google Now show the potential for personalized agents; Narrative Science takes complex data and turns it into words the masses can consume easily; Watson and Wolfram Alpha can give smart answers, either through curated reasoning or making smart guesses.

For the first time, we have a connected citizenry armed (for the most part) with smartphones. Nielsen estimated that smartphones would overtake feature phones in 2011, and that concentration is high in urban cores. The App Store is full of apps for bus schedules, commuters, local events, and other tools that can quickly become how governments connect with their citizens and manage their bureaucracies.

The consequence of all this, of course, is more data. Once governments go digital, their interactions with citizens can be easily instrumented and analyzed for waste or efficiency. That’s sure to provoke resistance from those who don’t like the scrutiny or accountability, but it’s a side effect of digitization: every industry that goes digital gets analyzed and optimized, whether it likes it or not.

Customer Relationship Optimization

The final home of applied big data is marketing. More specifically, it’s improving the relationship with consumers so companies can, as Sergio Zyman once said, sell them more stuff, more often, for more money, more efficiently.

The biggest data systems today are focused on web analytics, ad optimization, and the like. Many of today’s most popular architectures were weaned on ads and marketing, and have their ancestry in direct marketing plans. They’re just more focused than the comparatively blunt instruments with which direct marketers used to work.

Tamis a Lait by El Bibliomata on FlickrThe number of contact points in a company has multiplied significantly. Where once there was a phone number and a mailing address, today there are web pages, social media accounts, and more. Tracking users across all these channels — and turning every click, like, share, friend, or retweet into the start of a long funnel that leads, inexorably, to revenue is a big challenge. It’s also one that companies like Salesforce understand, with its investments in chat, social media monitoring, co-browsing, and more.

This is what’s lately been referred to as the “360-degree customer view” (though it’s not clear that companies will actually act on customer data if they have it, or whether doing so will become a compliance minefield). Big data is already intricately linked to online marketing, but it will branch out in two ways.

First, it’ll go from online to offline. Near-field-equipped smartphones with ambient check-in are a marketer’s wet dream, and they’re coming to pockets everywhere. It’ll be possible to track queue lengths, store traffic, and more, giving retailers fresh insights into their brick-and-mortar sales. Ultimately, companies will bring the optimization that online retail has enjoyed to an offline world as consumers become trackable.

Second, it’ll go from Wall Street (or maybe that’s Madison Avenue and Middlefield Road) to Main Street. Tools will get easier to use, and while small businesses might not have a BI platform, they’ll have a tablet or a smartphone that they can bring to their places of business. Mobile payment players like Square are already making them reconsider the checkout process. Adding portable customer intelligence to the tool suite of local companies will broaden how we use marketing tools.

Headlong into the trough

That’s my bet for the next three years, given the molasses of market confusion, vendor promises, and unrealistic expectations we’re about to contend with. Will big data change the world? Absolutely. Will it be able to defy the usual cycle of earnest adoption, crushing disappointment, and eventual rebirth all technologies must travel? Certainly not.


Hawaii and health care: A small state takes a giant step forward

Knots by Uncle Catherine, on FlickrIn an era characterized by political polarization and legislative stalemate, the tiny state of Hawaii has just demonstrated extraordinary leadership. The rest of the country should now recognize, applaud, and most of all, learn from Hawaii’s accomplishment.

Hawaii enacted a new law that harmonizes its state medical privacy laws with HIPAA, the federal medical privacy law. Hawaii’s legislators and governor, along with an impressive array of patient groups, health care providers, insurance companies, and health information technologists, agreed that having dozens of unique Hawaii medical privacy laws in addition to HIPAA was confusing, expensive, and bad for patients. HB 1957 thus eliminates the need for entities covered by HIPAA to also comply with Hawaii’s complex array of medical privacy laws.

How did this thicket of state medical privacy laws arise?

Hawaii’s knotty web of state medical privacy laws is not unique. There are vast numbers of state health privacy laws across the country — certainly many hundreds, likely thousands. Hawaii alone has more than 50. Most were enacted before HIPAA, which helps explain why there are so many; when no federal guarantee of health privacy existed, states took action to protect their constituents from improper invasions of their medical privacy. These laws grew helter-skelter over decades. For example, particularly restrictive laws were enacted after inappropriate and traumatizing disclosures of HIV status during the 1980s.

These laws were often rooted in a naïve faith that patient consent, rather than underlying structural protection, is the be-all and end-all of patient protection. Consent requirements thus became more detailed and demanding. Countless laws, sometimes buried in obscure areas of state law, created unique consent requirements over mental health, genetic information, reproductive health, infectious disease, adolescent, and disability records.

When the federal government created HIPAA, a comprehensive and complex medical privacy law, the powers in Washington realized that preempting this thicket of state laws would be a political impossibility. As every HIPAA 101 class teaches, HIPAA thus became “a floor, not a ceiling.” All state laws stricter than HIPAA continue to exist in full force.

So what’s so bad about having lots of state health privacy laws?

The harmful consequences of the state medical privacy law thicket coexisting with HIPAA include:

  • Adverse patient impact — First and foremost, the privacy law thicket is terrible for individual patients. The days when we saw only doctors in one state are long gone. We travel, we move, we get sick in different states, we choose caregivers in different states. We need our health information to be rapidly available to us and our providers wherever we are, but these state consent laws make it tough for providers to share records. Even providing patients with our own medical records — which is mandated by HIPAA — is impeded by perceptions that state-specific, or even institution-specific, consent forms must be used instead of national HIPAA-compliant forms.
  • Harmful to those intended to be protected — Paradoxically, laws intended to protect particular groups of patients, like those with HIV or mental health conditions, now undermine their clinical care. Providers sending records containing sensitive content are wary of letting complete records move, yet may be unable to mask the regulated data. When records are incomplete, delayed, or simply unavailable, providers can make wrong decisions and patients can get hurt.
  • Antiquated and legalistic consent forms and systems — Most providers feel obliged to honor a patient’s request to move medical records only in the form of a “wet signature” on a piece of paper. Most then insist that the piece of paper be moved only in person or by 1980s-era fax machines, despite the inconvenience to patients who don’t have a fax machine at hand. HIPAA allows the disclosure of health information for treatment, payment, and health care operations (all precisely defined terms), but because so many state laws require consent for particular situations, it is easier (and way more CYA) for institutions to err on the side of strict consent forms for all disclosures, even when permitted by HIPAA.
  • Obstacles to technological innovation and telemedicine — Digital systems to move information need simplicity — either, yes, the data can move, or no, it cannot. Trying to build systems when a myriad of complex, and essentially unknowable, laws govern whether data can move, who must consent, on what form, for what duration, or what data subsets must be expurgated, becomes a nightmare. No doubt, many health innovators today are operating in blissful ignorance of the state health privacy law thicket, but ignorance of these laws does not protect against enforcement or class action lawsuits.
  • Economic waste — As taxpayers, the state legal thicket hurts us all. Redundant tests and procedures are often ordered when medical records cannot be timely produced. Measuring the comparative effectiveness of alternative treatments and the performance of hospitals, providers, and insurers is crucial to improving quality and reducing costs, but state laws can restrict such uses. The 2009 stimulus law provided billions of dollars for health information technology and information exchange, but some of our return on that national investment is lost when onerous state-specific consent requirements must be baked into electronic health record (EHR) and health information exchange (HIE) design.

What can we learn from Hawaii?

Other states should follow Hawaii’s lead by having the boldness and foresight to wipe their own medical privacy laws off the books in favor of a simpler and more efficient national solution that protects privacy and facilitates clinical care. Our national legal framework is HIPAA, plus HITECH, a 2009 law that made HIPAA stricter, plus other new federal initiatives intended to create a secure, private, and reliable infrastructure for moving health information. While that federal framework isn’t perfect, that’s where we should be putting our efforts to protect, exchange, and make appropriate use of health information. Hawaii’s approach of reducing the additional burden of the complex state law layer just makes sense.

Some modest progress has occurred already. A few states are harmonizing their laws affecting health information exchanges (e.g., Kansas and Utah). Some states exempt HIPAA-regulated entities subject to new HITECH breach requirements from also having to comply with the state breach laws (e.g., Michigan and Indiana). These breach measures are helpful in a crisis, to be sure, by saving money on wasteful legal research, but irrelevant from the standpoint of providing care for patients or designing technology solutions or system improvements. California currently has a medical law harmonization initiative underway, which I hope is broadly supported in order to reduce waste and improve care.

To be blunt, we need much more dramatic progress in this area. In the case of health information exchange, states are not useful “laboratories of democracy“; they are towers of Babel that disserve patients. The challenges of providing clinical care, let alone making dramatic improvements while lowering costs, in the context of this convoluted mess of state laws, are severe. Patients, disease advocacy groups, doctors, nurses, hospitals, and technology innovators should let their state legislators know that harmonizing medical privacy laws would be a huge win for all involved.

Photo: Knots by Uncle Catherine, on Flickr

August 17 2012

Wall Street’s robots are not out to get you

ABOVE by Lyfetime, on FlickrTechnology is critical to today’s financial markets. It’s also surprisingly controversial. In most industries, increasing technological involvement is progress, not a problem. And yet, people who believe that computers should drive cars suddenly become Luddites when they talk about computers in trading.

There’s widespread public sentiment that technology in finance just screws the “little guy.” Some of that sentiment is due to concern about a few extremely high-profile errors. A lot of it is rooted in generalized mistrust of the entire financial industry. Part of the problem is that media coverage on the issue is depressingly simplistic. Hyperbolic articles about the “rogue robots of Wall Street” insinuate that high-frequency trading (HFT) is evil without saying much else. Very few of those articles explain that HFT is a catchall term that describes a host of different strategies, some of which are extremely beneficial to the public market.

I spent about six years as a trader, using automated systems to make markets and execute arbitrage strategies. From 2004-2011, as our algorithms and technology became more sophisticated, it was increasingly rare for a trader to have to enter a manual order. Even in 2004, “manual” meant instructing an assistant to type the order into a terminal; it was still routed to the exchange by a computer. Automating orders reduced the frequency of human “fat finger” errors. It meant that we could adjust our bids and offers in a stock immediately if the broader market moved, which enabled us to post tighter markets. It allowed us to manage risk more efficiently. More subtly, algorithms also reduced the impact of human biases — especially useful when liquidating a position that had turned out badly. Technology made trading firms like us more profitable, but it also benefited the people on the other sides of those trades. They got tighter spreads and deeper liquidity.

Many HFT strategies have been around for decades. A common one is exchange arbitrage, which Time magazine recently described in an article entitled “High Frequency Trading: Wall Street’s Doomsday Machine?”:

A high-frequency trader might try to take advantage of minuscule differences in prices between securities offered on different exchanges: ABC stock could be offered for one price in New York and for a slightly higher price in London. With a high-powered computer and an ‘algorithm,’ a trader could buy the cheap stock and sell the expensive one almost simultaneously, making an almost risk-free profit for himself.

It’s a little bit more difficult than that paragraph makes it sound, but the premise is true — computers are great for trades like that. As technology improved, exchange arb went from being largely manual to being run almost entirely via computer, and the market in the same stock across exchanges became substantially more efficient. (And as a result of competition, the strategy is now substantially less profitable for the firms that run it.)

Market making — posting both a bid and an offer in a security and profiting from the bid-ask spread — is presumably what Knight Capital was doing when it experienced “technical difficulties.” The strategy dates from the time when exchanges were organized around physical trading pits. Those were the bad old days, when there was little transparency and automation, and specialists and brokers could make money ripping off clients who didn’t have access to technology. Market makers act as liquidity providers, and they are an important part of a well-functioning market. Automated trading enables them to manage their orders efficiently and quickly, and helps to reduce risk.

So how do those high-profile screw-ups happen? They begin with human error (or, at least, poor judgment). Computerized trading systems can amplify these errors; it would be difficult for a person sending manual orders to simultaneously botch their markets in 148 different companies, as Knight did. But it’s nonsense to make the leap from one brokerage experiencing severe technical difficulties to claiming that automated market-making creates some sort of systemic risk. The way the market handled the Knight fiasco is how markets are supposed to function — stupidly priced orders came in, the market absorbed them, the U.S. Securities and Exchange Commission (SEC) and the exchanges adhered to their rules regarding which trades could be busted (ultimately letting most of the trades stand and resulting in a $440 million loss for Knight).

There are some aspects of HFT that are cause for concern. Certain strategies have exacerbated unfortunate feedback loops. The Flash Crash illustrated that an increase in volume doesn’t necessarily mean an increase in real liquidity. Nanex recently put together a graph (or a “horrifying GIF“) showing the sharply increasing number of quotes transmitted via automated systems across various exchanges. What it shows isn’t actual trades, but it does call attention to a problem called “quote spam.” Algorithms that employ this strategy generate a large number of buy and sell orders that are placed in the market and then are canceled almost instantly. They aren’t real liquidity; the machine placing them has no intention of getting a fill — it’s flooding the market with orders that competitor systems have to process. This activity leads to an increase in short-term volatility and higher trading costs.

The New York Times just ran an interesting article on HFT that included data on the average cost of trading one share of stock. From 2000 to 2010, it dropped from $.076 to $.035. Then it appears to have leveled off, and even increased slightly, to $.038 in 2012. If (as that data suggests) we’ve arrived at the point where the “market efficiency” benefit of HFT is outweighed by the risk of increased volatility or occasional instability, then regulators need to step in. The challenge is determining how to disincentivize destabilizing behavior without negatively impacting genuine liquidity providers. One possibility is to impose a financial transaction tax, possibly based on how long the order remains in the market or on the number of orders sent per second.

Rethinking regulation and market safeguards in light of new technology is absolutely appropriate. But the state of discourse in the mainstream press — mostly comprised of scare articles about “Wall Street’s terrifying robot invasion” — is unfortunate. Maligning computerized strategies because they are computerized is the wrong way to think about the future of our financial markets.

Photo: ABOVE by Lyfetime, on Flickr


August 15 2012

Mining the astronomical literature

There is a huge debate right now about making academic literature freely accessible and moving toward open access. But what would be possible if people stopped talking about it and just dug in and got on with it?

NASA’s Astrophysics Data System (ADS), hosted by the Smithsonian Astrophysical Observatory (SAO), has quietly been working away since the mid-’90s. Without much, if any, fanfare amongst the other disciplines, it has moved astronomers into a world where access to the literature is just a given. It’s something they don’t have to think about all that much.

The ADS service provides access to abstracts for virtually all of the astronomical literature. But it also provides access to the full text of more than half a million papers, going right back to the start of peer-reviewed journals in the 1800s. The service has links to online data archives, along with reference and citation information for each of the papers, and it’s all searchable and downloadable.

Number of papers published in the three main astronomy journals each year
Number of papers published in the three main astronomy journals each year. CREDIT: Robert Simpson

The existence of the ADS, along with the arXiv pre-print server, has meant that most astronomers haven’t seen the inside of a brick-built library since the late 1990s.

It also makes astronomy almost uniquely well placed for interesting data mining experiments, experiments that hint at what the rest of academia could do if they followed astronomy’s lead. The fact that the discipline’s literature has been scanned, archived, indexed and catalogued, and placed behind a RESTful API makes it a treasure trove, both for hypothesis generation and sociological research.

For example, the .Astronomy series of conferences is a small workshop that brings together the best and the brightest of the technical community: researchers, developers, educators and communicators. Billed as “20% time for astronomers,” it gives these people space to think about how the new technologies affect both how research and communicating research to their peers and to the public is done.

[Disclosure: I'm a member of the advisory board to the .Astronomy conference, and I previously served as a member of the programme organising committee for the conference series.]

It should perhaps come as little surprise that one of the more interesting projects to come out of a hack day held as part of this year’s .Astronomy meeting in Heidelberg was work by Robert Simpson, Karen Masters and Sarah Kendrew that focused on data mining the astronomical literature.

The team grabbed and processed the titles and abstracts of all the papers from the Astrophysical Journal (ApJ), Astronomy & Astrophysics (A&A), and the Monthly Notices of the Royal Astronomical Society (MNRAS) since each of those journals started publication — and that’s 1827 in the case of MNRAS.

By the end of the day, they’d found some interesting results showing how various terms have trended over time. The results were similar to what’s found in Google Books’ Ngram Viewer.

The relative popularity of the names of telescopes in the literature
The relative popularity of the names of telescopes in the literature. Hubble, Chandra and Spitzer seem to have taken turns in hogging the limelight, much as COBE, WMAP and Planck have each contributed to our knowledge of the cosmic microwave background in successive decades. References to Planck are still on the rise. CREDIT: Robert Simpson.

After the meeting, however, Robert has taken his initial results and explored the astronomical literature and his new corpus of data on the literature. He’s explored various visualisations of the data, including word matrixes for related terms and for various astro-chemistry.

Correlation between terms related to Active Galactic Nuclei
Correlation between terms related to Active Galactic Nuclei (AGN). The opacity of each square represents the strength of the correlation between the terms. CREDIT: Robert Simpson.

He’s also taken a look at authorship in astronomy and is starting to find some interesting trends.

Fraction of astronomical papers published with one, two, three, four or more authors
Fraction of astronomical papers published with one, two, three, four or more authors. CREDIT: Robert Simpson

You can see that single-author papers dominated for most of the 20th century. Around 1960, we see the decline begin, as two- and three-author papers begin to become a significant chunk of the whole. In 1978, author papers become more prevalent than single-author papers.

Compare the number of active research astronomers to the number of papers published each year
Compare the number of “active” research astronomers to the number of papers published each year (across all the major journals). CREDIT: Robert Simpson.

Here we see that people begin to outpace papers in the 1960s. This may reflect the fact that as we get more technical as a field, and more specialised, it takes more people to write the same number of papers, which is a sort of interesting result all by itself.

Interview with Robert Simpson: Behind the project and what lies ahead

I recently talked with Rob about the work he, Karen Masters, and Sarah Kendrew did at the meeting, and the work he’s been doing since with the newly gathered data.

What made you think about data mining the ADS?

Robert Simpson: At the .Astronomy 4 Hack Day in July, Sarah Kendrew had the idea to try to do an astronomy version of BrainSCANr, a project that generates new hypotheses in the neuroscience literature. I’ve had a go at mining ADS and arXiv before, so it seemed like a great excuse to dive back in.

Do you think there might be actual science that could be done here?

Robert Simpson: Yes, in the form of finding questions that were unexpected. With such large volumes of peer-reviewed papers being produced daily in astronomy, there is a lot being said. Most researchers can only try to keep up with it all — my daily RSS feed from arXiv is next to useless, it’s so bloated. In amongst all that text, there must be connections and relationships that are being missed by the community at large, hidden in the chatter. Maybe we can develop simple techniques to highlight potential missed links, i.e. generate new hypotheses from the mass of words and data.

Are the results coming out of the work useful for auditing academics?

Robert Simpson: Well, perhaps, but that would be tricky territory in my opinion. I’ve only just begun to explore the data around authorship in astronomy. One thing that is clear is that we can see a big trend toward collaborative work. In 2012, only 6% of papers were single-author efforts, compared with 70+% in the 1950s.

The average number of authors per paper since 1827
The above plot shows the average number of authors, per paper since 1827. CREDIT: Robert Simpson.

We can measure how large groups are becoming, and who is part of which groups. In that sense, we can audit research groups, and maybe individual people. The big issue is keeping track of people through variations in their names and affiliations. Identifying authors is probably a solved problem if we look at ORCID.

What about citations? Can you draw any comparisons with h-index data?

Robert Simpson: I haven’t looked at h-index stuff specifically, at least not yet, but citations are fun. I looked at the trends surrounding the term “dark matter” and saw something interesting. Mentions of dark matter rise steadily after it first appears in the late ’70s.

Compare the term dark matter with related terms
Compare the term “dark matter” with a few other related terms: “cosmology,” “big bang,” “dark energy,” and “wmap.” You can see cosmology has been getting more popular since the 1990s, and dark energy is a recent addition. CREDIT: Robert Simpson.

In the data, astronomy becomes more and more obsessed with dark matter — the term appears in 1% of all papers by the end of the ’80s and 6% today.

Looking at citations changes the picture. The community is writing papers about dark matter more and more each year, but they are getting fewer citations than they used to (the peak for this was in the late ’90s). These trends are normalised, so the only regency effect I can think of is that dark matter papers take more than 10 years to become citable. Either that or dark matter studies are currently in a trough for impact.

Can you see where work is dropped by parts of the community and picked up again?

Robert Simpson: Not yet, but I see what you mean. I need to build a better picture of the community and its components.

Can you build a social graph of astronomers out of this data? What about (academic) family trees?

Robert Simpson: Identifying unique authors is my next step, followed by creating fingerprints of individuals at a given point in time. When do people create their first-author papers, when do they have the most impact in their careers, stuff like that.

What tools did you use? In hindsight, would you do it differently?

I’m using Ruby and Perl to grab the data, MySQL to store and query it, JavaScript to display it (Google Charts and D3.js). I may still move the database part to MongoDB because it was designed to store documents. Similarly, I may switch from ADS to arXiv as the data source. Using arXiv would allow me to grab the full text in many cases, even if it does introduce a peer-review issue.

What’s next?

Robert Simpson: My aim is still to attempt real hypothesis generation. I’ve begun the process by investigating correlations between terms in the literature, but I think the power will be in being able to compare all terms with all terms and looking for the unexpected. Terms may correlate indirectly (via a third term, for example), so the entire corpus needs to be processed and optimised to make it work comprehensively.

Science between the cracks

I’m really looking forward to seeing more results coming out of Robert’s work. This sort of analysis hasn’t really been possible before. It’s showing a lot of promise both from a sociological angle, with the ability to do research into how science is done and how that has changed, but also ultimately as a hypothesis engine — something that can generate new science in and of itself. This is just a hack day experiment. Imagine what could be done if the literature were more open and this sort of analysis could be done across fields?

Right now, a lot of the most interesting science is being done in the cracks between disciplines, but the hardest part of that sort of work is often trying to understand the literature of the discipline that isn’t your own. Robert’s project offers a lot of hope that this may soon become easier.

August 14 2012

Solving the Wanamaker problem for health care

By Tim O’Reilly, Julie Steele, Mike Loukides and Colin Hill

“The best minds of my generation are thinking about how to make people click ads.” — Jeff Hammerbacher, early Facebook employee

“Work on stuff that matters.” — Tim O’Reilly

Doctors in operating room with data

In the early days of the 20th century, department store magnate John Wanamaker famously said, “I know that half of my advertising doesn’t work. The problem is that I don’t know which half.”

The consumer Internet revolution was fueled by a search for the answer to Wanamaker’s question. Google AdWords and the pay-per-click model transformed a business in which advertisers paid for ad impressions into one in which they pay for results. “Cost per thousand impressions” (CPM) was replaced by “cost per click” (CPC), and a new industry was born. It’s important to understand why CPC replaced CPM, though. Superficially, it’s because Google was able to track when a user clicked on a link, and was therefore able to bill based on success. But billing based on success doesn’t fundamentally change anything unless you can also change the success rate, and that’s what Google was able to do. By using data to understand each user’s behavior, Google was able to place advertisements that an individual was likely to click. They knew “which half” of their advertising was more likely to be effective, and didn’t bother with the rest.

Since then, data and predictive analytics have driven ever deeper insight into user behavior such that companies like Google, Facebook, Twitter, Zynga, and LinkedIn are fundamentally data companies. And data isn’t just transforming the consumer Internet. It is transforming finance, design, and manufacturing — and perhaps most importantly, health care.

How is data science transforming health care? There are many ways in which health care is changing, and needs to change. We’re focusing on one particular issue: the problem Wanamaker described when talking about his advertising. How do you make sure you’re spending money effectively? Is it possible to know what will work in advance?

Too often, when doctors order a treatment, whether it’s surgery or an over-the-counter medication, they are applying a “standard of care” treatment or some variation that is based on their own intuition, effectively hoping for the best. The sad truth of medicine is that we don’t really understand the relationship between treatments and outcomes. We have studies to show that various treatments will work more often than placebos; but, like Wanamaker, we know that much of our medicine doesn’t work for half or our patients, we just don’t know which half. At least, not in advance. One of data science’s many promises is that, if we can collect data about medical treatments and use that data effectively, we’ll be able to predict more accurately which treatments will be effective for which patient, and which treatments won’t.

A better understanding of the relationship between treatments, outcomes, and patients will have a huge impact on the practice of medicine in the United States. Health care is expensive. The U.S. spends over $2.6 trillion on health care every year, an amount that constitutes a serious fiscal burden for government, businesses, and our society as a whole. These costs include over $600 billion of unexplained variations in treatments: treatments that cause no differences in outcomes, or even make the patient’s condition worse. We have reached a point at which our need to understand treatment effectiveness has become vital — to the health care system and to the health and sustainability of the economy overall.

Why do we believe that data science has the potential to revolutionize health care? After all, the medical industry has had data for generations: clinical studies, insurance data, hospital records. But the health care industry is now awash in data in a way that it has never been before: from biological data such as gene expression, next-generation DNA sequence data, proteomics, and metabolomics, to clinical data and health outcomes data contained in ever more prevalent electronic health records (EHRs) and longitudinal drug and medical claims. We have entered a new era in which we can work on massive datasets effectively, combining data from clinical trials and direct observation by practicing physicians (the records generated by our $2.6 trillion of medical expense). When we combine data with the resources needed to work on the data, we can start asking the important questions, the Wanamaker questions, about what treatments work and for whom.

The opportunities are huge: for entrepreneurs and data scientists looking to put their skills to work disrupting a large market, for researchers trying to make sense out of the flood of data they are now generating, and for existing companies (including health insurance companies, biotech, pharmaceutical, and medical device companies, hospitals and other care providers) that are looking to remake their businesses for the coming world of outcome-based payment models.

Making health care more effective

Downloadable Editions

This report will soon be available in PDF, EPUB and Mobi formats. Submit your email to be alerted when the downloadable editions are ready.

What, specifically, does data allow us to do that we couldn’t do before? For the past 60 or so years of medical history, we’ve treated patients as some sort of an average. A doctor would diagnose a condition and recommend a treatment based on what worked for most people, as reflected in large clinical studies. Over the years, we’ve become more sophisticated about what that average patient means, but that same statistical approach didn’t allow for differences between patients. A treatment was deemed effective or ineffective, safe or unsafe, based on double-blind studies that rarely took into account the differences between patients. With the data that’s now available, we can go much further. The exceptions to this are relatively recent and have been dominated by cancer treatments, the first being Herceptin for breast cancer in women who over-express the Her2 receptor. With the data that’s now available, we can go much further for a broad range of diseases and interventions that are not just drugs but include surgery, disease management programs, medical devices, patient adherence, and care delivery.

For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients. But now we know much more: we know that it’s 100% effective in 70 to 80% of the patients, and ineffective in the rest. That’s not word games, because we can now use genetic markers to tell whether it’s likely to be effective or ineffective for any given patient, and we can tell in advance whether to treat with Tamoxifen or to try something else.

Two factors lie behind this new approach to medicine: a different way of using data, and the availability of new kinds of data. It’s not just stating that the drug is effective on most patients, based on trials (indeed, 80% is an enviable success rate); it’s using artificial intelligence techniques to divide the patients into groups and then determine the difference between those groups. We’re not asking whether the drug is effective; we’re asking a fundamentally different question: “for which patients is this drug effective?” We’re asking about the patients, not just the treatments. A drug that’s only effective on 1% of patients might be very valuable if we can tell who that 1% is, though it would certainly be rejected by any traditional clinical trial.

More than that, asking questions about patients is only possible because we’re using data that wasn’t available until recently: DNA sequencing was only invented in the mid-1970s, and is only now coming into its own as a medical tool. What we’ve seen with Tamoxifen is as clear a solution to the Wanamaker problem as you could ask for: we now know when that treatment will be effective. If you can do the same thing with millions of cancer patients, you will both improve outcomes and save money.

Dr. Lukas Wartman, a cancer researcher who was himself diagnosed with terminal leukemia, was successfully treated with sunitinib, a drug that was only approved for kidney cancer. Sequencing the genes of both the patient’s healthy cells and cancerous cells led to the discovery of a protein that was out of control and encouraging the spread of the cancer. The gene responsible for manufacturing this protein could potentially be inhibited by the kidney drug, although it had never been tested for this application. This unorthodox treatment was surprisingly effective: Wartman is now in remission.

While this treatment was exotic and expensive, what’s important isn’t the expense but the potential for new kinds of diagnosis. The price of gene sequencing has been plummeting; it will be a common doctor’s office procedure in a few years. And through Amazon and Google, you can now “rent” a cloud-based supercomputing cluster that can solve huge analytic problems for a few hundred dollars per hour. What is now exotic inevitably becomes routine.

But even more important: we’re looking at a completely different approach to treatment. Rather than a treatment that works 80% of the time, or even 100% of the time for 80% of the patients, a treatment might be effective for a small group. It might be entirely specific to the individual; the next cancer patient may have a different protein that’s out of control, an entirely different genetic cause for the disease. Treatments that are specific to one patient don’t exist in medicine as it’s currently practiced; how could you ever do an FDA trial for a medication that’s only going to be used once to treat a certain kind of cancer?

Foundation Medicine is at the forefront of this new era in cancer treatment. They use next-generation DNA sequencing to discover DNA sequence mutations and deletions that are currently used in standard of care treatments, as well as many other actionable mutations that are tied to drugs for other types of cancer. They are creating a patient-outcomes repository that will be the fuel for discovering the relation between mutations and drugs. Foundation has identified DNA mutations in 50% of cancer cases for which drugs exist (information via a private communication), but are not currently used in the standard of care for the patient’s particular cancer.

The ability to do large-scale computing on genetic data gives us the ability to understand the origins of disease. If we can understand why an anti-cancer drug is effective (what specific proteins it affects), and if we can understand what genetic factors are causing the cancer to spread, then we’re able to use the tools at our disposal much more effectively. Rather than using imprecise treatments organized around symptoms, we’ll be able to target the actual causes of disease, and design treatments tuned to the biology of the specific patient. Eventually, we’ll be able to treat 100% of the patients 100% of the time, precisely because we realize that each patient presents a unique problem.

Personalized treatment is just one area in which we can solve the Wanamaker problem with data. Hospital admissions are extremely expensive. Data can make hospital systems more efficient, and to avoid preventable complications such as blood clots and hospital re-admissions. It can also help address the challenge of hot-spotting (a term coined by Atul Gawande): finding people who use an inordinate amount of health care resources. By looking at data from hospital visits, Dr. Jeffrey Brenner of Camden, NJ, was able to determine that “just one per cent of the hundred thousand people who made use of Camden’s medical facilities accounted for thirty per cent of its costs.” Furthermore, many of these people came from two apartment buildings. Designing more effective medical care for these patients was difficult; it doesn’t fit our health insurance system, the patients are often dealing with many serious medical issues (addiction and obesity are frequent complications), and have trouble trusting doctors and social workers. It’s counter-intuitive, but spending more on some patients now results in spending less on them when they become really sick. While it’s a work in progress, it looks like building appropriate systems to target these high-risk patients and treat them before they’re hospitalized will bring significant savings.

Many poor health outcomes are attributable to patients who don’t take their medications. Eliza, a Boston-based company started by Alexandra Drane, has pioneered approaches to improve compliance through interactive communication with patients. Eliza improves patient drug compliance by tracking which types of reminders work on which types of people; it’s similar to the way companies like Google target advertisements to individual consumers. By using data to analyze each patient’s behavior, Eliza can generate reminders that are more likely to be effective. The results aren’t surprising: if patients take their medicine as prescribed, they are more likely to get better. And if they get better, they are less likely to require further, more expensive treatment. Again, we’re using data to solve Wanamaker’s problem in medicine: we’re spending our resources on what’s effective, on appropriate reminders that are mostly to get patients to take their medications.

More data, more sources

The examples we’ve looked at so far have been limited to traditional sources of medical data: hospitals, research centers, doctor’s offices, insurers. The Internet has enabled the formation of patient networks aimed at sharing data. Health social networks now are some of the largest patient communities. As of November 2011, PatientsLikeMe has over 120,000 patients in 500 different condition groups; ACOR has over 100,000 patients in 127 cancer support groups; 23andMe has over 100,000 members in their genomic database; and diabetes health social network SugarStats has over 10,000 members. These are just the larger communities, thousands of small communities are created around rare diseases, or even uncommon experiences with common diseases. All of these communities are generating data that they voluntarily share with each other and the world.

Increasingly, what they share is not just anecdotal, but includes an array of clinical data. For this reason, these groups are being recruited for large-scale crowdsourced clinical outcomes research.

Thanks to ubiquitous data networking through the mobile network, we can take several steps further. In the past two or three years, there’s been a flood of personal fitness devices (such as the Fitbit) for monitoring your personal activity. There are mobile apps for taking your pulse, and an iPhone attachment for measuring your glucose. There has been talk of mobile applications that would constantly listen to a patient’s speech and detect changes that might be the precursor for a stroke, or would use the accelerometer to report falls. Tanzeem Choudhury has developed an app called Be Well that is intended primarily for victims of depression, though it can be used by anyone. Be Well monitors the user’s sleep cycles, the amount of time they spend talking, and the amount of time they spend walking. The data is scored, and the app makes appropriate recommendations, based both on the individual patient and data collected across all the app’s users.

Continuous monitoring of critical patients in hospitals has been normal for years; but we now have the tools to monitor patients constantly, in their home, at work, wherever they happen to be. And if this sounds like big brother, at this point most of the patients are willing. We don’t want to transform our lives into hospital experiences; far from it! But we can collect and use the data we constantly emit, our “data smog,” to maintain our health, to become conscious of our behavior, and to detect oncoming conditions before they become serious. The most effective medical care is the medical care you avoid because you don’t need it.

Paying for results

Once we’re on the road toward more effective health care, we can look at other ways in which Wanamaker’s problem shows up in the medical industry. It’s clear that we don’t want to pay for treatments that are ineffective. Wanamaker wanted to know which part of his advertising was effective, not just to make better ads, but also so that he wouldn’t have to buy the advertisements that wouldn’t work. He wanted to pay for results, not for ad placements. Now that we’re starting to understand how to make treatment effective, now that we understand that it’s more than rolling the dice and hoping that a treatment that works for a typical patient will be effective for you, we can take the next step: Can we change the underlying incentives in the medical system? Can we make the system better by paying for results, rather than paying for procedures?

It’s shocking just how badly the incentives in our current medical system are aligned with outcomes. If you see an orthopedist, you’re likely to get an MRI, most likely at a facility owned by the orthopedist’s practice. On one hand, it’s good medicine to know what you’re doing before you operate. But how often does that MRI result in a different treatment? How often is the MRI required just because it’s part of the protocol, when it’s perfectly obvious what the doctor needs to do? Many men have had PSA tests for prostate cancer; but in most cases, aggressive treatment of prostate cancer is a bigger risk than the disease itself. Yet the test itself is a significant profit center. Think again about Tamoxifen, and about the pharmaceutical company that makes it. In our current system, what does “100% effective in 80% of the patients” mean, except for a 20% loss in sales? That’s because the drug company is paid for the treatment, not for the result; it has no financial interest in whether any individual patient gets better. (Whether a statistically significant number of patients has side-effects is a different issue.) And at the same time, bringing a new drug to market is very expensive, and might not be worthwhile if it will only be used on the remaining 20% of the patients. And that’s assuming that one drug, not two, or 20, or 200 will be required to treat the unlucky 20% effectively.

It doesn’t have to be this way.

In the U.K., Johnson & Johnson, faced with the possibility of losing reimbursements for their multiple myeloma drug Velcade, agreed to refund the money for patients who did not respond to the drug. Several other pay-for-performance drug deals have followed since, paving the way for the ultimate transition in pharmaceutical company business models in which their product is health outcomes instead of pills. Such a transition would rely more heavily on real-world outcome data (are patients actually getting better?), rather than controlled clinical trials, and would use molecular diagnostics to create personalized “treatment algorithms.” Pharmaceutical companies would also focus more on drug compliance to ensure health outcomes were being achieved. This would ultimately align the interests of drug makers with patients, their providers, and payors.

Similarly, rather than paying for treatments and procedures, can we pay hospitals and doctors for results? That’s what Accountable Care Organizations (ACOs) are about. ACOs are a leap forward in business model design, where the provider shoulders any financial risk. ACOs represent a new framing of the much maligned HMO approaches from the ’90s, which did not work. HMOs tried to use statistics to predict and prevent unneeded care. The ACO model, rather than controlling doctors with what the data says they “should” do, uses data to measure how each doctor performs. Doctors are paid for successes, not for the procedures they administer. The main advantage that the ACO model has over the HMO model is how good the data is, and how that data is leveraged. The ACO model aligns incentives with outcomes: a practice that owns an MRI facility isn’t incentivized to order MRIs when they’re not necessary. It is incentivized to use all the data at its disposal to determine the most effective treatment for the patient, and to follow through on that treatment with a minimum of unnecessary testing.

When we know which procedures are likely to be successful, we’ll be in a position where we can pay only for the health care that works. When we can do that, we’ve solved Wanamaker’s problem for health care.

Enabling data

Data science is not optional in health care reform; it is the linchpin of the whole process. All of the examples we’ve seen, ranging from cancer treatment to detecting hot spots where additional intervention will make hospital admission unnecessary, depend on using data effectively: taking advantage of new data sources and new analytics techniques, in addition to the data the medical profession has had all along.

But it’s too simple just to say “we need data.” We’ve had data all along: handwritten records in manila folders on acres and acres of shelving. Insurance company records. But it’s all been locked up in silos: insurance silos, hospital silos, and many, many doctor’s office silos. Data doesn’t help if it can’t be moved, if data sources can’t be combined.

There are two big issues here. First, a surprising amount of medical records are still either hand-written, or in digital formats that are scarcely better than hand-written (for example, scanned images of hand-written records). Getting medical records into a format that’s computable is a prerequisite for almost any kind of progress. Second, we need to break down those silos.

Anyone who has worked with data knows that, in any problem, 90% of the work is getting the data in a form in which it can be used; the analysis itself is often simple. We need electronic health records: patient data in a more-or-less standard form that can be shared efficiently, data that can be moved from one location to another at the speed of the Internet. Not all data formats are created equal, and some are certainly better than others: but at this point, any machine-readable format, even simple text files, is better than nothing. While there are currently hundreds of different formats for electronic health records, the fact that they’re electronic means that they can be converted from one form into another. Standardizing on a single format would make things much easier, but just getting the data into some electronic form, any, is the first step.

Once we have electronic health records, we can link doctor’s offices, labs, hospitals, and insurers into a data network, so that all patient data is immediately stored in a data center: every prescription, every procedure, and whether that treatment was effective or not. This isn’t some futuristic dream; it’s technology we have now. Building this network would be substantially simpler and cheaper than building the networks and data centers now operated by Google, Facebook, Amazon, Apple, and many other large technology companies. It’s not even close to pushing the limits.

Electronic health records enable us to go far beyond the current mechanism of clinical trials. In the past, once a drug has been approved in trials, that’s effectively the end of the story: running more tests to determine whether it’s effective in practice would be a huge expense. A physician might get a sense for whether any treatment worked, but that evidence is essentially anecdotal: it’s easy to believe that something is effective because that’s what you want to see. And if it’s shared with other doctors, it’s shared while chatting at a medical convention. But with electronic health records, it’s possible (and not even terribly expensive) to collect documentation from thousands of physicians treating millions of patients. We can find out when and where a drug was prescribed, why, and whether there was a good outcome. We can ask questions that are never part of clinical trials: is the medication used in combination with anything else? What other conditions is the patient being treated for? We can use machine learning techniques to discover unexpected combinations of drugs that work well together, or to predict adverse reactions. We’re no longer limited by clinical trials; every patient can be part of an ongoing evaluation of whether his treatment is effective, and under what conditions. Technically, this isn’t hard. The only difficult part is getting the data to move, getting data in a form where it’s easily transferred from the doctor’s office to analytics centers.

To solve problems of hot-spotting (individual patients or groups of patients consuming inordinate medical resources) requires a different combination of information. You can’t locate hot spots if you don’t have physical addresses. Physical addresses can be geocoded (converted from addresses to longitude and latitude, which is more useful for mapping problems) easily enough, once you have them, but you need access to patient records from all the hospitals operating in the area under study. And you need access to insurance records to determine how much health care patients are requiring, and to evaluate whether special interventions for these patients are effective. Not only does this require electronic records, it requires cooperation across different organizations (breaking down silos), and assurance that the data won’t be misused (patient privacy). Again, the enabling factor is our ability to combine data from different sources; once you have the data, the solutions come easily.

Breaking down silos has a lot to do with aligning incentives. Currently, hospitals are trying to optimize their income from medical treatments, while insurance companies are trying to optimize their income by minimizing payments, and doctors are just trying to keep their heads above water. There’s little incentive to cooperate. But as financial pressures rise, it will become critically important for everyone in the health care system, from the patient to the insurance executive, to assume that they are getting the most for their money. While there’s intense cultural resistance to be overcome (through our experience in data science, we’ve learned that it’s often difficult to break down silos within an organization, let alone between organizations), the pressure of delivering more effective health care for less money will eventually break the silos down. The old zero-sum game of winners and losers must end if we’re going to have a medical system that’s effective over the coming decades.

Data becomes infinitely more powerful when you can mix data from different sources: many doctor’s offices, hospital admission records, address databases, and even the rapidly increasing stream of data coming from personal fitness devices. The challenge isn’t employing our statistics more carefully, precisely, or guardedly. It’s about letting go of an old paradigm that starts by assuming only certain variables are key and ends by correlating only these variables. This paradigm worked well when data was scarce, but if you think about, these assumptions arise precisely because data is scarce. We didn’t study the relationship between leukemia and kidney cancers because that would require asking a huge set of questions that would require collecting a lot of data; and a connection between leukemia and kidney cancer is no more likely than a connection between leukemia and flu. But the existence of data is no longer a problem: we’re collecting the data all the time. Electronic health records let us move the data around so that we can assemble a collection of cases that goes far beyond a particular practice, a particular hospital, a particular study. So now, we can use machine learning techniques to identify and test all possible hypotheses, rather than just the small set that intuition might suggest. And finally, with enough data, we can get beyond correlation to causation: rather than saying “A and B are correlated,” we’ll be able to say “A causes B,” and know what to do about it.

Building the health care system we want

The U.S. ranks 37th out of developed economies in life expectancy and other measures of health, while by far outspending other countries on per-capita health care costs. We spend 18% of GDP on health care, while other countries on average spend on the order of 10% of GDP. We spend a lot of money on treatments that don’t work, because we have a poor understanding at best of what will and won’t work.

Part of the problem is cultural. In a country where even pets can have hip replacement surgery, it’s hard to imagine not spending every penny you have to prolong Grandma’s life — or your own. The U.S. is a wealthy nation, and health care is something we choose to spend our money on. But wealthy or not, nobody wants ineffective treatments. Nobody wants to roll the dice and hope that their biology is similar enough to a hypothetical “average” patient. No one wants a “winner take all” payment system in which the patient is always the loser, paying for procedures whether or not they are helpful or necessary. Like Wanamaker with his advertisements, we want to know what works, and we want to pay for what works. We want a smarter system where treatments are designed to be effective on our individual biologies; where treatments are administered effectively; where our hospitals our used effectively; and where we pay for outcomes, not for procedures.

We’re on the verge of that new system now. We don’t have it yet, but we can see it around the corner. Ultra-cheap DNA sequencing in the doctor’s office, massive inexpensive computing power, the availability of EHRs to study whether treatments are effective even after the FDA trials are over, and improved techniques for analyzing data are the tools that will bring this new system about. The tools are here now; it’s up to us to put them into use.

Recommended reading:

We recommend the following books regarding technology, data, and health care reform:

August 09 2012

The risks and rewards of a health data commons

As I wrote earlier this year in an ebook on data for the public good, while the idea of data as a currency is still in its infancy, it’s important to think about where the future is taking us and our personal data.

If the Obama administration’s smart disclosure initiatives gather steam, more citizens will be able to do more than think about personal data: they’ll be able to access their financial, health, education, or energy data. In the U.S. federal government, the Blue Button initiative, which initially enabled veterans to download personal health data, is now spreading to all federal employees, and it also earned adoption at private institutions like Aetna and Kaiser Permanente. Putting health data to work stands to benefit hundreds of millions of people. The Locker Project, which provides people with the ability to move and store personal data, is another approach to watch.

The promise of more access to personal data, however, is balanced by accompanying risks. Smartphones, tablets, and flash drives, after all, are lost or stolen every day. Given the potential of mhealth, and big data and health care information technology, researchers and policy makers alike are moving forward with their applications. As they do so, conversations and rulemaking about health care privacy will need to take into account not just data collection or retention but context and use.

Put simply, businesses must confront the ethical issues tied to massive aggregation and data analysis. Given that context, Fred Trotter’s post on who owns health data is a crucial read. As Fred highlights, the real issue is not ownership, per se, but “What rights do patients have regarding health care data that refers to them?”

Would, for instance, those rights include the ability to donate personal data to a data commons, much in the same way organs are donated now for research? That question isn’t exactly hypothetical, as the following interview with John Wilbanks highlights.

Wilbanks, a senior fellow at the Kauffman Foundation and director of the Consent to Research Project, has been an advocate for open data and open access for years, including a stint at Creative Commons; a fellowship at the World Wide Web Consortium; and experience in the academic, business, and legislative worlds. Wilbanks will be speaking at the Strata Rx Conference in October.

Our interview, lightly edited for content and clarity, follows.

Where did you start your career? Where has it taken you?

John WilbanksJohn Wilbanks: I got into all of this, in many ways, because I studied philosophy 20 years ago. What I studied inside of philosophy was semantics. In the ’90s, that was actually sort of pointless because there wasn’t much semantic stuff happening computationally.

In the late ’90s, I started playing around with biotech data, mainly because I was dating a biologist. I was sort of shocked at how the data was being represented. It wasn’t being represented in a way that was very semantic, in my opinion. I started a software company and we ran that for a while, [and then] sold it during the crash.

I went to the Worldwide Web Consortium, where I spent a year helping start their Semantic Web for Life Sciences project. While I was there, Creative Commons (CC) asked me to come and start their science project because I had known a lot of those guys. When I started my company, I was at the Berkman Center at Harvard Law School, and that’s where Creative Commons emerged from, so I knew the people. I knew the policy and I had gone off and had this bioinformatics software adventure.

I spent most of the last eight years at CC working on trying to build different commons in science. We looked at open access to scientific literature, which is probably where we had the most success because that’s copyright-centric. We looked at patents. We looked at physical laboratory materials, like stem cells in mice. We looked at different legal regimes to share those things. And we looked at data. We looked at both the technology aspects and legal aspects of sharing data and making it useful.

A couple of times over those years, we almost pivoted from science to health because science is so institutional that it’s really hard for any of the individual players to create sharing systems. It’s not like software, where anyone with a PC and an Internet connection can contribute to free software, or Flickr, where anybody with a digital camera can license something under CC. Most scientists are actually restricted by their institutions. They can’t share, even if they want to.

Health kept being interesting because it was the individual patients who had a motivation to actually create something different than the system did. At the same time, we were watching and seeing the capacity of individuals to capture data about themselves exploding. So, at the same time that the capacity of the system to capture data about you exploded, your own capacity to capture data exploded.

That, to me, started taking on some of the interesting contours that make Creative Commons successful, which was that you didn’t need a large number of people. You didn’t need a very large percentage of Wikipedia users to create Wikipedia. You didn’t need a large percentage of free software users to create free software. If this capacity to generate data about your health was exploding, you didn’t need a very large percentage of those people to create an awesome data resource: you needed to create the legal and technical systems for the people who did choose to share to make that sharing useful.

Since Creative Commons is really a copyright-centric organization, I left because the power on which you’re going to build a commons of health data is going to be privacy power, not copyright power. What I do now is work on informed consent, which is the legal system you need to work with instead of copyright licenses, as well as the technologies that then store, clean, and forward user-generated data to computational health and computational disease research.

What are the major barriers to people being able to donate their data in the same way they might donate their organs?

John Wilbanks: Right now, it looks an awful lot like getting onto the Internet before there was the web. The big ISPs kind of dominated the early adopters of computer technologies. You had AOL. You had CompuServe. You had Prodigy. And they didn’t communicate with each other. You couldn’t send email from AOL to CompuServe.

What you have now depends on the kind of data. If the data that interests you is your genotype, you’re probably a 23andMe customer and you’ve got a bunch of your data at 23andMe. If you are the kind of person who has a chronic illness and likes to share information about that illness, you’re probably a customer at PatientsLikeMe. But those two systems don’t interoperate. You can’t send data from one to the other very effectively or really at all.

On top of that, the system has data about you. Your insurance company has your billing records. Your physician has your medical records. Your pharmacy has your pharmacy records. And if you do quantified self, you’ve got your own set of data streams. You’ve got your Fitbit, the data coming off of your smartphone, and your meal data.

Almost all of these are basically populating different silos. In some cases, you have the right to download certain pieces of the data. For the most part, you don’t. It’s really hard for you, as an individual, to build your own, multidimensional picture of your data, whereas it’s actually fairly easy for all of those companies to sell your data to one another. There’s not a lot of technology that lets you share.

What are some of the early signals we’re seeing about data usage moving into actual regulatory language?

John Wilbanks: The regulatory language actually makes it fairly hard to do contextual privacy waiving, in a Creative Commons sense. It’s hard to do granular permissions around privacy in the way you can do granular conditional copyright grants because you don’t have intellectual property. The only legal tool you have is a contract, and the contracts don’t have a lot of teeth.

It’s pretty hard to do anything beyond a gift. It’s more like organ donation, where you don’t get to decide where the organs go. What I’m working on is basically a donation, not a conditional gift. The regulatory environment makes it quite hard to do anything besides that.

There was a public comment period that just finished. It’s an announcement of proposed rulemaking on what’s called the Common Rule, which is the Department of Health and Human Services privacy language. It was looking to re-examine the rules around letting de-identified data or anonymized data out for widespread use. They got a bunch of comments.

There’s controversy as to how de-identified data can actually be and still be useful. There is going to be, probably, a three-to-five year process where they rewrite the Common Rule and it’ll be more modern. No one knows how modern, but it will be at least more modern when that finishes.

Then there’s another piece in the US — HIPAA — which creates a totally separate regime. In some ways, it is the same as the Common Rule, but not always. I don’t think that’s going to get opened up. The way HIPAA works is that they have 17 direct identifiers that are labeled as identifying information. If you strip those out, it’s considered de-identified.

There’s an 18th bucket, which is anything else that can reasonably identify people. It’s really hard to hit. Right now, your genome is not considered to fall under that. I would be willing to bet within a year or two, it will be.

From a regulatory perspective, you’ve got these overlapping regimes that don’t quite fit and both of them are moving targets. That creates a lot of uncertainty from an investment perspective or from an analytics perspective.

How are you thinking about a “health data commons,” in terms of weighing potential risks against potential social good?

John Wilbanks: I think that that’s a personal judgment as to the risk-benefit decision. Part of the difficulty is that the regulations are very syntactic — “This is what re-identification is” — whereas the concept of harm, benefit, or risk is actually something that’s deeply personal. If you are sick, if you have cancer or a rare disease, you have a very different idea of what risk is compared to somebody who thinks of him or herself as healthy.

What we see — and this is born out in the Framingham Heart Study and all sorts of other longitudinal surveys — is that people’s attitudes toward risk and benefit change depending on their circumstances. Their own context really affects what they think is risky and what they think isn’t risky.

I believe that the early data donors are likely to be people for whom there isn’t a lot of risk perceived because the health system already knows that they’re sick. The health system is already denying them coverage, denying their requests for PET scans, denying their requests for access to care. That’s based on actuarial tables, not on their personal data. It’s based on their medical history.

If you’re in that group of people, then the perceived risk is actually pretty low compared to the idea that your data might actually get used or to the idea that you’re no longer passive. Even if it’s just a donation, you’re doing something outside of the system that’s accelerating the odds of getting something discovered. I think that’s the natural group.

If you think back to the numbers of users who are required to create free software or Wikipedia, to create a cultural commons, a very low percentage is needed to create a useful resource.

Depending on who you talk to, somewhere between 5-10% of all Americans either have a rare disease, have it in their first order family, or have a friend with a rare disease. Each individual disease might not have very many people suffering from it, but if you net them all up, it’s a lot of people. Getting several hundred thousand to a few million people enrolled is not an outrageous idea.

When you look at the existing examples of where such commons have come together, what have been the most important concrete positive outcomes for society?

John Wilbanks: I don’t think we have really even started to see them because most people don’t have computable data about themselves. Most people, if they have any data about themselves, have scans of their medical records.
What we really know is that there’s an opportunity cost to not trying, which is that the existing system is really inefficient, very bad at discovering drugs, and very bad at getting those drugs to market in a timely basis.

That’s one of the reasons we’re doing this is as an experiment. We would like to see exactly how effective big computational approaches are on health data. The problem is that there are two ways to get there.

One is through a set of monopoly companies coming together and working together. That’s how semiconductors work. The other is through an open network approach. There’s not a lot of evidence that things besides these two approaches work. Government intervention is probably not going to work.

Obviously, I come down on the open network side. But there’s an implicit belief, I think, both in the people who are pushing the cooperating monopolies approach and the people who are pushing the open networks approach, that there’s enormous power in the big-data-driven approach. We’re just leaving that on the table right now by not having enough data aggregated.

The benefits to health that will come out will be the ability to increasingly, by looking at a multidimensional picture of a person, predict with some confidence whether or not a drug will work, or whether they’re going to get sick, or how sick they’re going to get, or what lifestyle changes they can make to mitigate an illness. Right now, basically, we really don’t know very much.

Pretty Simple Data Privacy

John Wilbanks discussed “Pretty Simple Data Privacy” during a Strata Online Conference in January 2012. His presentation begins at the 7:18 mark in the following video:

Strata Rx — Strata Rx, being held Oct. 16-17 in San Francisco, is the first conference to bring data science to the urgent issues confronting health care.

Save 20% on registration with the code RADAR20

Photo: Science Commons

August 03 2012

They promised us flying cars

We may be living in the future, but it hasn’t entirely worked out how we were promised. I remember the predictions clearly: the 21st century was supposed to be full of self-driving cars, personal communicators, replicators and private space ships.

Except, of course, all that has come true. Google just got the first license to drive their cars entirely autonomously on public highways. Apple came along with the iPhone and changed everything. Three-dimensional printers have come out of the laboratories and into the home. And in a few short years, and from a standing start, Elon Musk and SpaceX has achieved what might otherwise have been thought impossible: late last year, SpaceX launched a spacecraft and returned it to Earth safely. Then they launched another, successfully docked it with the International Space Station, and then again returned it to Earth.

The SpaceX Dragon capsule is grappled and berthed to the Earth-facing port of the International Space Station’s Harmony module at 12:02 p.m. EDT, May 25, 2012. Credit: NASA/SpaceX

Right now there is a generation of high-tech tinkerers breaking the seals on proprietary technology and prototyping new ideas, which is leading to a rapid growth in innovation. The members of this generation, who are building open hardware instead of writing open software, seem to have come out of nowhere. Except, of course, they haven’t. Promised a future they couldn’t have, they’ve started to build it. The only difference between them and Elon Musk, Jeff Bezos, Sergey Brin, Larry Page and Steve Jobs is that those guys got to build bigger toys than the rest of us.

The dotcom billionaires are regular geeks just like us. They might be the best of us, or sometimes just the luckiest, but they grew up with the same dreams, and they’ve finally given up waiting for governments to build the future they were promised when they were kids. They’re going to build it for themselves.

The thing that’s driving the Maker movement is the same thing that’s driving bigger shifts, like the next space race. Unlike the old space race, pushed by national pride and the hope that we could run fast enough in place so that we didn’t have to start a nuclear war, this new space race is being driven by personal pride, ambition and childhood dreams.

But there are some who don’t see what’s happening, and they’re about to miss out. Case in point: a lot of big businesses are confused by the open hardware movement. They don’t understand it, don’t think it’s worth their while to make exceptions and cater to it. Even the so-called “smart money” doesn’t seem to get it. I’ve heard moderately successful venture capitalists from the Valley say that they “… don’t do hardware.” Those guys are about to lose their shirts.

Makers are geeks like you and me who have decided to go ahead and build the future themselves because the big corporations and the major governments have so singularly failed to do it for us. Is it any surprise that dotcom billionaires are doing the same? Is it any surprise that the future we build is going to look a lot like the future we were promised and not so much like the future we were heading toward?


StrataRx: Data science and health(care)

By Mike Loukides and Jim Stogdill

StrataRxWe are launching a conference at the intersection of health, health care, and data. Why?

Our health care system is in crisis. We are experiencing epidemic levels of obesity, diabetes, and other preventable conditions while at the same time our health care system costs are spiraling higher. Most of us have experienced increasing health care costs in our businesses or have seen our personal share of insurance premiums rise rapidly. Worse, we may be living with a chronic or life-threatening disease while struggling to obtain effective therapies and interventions — finding ourselves lumped in with “average patients” instead of receiving effective care designed to work for our specific situation.

In short, particularly in the United States, we are paying too much for too much care of the wrong kind and getting poor results. All the while our diet and lifestyle failures are demanding even more from the system. In the past few decades we’ve dropped from the world’s best health care system to the 37th, and we seem likely to drop further if things don’t change.

The very public fight over the Affordable Care Act (ACA) has brought this to the fore of our attention, but this is a situation that has been brewing for a long time. With the ACA’s arrival, increasing costs and poor outcomes, at least in part, are going to be the responsibility of the federal government. The fiscal outlook for that responsibility doesn’t look good and solving this crisis is no longer optional; it’s urgent.

There are many reasons for the crisis, and there’s no silver bullet. Health and health care live at the confluence of diet and exercise norms, destructive business incentives, antiquated care models, and a system that has severe learning disabilities. We aren’t preventing the preventable, and once we’re sick we’re paying for procedures and tests instead of results; and those interventions were designed for some non-existent average patient so much of it is wasted. Later we mostly ignore the data that could help the system learn and adapt.

It’s all too easy to be gloomy about the outlook for health and health care, but this is also a moment of great opportunity. We face this crisis armed with vast new data sources, the emerging tools and techniques to analyze them, an ACA policy framework that emphasizes outcomes over procedures, and a growing recognition that these are problems worth solving.

Data has a long history of being “unreasonably effective.” And at least from the technologist point of view it looks like we are on the cusp of something big. We have the opportunity to move from “Health IT” to an era of data-illuminated technology-enabled health.

For example, it is well known that poverty places a disproportionate burden on the health care system. Poor people don’t have medical insurance and can’t afford to see doctors; so when they’re sick they go to the emergency room at great cost and often after they are much sicker than they need to be. But what happens when you look deeper? One project showed that two apartment buildings in Camden, NJ accounted for a hugely disproportionate number of hospital admissions.

Targeting those buildings, and specific people within them, with integrated preventive care and medical intervention has led to significant savings.

That project was made possible by the analysis of hospital admissions, costs, and intervention outcomes — essentially, insurance claims data — across all the hospitals in Camden. Acting upon that analysis and analyzing the results of the action led to savings.

But claims data isn’t the only game in town anymore. Even more is possible as electronic medical records (EMR), genomic, mobile sensor, and other emerging data streams become available.

With mobile-enabled remote sensors like glucometers, blood pressure monitors, and futuristic tools like digital pills that broadcast their arrival in the stomach, we have the opportunity to completely revolutionize disease management. By moving from discrete and costly data events to a continuous stream of inexpensive remotely monitored data, care will improve for a broad range of chronic and life-threatening diseases. By involving fewer office visits, physician productivity will rise and costs will come down.

We are also beginning to see tantalizing hints of the future of personalized medicine in action. Cheap gene sequencing, better understanding of how drug molecules interact with our biology (and each other), and the tools and horsepower to analyze these complex interactions for a specific patient with specific biology in near real time will change how we do medicine. In the same way that Google’s AdSense took cost out of advertising by using data to target ads with precision, we’ll soon be able to make medical interventions that are much more patient-specific and cost effective.

StrataRx is based on the idea that data will improve health and health care, but we aren’t naive enough to believe that data alone solves all the problems we are facing. Health and health care are incredibly complex and multi-layered and big data analytics is only one piece of the puzzle. Solving our national crisis will also depend on policy and system changes, some of them to systems outside of health care. However, we know that data and its analysis have an important role to play in illuminating the current reality and creating those solutions.

StrataRx is a call for data scientists, technologists, health professionals, and the sector’s business leadership to convene, take part in the discussion, and make a difference!

Strata Rx — Strata Rx, being held Oct. 16-17 in San Francisco, is the first conference to bring data science to the urgent issues confronting healthcare.

Save 20% on registration with the code RADAR20


August 01 2012

Big data is our generation’s civil rights issue, and we don’t know it

Data doesn’t invade people’s lives. Lack of control over how it’s used does.

What’s really driving so-called big data isn’t the volume of information. It turns out big data doesn’t have to be all that big. Rather, it’s about a reconsideration of the fundamental economics of analyzing data.

For decades, there’s been a fundamental tension between three attributes of databases. You can have the data fast; you can have it big; or you can have it varied. The catch is, you can’t have all three at once.

The big data trifectaThe big data trifecta

I’d first heard this as the “three V’s of data”: Volume, Variety, and Velocity. Traditionally, getting two was easy but getting three was very, very, very expensive.

The advent of clouds, platforms like Hadoop, and the inexorable march of Moore’s Law means that now, analyzing data is trivially inexpensive. And when things become so cheap that they’re practically free, big changes happen — just look at the advent of steam power, or the copying of digital music, or the rise of home printing. Abundance replaces scarcity, and we invent new business models.

In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. It’s the instant of context.

That needs repeating:

You decide what data is about the moment you define its schema.

With the new, data-is-abundant model, we collect first and ask questions later. The schema comes after the collection. Indeed, big data success stories like Splunk, Palantir, and others are prized because of their ability to make sense of content well after it’s been collected — sometimes called a schema-less query. This means we collect information long before we decide what it’s for.

And this is a dangerous thing.

When bank managers tried to restrict loans to residents of certain areas (known as redlining) Congress stepped in to stop it (with the Fair Housing Act of 1968). They were able to legislate against discrimination, making it illegal to change loan policy based on someone’s race.

Home Owners' Loan Corporation map showing redlining of hazardous districts in 1936Home Owners' Loan Corporation map showing redlining of hazardous districts in 1936
Home Owners’ Loan Corporation map showing redlining of “hazardous” districts in 1936.

“Personalization” is another word for discrimination. We’re not discriminating if we tailor things to you based on what we know about you — right? That’s just better service.

In one case, American Express used purchase history to adjust credit limits based on where a customer shopped, despite his excellent credit limit:

Johnson says his jaw dropped when he read one of the reasons American Express gave for lowering his credit limit: “Other customers who have used their card at establishments where you recently shopped have a poor repayment history with American Express.”

Some of the things white men liked in 2010, according to OKCupidSome of the things white men liked in 2010, according to OKCupidWe’re seeing the start of this slippery slope everywhere from tailored credit-card limits like this one to car insurance based on driver profiles. In this regard, big data is a civil rights issue, but it’s one that society in general is ill-equipped to deal with.

We’re great at using taste to predict things about people. OKcupid’s 2010 blog post “The Real Stuff White People Like” showed just how easily we can use information to guess at race. It’s a real eye-opener (and the guys who wrote it didn’t include everything they learned — some of it was a bit too controversial). They simply looked at the words one group used which others didn’t often use. The result was a list of “trigger” words for a particular race or gender.

Now run this backwards. If I know you like these things, or see you mention them in blog posts, on Facebook, or in tweets, then there’s a good chance I know your gender and your race, and maybe even your religion and your sexual orientation. And that I can personalize my marketing efforts towards you.

That makes it a civil rights issue.

If I collect information on the music you listen to, you might assume I will use that data in order to suggest new songs, or share it with your friends. But instead, I could use it to guess at your racial background. And then I could use that data to deny you a loan.

Want another example? Check out Private Data In Public Ways, something I wrote a few months ago after seeing a talk at Big Data London, which discusses how publicly available last name information can be used to generate racial boundary maps:

Screen from the Mapping London projectScreen from the Mapping London project
Screen from the Mapping London project.

This TED talk by Malte Spitz does a great job of explaining the challenges of tracking citizens today, and he speculates about whether the Berlin Wall would ever have come down if the Stasi had access to phone records in the way today’s governments do.

So how do we regulate the way data is used?

The only way to deal with this properly is to somehow link what the data is with how it can be used. I might, for example, say that my musical tastes should be used for song recommendation, but not for banking decisions.

Tying data to permissions can be done through encryption, which is slow, riddled with DRM, burdensome, hard to implement, and bad for innovation. Or it can be done through legislation, which has about as much chance of success as regulating spam: it feels great, but it’s damned hard to enforce.

There are brilliant examples of how a quantified society can improve the way we live, love, work, and play. Big data helps detect disease outbreaks, improve how students learn, reveal political partisanship, and save hundreds of millions of dollars for commuters — to pick just four examples. These are benefits we simply can’t ignore as we try to survive on a planet bursting with people and shaken by climate and energy crises.

But governments need to balance reliance on data with checks and balances about how this reliance erodes privacy and creates civil and moral issues we haven’t thought through. It’s something that most of the electorate isn’t thinking about, and yet it affects every purchase they make.

This should be fun.

This post originally appeared on Solve for Interesting. This version has been lightly edited.

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.

Save 20% on registration with the code RADAR20


July 31 2012

Discovering science

The discovery of the Higgs boson gave us a window into the way science works. We’re over the hype and the high expectations kindled by last year’s pre-announcement. We’ve seen the moving personal interest story about Peter Higgs and how this discovery validates predictions he made almost 50 years ago, and which ones weren’t at the time thought “relevant.” Now we have an opportunity to do something more: to take a look at how science works and see what it is made of.


Higgs boson image via Wikimedia CommonsHiggs boson image via Wikimedia CommonsFirst and foremost: Science is about discovery. While the Higgs boson was the last piece in the puzzle for the Standard Model, the search for the Higgs wasn’t ultimately about verifying the Standard Model. It has predicted a lot of things successfully; it’s pointless to say that it hasn’t served us well. A couple of years ago, I asked some physicists what would happen if they didn’t find the Higgs, and the answer was uniformly: “That would be the coolest thing ever! We’d have to develop a new understanding of how particle physics works.” At the time, I pointed out that not finding the Higgs might be exciting to physicists, but it would certainly be disastrous for the funding of high-energy physics projects. (“What? We spent all this money to build you a machine to find this particle, and now you say that particle doesn’t even exist?”) But science must move forward, and the desire to rebuild quantum mechanics trumps funding.

Now that we have the Higgs (or something like it), physicists are hoping for a “strange” Higgs: a particle that differs from the Higgs predicted by the Standard model in some ways, a particle that requires a new theory. Indeed, to Nobel laureate Steven Weinberg, a Higgs that is exactly the Higgs predicted by the Standard Model would be a “nightmare.” Discovering something that’s more or less exactly what was predicted isn’t fun, and it isn’t interesting. And furthermore, there are other hints that there’s a lot of work to be done: dark matter and dark energy certainly hint at a physics that doesn’t fit into our current understanding. One irony of the Higgs is that, even if it’s “strange,” it focused too much attention on big, expensive science, to the detriment of valuable, though less dramatic (and less expensive) work.

Science is never so wrong as when it thinks that almost all the questions have been answered. In the late 19th century, scientists thought that physics was just about finished: all that was left were nagging questions about why an oven doesn’t incinerate us with an infinite blast of energy and some weird behavior when you shine ultraviolet light onto electrodes. Solving the pressing problem of black body radiation and the photoelectric effect required the the idea of energy quanta, which led to all of 20th century physics. (Planck’s first steps toward quantum mechanics and Einstein’s work on the photoelectric effect earned them Nobel Prizes.) Science is not about agreement on settled fact; it’s about pushing into the unknown and about the intellectual ferment and discussion that takes place when you’re exploring new territory.

Approximation, not law

Second: Science is about increasingly accurate approximations to the way nature works. Newton’s laws of motion, force equals mass times acceleration and all that, served us well for hundreds of years, until Einstein developed special relativity. Now, here’s the trick: Newtonian physics is perfectly adequate for anything you or I are likely to do in our lifetimes, unless SpaceX develops some form of interstellar space travel. However, relativistic effects are observable, even on Earth: clocks run slightly slower in airliners and slightly faster on the tops of mountains. These effects aren’t measurable with your wristwatch, but they are measurable (and have been measured, with precisely the results predicted by Einstein) with atomic clocks. So, do we say Newtonian physics is “wrong”? It’s good enough, and any physicist would be shocked at a science curriculum that didn’t include Newtonian physics. But neither can we say that Newtonian physics is “right,” if “right” means anything more than “good enough.” Relativity implies a significantly different conception of how the universe works. I’d argue that it’s not just a better approximation, it’s a different (and more accurate) world view. This shift in world view as we go from Newton to Einstein is, to me, much more important than the slightly more accurate answers we get from relativity.

What do “right” and “wrong” mean in this context? Those terms are only marginally useful. And the notion of physical “law” is even less useful. “Laws” are really semantic constructs, and the appearance of the words “physical law” usually signify that someone has missed the point. I cringe when I hear people talk about the “law of gravity.” Because there’s no such thing; Newtonian gravity was replaced by Einsteinian general relativity (both a better approximation and a radically different view of how the universe works), and there are plenty of reasons to believe that general relativity isn’t the end of the story. The bottom line is that we don’t really know what gravity is or how it works, and all we really know about gravity is that our limited Einsteinian understanding probably doesn’t work for really small things and might not work for really big things. There are very good reasons to believe that gravity waves exist (and we’re building the LIGO gravitational interferometer to detect them), but right now, they’re in the same category the Higgs boson was a decade ago. In theory, they should exist, and the universe will get a whole lot more interesting if we don’t find them. So, the only “law of gravity” we understand now is an approximation, and we have no idea what it approximates. And when we find a better approximation (one that explains dark energy, perhaps, or one that shows how gravity worked at the time of the Big Bang), that approximation will come with a significantly different world view.

Whatever we nominate as physical “law” is only law until we find a better law, a better approximation with its own story. Theories are replaced by better theories, which in turn are replaced by better theories. If we actually found a completely accurate “theory of everything” in any discipline, that might be the ultimate success, but it would also be tragic; it would be the end of science if it didn’t raise any further questions.

Simplicity and aesthetics

Aesthetics is a recurring principle both in the sciences (particularly physics) and in mathematics. It’s a particular kind of minimalist aesthetics: all things being equal, science prefers the explanation that makes the fewest assumptions. Gothic or rococo architecture doesn’t fit in. This principle has long been known as Occam’s Razor, and it’s worth being precise about what it means. We often hear about the “simplest explanation,” but merely being simple doesn’t make an explanation helpful. There are plenty of simple explanations. “The Earth sits on the back of four elephants, which stand on a turtle” is simple, but it makes lots of assumptions: the turtle must stand on something (“it’s turtles all the way down”), and the elephants and turtles need to eat. If we’re to accept the elephant theory, we have to assume that there are answers to these questions, otherwise they’re merely unexamined assumptions. Occam’s Razor is not about simplicity, but about assumptions. A theory that makes fewer assumptions may not be simpler, but almost always gives a better picture of reality.

One problem in physics is the number of variables that just have to have the right value to make the universe work. Each of these variables is an assumption, in a sense: they are what they are; we can’t say why any more than we can explain why we live in a three-dimensional universe. Physicists would like to reduce the number to 1 or, even better, 0: the universe would be as irreducible as π and derivable from pure mathematics. I admit that I find this drive a bit perplexing. I would expect the universe to have a large number of constants that just happen to have the right values and can’t be derived either from first principles or from other constants, especially since many modern cosmological theories suggest that universes are being created constantly and only a small number “work.”

However, the driving principle here is that we won’t get anywhere in understanding the universe by saying “what’s the matter with complexity?” In practice, the drive to provide simpler, more compelling descriptions has driven scientific progress. Copernicus’ heliocentric model for the solar system wasn’t more accurate than the geocentric Ptolemaic system. It took Kepler and elliptical orbits to make a heliocentric universe genuinely better. But the Ptolemaic model required lots of tinkering to make it work right, to make the cycles and epicycles fit their observational data about planetary motion.

There are many things about the universe that current theory can’t explain. The positive charge of a proton happens to equal the negative charge of an electron, but there’s no theoretical reason for them to be equal. If they weren’t equal, chemistry would be profoundly different, and life might not be possible. But the anthropic principle (physics is the way it is because we can observe it, and we can’t observe a universe in which we can’t exist) is ultimately unsatisfying; it’s only a clever way of leaving assumptions unchallenged.

Ultimately, the excitement of science has to do with challenging your assumptions about how things work. That challenge lies behind all successful scientific theories: can you call everything into question and see what lies behind the surface? What passes for “common sense” is usually nothing more than unexamined assumptions. To my mind, one of the most radical insights comes from relativity: since it doesn’t matter where you put the origin of your coordinate system, you can put the origin on the Earth if you want. In that sense, the Ptolemaic solar system isn’t “wrong.” The mathematics is more complex, but it all works. So, have we made progress? Counter-intuitive as relativity may seem, in relativity Einstein makes nowhere near as many assumptions as Ptolemy and his followers: very little is assumed besides the constancy of the speed of light and the force of gravity. The drive for such radical simplicity, as a way of forcing us to look behind our “common sense,” is at the heart of science.

Verification and five nines

In the search for the Higgs, we’ve often heard about “five nines,” or a chance of roughly 1 in 100,000 that the result is in error. Earlier results were inconclusive because the level of confidence was only “two nines,” or roughly one in 100. What’s the big difference? One in 100 seems like an acceptably small chance of error.

I asked O’Reilly author and astrophysicist Alasdair Allan (@aallan) about this, and he had an illuminating explanation. There is nothing magical about five nines, or two nines for that matter. The significance is that, if a physical phenomenon is real, if something is actually happening, then you ought to be able to collect enough data to get five nines confidence. There’s nothing “wrong” with an experiment that only gives you two nines, but if it’s actually telling you something real, you should be able to push it to five nines (or six, or seven, if you have enough time and data collecting ability). So, we know that the acceleration due to gravity on the surface of the Earth is 32.2 feet per second per second. In a high school physics lab, you can verify this to about two nines (maybe more if high schools have more sophisticated equipment than they did in my day). With more sophisticated equipment, pushing the confidence level to five nines is a trivial exercise. That’s exactly what happened with the Higgs: the initial results had a confidence level of about two nines, but in the past year, scientists were able to collect more data and get the confidence level up to five nines.

Does the Higgs become “real” at that point? Well, if it is real at all, it was real all along. But what this says is that there’s an experimental result that we can have confidence in and that we can use as the foundation for future results. Notice that this result doesn’t definitively say that CERN has found a Higgs Boson, just that they’ve definitively found something that could be the Higgs (but that could prove to be something different).

Scientists are typically very careful about the results in their claims. Last year’s claims about “faster than light” neutrinos provide a great demonstration of how the scientific process works. The scientists who announced the result didn’t claim that they’d found neutrinos that traveled faster than light; they stated that they had a very strange result indicating that neutrinos traveled faster than light and wanted help from other scientists in understanding whether they had analyzed the data correctly. And even though many scientists were excited by the possibility that relativity would need to be re-thought, a serious effort was made to understand what the result could mean. Ultimately, of course, the researchers discovered that a cable had been attached incorrectly; when that problem was fixed, the anomalous results disappeared. So, we’re safe in a boring world: Neutrinos don’t travel faster than light, and theoretical physicists’ desire to rebuild relativity will have to wait.

While this looks like an embarrassment for science, it’s a great example of what happens when things go right. The scientific community went to work on several fronts: creating alternate theories (which have now all been discarded), exploring possible errors in the calculations (none were found), doing other experiments to measure the speed of neutrinos (no faster-than-light neutrinos were found), and looking for problems with the equipment itself (which they eventually found). Successful science is as much about mistakes and learning from them as it is about successes. And it’s not just neutrinos: Richard Muller, one of the most prominent skeptics on climate change, recently stated that examination of the evidence has convinced him that he was wrong, that “global warming was real … Human activity is almost entirely the cause.” It would be a mistake to view this merely as vindication for the scientists arguing for global warming. Good science needs skeptics; they force you to analyze the evidence carefully, and as in the neutrino case, prevent you from making serious errors. But good scientists also know how to change their minds when the evidence demands it.

If we’re going to understand how to take care of our world in the coming generations, we have to understand how science works. Science is being challenged at every turn: from evolution to climatology to health (Coke’s claim that there’s no connection between soft drinks and obesity, reminiscent of the tobacco industry’s attempts to discredit the link between lung cancer and smoking), we’re seeing a fairly fundamental attack on the basic tools of human understanding. You don’t have to look far to find claims that science is a big conspiracy, funded by whomever you choose to believe funds such conspiracies, or that something doesn’t need to be taken seriously because it’s just a “theory.”

Scientists are rarely in complete agreement, nor do they try to advance some secret agenda. They’re excited by the idea of tearing down their most cherished ideas, whether that’s relativity or the Standard Model. A Nobel Prize rarely awaits someone who confirms what everyone already suspects. But the absence of complete agreement doesn’t mean that there isn’t consensus, and that consensus needs to be taken seriously. Similarly, scientists are always questioning their data: both the data that supports their own conclusions and the data that doesn’t. I was disgusted by a Fox news clip implying that science was untrustworthy because scientists were questioning their theories. Of course they’re questioning their theories. That’s what scientists are supposed to do; that’s how science makes progress. But it doesn’t mean that those theories aren’t the most accurate models we have about how the world, and the universe itself, are put together. If we’re going to understand our world, and our impact on that world, we had better base our understanding on data and use the best models we have.

Higgs boson image via Wikimedia Commons.

July 30 2012

Open source won

I heard the comments a few times at the 14th OSCON: The conference has lost its edge. The comments resonated with my own experience — a shift in demeanor, a more purposeful, optimistic attitude, less itching for a fight. Yes, the conference has lost its edge, it doesn’t need one anymore.

Open source won. It’s not that an enemy has been vanquished or that proprietary software is dead, there’s not much regarding adopting open source to argue about anymore. After more than a decade of the low-cost, lean startup culture successfully developing on open source tools, it’s clearly a legitimate, mainstream option for technology tools and innovation.

And open source is not just for hackers and startups. A new class of innovative, widely adopted technologies has emerged from the open source culture of collaboration and sharing — turning the old model of replicating proprietary software as open source projects on its head. Think GitHub, D3, Storm, Node.js, Rails, Mongo, Mesos or Spark.

We see more enterprise and government folks intermingling with the stalwart open source crowd who have been attending OSCON for years. And, these large organizations are actively adopting many of the open source technologies we track, e.g., web development frameworks, programming languages, content management, data management and analysis tools.

We hear fewer concerns about support or needing geek-level technical competency to get started with open source. In the Small and Medium Business (SMB) market we see mass adoption of open source for content management and ecommerce applications — even for self-identified technology newbies.

MySQL appears as popular as ever and remains open source after three years of Oracle control and Microsoft is pushing open source JavaScript as a key part of its web development environment and more explicit support for other open source languages. Oracle and Microsoft are not likely to radically change their business models, but their recent efforts show that open source can work in many business contexts.

Even more telling:

  • With so much of the consumer web undergirded with open source infrastructure, open source permeates most interactions on the web.
  • The massive, $100 million, GitHub investment validates the open collaboration model and culture — forking becomes normal.

What does winning look like? Open source is mainstream and a new norm — for startups, small business, the enterprise and government. Innovative open source technologies creating new business sectors and ecosystems (e.g., the distribution options, tools and services companies building around Hadoop). And what’s most exciting is the notion that the collaborative, sharing culture that permeates the open source community spreads to the enterprise and government with the same impact on innovation and productivity.

So, thanks to all of you who made the open source community a sustainable movement, the ones who were there when … and all the new folks embracing the culture. I can’t wait to see the new technologies, business sectors and opportunities you create.


July 21 2012

Overfocus on tech skills could exclude the best candidates for jobs

At the second RailsConf, David Heinemeier Hansson told the audience about a recruiter trying to hire with “5 years of experience with Ruby on Rails.” DHH told him “Sorry; I’ve only got 4 years.” We all laughed (I don’t think there’s anyone in the technical world who hasn’t dealt with a clueless recruiter), but little did we know this was the shape of things to come.

Last week, a startup in a relatively specialized area advertised a new engineering position for which they expected job candidates to have used their API. That raised a few eyebrows, not the least because it’s a sad commentary on the current jobs situation.

On one hand, we have high unemployment. But on the other hand, at least in the computing industry, there’s no shortage of jobs. I know many companies that are hiring, and all of them are saying they can’t find the people they want. I’m only familiar with the computer industry, which is often out of synch with the rest of the economy. Certainly, in Silicon Valley where you can’t throw a stone without hitting a newly-funded startup, we’d expect a chronic shortage of software developers. But a quick Google search will show you that the complaint is widespread: trucking, nursing, manufacturing, teaching, you’ll see the “lack of qualified applicants” complaint everywhere you look.

Is the problem that there are no qualified people? Or is the problem with the qualifications themselves?

There certainly have been structural changes in the economy, for better or for worse: many jobs have been shipped offshore, or eliminated through automation. And employers are trying to move some jobs back onshore for which the skills no longer exist in the US workforce. But I don’t believe that’s the whole story. A number of articles recently have suggested that the problem with jobs isn’t the workforce, it’s the employers: companies that are only willing to hire people who will drop in perfectly to the position that’s open. Hence, a startup requiring that applicants have developed code using their API.

It goes further: many employers are apparently using automated rejection services which (among other things) don’t give applicants the opportunity to make their case: there’s no human involved. There’s just a resume or an application form matched against a list of requirements that may be grossly out of touch with reality, generated by an HR department that probably doesn’t understand what they’re looking for, and that will never talk to the candidates they reject.

I suppose it’s a natural extension of data science to think that hiring can be automated. In the future, perhaps it will be. Even without automated application processing, it’s altogether too easy for an administrative assistant to match resumes against a checklist of “requirements” and turn everyone down: especially easy when the stack of resumes is deep. If there are lots of applications, and nobody fits the requirements, it must be the applicants’ fault, right? But at this point, rigidly matching candidates against inflexible job requirements isn’t a way to go forward.

Even for a senior position, if a startup is only willing to hire people who have already used its API, it is needlessly narrowing its applicant pool to a very small group. The candidates who survive may know the API already, but what else do they know? Are the best candidates in that group?

A senior position is likely to require a broad range of knowledge and experience, including software architecture, development methodologies, programming languages and frameworks. You don’t want to exclude most of the candidates by imposing extraneous requirements, even if those requirements make superficial sense. Does the requirement that candidates have worked with the API seem logical to an unseasoned executive or non-technical HR person? Yes, but it’s as wrong as you can get, even for a startup that expects new hires to hit the ground running.

The reports about dropping enrollments in computer science programs could give some justification to the claim that there’s a shortage of good software developers. But the ranks of software developers have never been filled by people with computer science degrees. In the early 80s, a friend of mine (a successful software developer) lamented that he was probably the last person to get a job in computing without a CS degree.

At the time, that seemed plausible, but in retrospect, it was completely wrong. I still see many people who build successful careers after dropping out of college, not completing high school, or majoring in something completely unrelated to computing. I don’t believe that they are the exceptions, nor should they be. The best way to become a top-notch software developer may well be to do a challenging programming-intensive degree program in some other discipline. But if the current trend towards overly specific job requirements and automated rejections continues, my friend will be proven correct, just about 30 years early.

A data science skills gap?

What about new areas like “data science”, where there’s a projected shortage of 1.5 million “managers and analysts”?

Well, there will most certainly be a shortage if you limit yourselves to people who have some kind of degree in data science, or a data science certification. (There are some degree programs, and no certifications that I’m aware of, though the related fields of Statistics and Business Intelligence are lousy with certifications). If you’re a pointy-haired boss who needs a degree or a certificate to tell you that a potential hire knows something in an area where you’re incompetent, you’re going to see a huge shortage of talent.

But as DJ Patil said in “Building Data Science Teams,” the best data scientists are not statisticians; they come from a wide range of scientific disciplines, including (but not limited to) physics, biology, medicine, and meteorology. Data science teams are full of physicists. The chief scientist of Kaggle, Jeremy Howard, has a degree in philosophy. The key job requirement in data science (as it is in many technical fields) isn’t demonstrated expertise in some narrow set of tools, but curiousity, flexibility, and willingness to learn. And the key obligation of the employer is to give its new hires the tools they need to succeed.

At this year’s Velocity conference, Jay Parikh talked about Facebook’s boot camp for bringing new engineers up to speed (this segment starts at about 3:30). New hires are expected to produce shippable code in the first week. There’s no question that they’re expected to come up to speed fast. But what struck me was that boot camp is that it’s a 6 week program (plus a couple additional weeks if you’re hired into operations) designed to surround new hires with the help they need to be successful. That includes mentors to help them work with the code base, review their code, integrate them into Facebook culture, and more. They aren’t expected to “hit the ground running.” They’re expected to get up to speed fast, and given a lot of help to do so successfully.

Facebook has high standards for whom they hire, but boot camp demonstrates that they understand that successful hiring isn’t about finding the perfect applicant: it’s about what happens after the new employee shows up.

Last Saturday, I had coffee with Nathan Milford, US Operations manager for Outbrain. We discussed these issues, along with synthetic biology, hardware hacking, and many other subjects. He said “when I’m hiring someone, I look for an applicant that fits the culture, who is bright, and who is excited and wants to learn. That’s it. I’m not going to require that they come with prior experience in every component of our stack. Anyone who wants to learn can pick that up on the job.”

That’s the attitude we clearly need if we’re going to make progress.

July 20 2012

Data Jujitsu: The art of turning data into product

Having worked in academia, government and industry, I’ve had a unique opportunity to build products in each sector. Much of this product development has been around building data products. Just as methods for general product development have steadily improved, so have the ideas for developing data products. Thanks to large investments in the general area of data science, many major innovations (e.g., Hadoop, Voldemort, Cassandra, HBase, Pig, Hive, etc.) have made data products easier to build. Nonetheless, data products are unique in that they are often extremely difficult, and seemingly intractable for small teams with limited funds. Yet, they get solved every day.

How? Are the people who solve them superhuman data scientists who can come up with better ideas in five minutes than most people can in a lifetime? Are they magicians of applied math who can cobble together millions of lines of code for high-performance machine learning in a few hours? No. Many of them are incredibly smart, but meeting big problems head-on usually isn’t the winning approach. There’s a method to solving data problems that avoids the big, heavyweight solution, and instead, concentrates building something quickly and iterating. Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small.

We call this Data Jujitsu: the art of using multiple data elements in clever ways to solve iterative problems that, when combined, solve a data problem that might otherwise be intractable. It’s related to Wikipedia’s definition of the ancient martial art of jujitsu: “the art or technique of manipulating the opponent’s force against himself rather than confronting it with one’s own force.”

How do we apply this idea to data? What is a data problem’s “weight,” and how do we use that weight against itself? These are the questions that we’ll work through in the subsequent sections.

To start, for me, a good definition of a data product is a product that facilitates an end goal through the use of data. It’s tempting to think of a data product purely as a data problem. After all, there’s nothing more fun than throwing a lot of technical expertise and fancy algorithmic work at a difficult problem. That’s what we’ve been trained to do; it’s why we got into this game in the first place. But in my experience, meeting the problem head-on is a recipe for disaster. Building a great data product is extremely challenging, and the problem will always become more complex, perhaps intractable, as you try to solve it.

Before investing in a big effort, you need to answer one simple question: Does anyone want or need your product? If no one wants the product, all the analytical work you throw at it will be wasted. So, start with something simple that lets you determine whether there are any customers. To do that, you’ll have to take some clever shortcuts to get your product off the ground. Sometimes, these shortcuts will survive into the finished version because they represent some fundamentally good ideas that you might not have seen otherwise; sometimes, they’ll be replaced by more complex analytic techniques. In any case, the fundamental idea is that you shouldn’t solve the whole problem at once. Solve a simple piece that shows you whether there’s an interest. It doesn’t have to be a great solution; it just has to be good enough to let you know whether it’s worth going further (e.g., a minimum viable product).

Here’s a trivial example. What if you want to collect a user’s address? You might consider a free-form text box, but writing a parser that can identify a name, street number, apartment number, city, zip code, etc., is a challenging problem due to the complexity of the edge cases. Users don’t necessarily put in separators like commas, nor do they necessarily spell states and cities correctly. The problem becomes much simpler if you do what most web applications do: provide separate text areas for each field, and make states drop-down boxes. The problem becomes even simpler if you can populate the city and state from a zip code (or equivalent).

Now for a less trivial example. A LinkedIn profile includes a tremendous amount of information. Can we use a profile like this to build a recommendation system for conferences? The answer is “yes.” But before answering “how,” it’s important to step back and ask some fundamental questions:

A) Does the customer care? Is there a market fit? If there isn’t, there’s no sense in building an application.

B) How long do we have to learn the answer to Question A?

We could start by creating and testing a full-fledged recommendation engine. This would require an information extraction system, an information retrieval system, a model training layer, a front end with a well-designed user interface, and so on. It might take well over 1,000 hours of work before we find out whether the user even cares.

Instead, we could build a much simpler system. Among other things, the LinkedIn profile lists books.

Book recommendations from LinkedIn profile

Books have ISBN numbers, and ISBN numbers are tagged with keywords. Similarly, there are catalogs of events that are also cataloged with keywords (Lanyrd is one). We can do some quick and dirty matching between keywords, build a simple user interface, and deploy it in an ad slot to a limited group of highly engaged users. The result isn’t the best recommendation system imaginable, but it’s good enough to get a sense of whether the users care. Most importantly, it can be built quickly (e.g., in a few days, if not a few hours). At this point, the product is far from finished. But now you have something you can test to find out whether customers are interested. If so, you can then gear up for the bigger effort. You can build a more interactive user interface, add features, integrate new data in real time, and improve the quality of the recommendation engine. You can use other parts of the profile (skills, groups and associations, even recent tweets) as part of a complex AI or machine learning engine to generate recommendations.

The key is to start simple and stay simple for as long as possible. Ideas for data products tend to start simple and become complex; if they start complex, they become impossible. But starting simple isn’t always easy. How do you solve individual parts of a much larger problem? Over time, you’ll develop a repertoire of tools that work for you. Here are some ideas to get you started.

Use product design

One of the biggest challenges of working with data is getting the data in a useful form. It’s easy to overlook the task of cleaning the data and jump to trying to build the product, but you’ll fail if getting the data into a usable form isn’t the first priority. For example, let’s say you have a simple text field into which the user types a previous employer. How many ways are there to type “IBM”? A few dozen? In fact, thousands: everything from “IBM” and “I.B.M.” to “T.J. Watson Labs” and “Netezza.” Let’s assume that to build our data product it’s necessary to have all these names tied to a common ID. One common approach to disambiguate the results would be to build a relatively complex artificial intelligence engine, but this would take significant time. Another approach would be to have a drop-down list of all the companies, but this would be a horrible user experience due to the length of the list and limited flexibility in choices.

What about Data Jujitsu? Is there a much simpler and more reliable solution? Yes, but not in artificial intelligence. It’s not hard to build a user interface that helps the user arrive at a clean answer. For example, you can:

  • Support type-ahead, encouraging the user to select the most popular term.
  • Prompt the user with “did you mean … ?”
  • If at this point you still don’t have anything usable, ask the user for more help: Ask for a stock ticker symbol or the URL of the company’s home page.

The point is to have a conversation rather than just a form. Engage the user to help you, rather than relying on analysis. You’re not just getting the user more involved (which is good in itself), you’re getting clean data that will simplify the work for your back-end systems. As a matter of practice, I’ve found that trying to solve a problem on the back end is 100-1,000 times more expensive than on the front end.

MapR delivers on the promise of Hadoop, making big data management and analysis a reality for more business users. The award-winning MapR Distribution brings unprecedented dependability, speed, and ease-of-use to Hadoop.

When in doubt, use humans

As technologists, we are predisposed to look for scalable technical solutions. We often jump to technical solutions before we know what solutions will work. Instead, see if you can break down the task into bite-size portions that humans can do, then figure out a technical solution that allows the process to scale. Amazon’s Mechanical Turk is a system for posting small problems online and paying people a small amount (typically a couple of cents) for solutions. It’s come to the rescue of many an entrepreneur who needed to get a product off the ground quickly but didn’t have months to spend on developing an analytical solution.

Here’s an example. A camera company wanted to test a product that would tell restaurant owners how many tables were occupied or empty during the day. If you treat this problem as an exercise in computer vision, it’s very complex. It can be solved, but it will take some PhDs, lots of time, and large amounts of computing power. But there’s a simpler solution. Humans can easily look at a picture and tell whether or not a table has anyone seated at it. So the company took images at regular intervals and used humans to count occupied tables. This gave them the opportunity to test their idea and determine whether the product was viable before investing in a solution to a very difficult problem. It also gave them the ability to find out what their customers really wanted to know: just the number of occupied tables? The average number of people at each table? How long customers stayed at the table? That way, when they start to build the real product, using computer vision techniques rather than humans, they know what problem to solve.

Humans are also useful for separating valid input from invalid. Imagine building a system to collect recipes for an online cookbook. You know you’ll get a fair amount of spam; how do you separate out the legitimate recipes? Again, this is a difficult problem for artificial intelligence without substantial investment, but a fairly simple problem for humans. When getting started, we can send each page to three people via Mechanical Turk. If all agree that the recipe is legitimate, we can use it. If all agree that the recipe is spam, we can reject it. And if the vote is split, we can escalate by trying another set of reviewers or adding additional data to those additional reviewers that allows them to make a better assessment. The key thing is to watch for the signals the humans use to make their decisions. When we’ve identified those signals, we can start building more complex automated systems. By using humans to solve the problem initially, we can learn a great deal about the problem at a very low cost.

Aardvark (a promising startup that was acquired by Google) took a similar path. Their goal was to build a question and answer service that routed users’ questions to real people with “inside knowledge.” For example, if a user wanted to know a good restaurant for a first date in Palo Alto, Calif., Aardvark would route the question to people living in the broader Palo Alto area, then compile the answers. They started by building tools that would allow employees to route the questions by hand. They knew this wouldn’t scale, but it let them learn enough about the routing problem to start building a more automated solution. The human solution not only made it clear what they needed to build, it proved that the technical solution was worth the effort and bought them the time they needed to build it.

In both cases, if you were to graph the work expended versus time, it would look something like this:

Work vs Time graph

Ignore the fact that I’ve violated a fundamental law of data science and presented a graph without scales on the axes. The point is that technical solutions will always win in the long run; they’ll always be more efficient, and even a poor technical solution is likely to scale better than using humans to answer questions. But when you’re getting started, you don’t care about the long run. You just want to survive long enough to have a long run, to prove that your product has value. And in the short term, human solutions require much less work. Worry about scaling when you need to.

Be opportunistic for wins

I’ve stressed building the simplest possible thing, even if you need to take shortcuts that appear to be extreme. Once you’ve got something working and you’ve proven that users want it, the next step is to improve the product. Amazon provides a good example. Back when they started, Amazon pages contained product details, reviews, the price, and a button to buy the item. But what if the customer isn’t sure he’s found what he wants and wants to do some comparison shopping? That’s simple enough in the real world, but in the early days of Amazon, the only alternative was to go back to the search engine. This is a “dead end flow”: Once the user has gone back to the search box, or to Google, there’s a good chance that he’s lost. He might find the book he wants at a competitor, even if Amazon sells the same product at a better price.

Amazon needed to build pages that channeled users into other related products; they needed to direct users to similar pages so that they wouldn’t lose the customer who didn’t buy the first thing he saw. They could have built a complex recommendation system, but opted for a far simpler system. They did this by building collaborative filters to add “People who viewed this product also viewed” to their pages. This addition had a profound effect: Users can do product research without leaving the site. If you don’t see what you want at first, Amazon channels you into another page. It was so successful that Amazon has developed many variants, including “People who bought this also bought” (so you can load up on accessories), and so on.

The collaborative filter is a great example of starting with a simple product that becomes a more complex system later, once you know that it works. As you begin to scale the collaborative filter, you have to track the data for all purchases correctly, build the data stores to hold that data, build a processing layer, develop the processes to update the data, and deal with relevancy issues. Relevance can be tricky. When there’s little data, it’s easy for a collaborative filter to give strange results; with a few errant clicks in the database, it’s easy to get from fashion accessories to power tools. At the same time, there are still ways to make the problem simpler. It’s possible to do the data analysis in a batch mode, reducing the time pressure; rather than compute “People who viewed this also viewed” on the fly, you can compute it nightly (or even weekly or monthly). You can make do with the occasional irrelevant answer (“People who bought leather handbags also bought power screwdrivers”), or perhaps even use Mechanical Turk to filter your pre-computed recommendations. Or even better, ask the users for help.

Being opportunistic can be done with analysis of general products, too. The Wall Street Journal chronicles a case in which Zynga was able to rapidly build on a success in their game FishVille. You can earn credits to buy fish, but you can also purchase credits. The Zynga Analytics team noticed that a particular set of fish was being purchased at six times the rate of all the other fish. Zynga took the opportunity to design several similar virtual fish, for which they charged $3 to $4 each. The data showed that they clearly had stumbled on to something. The common trait was that the translucent feature of the fish was what the customer wanted. Using this combination of quick observations and deploying lightweight tests, they were able to significantly add to their profits.

Ground your product in the real world

We can learn more from Amazon’s collaborative filters. What happens when you go into a physical store to buy something, say, headphones? You might look for sale prices, you might look for reviews, but you almost certainly don’t just look at one product. You look at a few, most likely something located near whatever first caught your eye. By adding “People who viewed this product also viewed,” Amazon built a similar experience into the web page. In essence, they “grounded” their virtual experience to a similar one in the real world via data.

LinkedIn’s People You May Know embodies both Data Jujitsu and grounding the product in the real world. Think about what happens when you arrive at a conference reception. You walk around the outer edge until you find someone you recognize, then you latch on to that person until you see some more people you know. At that point, your interaction style changes: Once you know there are friendly faces around, you’re free to engage with people you don’t know. (It’s a great exercise to watch this happen the next time you attend a conference.)

The same kind of experience takes place when you join a new social network. The first data scientists at LinkedIn recognized this and realized that their online world had two big challenges. First, because it is a website, you can’t passively walk around the outer edges of the group. It’s like looking for friends in a darkened room. Second, LinkedIn is fighting for every second you stay on its site; it’s not like a conference where you’re likely to have a drink or two while looking for friends. There’s a short window, really only a few seconds, for you to become engaged. If you don’t see any point to the site, you click somewhere else and you’re gone.

Earlier attempts to solve this problem, such as address book importers or search facilities, imposed too much friction. They required too much work for the poor user, who still didn’t understand why the site was valuable. But our LinkedIn team realized that a few simple heuristics could be used to determine a set of “people you may know.” We didn’t have the resources to build a complete solution. But to get something started, we could run a series of simple queries on the database: “what do you do,” “where do you live,” “where did you go to school,” and other questions that you might ask someone you met for the first time. We also used triangle closing (if Jane is connected to Mark, and Mark is connected to Sally, Sally and Jane have a high likelihood of knowing each other). To test the idea, we built a customized ad that showed each user the three people they were most likely to know. Clicking on one of those people took you to the “add connection” page. (Of course, if you saw the ad again, the results would have been the same, but the point was to quickly test with minimal impact to the user.) The results were overwhelming; it was clear that this needed to become a full-blown product, and it was quickly replicated by Facebook and all other social networks. Only after realizing that we had a hit on our hands did we do the work required to build the sophisticated machinery necessary to scale the results.

After People You May Know, our LinkedIn team realized that we could use a similar approach to build Groups You May Like. We built it almost as an exercise, when we were familiarizing ourselves with some new database technologies. It took under a week to build the first version and get it on to the home page, again using an ad slot. In the process, we learned a lot about the limitations and power of a recommendation system. On one hand, the numbers showed that people really loved the product. But additional filter rules were needed: Users didn’t like it when the system recommended political or religious groups. In hindsight, this seems obvious, almost funny, but it would have been very hard to anticipate all the rules we needed in advance. This lightweight testing gave us the flexibility to add rules as we discovered we needed them. Since we needed to test our new databases anyway, we essentially got this product “for free.” It’s another great example of a group that did something successful, then immediately took advantage of the opportunities for further wins.

Give data back to the user to create additional value

By giving data back to the user, you can create both engagement and revenue. We’re far enough into the data game that most users have realized that they’re not the customer, they’re the product. Their role in the system is to generate data, either to assist in ad targeting or to be sold to the highest bidder, or both. They may accept that, but I don’t know anyone who’s happy about it. But giving data back to the user is a way of showing that you’re on their side, increasing their engagement with your product.

How do you give data back to the user? LinkedIn has a product called “Who’s Viewed Your Profile.” This product lists the people who have viewed your profile (respecting their privacy settings, of course), and provides statistics about the viewers. There’s a time series view, a list of search terms that have been used to find you, and the geographical areas in which the viewers are located. It’s timely and actionable data, and it’s addictive. It’s visible on everyone’s home page, and it shows the number of profile views, so it’s not static. Every time you look at your LinkedIn page, you’re tempted to click.

Who Viewed Profile box from LinkedIn

And people do click. Engagement is so high that LinkedIn has two versions: one free, and the other part of the subscription package. This product differentiation benefits the casual user, who can see some summary statistics without being overloaded with more sophisticated features, while providing an easy upgrade path for more serious users.

LinkedIn isn’t the only product that provides data back to the user. Xobni analyzes your email to provide better contact management and help you control your inbox. Mint (acquired by Intuit) studies your credit cards to help you understand your expenses and compare them to others in your demographic. Pacific Gas and Electric has a SmartMeter that allows you to analyze your energy usage. We’re even seeing health apps that take data from your phone and other sensors and turn it into a personal dashboard.

In short, everyone reading this has probably spent the last year or more of their professional life immersed in data. But it’s not just us. Everyone, including users, has awakened to the value of data. Don’t hoard it; give it back, and you’ll create an experience that is more engaging and more profitable for both you and your company.

No data vomit

As data scientists, we prefer to interact with the raw data. We know how to import it, transform it, mash it up with other data sources, and visualize it. Most of your customers can’t do that. One of the biggest challenges of developing a data product is figuring out how to give data back to the user. Giving back too much data in a way that’s overwhelming and paralyzing is “data vomit.” It’s natural to build the product that you would want, but it’s very easy to overestimate the abilities of your users. The product you want may not be the product they want.

When we were building the prototype for “Who’s Viewed My Profile,” we created an early version that showed all sorts of amazing data, with a fantastic ability to drill down into the detail. How many clicks did we get when we tested it? Zero. Why? An “inverse interaction law” applies to most users: The more data you present, the less interaction.

Cool interactions graph

The best way to avoid data vomit is to focus on actionability of data. That is, what action do you want the user to take? If you want them to be impressed with the number of things that you can do with the data, then you’re likely producing data vomit. If you’re able to lead them to a clear set of actions, then you’ve built a product with a clear focus.

Expect unforeseen side effects

Of course, it’s impossible to avoid unforeseen side effects completely, right? That’s what “unforeseen” means. However, unforeseen side effects aren’t a joke. One of the best examples of an unforeseen side effect is “My TiVo Thinks I’m Gay.” Most digital video recorders have a recommendation system for other shows you might want to watch; they’ve learned from Amazon. But there are cases wherein a user has watched a particular show (say “Will & Grace”), and then it recommends other shows with similar themes (“The Ellen DeGeneres Show,” “Queer as Folk,” etc.). Along similar lines, An Anglo friend of mine who lives in a neighborhood with many people from Southeast Asia recently told me that his Netflix recommendations are overwhelmed with Bollywood films.

This sounds funny, and it’s even been used as the basis of a sitcom plot. But it’s a real pain point for users. Outsmarting the recommendation engine once it has “decided” what you want is difficult and frustrating, and you stand a good chance of losing the customer. What’s going wrong? In the case of the Bollywood recommendations, the algorithm is probably overemphasizing the movies that have been watched by the surrounding population. With the TiVo, there’s no easy way to tell the system that it’s wrong. Instead, you’re forced to try to outfox it, and users who have tried have discovered that it’s hard to out think an intelligent agent that has gotten the wrong idea.

Improving precision and recall

What tools do we have to think about bad results — things like unfortunate recommendations and collaborative filtering gone wrong? Two concepts, precision and recall, let us describe the problem more precisely. Here’s what they mean:

Precision — The ability to provide a result that exactly matches what’s desired. If you’re building a recommendation engine, can you give a good recommendation every time? If you’re displaying advertisements, will every ad result in a click? That’s high precision.

Recall — The set of possible good recommendations. Recall is fundamentally about inventory: Good recall means that you have a lot of good recommendations, or a lot of advertisements that you can potentially show the user.

It’s obvious that you’d like to have both high precision and high recall. For example, if you’re showing a user advertisements, you’d be in heaven if you have a lot of ads to show, and every ad has a high probability of resulting in a click. Unfortunately, precision and recall often work against each other: As precision increases, recall drops, and vice versa. The number of ads that have a 95% chance of resulting in a click is likely to be small indeed, and the number of ads with a 1% chance is obviously much larger.

So, an important issue in product design is the tradeoff between precision versus recall. If you’re working on a search engine, precision is the key, and having a large inventory of plausible search results is irrelevant. Results that will satisfy the user need to get to the top of the page. Low-precision search results yield a poor experience.

On the other hand, low-precision ads are almost harmless (perhaps because they’re low precision, but that’s another matter). It’s hard to know what advertisement will elicit a click, and generally it’s better to show a user something than nothing at all. We’ve seen enough irrelevant ads that we’ve learned to tune them out effectively.

The difference between these two cases is how the data is presented to the user. Search data is presented directly: If you search Google for “data science,” you’ll get 1.16 billion results in 0.47 seconds (as of this writing). The results on the first few pages will all have the term “data science” in them. You’re getting results directly related to your search; this makes intuitive sense. But the rationale behind advertising content is obfuscated. You see ads, but you don’t know why you were shown those ads. Nothing says, “We showed you this ad because you searched for data science and we know you live in Virginia, so here’s the nearest warehouse for all your data needs.” Since the relationship between the ad and your interests is obfuscated, it’s hard to judge an ad harshly for being irrelevant, but it’s also not something you’re going to pay attention to.

Generalizing beyond advertising, when building any data product in which the data is obfuscated (where there isn’t a clear relationship between the user and the result), you can compromise on precision, but not on recall. But when the data is exposed, focus on high precision.


Another issue to contend with is subjectivity: How does the user perceive the results? One product at LinkedIn delivers a set of up to 10 job recommendations. The problem is that users focus on the bad recommendations, rather than the good ones. If nine results are spot on and one is off, the user will leave thinking that the entire product is terrible. One bad experience can spoil a consistently good experience. If, over five web sessions, we show you 49 perfect results in a row, but the 50th one doesn’t make sense, the damage is still done. It’s not quite as bad as if the bad result appeared in the first session, but it’s still done, and it’s hard to recover. The most common guideline is to strive for a distribution in which there are many good results, a few great ones, and no bad ones.

That’s only part of the story. You don’t really know what the user will consider a poor recommendation. Here are two sets of job recommendations:

Jobs You May Be Interested In example 1

Jobs You May Be Interested In example 2

What’s important: The job itself? Or the location? Or the title? Will the user consider a recommendation “bad” if it’s a perfect fit, but requires him to move to Minneapolis? What if the job itself is a great fit, but the user really wants “senior” in the title? You really don’t know. It’s very difficult for a recommendation engine to anticipate issues like these.

Enlisting other users

One jujitsu approach to solving this problem is to flip it around and use the social system to our advantage. Instead of sending these recommendations directly to the user, we can send the recommendations to their connections and ask them to pass along the relevant ones. Let’s suppose Mike sends me a job recommendation that, at first glance, I don’t like. One of these two things is likely to happen:

  • I’ll take a look at the job recommendation and realize it is a terrible recommendation and it’s Mike’s fault.
  • I’ll take a look at the job recommendation and try to figure out why Mike sent it. Mike may have seen something in it that I’m missing. Maybe he knows that the company is really great.

At no time is the system being penalized for making a bad recommendation. Furthermore, the product is producing data that now allows us to better train the models and increase overall precision. Thus, a little twist in the product can make a hard relevance problem disappear. This kind of cleverness lets you take a problem that’s extraordinarily challenging and gives you an edge to make the product work.

Referral Center example from LinkedIn

Ask and you shall receive

We often focus on getting a limited set of data from a user. But done correctly, you can engage the user to give you more useful, high-quality data. For example, if you’re building a restaurant recommendation service, you might ask the user for his or her zip code. But if you also ask for the zip code where the user works, you have much more information. Not only can you make recommendations for both locations, but you can predict the user’s typical commute patterns and make recommendations along the way. You increase your value to the user by giving the user a greater diversity of recommendations.

In keeping with Data Jujitsu, predicting commute patterns probably shouldn’t be part of your first release; you want the simplest thing that could possibly work. But asking for the data gives you the potential for a significantly more powerful and valuable product.

Take heed not just to demand data. You need to explain to the user why you’re asking for data; you need to disarm the user’s resistance to providing more information by telling him that you’re going to provide value (in this case, more valuable recommendations), rather than abusing the data. It’s essential to remember that you’re having a conversation with the user, rather than giving him a long form to fill out.

Anticipate failure

As we’ve seen, data products can fail because of relevance problems arising from the tradeoff between precision and recall. Design your product with the assumption that it will fail. And in the process, design it so that you can preserve the user experience even if it fails.

Two data products that demonstrate extremes in user experience are Sony’s AIBO (a robotic pet), and interactive voice response systems (IVR), such as the ones that answer the phone when you call an airline to change a flight.

Let’s consider the AIBO first. It’s a sophisticated data product. It takes in data from different sensors and uses this data to train models so that it can respond to you. What do you do if it falls over or does something similarly silly, like getting stuck walking into a wall? Do you kick it? Curse at it? No. Instead, you’re likely to pick it up and help it along. You are effectively compensating for when it fails. Let’s suppose instead of being a robotic dog, it was a robot that brought hot coffee to you. If it spilled the coffee on you, what would your reaction be? You might both kick it and curse at it. Why the difference? The difference is in the product’s form and execution. By making the robot a dog, Sony limited your expectations; you’re predisposed to cut the robot slack if it doesn’t perform correctly.

Now, let’s consider the IVR system. This is also a sophisticated data product. It tries to understand your speech and route you to the right person, which is no simple task. When you call one these systems, what’s your first response? If it is voice activated, you might say, “operator.” If that doesn’t work, maybe you’ll say “agent” or “representative.” (I suspect you’ll be wanting to scream “human” into the receiver.) Maybe you’ll start pressing the button “0.” Have you ever gone through this process and felt good? More often than not, the result is frustration.

What’s the difference? The IVR product inserts friction into the process (at least from the customer’s perspective), and limits his ability to solve a problem. Furthermore, there isn’t an easy way to override the system. Users think they’re up against a machine that thinks it is smarter than they are, and that is keeping them from doing what they want. Some could argue that this is a design feature, that adding friction is a way of controlling the amount of interaction with customer service agents. But, the net result is frustration for the customer.

You can give your data product a better chance of success by carefully setting the users’ expectations. The AIBO sets expectations relatively low: A user doesn’t expect a robotic dog to be much other than cute. Let’s think back to the job recommendations. By using Data Jujitsu and sending the results to the recipient’s network, rather than directly to him, we create a product that doesn’t act like an overly intelligent machine that the user is going to hate. By enlisting a human to do the filtering, we put a human face behind the recommendation.

One under-appreciated facet of designing data products is how the user feels after using the product. Does he feel good? Empowered? Or disempowered and dejected? A product like the AIBO, or like job recommendations sent via a friend, is structured so that the user is predisposed toward feeling good after he’s finished.

In many applications, a design treatment that gives the user control over the outcome can go far to create interactions that leave the user feeling good. For example, if you’re building a collaborative filter, you will inevitably generate incorrect recommendations. But you can allow the user to tell you about poor recommendations with a button that allows the user to “X” out recommendations he doesn’t like.

Facebook uses this design technique when they show you an ad. They also give you control to hide the ad, as well as an an opportunity to tell them why you don’t think the ad is relevant. The choices they give you range from not being relevant to being offensive. This provides an opportunity to engage users as well as give them control. It turns annoyance into empowerment; rather than being a victim of the bad ad targeting, users get to feel that they can make their own recommendations about which ads they will see in the future.

Facebook ad targeting and customization

Putting Data Jujitsu into practice

You’ve probably recognized some similarities between Data Jujitsu and some of the thought behind agile startups: Data Jujitsu embraces the notion of the minimum viable product and the simplest thing that could possibly work. While these ideas make intuitive sense, as engineers, many of us have to struggle against the drive to produce a beautiful, fully-featured, massively complex solution. There’s a reason that Rube Goldberg cartoons are so attractive. Data Jujitsu is all about saying “no” to our inner Rube Goldberg.

I talked at the start about getting clean data. It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. If you can come up with strategies for data entry that are inherently clean (such as populating city and state fields from a zip code), you’re much better off. Work done up front in getting clean data will be amply repaid over the course of the project.

A surprising amount of Data Jujitsu is about product design and user experience. If you can design your product so that users are predisposed to cut it some slack when it’s wrong (like the AIBO or, for that matter, the LinkedIn job recommendation engine), you’re way ahead. If you can enlist your users to help, you’re ahead on several levels: You’ve made the product more engaging, and you’ve frequently taken a shortcut around a huge data problem.

The key aspect of making a data product is putting the “product” first and “data” second. Saying it another way, data is one mechanism by which you make the product user-focused. With all products, you should ask yourself the following three questions:

  1. What do you want the user to take away from this product?
  2. What action do you want the user to take because of the product?
  3. How should the user feel during and after using your product?

If your product is successful, you will have plenty of time to play with complex machine learning algorithms, large computing clusters running in the cloud, and whatever you’d like. Data Jujitsu isn’t the end of the road; it’s really just the beginning. But it’s the beginning that allows you to get to the next step.

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World. Save 20% on registration with the code RADAR20



Economic impact of open source on small business

A few months back, Tim O’Reilly and Hari Ravichandran, founder and CEO of Endurance International Group (EIG), had a discussion about the web hosting business. They talked specifically about how much of Hari’s success had been enabled by open source software. But Hari wasn’t just telling his success story to Tim, but rather was more interested in finding ways to give back to the communities that made his success possible. The two agreed that both companies would work together to produce a report making clear just how much of a role open source software plays in the hosting industry, and by extension, in enabling the web presence of millions of small businesses.

We hope you will read this free report while thinking about all the open source projects, teams and communities that have contributed to the economic succes of small businesses or local governments, yet it’s hard to measure their true economic impact. We combed through mountains of data, built economic models, surveyed customers and had discussions with small and medium businesses (SMB) to pull together a fairly broad-reaching dataset on which to base our study. The results are what you will find in this report.

Here are a few of the findings we derived from Bluehost data (an EIG company) and follow-on research:

  • 60% of web hosting usage is by SMBs, 71% if you include non-profits. Only 22% of hosted sites are for personal use.
  • WordPress is a far more important open source product than most people give it credit for. In the SMB hosting market, it is as widely used as MySQL and PHP, far ahead of Joomla and Drupal, the other leading content management systems.
  • Languages commonly used by high-tech startups, such as Ruby and Python, have little usage in the SMB hosting market, which is dominated by PHP for server-side scripting and JavaScript for client-side scripting.
  • Open source hosting alternatives have at least a 2:1 cost advantage relative to proprietary solutions.

Given that SMBs are widely thought to generate as much as 50% of GDP, the productivity gains to the economy as a whole that can be attributed to open source software are significant. The most important open source programs contributing to this expansion of opportunity for small businesses include Linux, Apache, MySQL, PHP, JavaScript, and WordPress. The developers of these open source projects and the communities that support them are truly unsung heroes of the economy!

Tim O’Reilly hosted a discussion at OSCON 2012 to examine the report’s findings. He was joined by Dan Handy, CEO of Bluehost; John Mone, EVP Technology at Endurance International Group; Roger Magoulas, Director of Market Research at O’Reilly; and Mike Hendrickson, VP of Content Strategy at O’Reilly. The following video contains the full discussion:


July 19 2012

“It’s impossible for me to die”

Julien Smith believes I won’t let him die.

The subject came up during our interview at Foo Camp 2012 — part of our ongoing foo interview series — in which Smith argued that our brains and innate responses don’t always map to the safety of our modern world:

“We’re in a place where it’s fundamentally almost impossible to die. I could literally — there’s a table in front of me made of glass — I could throw myself onto the table. I could attempt to even cut myself in the face or the throat, and before I did that, all these things would stop me. You would find a way to stop me. It’s impossible for me to die.”

[Discussed at the 5:16 mark in the associated video interview.]

Smith didn’t test his theory, but he makes a good point. The way we respond to the world often doesn’t correspond with the world’s true state. And he’s right about that not-letting-him-die thing; myself and the other people in the room would have jumped in had he crashed through a pane of glass. He would have then gone to an emergency room where the doctors and nurses would usher him through a life-saving process. The whole thing is set up to keep him among the living.

Acknowledging the safety of an environment isn’t something most people do by default. Perhaps we don’t want to tempt fate. Or maybe we’re wired to identify threats even when they’re not present. This disconnect between our ancient physical responses and our modern environments is one of the things Smith explores in his book The Flinch.

“Your body, all that it wants from you is to reproduce as often as possible and die,” Smith said during our interview. “It doesn’t care about anything else. It doesn’t want you to write a book. It doesn’t want you to change the world. It doesn’t even want you to live that long. It doesn’t care … Our brains are running on what a friend of mine would call ‘jungle surplus hardware.’ We want to do things that are totally counter and against what our jungle surplus hardware wants.” [Discussed at 2:00]

In his book, Smith says a flinch is an appropriate and important response to fights and car crashes and those sorts of things. But flinches also bubble up when we’re starting a new business, getting into a relationship and considering other risky non-life-threatening events. According to Smith, these are the flinches that hold people back.

“Your world has a safety net,” Smith writes in the book. “You aren’t in free fall, and you never will be. You treat mistakes as final, but they almost never are. Pain and scars are a part of the path, but so is getting back up, and getting up is easier than ever.”

There are many people in the world who face daily danger and the prospect of catastrophic outcomes. For them, flinches are essential survival tools. But there are also people who are surrounded by safety and opportunity. As hard as it is for a worrier like me to admit it (I’m writing this on an airplane, so fingers crossed), I’m one of them. A fight-or-flight response would be an overreaction to 99% of the things I encounter on a daily basis.

Now, I’m not about to start a local chapter of anti-flinchers, but I do think Smith has a legitimate point that deserves real consideration. Namely, gut reactions can be wrong.

Real danger and compromised thinking

To be clear, Smith isn’t suggesting we blithely ignore those little voices in the backs of our heads when a real threat is brewing.

“You can’t assume that you’re wrong, and you can’t assume that you’re right,” he said, relaying advice he received from a security expert. “You can just assume that you’re unable to process this decision properly, so step away from it and then decide from another vantage point. If you can do that, you’re fundamentally, every day, going to make better decisions.” [Discussed at 4:10]

I was surprised by this answer. I figured a guy who wrote a book about the detriments of flinches would compare threatening circumstances with other unlikely events, like lightning strikes and lottery wins. But Smith is doing something more thoughtful than rejecting fear outright. He’s working within a framework that challenges assumptions about our physical and mental processes. You can’t trust your brain or your body if you’re incapable of processing the threat. The success of your survival method, whatever it may be, depends on your capabilities. So, what you have to do is know when you’re compromised, get out of there, and then give yourself the opportunity to assess under better circumstances.

Other things from the interview

At the end of the interview I asked Smith about the people and projects he follows. He pointed toward Peter Thiel because he admires people who see different versions of the future. Smith also tracks the audacious moves made by startups, and he looks for ways those same actions and perspectives can be applied in non-startup environments. The goal is to to “see if we come up with a better society or a better individual as a result.”

You can see the full interview from Foo Camp in the following video:

Associated photo on home and category pages: Broken Glass on Concrete by shaire productions, on Flickr


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!