Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 06 2013

Another Serving of Data Skepticism

I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!

That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.

Let’s do some thought experiments–unfortunately, totally devoid of data. But I don’t think we need data to get to the core of the problem. Think of the classic false correlation (when teaching logic, also used as an example of a false syllogism): there’s a strong correlation between people who eat pickles and people who die. Well, yeah. We laugh. But let’s take this a step further: correlation is a double headed arrow. So not only does this poor logic imply that we can reduce the death rate by preventing people from eating pickles, it also implies that we can harm the chemical companies that produce vinegar by preventing people from dying. And here we see what’s really happening: to remove one head of the double-headed arrow, we use “common sense” to choose between two stories: one that’s merely silly, and another that’s so ludicrous we never even think about it. Seems to work here (for a very limited value of “work”); but if I’ve learned one thing, it’s that good old common sense is frequently neither common nor sensible. For more realistic correlations, it certainly seems ironic that we’re doing all this data analysis just to end up relying on common sense.

Now let’s look at something equally hypothetical that isn’t silly. A drug is correlated with reduced risk of death due to heart failure. Good thing, right? Yes–but why? What if the drug has nothing to do with heart failure, but is really an anti-depressant that makes you feel better about yourself so you exercise more? If you’re in the “correlation is as good as causation” club, doesn’t make a difference: you win either way. Except that, if the key is really exercise, there might be much better ways to achieve the same result. Certainly much cheaper, since the drug industry will no doubt price the pills at $100 each. (Tangent: I once saw a truck drive up to an orthopedist’s office and deliver Vioxx samples with a street value probably in the millions…) It’s possible, given some really interesting work being done on the placebo effect, that a properly administered sugar pill will make the patient feel better and exercise, yielding the same result. (Though it’s possible that sugar pills only work as placebos if they’re expensive.) I think we’d like to know, rather than just saying that correlation is just as good as causation, if you have a lot of data.

Perhaps I haven’t gone far enough: with enough data, and enough dimensions to the data, it would be possible to detect the correlations between the drug, psychological state, exercise, and heart disease. But that’s not the point. First, if correlation really is as good as causation, why bother? Second, to analyze data, you have to collect it. And before you collect it, you have to decide what to collect. Data is socially constructed (I promise, this will be the subject of another post), and the data you don’t decide to collect doesn’t exist. Decisions about what data to collect are almost always driven by the stories we want to tell. You can have petabytes of data, but if it isn’t the right data, if it’s data that’s been biased by preconceived notions of what’s important, you’re going to be misled. Indeed, any researcher knows that huge data sets tend to create spurious correlations.

Causation has its own problems, not the least of which is that it’s impossible to prove. Unfortunately, that’s the way the world works. But thinking about cause and how events relate to each other helps us to be more critical about the correlations we discover. As humans we’re storytellers, and an important part of data work is building a story around the data. Mere correlations arising from a gigantic pool of data aren’t enough to satisfy us. But there are good stories and bad ones, and just as it’s possible to be careful in designing your experiments, it’s possible to be careful and ethical in the stories you tell with your data. Those stories may be the closest we get ever get to an understanding of cause; but we have to realize that they’re just stories, that they’re provisional, and that better evidence (which may just be correlations) may force us to retell our stories at any moment. Correlation is as good as causation is just an excuse for intellectual sloppiness; it’s an excuse to replace thought with an odd kind of “common sense,” and to shut down the discussion that leads to good stories and understanding.

April 30 2013

Leading Indicators

In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.

Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.

This reminded me of when my daughter was in first grade, and we looked (briefly) at private schools. All the schools talked the same talk. But if you looked at classes, it was pretty clear that the quality of the music program was a proxy for the quality of the school. After all, it’s easy to shortchange music, and both hard and expensive to do it right. Oddly enough, using the music program as a proxy for evaluating school quality has continued to work through middle school and (public) high school. It’s the first thing to cut when the budget gets tight; and if a school has a good music program with excellent teachers, they’re probably not shortchanging the kids elsewhere.

How does this connect to data science? What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting? We came up with a few ideas:

  • Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.
  • Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.
  • When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.
  • What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.
  • Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?
  • What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.
  • Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.

Coming up with these questions was an interesting thought experiment; we don’t know whether it holds water, but we suspect it does. Any ideas and opinions?

February 25 2013

Big data is dead, long live big data: Thoughts heading to Strata

A recent VentureBeat article argues that “Big Data” is dead. It’s been killed by marketers. That’s an understandable frustration (and a little ironic to read about it in that particular venue). As I said sarcastically the other day, “Put your Big Data in the Cloud with a Hadoop.”

You don’t have to read much industry news to get the sense that “big data” is sliding into the trough of Gartner’s hype curve. That’s natural. Regardless of the technology, the trough of the hype cycle is driven by by a familiar set of causes: it’s fed by over-agressive marketing, the longing for a silver bullet that doesn’t exist, and the desire to spout the newest buzzwords. All of these phenomena breed cynicism. Perhaps the most dangerous is the technologist who never understands the limitations of data, never understands what data isn’t telling you, or never understands that if you ask the wrong questions, you’ll certainly get the wrong answers.

Big data is not a term I’m particularly fond of. It’s just data, regardless of the size. But I do like Roger Magoulas’ definition of “big data”: big data is when the size of the data becomes part of the problem. I like that definition because it scales. It was meaningful in 1960, when “big data” was a couple of megabytes. It will be meaningful in 2030, when we all have petabyte laptops, or eyeglasses connected directly to Google’s yottabyte cloud. It’s not convenient for marketing, I admit; today’s “Big Data!!! With Hadoop And Other Essential Nutrients Added” is tomorrow’s “not so big data, small data actually.” Marketing, for better or for worse, will deal.

Whether or not Moore’s Law continues indefinitely, the real importance of the amazing increase in computing power over the last six decades isn’t that things have gotten faster; it’s the size of the problems we can solve has gotten much, much larger. Or as Chris Gaun just wrote, big data is leading scientists to ask bigger questions. We’ve been a little too focused on Amdahl’s law, about making computing faster, and not focused enough on the reverse: how big a problem can you solve in a given time, given finite resources? Modern astronomy, physics, and genetics are all inconceivable without really big data, and I mean big on a scale that dwarfs Amazon’s inventory database. At the edges of research, data is, and always will be, part of the problem. Perhaps even the biggest part of the problem.

In the next year, we’ll slog through the cynicism that’s a natural outcome of the hype cycle. But I’m not worrying about cynicism. Data isn’t like Java, or Rails, or any of a million other technologies; data has been with us since before computers were invented, and it will still be with us when we move onto whatever comes after digital computing. Data, and specifically “big data,” will always be at the edges of research and understanding. Whether we’re mapping the brain or figuring out how the universe works, the biggest problems will almost always be the ones for which the size of the data is part of the problem. That’s an invariant. That’s why I’m excited about data.

August 29 2012

Analyzing health care data to empower patients

The stress of falling seriously ill often drags along the frustration of having no idea what the treatment will cost. We’ve all experienced the maddening stream of seemingly endless hospital bills, and testimony by E-patient Dave DeBronkart and others show just how absurd U.S. payment systems are.

So I was happy to seize the opportunity to ask questions of three researchers from Castlight Health about the service they’ll discuss at the upcoming Strata Rx conference about data in health care.

Castlight casts its work in the framework of a service to employers and consumers. But make no mistake about it: they are a data-rich research operation, and their consumers become empowered patients (e-patients) who can make better choices.

As Arjun Kulothungun, John Zedlewski, and Eugenia Bisignani wrote to me, “Patients become empowered when actionable information is made available to them. In health care, like any other industry, people want high quality services at competitive prices. But in health care, quality and cost are often impossible for an average consumer to determine. We are proud to do the heavy lifting to bring this information to our users.”

Following are more questions and answers from the speakers:

1. Tell me a bit about what you do at Castlight and at whom you aim your services.

We work together in the Research team at Castlight Health. We provide price and quality information to our users for most common health care services, including those provided by doctors, hospitals, labs, and imaging facilities. This information is provided to patients through a user-friendly web interface and mobile app that shows their different healthcare options customized to their health care plan. Our research team has built a sophisticated pricing system that factors in a wide variety of data sources to produce accurate prices for our users.

At a higher level, this fits into our company’s drive toward healthcare transparency, to help users better understand and navigate their healthcare options. Currently, we sell this product to employers to be offered as a benefit to their employees and their dependents. Our product is attractive to self-insured employers who operate a high-deductible health plan. High-deductible health plans motivate employees to explore their options, since doing so helps them save on their healthcare costs and find higher quality care. Our product helps patients easily explore those options.

2. What kinds of data do you use? What are the challenges involved in working with this data and making it available to patients?

We bring in data from a variety of sources to model the financial side of the healthcare industry, so that we can accurately represent the true cost of care to our users. One of the challenges we face is that the data is often messy. This is due to the complex ways that health care claims are adjudicated, and the largely manual methods of data entry. Additionally, provider data is not highly standardized, so it is often difficult to match data from different sources. Finally, in a lot of cases the data is sparse: some health care procedures are frequent, but others are only seldom performed, so it is more challenging to determine their prices.

The variability of care received also presents a challenge, because the exact care a patient receives during a visit cannot always be predicted ahead of time. A single visit to a doctor can yield a wide array of claim line items, and the patient is subsequently responsible for the sum of these services. Thus, our intent is to convey the full cost of the care patients are to receive. We believe patients are interested in understanding their options in a straightforward way, and that they don’t think in terms of claim line items and provider billing codes. So we spend a lot of time determining the best way to reflect the total cost of care to our users.

3. How much could a patient save if they used Castlight effectively? What would this mean for larger groups?

For a given procedure or service, the difference in prices in a local area can vary by 100% or more. For instance, right here in San Francisco, we can see that the cost for a particular MRI varies from $450 to nearly $3000, depending on the facility that a patient chooses, while an office visit with a primary care doctor can range from $60 to $180. But a patient may not always wish to choose the lowest cost option. A number of different factors affect how much a patient could save: the availability of options in their vicinity, the quality of the services, the patient’s ability to change the current doctor/hospital for a service, personal preferences, and the insurance benefits provided. Among our customers, the empowerment of patients adds up to employer savings of around 13% in comparison to expected trends.

In addition to cost savings, Castlight also helps drive better quality care. We have shown a 38% reduction in gaps in care for chronic conditions such as diabetes and high blood pressure. This will help drive further savings as individuals adhere to clinically proven treatment schedules.

4. What other interesting data sets are out there for healthcare consumers to use? What sorts of data do you wish were available?

Unfortunately, data on prices of health care procedures is still not widely available from government sources and insurers. Data sources that are available publicly are typically too complex and arcane to be actionable for average health care consumers.

However, CMS has recently made a big push to provide data on hospital quality. Their “hospital compare” website is a great resource to access this data. We have integrated the Medicare statistics into the Castlight product, and we’re proud of the role that Castlight co-founder and current CTO of the United States Todd Park played in making it available to the public. Despite this progress on sharing hospital data, the federal government has not made the same degree of progress in sharing information for individual physicians, so we would love to see more publicly collected data in this area.

5. Are there crowdsourcing opportunities? If patients submitted data, could it be checked for quality, and how could it further improve care and costs?

We believe that engaging consumers by asking them to provide data is a great idea! The most obvious place for users to provide data is by writing reviews of their experiences with different providers, as well as rating those providers on various facets of care. Castlight and other organizations aggregate and report on these reviews as one measure of provider quality.

It is harder to use crowdsourced information to compute costs. There are significant challenges in matching crowdsourced data to providers and especially to services performed, because line items are not identified to consumers by their billing codes. Additionally, rates tend to depend on the consumer’s insurance plan. Nonetheless, we are exploring ways to use crowdsourced pricing data for Castlight.

August 14 2012

Solving the Wanamaker problem for health care

By Tim O’Reilly, Julie Steele, Mike Loukides and Colin Hill

“The best minds of my generation are thinking about how to make people click ads.” — Jeff Hammerbacher, early Facebook employee

“Work on stuff that matters.” — Tim O’Reilly

Doctors in operating room with data

In the early days of the 20th century, department store magnate John Wanamaker famously said, “I know that half of my advertising doesn’t work. The problem is that I don’t know which half.”

The consumer Internet revolution was fueled by a search for the answer to Wanamaker’s question. Google AdWords and the pay-per-click model transformed a business in which advertisers paid for ad impressions into one in which they pay for results. “Cost per thousand impressions” (CPM) was replaced by “cost per click” (CPC), and a new industry was born. It’s important to understand why CPC replaced CPM, though. Superficially, it’s because Google was able to track when a user clicked on a link, and was therefore able to bill based on success. But billing based on success doesn’t fundamentally change anything unless you can also change the success rate, and that’s what Google was able to do. By using data to understand each user’s behavior, Google was able to place advertisements that an individual was likely to click. They knew “which half” of their advertising was more likely to be effective, and didn’t bother with the rest.

Since then, data and predictive analytics have driven ever deeper insight into user behavior such that companies like Google, Facebook, Twitter, Zynga, and LinkedIn are fundamentally data companies. And data isn’t just transforming the consumer Internet. It is transforming finance, design, and manufacturing — and perhaps most importantly, health care.

How is data science transforming health care? There are many ways in which health care is changing, and needs to change. We’re focusing on one particular issue: the problem Wanamaker described when talking about his advertising. How do you make sure you’re spending money effectively? Is it possible to know what will work in advance?

Too often, when doctors order a treatment, whether it’s surgery or an over-the-counter medication, they are applying a “standard of care” treatment or some variation that is based on their own intuition, effectively hoping for the best. The sad truth of medicine is that we don’t really understand the relationship between treatments and outcomes. We have studies to show that various treatments will work more often than placebos; but, like Wanamaker, we know that much of our medicine doesn’t work for half or our patients, we just don’t know which half. At least, not in advance. One of data science’s many promises is that, if we can collect data about medical treatments and use that data effectively, we’ll be able to predict more accurately which treatments will be effective for which patient, and which treatments won’t.

A better understanding of the relationship between treatments, outcomes, and patients will have a huge impact on the practice of medicine in the United States. Health care is expensive. The U.S. spends over $2.6 trillion on health care every year, an amount that constitutes a serious fiscal burden for government, businesses, and our society as a whole. These costs include over $600 billion of unexplained variations in treatments: treatments that cause no differences in outcomes, or even make the patient’s condition worse. We have reached a point at which our need to understand treatment effectiveness has become vital — to the health care system and to the health and sustainability of the economy overall.

Why do we believe that data science has the potential to revolutionize health care? After all, the medical industry has had data for generations: clinical studies, insurance data, hospital records. But the health care industry is now awash in data in a way that it has never been before: from biological data such as gene expression, next-generation DNA sequence data, proteomics, and metabolomics, to clinical data and health outcomes data contained in ever more prevalent electronic health records (EHRs) and longitudinal drug and medical claims. We have entered a new era in which we can work on massive datasets effectively, combining data from clinical trials and direct observation by practicing physicians (the records generated by our $2.6 trillion of medical expense). When we combine data with the resources needed to work on the data, we can start asking the important questions, the Wanamaker questions, about what treatments work and for whom.

The opportunities are huge: for entrepreneurs and data scientists looking to put their skills to work disrupting a large market, for researchers trying to make sense out of the flood of data they are now generating, and for existing companies (including health insurance companies, biotech, pharmaceutical, and medical device companies, hospitals and other care providers) that are looking to remake their businesses for the coming world of outcome-based payment models.

Making health care more effective

Downloadable Editions

This report will soon be available in PDF, EPUB and Mobi formats. Submit your email to be alerted when the downloadable editions are ready.

What, specifically, does data allow us to do that we couldn’t do before? For the past 60 or so years of medical history, we’ve treated patients as some sort of an average. A doctor would diagnose a condition and recommend a treatment based on what worked for most people, as reflected in large clinical studies. Over the years, we’ve become more sophisticated about what that average patient means, but that same statistical approach didn’t allow for differences between patients. A treatment was deemed effective or ineffective, safe or unsafe, based on double-blind studies that rarely took into account the differences between patients. With the data that’s now available, we can go much further. The exceptions to this are relatively recent and have been dominated by cancer treatments, the first being Herceptin for breast cancer in women who over-express the Her2 receptor. With the data that’s now available, we can go much further for a broad range of diseases and interventions that are not just drugs but include surgery, disease management programs, medical devices, patient adherence, and care delivery.

For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients. But now we know much more: we know that it’s 100% effective in 70 to 80% of the patients, and ineffective in the rest. That’s not word games, because we can now use genetic markers to tell whether it’s likely to be effective or ineffective for any given patient, and we can tell in advance whether to treat with Tamoxifen or to try something else.

Two factors lie behind this new approach to medicine: a different way of using data, and the availability of new kinds of data. It’s not just stating that the drug is effective on most patients, based on trials (indeed, 80% is an enviable success rate); it’s using artificial intelligence techniques to divide the patients into groups and then determine the difference between those groups. We’re not asking whether the drug is effective; we’re asking a fundamentally different question: “for which patients is this drug effective?” We’re asking about the patients, not just the treatments. A drug that’s only effective on 1% of patients might be very valuable if we can tell who that 1% is, though it would certainly be rejected by any traditional clinical trial.

More than that, asking questions about patients is only possible because we’re using data that wasn’t available until recently: DNA sequencing was only invented in the mid-1970s, and is only now coming into its own as a medical tool. What we’ve seen with Tamoxifen is as clear a solution to the Wanamaker problem as you could ask for: we now know when that treatment will be effective. If you can do the same thing with millions of cancer patients, you will both improve outcomes and save money.

Dr. Lukas Wartman, a cancer researcher who was himself diagnosed with terminal leukemia, was successfully treated with sunitinib, a drug that was only approved for kidney cancer. Sequencing the genes of both the patient’s healthy cells and cancerous cells led to the discovery of a protein that was out of control and encouraging the spread of the cancer. The gene responsible for manufacturing this protein could potentially be inhibited by the kidney drug, although it had never been tested for this application. This unorthodox treatment was surprisingly effective: Wartman is now in remission.

While this treatment was exotic and expensive, what’s important isn’t the expense but the potential for new kinds of diagnosis. The price of gene sequencing has been plummeting; it will be a common doctor’s office procedure in a few years. And through Amazon and Google, you can now “rent” a cloud-based supercomputing cluster that can solve huge analytic problems for a few hundred dollars per hour. What is now exotic inevitably becomes routine.

But even more important: we’re looking at a completely different approach to treatment. Rather than a treatment that works 80% of the time, or even 100% of the time for 80% of the patients, a treatment might be effective for a small group. It might be entirely specific to the individual; the next cancer patient may have a different protein that’s out of control, an entirely different genetic cause for the disease. Treatments that are specific to one patient don’t exist in medicine as it’s currently practiced; how could you ever do an FDA trial for a medication that’s only going to be used once to treat a certain kind of cancer?

Foundation Medicine is at the forefront of this new era in cancer treatment. They use next-generation DNA sequencing to discover DNA sequence mutations and deletions that are currently used in standard of care treatments, as well as many other actionable mutations that are tied to drugs for other types of cancer. They are creating a patient-outcomes repository that will be the fuel for discovering the relation between mutations and drugs. Foundation has identified DNA mutations in 50% of cancer cases for which drugs exist (information via a private communication), but are not currently used in the standard of care for the patient’s particular cancer.

The ability to do large-scale computing on genetic data gives us the ability to understand the origins of disease. If we can understand why an anti-cancer drug is effective (what specific proteins it affects), and if we can understand what genetic factors are causing the cancer to spread, then we’re able to use the tools at our disposal much more effectively. Rather than using imprecise treatments organized around symptoms, we’ll be able to target the actual causes of disease, and design treatments tuned to the biology of the specific patient. Eventually, we’ll be able to treat 100% of the patients 100% of the time, precisely because we realize that each patient presents a unique problem.

Personalized treatment is just one area in which we can solve the Wanamaker problem with data. Hospital admissions are extremely expensive. Data can make hospital systems more efficient, and to avoid preventable complications such as blood clots and hospital re-admissions. It can also help address the challenge of hot-spotting (a term coined by Atul Gawande): finding people who use an inordinate amount of health care resources. By looking at data from hospital visits, Dr. Jeffrey Brenner of Camden, NJ, was able to determine that “just one per cent of the hundred thousand people who made use of Camden’s medical facilities accounted for thirty per cent of its costs.” Furthermore, many of these people came from two apartment buildings. Designing more effective medical care for these patients was difficult; it doesn’t fit our health insurance system, the patients are often dealing with many serious medical issues (addiction and obesity are frequent complications), and have trouble trusting doctors and social workers. It’s counter-intuitive, but spending more on some patients now results in spending less on them when they become really sick. While it’s a work in progress, it looks like building appropriate systems to target these high-risk patients and treat them before they’re hospitalized will bring significant savings.

Many poor health outcomes are attributable to patients who don’t take their medications. Eliza, a Boston-based company started by Alexandra Drane, has pioneered approaches to improve compliance through interactive communication with patients. Eliza improves patient drug compliance by tracking which types of reminders work on which types of people; it’s similar to the way companies like Google target advertisements to individual consumers. By using data to analyze each patient’s behavior, Eliza can generate reminders that are more likely to be effective. The results aren’t surprising: if patients take their medicine as prescribed, they are more likely to get better. And if they get better, they are less likely to require further, more expensive treatment. Again, we’re using data to solve Wanamaker’s problem in medicine: we’re spending our resources on what’s effective, on appropriate reminders that are mostly to get patients to take their medications.

More data, more sources

The examples we’ve looked at so far have been limited to traditional sources of medical data: hospitals, research centers, doctor’s offices, insurers. The Internet has enabled the formation of patient networks aimed at sharing data. Health social networks now are some of the largest patient communities. As of November 2011, PatientsLikeMe has over 120,000 patients in 500 different condition groups; ACOR has over 100,000 patients in 127 cancer support groups; 23andMe has over 100,000 members in their genomic database; and diabetes health social network SugarStats has over 10,000 members. These are just the larger communities, thousands of small communities are created around rare diseases, or even uncommon experiences with common diseases. All of these communities are generating data that they voluntarily share with each other and the world.

Increasingly, what they share is not just anecdotal, but includes an array of clinical data. For this reason, these groups are being recruited for large-scale crowdsourced clinical outcomes research.

Thanks to ubiquitous data networking through the mobile network, we can take several steps further. In the past two or three years, there’s been a flood of personal fitness devices (such as the Fitbit) for monitoring your personal activity. There are mobile apps for taking your pulse, and an iPhone attachment for measuring your glucose. There has been talk of mobile applications that would constantly listen to a patient’s speech and detect changes that might be the precursor for a stroke, or would use the accelerometer to report falls. Tanzeem Choudhury has developed an app called Be Well that is intended primarily for victims of depression, though it can be used by anyone. Be Well monitors the user’s sleep cycles, the amount of time they spend talking, and the amount of time they spend walking. The data is scored, and the app makes appropriate recommendations, based both on the individual patient and data collected across all the app’s users.

Continuous monitoring of critical patients in hospitals has been normal for years; but we now have the tools to monitor patients constantly, in their home, at work, wherever they happen to be. And if this sounds like big brother, at this point most of the patients are willing. We don’t want to transform our lives into hospital experiences; far from it! But we can collect and use the data we constantly emit, our “data smog,” to maintain our health, to become conscious of our behavior, and to detect oncoming conditions before they become serious. The most effective medical care is the medical care you avoid because you don’t need it.

Paying for results

Once we’re on the road toward more effective health care, we can look at other ways in which Wanamaker’s problem shows up in the medical industry. It’s clear that we don’t want to pay for treatments that are ineffective. Wanamaker wanted to know which part of his advertising was effective, not just to make better ads, but also so that he wouldn’t have to buy the advertisements that wouldn’t work. He wanted to pay for results, not for ad placements. Now that we’re starting to understand how to make treatment effective, now that we understand that it’s more than rolling the dice and hoping that a treatment that works for a typical patient will be effective for you, we can take the next step: Can we change the underlying incentives in the medical system? Can we make the system better by paying for results, rather than paying for procedures?

It’s shocking just how badly the incentives in our current medical system are aligned with outcomes. If you see an orthopedist, you’re likely to get an MRI, most likely at a facility owned by the orthopedist’s practice. On one hand, it’s good medicine to know what you’re doing before you operate. But how often does that MRI result in a different treatment? How often is the MRI required just because it’s part of the protocol, when it’s perfectly obvious what the doctor needs to do? Many men have had PSA tests for prostate cancer; but in most cases, aggressive treatment of prostate cancer is a bigger risk than the disease itself. Yet the test itself is a significant profit center. Think again about Tamoxifen, and about the pharmaceutical company that makes it. In our current system, what does “100% effective in 80% of the patients” mean, except for a 20% loss in sales? That’s because the drug company is paid for the treatment, not for the result; it has no financial interest in whether any individual patient gets better. (Whether a statistically significant number of patients has side-effects is a different issue.) And at the same time, bringing a new drug to market is very expensive, and might not be worthwhile if it will only be used on the remaining 20% of the patients. And that’s assuming that one drug, not two, or 20, or 200 will be required to treat the unlucky 20% effectively.

It doesn’t have to be this way.

In the U.K., Johnson & Johnson, faced with the possibility of losing reimbursements for their multiple myeloma drug Velcade, agreed to refund the money for patients who did not respond to the drug. Several other pay-for-performance drug deals have followed since, paving the way for the ultimate transition in pharmaceutical company business models in which their product is health outcomes instead of pills. Such a transition would rely more heavily on real-world outcome data (are patients actually getting better?), rather than controlled clinical trials, and would use molecular diagnostics to create personalized “treatment algorithms.” Pharmaceutical companies would also focus more on drug compliance to ensure health outcomes were being achieved. This would ultimately align the interests of drug makers with patients, their providers, and payors.

Similarly, rather than paying for treatments and procedures, can we pay hospitals and doctors for results? That’s what Accountable Care Organizations (ACOs) are about. ACOs are a leap forward in business model design, where the provider shoulders any financial risk. ACOs represent a new framing of the much maligned HMO approaches from the ’90s, which did not work. HMOs tried to use statistics to predict and prevent unneeded care. The ACO model, rather than controlling doctors with what the data says they “should” do, uses data to measure how each doctor performs. Doctors are paid for successes, not for the procedures they administer. The main advantage that the ACO model has over the HMO model is how good the data is, and how that data is leveraged. The ACO model aligns incentives with outcomes: a practice that owns an MRI facility isn’t incentivized to order MRIs when they’re not necessary. It is incentivized to use all the data at its disposal to determine the most effective treatment for the patient, and to follow through on that treatment with a minimum of unnecessary testing.

When we know which procedures are likely to be successful, we’ll be in a position where we can pay only for the health care that works. When we can do that, we’ve solved Wanamaker’s problem for health care.

Enabling data

Data science is not optional in health care reform; it is the linchpin of the whole process. All of the examples we’ve seen, ranging from cancer treatment to detecting hot spots where additional intervention will make hospital admission unnecessary, depend on using data effectively: taking advantage of new data sources and new analytics techniques, in addition to the data the medical profession has had all along.

But it’s too simple just to say “we need data.” We’ve had data all along: handwritten records in manila folders on acres and acres of shelving. Insurance company records. But it’s all been locked up in silos: insurance silos, hospital silos, and many, many doctor’s office silos. Data doesn’t help if it can’t be moved, if data sources can’t be combined.

There are two big issues here. First, a surprising amount of medical records are still either hand-written, or in digital formats that are scarcely better than hand-written (for example, scanned images of hand-written records). Getting medical records into a format that’s computable is a prerequisite for almost any kind of progress. Second, we need to break down those silos.

Anyone who has worked with data knows that, in any problem, 90% of the work is getting the data in a form in which it can be used; the analysis itself is often simple. We need electronic health records: patient data in a more-or-less standard form that can be shared efficiently, data that can be moved from one location to another at the speed of the Internet. Not all data formats are created equal, and some are certainly better than others: but at this point, any machine-readable format, even simple text files, is better than nothing. While there are currently hundreds of different formats for electronic health records, the fact that they’re electronic means that they can be converted from one form into another. Standardizing on a single format would make things much easier, but just getting the data into some electronic form, any, is the first step.

Once we have electronic health records, we can link doctor’s offices, labs, hospitals, and insurers into a data network, so that all patient data is immediately stored in a data center: every prescription, every procedure, and whether that treatment was effective or not. This isn’t some futuristic dream; it’s technology we have now. Building this network would be substantially simpler and cheaper than building the networks and data centers now operated by Google, Facebook, Amazon, Apple, and many other large technology companies. It’s not even close to pushing the limits.

Electronic health records enable us to go far beyond the current mechanism of clinical trials. In the past, once a drug has been approved in trials, that’s effectively the end of the story: running more tests to determine whether it’s effective in practice would be a huge expense. A physician might get a sense for whether any treatment worked, but that evidence is essentially anecdotal: it’s easy to believe that something is effective because that’s what you want to see. And if it’s shared with other doctors, it’s shared while chatting at a medical convention. But with electronic health records, it’s possible (and not even terribly expensive) to collect documentation from thousands of physicians treating millions of patients. We can find out when and where a drug was prescribed, why, and whether there was a good outcome. We can ask questions that are never part of clinical trials: is the medication used in combination with anything else? What other conditions is the patient being treated for? We can use machine learning techniques to discover unexpected combinations of drugs that work well together, or to predict adverse reactions. We’re no longer limited by clinical trials; every patient can be part of an ongoing evaluation of whether his treatment is effective, and under what conditions. Technically, this isn’t hard. The only difficult part is getting the data to move, getting data in a form where it’s easily transferred from the doctor’s office to analytics centers.

To solve problems of hot-spotting (individual patients or groups of patients consuming inordinate medical resources) requires a different combination of information. You can’t locate hot spots if you don’t have physical addresses. Physical addresses can be geocoded (converted from addresses to longitude and latitude, which is more useful for mapping problems) easily enough, once you have them, but you need access to patient records from all the hospitals operating in the area under study. And you need access to insurance records to determine how much health care patients are requiring, and to evaluate whether special interventions for these patients are effective. Not only does this require electronic records, it requires cooperation across different organizations (breaking down silos), and assurance that the data won’t be misused (patient privacy). Again, the enabling factor is our ability to combine data from different sources; once you have the data, the solutions come easily.

Breaking down silos has a lot to do with aligning incentives. Currently, hospitals are trying to optimize their income from medical treatments, while insurance companies are trying to optimize their income by minimizing payments, and doctors are just trying to keep their heads above water. There’s little incentive to cooperate. But as financial pressures rise, it will become critically important for everyone in the health care system, from the patient to the insurance executive, to assume that they are getting the most for their money. While there’s intense cultural resistance to be overcome (through our experience in data science, we’ve learned that it’s often difficult to break down silos within an organization, let alone between organizations), the pressure of delivering more effective health care for less money will eventually break the silos down. The old zero-sum game of winners and losers must end if we’re going to have a medical system that’s effective over the coming decades.

Data becomes infinitely more powerful when you can mix data from different sources: many doctor’s offices, hospital admission records, address databases, and even the rapidly increasing stream of data coming from personal fitness devices. The challenge isn’t employing our statistics more carefully, precisely, or guardedly. It’s about letting go of an old paradigm that starts by assuming only certain variables are key and ends by correlating only these variables. This paradigm worked well when data was scarce, but if you think about, these assumptions arise precisely because data is scarce. We didn’t study the relationship between leukemia and kidney cancers because that would require asking a huge set of questions that would require collecting a lot of data; and a connection between leukemia and kidney cancer is no more likely than a connection between leukemia and flu. But the existence of data is no longer a problem: we’re collecting the data all the time. Electronic health records let us move the data around so that we can assemble a collection of cases that goes far beyond a particular practice, a particular hospital, a particular study. So now, we can use machine learning techniques to identify and test all possible hypotheses, rather than just the small set that intuition might suggest. And finally, with enough data, we can get beyond correlation to causation: rather than saying “A and B are correlated,” we’ll be able to say “A causes B,” and know what to do about it.

Building the health care system we want

The U.S. ranks 37th out of developed economies in life expectancy and other measures of health, while by far outspending other countries on per-capita health care costs. We spend 18% of GDP on health care, while other countries on average spend on the order of 10% of GDP. We spend a lot of money on treatments that don’t work, because we have a poor understanding at best of what will and won’t work.

Part of the problem is cultural. In a country where even pets can have hip replacement surgery, it’s hard to imagine not spending every penny you have to prolong Grandma’s life — or your own. The U.S. is a wealthy nation, and health care is something we choose to spend our money on. But wealthy or not, nobody wants ineffective treatments. Nobody wants to roll the dice and hope that their biology is similar enough to a hypothetical “average” patient. No one wants a “winner take all” payment system in which the patient is always the loser, paying for procedures whether or not they are helpful or necessary. Like Wanamaker with his advertisements, we want to know what works, and we want to pay for what works. We want a smarter system where treatments are designed to be effective on our individual biologies; where treatments are administered effectively; where our hospitals our used effectively; and where we pay for outcomes, not for procedures.

We’re on the verge of that new system now. We don’t have it yet, but we can see it around the corner. Ultra-cheap DNA sequencing in the doctor’s office, massive inexpensive computing power, the availability of EHRs to study whether treatments are effective even after the FDA trials are over, and improved techniques for analyzing data are the tools that will bring this new system about. The tools are here now; it’s up to us to put them into use.

Recommended reading:

We recommend the following books regarding technology, data, and health care reform:

August 03 2012

StrataRx: Data science and health(care)

By Mike Loukides and Jim Stogdill

StrataRxWe are launching a conference at the intersection of health, health care, and data. Why?

Our health care system is in crisis. We are experiencing epidemic levels of obesity, diabetes, and other preventable conditions while at the same time our health care system costs are spiraling higher. Most of us have experienced increasing health care costs in our businesses or have seen our personal share of insurance premiums rise rapidly. Worse, we may be living with a chronic or life-threatening disease while struggling to obtain effective therapies and interventions — finding ourselves lumped in with “average patients” instead of receiving effective care designed to work for our specific situation.

In short, particularly in the United States, we are paying too much for too much care of the wrong kind and getting poor results. All the while our diet and lifestyle failures are demanding even more from the system. In the past few decades we’ve dropped from the world’s best health care system to the 37th, and we seem likely to drop further if things don’t change.

The very public fight over the Affordable Care Act (ACA) has brought this to the fore of our attention, but this is a situation that has been brewing for a long time. With the ACA’s arrival, increasing costs and poor outcomes, at least in part, are going to be the responsibility of the federal government. The fiscal outlook for that responsibility doesn’t look good and solving this crisis is no longer optional; it’s urgent.

There are many reasons for the crisis, and there’s no silver bullet. Health and health care live at the confluence of diet and exercise norms, destructive business incentives, antiquated care models, and a system that has severe learning disabilities. We aren’t preventing the preventable, and once we’re sick we’re paying for procedures and tests instead of results; and those interventions were designed for some non-existent average patient so much of it is wasted. Later we mostly ignore the data that could help the system learn and adapt.

It’s all too easy to be gloomy about the outlook for health and health care, but this is also a moment of great opportunity. We face this crisis armed with vast new data sources, the emerging tools and techniques to analyze them, an ACA policy framework that emphasizes outcomes over procedures, and a growing recognition that these are problems worth solving.

Data has a long history of being “unreasonably effective.” And at least from the technologist point of view it looks like we are on the cusp of something big. We have the opportunity to move from “Health IT” to an era of data-illuminated technology-enabled health.

For example, it is well known that poverty places a disproportionate burden on the health care system. Poor people don’t have medical insurance and can’t afford to see doctors; so when they’re sick they go to the emergency room at great cost and often after they are much sicker than they need to be. But what happens when you look deeper? One project showed that two apartment buildings in Camden, NJ accounted for a hugely disproportionate number of hospital admissions.

Targeting those buildings, and specific people within them, with integrated preventive care and medical intervention has led to significant savings.

That project was made possible by the analysis of hospital admissions, costs, and intervention outcomes — essentially, insurance claims data — across all the hospitals in Camden. Acting upon that analysis and analyzing the results of the action led to savings.

But claims data isn’t the only game in town anymore. Even more is possible as electronic medical records (EMR), genomic, mobile sensor, and other emerging data streams become available.

With mobile-enabled remote sensors like glucometers, blood pressure monitors, and futuristic tools like digital pills that broadcast their arrival in the stomach, we have the opportunity to completely revolutionize disease management. By moving from discrete and costly data events to a continuous stream of inexpensive remotely monitored data, care will improve for a broad range of chronic and life-threatening diseases. By involving fewer office visits, physician productivity will rise and costs will come down.

We are also beginning to see tantalizing hints of the future of personalized medicine in action. Cheap gene sequencing, better understanding of how drug molecules interact with our biology (and each other), and the tools and horsepower to analyze these complex interactions for a specific patient with specific biology in near real time will change how we do medicine. In the same way that Google’s AdSense took cost out of advertising by using data to target ads with precision, we’ll soon be able to make medical interventions that are much more patient-specific and cost effective.

StrataRx is based on the idea that data will improve health and health care, but we aren’t naive enough to believe that data alone solves all the problems we are facing. Health and health care are incredibly complex and multi-layered and big data analytics is only one piece of the puzzle. Solving our national crisis will also depend on policy and system changes, some of them to systems outside of health care. However, we know that data and its analysis have an important role to play in illuminating the current reality and creating those solutions.

StrataRx is a call for data scientists, technologists, health professionals, and the sector’s business leadership to convene, take part in the discussion, and make a difference!

Strata Rx — Strata Rx, being held Oct. 16-17 in San Francisco, is the first conference to bring data science to the urgent issues confronting healthcare.

Save 20% on registration with the code RADAR20


January 23 2012

Survey results: How businesses are adopting and dealing with data

On December 7, 2011, we held our fifth Strata Online Conference. This series of free web events brings together analysts, innovators and researchers from a variety of fields. Each conference, we look at a particular facet of the move to big data — from personal analytics, to disruptive startups, to enterprise adoption.

This time, we focused on how businesses are going to embrace big data, and where the challenges lie. It was a perfect opportunity to survey the attendees and get a glimpse into enterprise adoption of big data. Out of the roughly 350 attendees, approximately 100 agreed to give us their feedback on a number of questions we asked. Here are the results.

Some basic facts

While the attendees worked for a mix of commercial, educational, government, and non-profit companies, the vast majority (82%) worked for a commercial, for-profit company.

What kind of organization do you work for?
Click to enlarge.

Most of the attendees' organizations were also fairly large — more than half of them had 500 co-workers, and 22% of them had more than 10,000.

How big is your organization?
Click to enlarge.

We used this demographic information to segment and better analyze the other three questions we asked.

Big data adoption and challenges

We then asked attendees about their journey to big data. Fewer than 20% of them already have a big data solution in place — which we clarified to mean some kind of massive-scale, sharded, NoSQL, parallel data query system that may employ interactivity and machine-assisted data exploration. More than a quarter said they have no plans at this time.

How soon do you expect to implement a big data solution?
Click to enlarge.

While it's relatively early days for adoption, more than 60% of attendees said they were in the process of gathering information on big data and what it meant to them. This is a spurious result at best: we're of course selecting an audience that wants to be an audience. Nevertheless, the volume of attendees and their feedback suggests that deployment is ramping up: if you're a big data vendor, this is the time to be fighting for mindshare.

What's the biggest challenge you see with big data?
Click to enlarge.

When it comes to actually deploying big data, companies have plenty of challenges. The big ones seem to be:

  • Data privacy and governance.
  • Defining what big data actually is.
  • Integrating big data with legacy systems.
  • A lack of big data skills.
  • The cost of tools.

Analyzing a bit further

These results might be informative, but what we really want to know is how they correlate. After all, Strata is a data conference: we'd be remiss if we didn't crunch things a bit!

First, we wondered whether there's a relationship between the size of a company and the kinds of problems it's experiencing with big data.

Obstacles by company size
Click to enlarge.

Our results suggest that governance and skill shortages are problems for larger companies, and that smaller businesses worry much less about data privacy and integrating legacy systems. Cost concerns come largely from mid-sized businesses.

Then we wondered whether adoption is tied to company size.

Big data adoption progress by company size
Click to enlarge.

Among our attendees, smaller firms were ahead of the game: none of the companies larger than 500 employees said they had big data in place today.

We also found that educational, government, and NGO respondents didn't list cost as a top concern, suggesting that they may have a tolerance for open-source or home-grown approaches.

Obstacles by company type
Click to enlarge.

Of course, the number of responses from these segments isn't statistically significant, but it warrants further study, particularly for commercial offerings trying to sell outside the for-profit world.

Finally, we wondered whether the things a company worries about change as it goes from "just browsing" to "trying to build."

Obstacles by time to implement
Click to enlarge.

Concerns do seem to shift over the course of adoption and maturity. Early on, companies struggle to define what big data is and worry about staffing. As they get closer to implementation, their attention shifts to legacy system integration. Once they have a system, talent shortages and a variety of other, more specific concerns emerge.

While not a hard-core study — respondents weren't randomly selected, the number of responses within some segments isn't statistically significant, and so on — this feedback does suggest that there's a large demand for clear information on what big data is and how it'll change business, and that as enterprises move to adopt these technologies they'll face integration headaches and staffing issues.

The next free Strata Online Conference will be held on January 25. We'll be taking a look at what's in store for the upcoming Strata Conference (Feb 28-March 1 in Santa Clara, Calif).

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


December 19 2011

Big crime meets big data

Marc Goodman (@futurecrimes) is a former Los Angeles police officer who started that department's first Internet crime unit in the mid-1990s. After two decades spent working with Interpol, the United Nations, and NATO, Goodman founded the Future Crimes Institute to track how criminals use technology.

Malicious types of software, like viruses, worms, and trojans, are the main tools used to harvest personal data. Cyber criminals also use social engineering techniques, such as phishing emails populated with data gleaned from social networks, to trick people into providing further details. In the interview below, Goodman outlines some of the other ways organized criminals and terrorists are harnessing data for nefarious ends.

What motivates data criminals?

Marc GoodmanMarc Goodman: Anything that would motivate someone to join a startup would motivate a criminal. They want money, shares in the business, a challenge. They don't want a 9-to-5 environment. They also want the respect of their peers. They have an us-against-them attitude; they're highly innovative and adaptive, and they never take the head-on approach. They always find clever and imaginative ways to go about something that a good person would never have considered.

What type of personal data is most valuable to criminals?

Marc Goodman: The best value is a bank account takeover. A standard credit card might cost a criminal only $10, but for $700 they could buy details of a bank account with $50,000 in it, money that could be stolen in just one transaction.

European credit cards tend to cost more than American credit cards since Europeans are much better at guarding their data. There's also a universal identifier for Americans — the social security number — but the same thing doesn't exist from a pan-European perspective.

How is data crime more scalable than traditional crime?

Marc Goodman: Data crime can be scripted and automated. If you were to take a gun or a knife and stand on a street corner, there are only so many people you can rob. You have to do the crime, run away from the scene, worry about the police, etc. You can't walk into Wembley Stadium with a gun and say, "Everybody, put your hands up," but you can do the equivalent from a cyber-crime perspective.

One of the reasons why cyber crime thrives is that it's totally international whereas law enforcement is totally national. Now, the person attacking you can be sitting in New York or Tokyo or Botswana. The ability to conduct business without getting on a plane is an awesome advantage for international organized crime.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

How has cyber crime evolved?

Marc Goodman: In the 1970s, you had to be a clever hacker and create your own scripts. Now all of that stuff can be bought off the shelf. You can buy a package of crimeware and put in the email addresses or the domain that you want to attack via a nice user interface. It's really plug-and-play criminality.

You claim that the 2008 Mumbai attackers used real-time data gathering from social networks and other media. How do terrorists use data?

Marc Goodman: Since the Internet arrived, terrorists have been advertising, doing PR, recruiting, and fundraising, all online. But this was the first time that we had seen terrorists use technology to the full extent that this group did during the incident. They had mobile phones and satellite phones. The terrorist war room they set up to monitor the media and feed back information in real time to the attackers was a really significant innovation.

They re-engineered the attack mid-incident to kill more people. They were constantly looking for new hostages. Organizations like the BBC and CNN were tweeting to ask people on the ground in Mumbai to contact a producer. People trapped in hotels called the TV stations. All of that information was being tracked by the terrorist war room. There was an Indian minister who was doing a live interview on the Indian Broadcast Network (IBN) while hiding in the kitchen of the ballroom of the Taj Mahal hotel. The war room picked this up and directed the attackers to that part of the hotel where they could find the minister.

What can be done to combat cyber crime?

Marc Goodman: The terrorism problem is very different from the cyber crime problem. Most terrorism tends to have a basis in the real world whereas cyber crime tends to be purely online. Governments are pretty good at tracking the terrorists in their own countries, and there is decent international cooperation on terrorism.

What is making things more difficult for governments is that, in the old days, if you tapped somebody's home phone, you had a good picture of what was going on. Now you don't know where to look. Are they communicating on Facebook, on Twitter, or having a meeting in World of Warcraft?

Law enforcement needs to develop better systems to deal with the madness of social media in terrorist attacks. The public is getting involved in ways that are, frankly, unhealthy. There was a hostage situation in the U.S. a couple of months ago where a man took a hostage and was sexually assaulting her. He was trapped in a hotel room with guns and was posting live on Facebook and Twitter. Then the public started to interact with the hostage-taker, tweeting things like, "You wouldn't kill her. You are not brave enough to do it." In the past, police could close off several blocks, put up yellow crime scene tape, close the airspace over the scene, and bring in a trained negotiator. How does law enforcement intervene when there can be a completely disintermediated conversation between the criminal or terrorist and the general public?

Marc Goodman discussed the business of illegal data at Strata New York 2011. His full presentation is available in the following video:

This interview was edited and condensed.


November 30 2011

Big data goes to work

Companies that are slow to adopt data-driven practices don't need to worry about long-term plans — they'll be disrupted out of existence before those deadlines arrive. And even if your business is on the data bandwagon, you shouldn't get too comfortable. Shifts in consumer tolerances and expectations are quickly shaping how businesses apply big data.

Alistair Croll, Strata online program chair, explores these shifts and other data developments in the following interview. Many of these same topics will be discussed at "Moving to Big Data," a free Strata Online Conference being held Dec. 7.

How are consumer expectations about data influencing enterprises?

Alistair CrollAlistair Croll: There are two dimensions. First, consumer tolerance for sharing data has gone way up. I think there's a general realization that shared information isn't always bad: we can use it to understand trends or fight diseases. Recent rulings by the Supreme Court and legislation like the Genetic Information Nondiscrimination Act (GINA) offer some degree of protection. This means it's easier for companies to learn about their customers.

Second, consumers expect that if a company knows about them, it will treat them personally. We're incensed when a vendor that claims to have a personal connection with us treats us anonymously. The pact of sharing is that we demand personalization in return. That means marketers are scrambling to turn what they know about their customers into changes in how they interact with them.

What's the relationship between traditional business intelligence (BI) and big data? Are they adversaries?

Alistair Croll: Big data is a successor to traditional BI, and in that respect, there's bound to be some bloodshed. But both BI and big data are trying to do the same thing: answer questions. If big data gets businesses asking better questions, it's good for everyone.

Big data is different from BI in three main ways:

  1. It's about more data than BI, and this is certainly a traditional definition of big data.
  2. It's about faster data than BI, which means exploration and interactivity, and in some cases delivering results in less time than it takes to load a web page.
  3. It's about unstructured data, which we only decide how to use after we've collected it and need algorithms and interactivity in order to find the patterns it contains.

When traditional BI bumps up against the edges of big, fast, or unstructured, that's when big data takes over. So, it's likely that in a few years we'll ask a business question, and the tools themselves will decide if they can use traditional relational databases and data warehouses or if they should send the task to a different architecture based on its processing requirements.

What's obvious to anyone on either side of the BI/big data fence is that the importance of asking the right questions — and the business value of doing so — has gone way, way up.

How can businesses unlock their data? What's involved in that process?

Alistair Croll: The first step is to ask the right questions. Before, a leader was someone who could convince people to act in the absence of clear evidence. Today, it's someone who knows what questions to ask.

Acting in the absence of clear evidence mattered because we lived in a world of risk and reward. Uncertainty meant we didn't know which course of action to take — and that if we waited until it was obvious, all the profit would have evaporated.

But today, everyone has access to more data than they can handle. There are simply too many possible actions, so the spoils go to the organization that can choose among them. This is similar to the open-source movement: Goldcorp took its geological data on gold deposits — considered the "crown jewels" in the mining industry — and shared it with the world, creating a contest to find rich veins to mine. Today, they're one of the most successful mining companies in the world. That comes from sharing and opening up data, not hoarding it.

Finally, the value often isn't in the data itself; it's in building an organization that can act on it swiftly. Military strategist John Boyd developed the observe, orient, decide and act (OODA) loop, which is a cycle of collecting information and acting that fighter pilots could use to outwit their opponents. Pilots talk of "getting inside" the enemy's OODA loop; companies need to do the same thing.

So, businesses need to do three things:

  1. Learn how to ask the right questions instead of leading by gut feel and politics.
  2. Change how they think about data, opening it up to make the best use of it when appropriate and realizing that there's a risk in being too private.
  3. Tune the organization to iterate more quickly than competitors by collecting, interpreting, and testing information on its markets and customers.
Moving to Big Data: Free Strata Online Conference — In this free online event, being held Dec. 7, 2011, at 9AM Pacific, we'll look at how big data stacks and analytical approaches are gradually finding their way into organizations as well as the roadblocks that can thwart efforts to become more data driven. (This Strata Online Conference is sponsored by Microsoft.)

Register to attend this free Strata Online Conference

What are the most common data roadblocks in companies?

Alistair Croll: Everyone I talk to says privacy, governance, and compliance. But if you really dig in, it's culture. Employees like being smart, or convincing, or compelling. They've learned soft skills like negotiation, instinct, and so on.

Until now, that's been enough to win friends and influence people. But the harsh light of data threatens existing hierarchies. When you have numbers and tests, you don't need arguments. All those gut instincts are merely hypotheses ripe for testing, and that means the biggest obstacle is actually company culture.

Are most businesses still in the data acquisition phase? Or are you seeing companies shift into data application?

Alistair Croll: These aren't really phases. Companies have a cycle — call it a data supply chain — that consists of collection, interpretation, sharing, and measuring. They've been doing it for structured data for decades: sales by quarter, by region, by product. But they're now collecting more data, without being sure how they'll use it.

We're also seeing them asking questions that can't be answered by traditional means, either because there's too much data to analyze in a timely manner, or because the tools they have can't answer the questions they have. That's bringing them to platforms like Hadoop.

One of the catalysts for this adoption has been web analytics, which is, for many firms, their first taste of big data. And now, marketers are asking, "If I have this kind of insight into my online channels, why can't I get it elsewhere?" Tools once used for loyalty programs and database marketing are being repurposed for campaign management and customer insight.

How will big data shape businesses over the next few years?

Alistair Croll: I like to ask people, "Why do you know more about your friends' vacations (through Facebook or Twitter) than about whether you're going to make your numbers this quarter or where your trucks are?" The consumer web is writing big data checks that enterprise BI simply can't cash.

Where I think we'll see real disruption and adoption is in horizontal applications. The big data limelight is focused on vertical stuff today — genomics, algorithmic trading, and so on. But when it's used to detect employee fraud or to hire and fire the right people, or to optimize a supply chain, then the benefits will be irresistible.

In the last decade, web analytics, CRM, and other applications have found their way into enterprise IT through the side door, in spite of the CIO's allergies to outside tools. These applications are often built on "big data," scale-out architectures.

Which companies are doing data right?

Alistair Croll: Unfortunately, the easy answer is "the new ones." Despite having all the data, Blockbuster lost to Netflix; Barnes & Noble lost to Amazon. It may be that, just like the switch from circuits to packets or from procedural to object-oriented programming, running a data-driven business requires a fundamentally different skill set.

Big firms need to realize that they're sitting on a massive amount of information but are unable to act on it unless they loosen up and start asking the right questions. And they need to realize that big data is a massive disintermediator, from which no industry is safe.

This interview was edited and condensed.


October 04 2011

Four short links: 4 October 2011

  1. -- Singaporean version of TechStars, with 100-day program ("the bootcamp") Jan-Apr 2012. Startups from anywhere in the world can apply, and will want to because Singapore is the gateway to Asia. They'll also have mentors from around the world.
  2. Oracle NoSQLdb -- Oracle want to sell you a distributed key-value store. It's called "Oracle NoSQL" (as opposed to PostgreSQL, which is SQL No-Oracle). (via Edd Dumbill)
  3. Facebook Browser -- interesting thoughts about why the browser might be a good play for Facebook. I'm not so sure: browsers don't lend themselves to small teams, and search advertising doesn't feel like a good fit with Facebook's existing work. Still, making me grumpy again to see browsers become weapons again.
  4. Bitbucket -- a competitor to Github, from the folks behind the widely-respected Jira and Confluence tools. I'm a little puzzled, to be honest: Github doesn't seem to have weak spots (the way, for example, that Sourceforge did).

August 09 2011

There's no such thing as big data

“You know,” said a good friend of mine last week, “there’s really no such thing as big data.”

I sighed a bit inside. In the past few years, cloud computing critics have said similar things: that clouds are nothing new, that they’re just mainframes, that they’re just painting old technologies with a cloud brush to help sales. I’m wary of this sort of techno-Luddism. But this person is sharp, and not usually prone to verbal linkbait, so I dug deeper.

He’s a ridiculously heavy traveler, racking up hundreds of thousands of miles in the air each year. He’s the kind of flier airlines dream of: loyal, well-heeled, and prone to last-minute, business-class trips. He's is exactly the kind of person an airline needs to court aggressively, one who represents a disproportionally large amount of revenues. He’s an outlier of the best kind. He’d been a top-ranked passenger with United Airlines for nearly a decade, using their Mileage Plus program for everything from hotels to car rentals.

And then his company was acquired.

The acquiring firm had a contractual relationship with American Airlines, a competitor of United with a completely separate loyalty program. My friend’s air travel on United and its partner airlines dropped to nearly nothing.

He continued to book hotels in Shanghai, rent cars in Barcelona, and buy meals in Tahiti, and every one of those transactions was tied to his loyalty program with United. So the airline knew he was traveling -- just not with them.

Astonishingly, nobody ever called him to inquire about why he'd stopped flying with them. As a result, he’s far less loyal than he was. But more importantly, United has lost a huge opportunity to try to win over a large company’s business, with a passionate and motivated inside advocate.

And this was his point about big data: that given how much traditional companies put it to work, it might as well not exist. Companies have countless ways they might use the treasure troves of data they have on us. Yet all of this data lies buried, sitting in silos. It seldom sees the light of day.

When a company does put data to use, it’s usually a disruptive startup. Zappos and customer service. Amazon and retailing. Craigslist and classified ads. Zillow and house purchases. LinkedIn and recruiting. eBay and payments. Ryanair and air travel. One by one, industry incumbents are withering under the harsh light of data.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

20% on registration with the code STN11RAD

Big data and the innovator's dilemma

Large companies with entrenched business models tend to cling to their buggy-whips. They have a hard time breaking their own business models, as Clay Christensen so clearly stated in "The Innovator’s Dilemma," but it's too easy to point the finger at simple complacency.

Early-stage companies have a second advantage over more established ones: they can ask for forgiveness instead of permission. Because they have less to lose, they can make risky bets. In the early days of PayPal, the company could skirt regulations more easily than Visa or Mastercard, because it had far less to fear if it was shut down. This helped it gain marketshare while established credit-card companies were busy with paperwork.

The real problem is one of asking the right questions.

At a dig data conference run by The Economist this spring, one of the speakers made a great point: Archimedes had taken baths before.

(Quick historical recap: In an almost certainly apocryphal tale, Hiero of Syracuse had asked Archimedes to devise a way of measuring density, an indicator of purity, in irregularly shaped objects like gold crowns. Archimedes realized that the level of water in a bath changed as he climbed in, making it an indicator of volume. Eureka!)

The speaker’s point was this: it was the question that prompted Archimedes’ realization.

Small, agile startups disrupt entire industries because they look at traditional problems with a new perspective. They’re fearless, because they have less to lose. But big, entrenched incumbents should still be able to compete, because they have massive amounts of data about their customers, their products, their employees, and their competitors. They fail because often they just don’t know how to ask the right questions.

In a recent study, McKinsey found that by 2018, the U.S. will face a shortage of 1.5 million managers who are fluent in data-based decision making. It’s a lesson not lost on leading business schools: several of them are introducing business courses in analytics.

Ultimately, this is what my friend’s airline example underscores. It takes an employee, deciding that the loss of high-value customers is important, to run a query of all their data and find him, and then turn that into a business advantage. Without the right questions, there really is no such thing as big data -- and today, it’s the upstarts that are asking all the good questions.

When it comes to big data, you either use it or lose.

This is what we’re hoping to explore at Strata JumpStart in New York next month. Rather than taking a vertical look at a particular industry, we’re looking at the basics of business administration through a big data lens. We'll be looking at apply big data to HR, strategic planning, risk management, competitive analysis, supply chain management, and so on. In a world flooded by too much data and too many answers, tomorrow's business leaders need to learn how to ask the right questions.

February 07 2011

Four short links: 7 February 2011

  1. UK Internet Entrepreneurs (Guardian) -- two things stood out for me. (1) A startup focused on 3d printing better dolls for boys and girls. (2) it seems easier to the government to start something new and impose its own vision than it is to understand and integrate with what already exists.
  2. TreeSaver.js -- MIT/GPLv2-licensed JavaScript framework for creating magazine-style layouts using standards-compliant HTML and CSS.
  3. Using git to Manage a Web Site -- This page describes how I set things up so that I can make changes live by running just "git push web".
  4. Strata Data Conference Recap -- Clean data > More Data > Fancy Math — this is the order which makes data easier and better to work with. Clean data will be easier to work with and provide best results. If your data isn't clean, it is better to have more data than having to resort to fancy math. Using higher order statistical processing, while workable as a last resort, will require longer to develop, difficult algorithms and harder to maintain. So best place to focus is to start with clean data.

January 19 2011

Data startups: we want you

The startup showcase at Web 2.0 Expo NY 2010The O'Reilly Strata Conference on making data work is almost upon us. There's one final opportunity to be a part of this epoch-defining event: the Startup Showcase.

We're seeking startups that want to pitch the attendees — a broad selection of leaders from the business and investment community, as well as elite developers and savvy data-heads. Successful applicants will receive a couple of Strata registrations gratis, and the chance to be one of three winners who get to give their company pitches on the main stage.

Our judges include Roger Ehrenberg of IA Ventures, one of the leading investors in data-driven companies, and Tim Guleri of Sierra Ventures, whose successes include big data star Greenplum, which was recently acquired by EMC.

You've got until the end of this week to tell us why your startup should be a part of Strata. Submissions close on Friday night (Jan. 21.), so apply now.

January 04 2011

Everyone loves a science fair

Our first Strata conference is just a month away. It's exceeded our expectations in nearly every respect, from the breadth of topics, to the calibre of presenters, to the outpouring of support and interest in subjects like Big Data, visualization, ubiquitous computing, and new interfaces.

But there's one part of Strata I'm especially excited about: the science fair. There are hundreds of groundbreaking projects lurking in research labs, universities, and garages, and we want to showcase some of them. So we've set up an event at Strata to let attendees explore some of the weirder artifacts of an always-on world.

As William Gibson observed, the future is here—it's just not evenly distributed. A few short years ago, a portable, wirelessly connected, camera-and-microphone-equipped, touch-and-voice-operated device was science fiction; today, we discard them like digital chaff as soon as Steve Jobs announces a new phone. These cheap, powerful devices are a boon to hobbyists everywhere, from a hacked kinect to an Arduino.

All this big data isn't useful on its own. It needs to be crunched, massaged, and organized by clouds of machines; then it needs to be distributed to the corners of humanity. Ultimately, it has to be accessible and intuitively understandable. And it's this last part that I'm hoping the science fair will show us -- the next Homebrew Computer Club, playing with bytes and devices the way the previous one created the PC revolution from solder, chips, and obsessiveness.

We're still looking for things to show in the science fair (submissions are due by Jan. 14). If you have an interesting tool or technology to show -- the more beta, the better -- let us know. Let's show the world what the future will look like, one project at a time.

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!