Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 24 2012

The Direct Project has teeth, but it needs pseudonymity

Yesterday, Meaningful Use Stage 2 was released.

You can read the final rule here and you can read the announcement here.

As we read and parse the 900 or so pages of government-issued goodness, you can expect lots of commentary and discussion. Geek Doctor  already has a summary and Motorcycle Guy can be expected to help us all parse the various health IT standards that have been newly blessed. Expect Brian Ahier to also be worth reading over the next couple of days.

I just wanted to highlight one thing about the newly released rules. As suspected, the actual use of the Direct Project will be a requirement. That means certified electronic health record (EHR) systems will have to implement it, and doctors and hospitals will have to exchange data with it. Awesome.

More importantly, this will be the first health IT interoperability standard with teeth. The National Institute of Standards and Technology (NIST) will be setting up an interoperability test server. It will not be enough to say that you support Direct. People will have to prove it. I love it. This has been the problem with Health Level 7 et al for years. No central standard for testing always means an unreliable and weak standard. Make no mistake, this is a critical and important move from the Office of the National Coordinator for Health Information Technology (ONC).

(Have I mentioned that I love that Farzad Mostashari — our current ONC — uses Twitter? I also love that he has a sense of humor!)

Now we just need to make sure that patient pseudonymity is supported on the Directed Exchange network. To do otherwise is to force patients to trust the whole network rather than to merely trust their own doctors. I have already made that case, but it is really nice to see both Arien Malec (founding coordinator of the Direct Project) and Sean Nolan (chief architect at Microsoft HealthVault) have weighed in with similar thoughts. Malec wrote a  lovely piece that details how to translate patient pseudonymity into NIST assurance levels. Nolan talked about how difficult it would be for HealthVault to have to do identity proofing on patients.

In order to emphasize my point in a more public way, I have beat everyone to the punch and registered the account of DaffyDuck@direct.healthvault.com. Everyone seems to think this is just the kind of madness that we need to avoid. But this is just the kind of madness that patients need to really protect their privacy.

Here’s an example. Lets imagine that I am a pain patient and I am seeking treatment from a pain specialist named Dr. John Doe who works at Pain No More clinic. His Direct address might be john.doe@direct.painnomore.com

Now if I provide DaffyDuck@direct.healthvault.com to Dr. Doe and Dr. Doe can be sure that he is always talking to me when he communicates with that address, then there is nothing else that needs to happen here. There never needs to be a formal cryptographic association between DaffyDuck@direct.healthvault.com and Fred Trotter. I know that there is a connection and my doctor knows that there is a connection and those are the only people that need to know.

If any cryptographic or otherwise published association were to exist, then anyone who had access to my public certifications and/or knew of communication between john.doe@direct.painnomore.com and DaffyDuck@direct.healthvault.com could make a pretty good guess about my health care status. I am not actually interested in trusting the Directed Exchange network. I am interested in trusting through the Directed Exchange network. Pseudonymity gives both me and my doctor that privilege. If a patient wants to give a different Direct email address to every doctor they work with, they should have that option.

This is a critical patient privacy feature of the Direct protocol and it was designed in from the beginning. It is critical that later policy makers not screw this up.

Strata Rx — Strata Rx, being held Oct. 16-17 in San Francisco, is the first conference to bring data science to the urgent issues confronting healthcare.

Save 20% on registration with the code RADAR20

Related:

August 23 2012

Balancing health privacy with innovation will rely on improving informed consent

Society is now faced with how to balance the privacy of the individual patient with the immense social good that could come through great health data sharing. Making health data more open and fluid holds both the potential to be hugely beneficial for patients and enormously harmful. As my colleague Alistair Croll put it this summer, big data may well be a civil rights issue that much of the world doesn’t know about yet.

This will likely be a tension that persists throughout my lifetime as technology spreads around the world. While big data breaches are likely to make headlines, more subtle uses of health data have the potential to enable employers, insurers or governments to discriminate — or worse. Figuring out shopping habits can also allow a company to determine a teenager was pregnant before her father did. People simply don’t realize how much about their lives can be intuited through analysis of their data exhaust.

To unlock the potential of health data for the public good, informed consent must mean something. Patients must be given the information and context for how and why their health data will be used in clear, transparent ways. To do otherwise is to duck the responsibility that comes with the immense power of big data.

In search of an informed opinion on all of these issues, I called up Deven McGraw (@HealthPrivacy), the director of the Health Privacy Project at the Center for Democracy and Technology (CDT). Our interview, lightly edited for content and clarity, follows.

Should people feel better about, say, getting their genome decoded because the Patient Protection and Affordable Care Act (PPACA) was upheld by the Supreme Court? What about other health-data-based discrimination?

Deven McGraw: The reality that someone could get data and use it in a way that harms people, and the inability to get affordable health care insurance or to get insurance at all, has been a significant driver of the concerns people have about health data for a very long time.

It’s not the only driver of people’s privacy concerns. Just removing the capacity for entities to do harm to individuals using their health data is probably not going to fully resolve the problem.

It’s important to pursue from a policy standpoint, but it’s also the case that people feel stigmatized by their health data. They feel health care is something they want to be able to pursue privately, even if the chances are very low that anybody could get the information and actually harm them with it by denying them insurance or denying them employment, which is an area we actually haven’t fully fixed. Your ability to get life insurance or disability insurance was not fixed by the Affordable Care Act.

Even if you fix all of those issues, privacy protections are about building an ecosystem in health care that people will trust. When they need to seek care that might be deemed to be sensitive to them, they feel like they can go get care and have some degree of confidence that that information isn’t going to be shared outside of those who have a need to know it, like health care providers or their insurance company if they are seeking to be reimbursed for care.

Obviously, public health can play a role. The average individual doesn’t realize that, often, their health data is sent to public health authorities if they have certain conditions or diseases, or even just as a matter of routine reporting for surveillance purposes.

Some of this is about keeping a trustworthy environment for individuals so they can seek the care they need. That’s a key goal for privacy. The other aspect of it is making sure we have the data available for important public purposes, but in a way that respects the fact that this data is sensitive.

We need to not be disrupting the trust people have in the health care system. If you can’t give people some reasonable assurance about how their data is used, there are lots of folks who will decline to seek care or will lie about health conditions when truthfulness is important.

Are health care providers and services being honest about health data use?

Deven McGraw: Transparency and openness about how we use health data in this country is seriously lacking. Part of it is the challenge of being up front with people, disclosing things they need to know but not overwhelming them with so much information in a consent form that they just sign on the bottom and don’t read it and don’t fully understand it.

It’s really hard to get notice and transparency right, and it’s a constant struggle. The FTC report on privacy talks a lot about how hard it is to be transparent with people about data sharing on the Internet or data collection on your mobile phone.

Ideally, for people to be truly informed, you’d give them an exhaustive amount of information, right? But if you give them too much information, the chances that they’ll read it and understand it are really low. So then people end up saying “yes” to things they don’t even realize they’re saying “yes” to.

On the other hand, we haven’t put enough effort into trying different ways of educating people. We, for too long, have assumed that, in a regulatory regime that provides permissive data sharing within the health care context, people will just trust their doctors.

I’ve been to a seminar on researchers getting access to data. The response of one of the researchers to the issue of “How do you regulate data uses for research?” and “What’s the role of consent?” and “What’s the role of institutional review boards?” was, “Well, people should just trust researchers.”

Maybe some people trust researchers, but that’s not really good enough. You have to earn trust. There’s a lot of room for innovative thinking along those lines. It’s something I have been increasingly itchy to try to dive into in more detail with folks who have expertise in other disciplines, like sociology, anthropology and community-building. What does it take to build trusted infrastructures that are transparent, open and that people are comfortable participating in?

There’s no magic endpoint for privacy, like, “Oh, we have privacy now,” versus, “Oh, we don’t have privacy.” To me, the magic endpoint is whether we have a health care data ecosystem that most people trust. It’s not perfect, but it’s good enough. I don’t think we’re quite there yet.

What specifically needs to happen on the openness and transparency side?

Deven McGraw: When I hear about state-based or community-based health information exchanges (HIE) going out and having town meetings with people in advance of building the HIE, working with the physicians in their communities to make sure they’re having conversations with their patients about what’s happening in the community, the electronic records movement and the HIE they’re building, that’s exactly the kind of work you need to do. When I hear about initiatives where people have actually spent the time and resources to educate patients, it warms my heart.

Yes, it’s fairly time- and resource-intensive, but in my view, it pays huge dividends on the backend, in terms of the level of trust and buy-in the community has to what you’re doing. It’s not that big of a leap. If you live in a community where people tend to go to church on Sundays, reach out to the churches. Ask pastors if you can speak to their congregations. Or bring them along and have them speak to their own congregations. Do tailored outreach to people through vehicles they already trust.

I think a lot of folks are pressed for time and resources, and feeling like digitization of the health care system should have happened yesterday. People are dying from errors in care and not getting their care coordinated. All of that is true. But this is a huge change in health care, and we have to do the hard work of outreach and engagement of patients in the community to do it right. In many ways, it’s a community-by-community effort. We’re not one great ad campaign away from solving the issue.

Is there mistrust for good reason? There have been many years of data breaches, coupled with new fears sparked by hacks enabled by electronic health record (EHR) adoption.

Deven McGraw: Part of it is when one organization has a breach, it’s like they all did. There is a collective sense that the health care industry, overall, doesn’t have its act together. It can’t quite figure out how to do electronic records right when we have breach after breach after breach. If breaches were rare, that would be one thing, but they’re still far too frequent. Institutions aren’t taking the basic steps they could take to reduce breaches. You’re never going to eliminate them, but you certainly can reduce them below where we are today.

In the context of certain secondary data uses, like when parents find out after the fact that blood spots collected from their infants at birth are being used for multiple purposes, you don’t want to surprise people about what you’re doing with their health information, the health information of their children, and that of other family members.

I think most people would be quite comfortable with many uses of health data, including those that do not necessarily directly benefit them but benefit human beings generally, or people who have the same disease, or people like them. In general, we’re actually a fairly generous people, but we don’t want to be surprised by unexpected use.

There’s a tremendous amount of work to do. We have a tendency to think issues like secondary use get resolved by asking for people’s consent ahead of time. Consent certainly plays an important role in protecting people’s privacy and giving them some sense of control over their health care information, but because consent in practice actually doesn’t do such a great job, we can’t over-rely on it to create a trust ecosystem. We have to do more on the openness and transparency side so that people are brought along with where we’re going with these health information technology initiatives.

What do doctors’ offices need to do to mitigate risks from EHR adoption?

Deven McGraw: It’s absolutely true that digitizing data in the absence of the adoption of technical security safeguards puts it much more at risk. You cannot hack into a paper file. If you lose a paper file, you’ve lost one paper file. If you lose a laptop, you’ve lost hundreds of thousands of records, if they’re on there and you didn’t encrypt the data.

Having said that, there are so many tools that you can adopt in technology with data in a digital form that are much stronger from a security standpoint than is true in paper. You can set role-based access controls for who can access a file and track who has accessed a file. You can’t do that with paper. You can use encryption technology. You can use stronger identity and authentication levels in order to make sure the person accessing the data is, in fact, authorized to do so and is the person they say they are on the other end of the transaction.

We do need people to adopt those technologies and to use them. You’re talking about a health care industry that has stewardship over some of the most sensitive data we have out there. It’s not the nuclear codes, but for a lot of people, it’s incredibly sensitive data — and yet, we trust the security of that data to rank amateurs. Honestly, there’s no other way around that. The people who create the data are physicians. Most of them don’t have any experience in digital security.

We have to count on the vendors of those systems to build in security safeguards. Then, we have to count on giving physicians and their staffs as much guidance as we can so they can actually deploy those safeguards and don’t create workarounds to them that create bigger holes in the security of the data and potentially create patient safety issues. It’s an enormously complex problem, but it’s not the reason to say, “Well, we can’t do this.”

Due to the efforts of many advocates, as you well know, health data has become a big part of the discussion around open data. What are the risks and benefits?

Deven McGraw: Honestly, people throw the term “open data” around a lot, and I don’t think we have a clear, agreed-upon definition for what that is. It’s a mistake to think that open data means all health data, fully identifiable, available to anybody, for any purpose, for any reason. That would be a totally “open data” environment. No rules, no restrictions, you get what you need. It certainly would be transformative and disruptive. We’d probably learn an awful lot from the data. But at the same time, we’ve potentially completely blown trust in the system because we can give no guarantees to anybody about what’s going to happen with their data.

Open data means creating rules that provide greater access to data but with certain privacy protections in place, such as protections on minimizing the identifiability of the data. That typically has been the way government health data initiatives, for example, have been put forth: the data that’s open, that’s really widely accessible, is data with a very low risk of being identified with a particular patient. The focus is typically on the patient side, but I think, even in the government health data initiatives that I’m aware of, it’s also not identifiable to a particular provider. It’s aggregate data that says, “How often is that very expensive cardiac surgery done and in what populations of patients? What are the general outcomes?” That’s all valuable information but not data at the granular level, where it’s traceable to an individual and, therefore, puts at risk the notion they can confidentially receive care.

We have a legal regime that opens the doors to data use much wider if you mask identifiers in data, remove them from a dataset, or use statistical techniques to render data to have a very low risk of re-identification.

We don’t have a perfect regulatory regime on that front. We don’t have any strict prohibitions against re-identifying that data. We don’t have any mechanisms to hold people accountable if they do re-identify the data, or if they release a dataset that then is subsequently re-identified because they were sloppy in how they de-identified it. We don’t have the regulatory regime that we need to create an open data ecosystem that loosens some of the regulatory constraints on data but in a way that still protects individual privacy to the maximum extent possible.

Again, it’s a balance. What we’re trying to achieve is a very low risk of re-identification; it’s impossible to achieve no risk of re-identification and still have any utility in the data whatsoever, or so I’m told by researchers.

It is absolutely the path we need to proceed down. Our health care system is so messed up and fails so many people so much of the time. If we don’t start using this data, learning from it and deploying testing initiatives more robustly, getting rid of the ones that don’t work and more aggressively pursuing the interventions that do, we’re never going to move the needle. And consumers suffer from that. They suffer as much — or more, quite frankly — than they do from violations of their privacy. The end goal here is we need to create a health care system that works and that people trust. You need to be pursuing both of those goals.

Congress hasn’t had much appetite for passing new health care legislation in this election year, aside from the House trying to repeal PPACA 33 times. That would seem to leave reform up to the U.S. Department of Health and Human Services (HHS), for now. Where do we stand with rulemaking around creating regulatory regimes like those you’ve described?

Deven McGraw: HHS certainly has made progress in some areas and is much more proactive on the issue of health privacy than I think they have been in the past. On the other hand, I’m not sure I can point to significant milestones that have been met.

Some of that isn’t completely their fault. Within an administration, there are multiple decision-makers. For any sort of policy matter where you want to move the ball forward, there’s a fair amount of process and approval up the food chain that has to happen. In an election year, in particular, that whole mechanism gets jammed up in ways that are often disappointing.

We still don’t have finalized HIPAA rules from the HITECH changes, which is really unfortunate. And I’m now thinking we won’t see them until November. Similarly, there was a study on de-identification that Congress called for in the HITECH legislation. It’s two years late, creeping up on three, and we still haven’t seen it.

You can point to those and you sort of throw up your hands and say, “What’s going on? Who’s minding the store?” If we know and appreciate that we need to build this trust environment to move the needle forward on using health IT to address quality and cost issues, then it starts to look very bad in terms of a report card for the agency on those elements.

On the other hand, you have the Office of the National Coordinator for Health IT doing more work through setting funding conditions on states to get them to adopt privacy frameworks for health information exchanges.

You have progress being made by the Office for Civil Rights on HIPAA enforcement. They’re doing audits. They now have more enforcement actions in the last year than they had in the total number of years the regulations were in effect prior to this year. They’re getting serious.

From a research perspective, the other thing I would mention is the efforts to try to make the common rule — the set of rules that governs federally funded research — more consistent with HIPAA and more workable for researchers. But there’s still a lot of work to be done on that initiative as well.

We started the conversation by saying these are really complex issues. They don’t get fixed overnight. In some respects, fast action is less important than getting it right, but we really should be making faster progress than we are.

What does the trend toward epatients and peer-to-peer health care mean for privacy, prevention and informed consent?

Deven McGraw: I think the epatient movement and increase in people’s use of Internet technologies, like social media, to connect with one another and to share data and experiences in order to improve their care is an enormously positive development. It’s a huge game-changer. And, of course, it will have an impact on privacy.

One of the things we’re going to have to keep an eye on is the fact that one out of six people, when they’re surveyed, say they practice what we call “privacy protective behaviors.” They lie to their physicians. They don’t go to seek the care they need, which is often the case with respect to mental illness. Or they seek care out of their area in order to prevent people they might know who work in their local hospital from seeing their data.

But that’s only one out of six people who say that, so there are an awful lot of people who, from the start, even when they’re healthy, are completely comfortable being open with their data. Certainly when you’re sick, your desire is to get better. And when you’re seriously sick, your desire is to save your life. Anything you can do to do that means whatever qualms you may have had about sharing your data, if they existed at all, go right out the window.

On the other hand, we have to build an ecosystem that the one out of six people can use as well. That’s what I’m focusing on, in particular, in the consumer-facing health space, the “Health 2.0 space” and on social media sites. It really should be the choice of the individual about how much data they share. There needs to be a lot of transparency about how that data is used.

When I look at a site like PatientsLikeMe, I know some privacy advocates think it’s horrifying and that those people are crazy for sharing the level of detail in their data on that site. On the other hand, I have read few privacy policies that are as transparent and open about what they do with data as PatientsLikeMe’s policy. They’re very up front about what happens with that data. I’m confident that people who go on the site absolutely know what they’re doing. It’s not my job to tell them they can’t do it.

But we also need to create environments so people can get the benefits of sharing their experiences with other patients who have their disease — because it’s enormously empowering and groundbreaking from a research standpoint — without telling people they have to throw all of their inhibitions out the door.

You clearly care about these issues deeply. How did you end up in your current position?

Deven McGraw: I was working at the National Partnership for Women and Families, which is another nonprofit advocacy organization here in town [Washington, D.C.], as their chief operating officer. I had been working on health information technology policy issues — specifically, the use of technology to improve health care quality and trying to normalize or reduce costs. I was getting increasingly involved in being a consumer representative at meetings on health information technology adoption and applauding health information technology adoption, and thinking about what the benefits for consumers were and how we can make sure that those happen.

The one issue that kept coming up in those conversations was that we know we need to build in privacy protections for this data and we know we have HIPAA — so where are the gaps? What do we need to do to move the ball forward? I never had enough time to really drill down on that issue because I was the chief operating officer of a nonprofit.

At the time, the Health Privacy Project was an independent nonprofit organization that had been founded and led by one dynamic woman, Janlori Goldman. She was living in New York and was ready to transition the work to somebody else. When the CDT approached me about being the director of the Health Privacy Project, they were moving it into CDT to take advantage of all the technology and Internet expertise at a time when we’re trying to move health care aggressively into the digital space. It was a perfect storm, with me wishing I had more time to think through the privacy issues and then this job aligned with the way I like to do policy work, which is to sit down with stakeholders and try to figure out a solution that ideally works for everybody.

From a timing perspective, it couldn’t have been more perfect. It was right during the consideration of bills on health IT. There were hearings on health information technology that we were invited to testify in. We wrote papers to put ourselves on the map, in terms of our theory about how to do privacy well in health IT and what the role of patient consent should be in privacy, because a lot of the debate was really spinning around that one issue. It’s been a terrific experience. It’s an enormous challenge.

Strata Rx — Strata Rx, being held Oct. 16-17 in San Francisco, is the first conference to bring data science to the urgent issues confronting health care.

Save 20% on registration with the code RADAR20

Related:

August 14 2012

Solving the Wanamaker problem for health care

By Tim O’Reilly, Julie Steele, Mike Loukides and Colin Hill

“The best minds of my generation are thinking about how to make people click ads.” — Jeff Hammerbacher, early Facebook employee

“Work on stuff that matters.” — Tim O’Reilly

Doctors in operating room with data

In the early days of the 20th century, department store magnate John Wanamaker famously said, “I know that half of my advertising doesn’t work. The problem is that I don’t know which half.”

The consumer Internet revolution was fueled by a search for the answer to Wanamaker’s question. Google AdWords and the pay-per-click model transformed a business in which advertisers paid for ad impressions into one in which they pay for results. “Cost per thousand impressions” (CPM) was replaced by “cost per click” (CPC), and a new industry was born. It’s important to understand why CPC replaced CPM, though. Superficially, it’s because Google was able to track when a user clicked on a link, and was therefore able to bill based on success. But billing based on success doesn’t fundamentally change anything unless you can also change the success rate, and that’s what Google was able to do. By using data to understand each user’s behavior, Google was able to place advertisements that an individual was likely to click. They knew “which half” of their advertising was more likely to be effective, and didn’t bother with the rest.

Since then, data and predictive analytics have driven ever deeper insight into user behavior such that companies like Google, Facebook, Twitter, Zynga, and LinkedIn are fundamentally data companies. And data isn’t just transforming the consumer Internet. It is transforming finance, design, and manufacturing — and perhaps most importantly, health care.

How is data science transforming health care? There are many ways in which health care is changing, and needs to change. We’re focusing on one particular issue: the problem Wanamaker described when talking about his advertising. How do you make sure you’re spending money effectively? Is it possible to know what will work in advance?

Too often, when doctors order a treatment, whether it’s surgery or an over-the-counter medication, they are applying a “standard of care” treatment or some variation that is based on their own intuition, effectively hoping for the best. The sad truth of medicine is that we don’t really understand the relationship between treatments and outcomes. We have studies to show that various treatments will work more often than placebos; but, like Wanamaker, we know that much of our medicine doesn’t work for half or our patients, we just don’t know which half. At least, not in advance. One of data science’s many promises is that, if we can collect data about medical treatments and use that data effectively, we’ll be able to predict more accurately which treatments will be effective for which patient, and which treatments won’t.

A better understanding of the relationship between treatments, outcomes, and patients will have a huge impact on the practice of medicine in the United States. Health care is expensive. The U.S. spends over $2.6 trillion on health care every year, an amount that constitutes a serious fiscal burden for government, businesses, and our society as a whole. These costs include over $600 billion of unexplained variations in treatments: treatments that cause no differences in outcomes, or even make the patient’s condition worse. We have reached a point at which our need to understand treatment effectiveness has become vital — to the health care system and to the health and sustainability of the economy overall.

Why do we believe that data science has the potential to revolutionize health care? After all, the medical industry has had data for generations: clinical studies, insurance data, hospital records. But the health care industry is now awash in data in a way that it has never been before: from biological data such as gene expression, next-generation DNA sequence data, proteomics, and metabolomics, to clinical data and health outcomes data contained in ever more prevalent electronic health records (EHRs) and longitudinal drug and medical claims. We have entered a new era in which we can work on massive datasets effectively, combining data from clinical trials and direct observation by practicing physicians (the records generated by our $2.6 trillion of medical expense). When we combine data with the resources needed to work on the data, we can start asking the important questions, the Wanamaker questions, about what treatments work and for whom.

The opportunities are huge: for entrepreneurs and data scientists looking to put their skills to work disrupting a large market, for researchers trying to make sense out of the flood of data they are now generating, and for existing companies (including health insurance companies, biotech, pharmaceutical, and medical device companies, hospitals and other care providers) that are looking to remake their businesses for the coming world of outcome-based payment models.

Making health care more effective

Downloadable Editions

This report will soon be available in PDF, EPUB and Mobi formats. Submit your email to be alerted when the downloadable editions are ready.

What, specifically, does data allow us to do that we couldn’t do before? For the past 60 or so years of medical history, we’ve treated patients as some sort of an average. A doctor would diagnose a condition and recommend a treatment based on what worked for most people, as reflected in large clinical studies. Over the years, we’ve become more sophisticated about what that average patient means, but that same statistical approach didn’t allow for differences between patients. A treatment was deemed effective or ineffective, safe or unsafe, based on double-blind studies that rarely took into account the differences between patients. With the data that’s now available, we can go much further. The exceptions to this are relatively recent and have been dominated by cancer treatments, the first being Herceptin for breast cancer in women who over-express the Her2 receptor. With the data that’s now available, we can go much further for a broad range of diseases and interventions that are not just drugs but include surgery, disease management programs, medical devices, patient adherence, and care delivery.

For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients. But now we know much more: we know that it’s 100% effective in 70 to 80% of the patients, and ineffective in the rest. That’s not word games, because we can now use genetic markers to tell whether it’s likely to be effective or ineffective for any given patient, and we can tell in advance whether to treat with Tamoxifen or to try something else.

Two factors lie behind this new approach to medicine: a different way of using data, and the availability of new kinds of data. It’s not just stating that the drug is effective on most patients, based on trials (indeed, 80% is an enviable success rate); it’s using artificial intelligence techniques to divide the patients into groups and then determine the difference between those groups. We’re not asking whether the drug is effective; we’re asking a fundamentally different question: “for which patients is this drug effective?” We’re asking about the patients, not just the treatments. A drug that’s only effective on 1% of patients might be very valuable if we can tell who that 1% is, though it would certainly be rejected by any traditional clinical trial.

More than that, asking questions about patients is only possible because we’re using data that wasn’t available until recently: DNA sequencing was only invented in the mid-1970s, and is only now coming into its own as a medical tool. What we’ve seen with Tamoxifen is as clear a solution to the Wanamaker problem as you could ask for: we now know when that treatment will be effective. If you can do the same thing with millions of cancer patients, you will both improve outcomes and save money.

Dr. Lukas Wartman, a cancer researcher who was himself diagnosed with terminal leukemia, was successfully treated with sunitinib, a drug that was only approved for kidney cancer. Sequencing the genes of both the patient’s healthy cells and cancerous cells led to the discovery of a protein that was out of control and encouraging the spread of the cancer. The gene responsible for manufacturing this protein could potentially be inhibited by the kidney drug, although it had never been tested for this application. This unorthodox treatment was surprisingly effective: Wartman is now in remission.

While this treatment was exotic and expensive, what’s important isn’t the expense but the potential for new kinds of diagnosis. The price of gene sequencing has been plummeting; it will be a common doctor’s office procedure in a few years. And through Amazon and Google, you can now “rent” a cloud-based supercomputing cluster that can solve huge analytic problems for a few hundred dollars per hour. What is now exotic inevitably becomes routine.

But even more important: we’re looking at a completely different approach to treatment. Rather than a treatment that works 80% of the time, or even 100% of the time for 80% of the patients, a treatment might be effective for a small group. It might be entirely specific to the individual; the next cancer patient may have a different protein that’s out of control, an entirely different genetic cause for the disease. Treatments that are specific to one patient don’t exist in medicine as it’s currently practiced; how could you ever do an FDA trial for a medication that’s only going to be used once to treat a certain kind of cancer?

Foundation Medicine is at the forefront of this new era in cancer treatment. They use next-generation DNA sequencing to discover DNA sequence mutations and deletions that are currently used in standard of care treatments, as well as many other actionable mutations that are tied to drugs for other types of cancer. They are creating a patient-outcomes repository that will be the fuel for discovering the relation between mutations and drugs. Foundation has identified DNA mutations in 50% of cancer cases for which drugs exist (information via a private communication), but are not currently used in the standard of care for the patient’s particular cancer.

The ability to do large-scale computing on genetic data gives us the ability to understand the origins of disease. If we can understand why an anti-cancer drug is effective (what specific proteins it affects), and if we can understand what genetic factors are causing the cancer to spread, then we’re able to use the tools at our disposal much more effectively. Rather than using imprecise treatments organized around symptoms, we’ll be able to target the actual causes of disease, and design treatments tuned to the biology of the specific patient. Eventually, we’ll be able to treat 100% of the patients 100% of the time, precisely because we realize that each patient presents a unique problem.

Personalized treatment is just one area in which we can solve the Wanamaker problem with data. Hospital admissions are extremely expensive. Data can make hospital systems more efficient, and to avoid preventable complications such as blood clots and hospital re-admissions. It can also help address the challenge of hot-spotting (a term coined by Atul Gawande): finding people who use an inordinate amount of health care resources. By looking at data from hospital visits, Dr. Jeffrey Brenner of Camden, NJ, was able to determine that “just one per cent of the hundred thousand people who made use of Camden’s medical facilities accounted for thirty per cent of its costs.” Furthermore, many of these people came from two apartment buildings. Designing more effective medical care for these patients was difficult; it doesn’t fit our health insurance system, the patients are often dealing with many serious medical issues (addiction and obesity are frequent complications), and have trouble trusting doctors and social workers. It’s counter-intuitive, but spending more on some patients now results in spending less on them when they become really sick. While it’s a work in progress, it looks like building appropriate systems to target these high-risk patients and treat them before they’re hospitalized will bring significant savings.

Many poor health outcomes are attributable to patients who don’t take their medications. Eliza, a Boston-based company started by Alexandra Drane, has pioneered approaches to improve compliance through interactive communication with patients. Eliza improves patient drug compliance by tracking which types of reminders work on which types of people; it’s similar to the way companies like Google target advertisements to individual consumers. By using data to analyze each patient’s behavior, Eliza can generate reminders that are more likely to be effective. The results aren’t surprising: if patients take their medicine as prescribed, they are more likely to get better. And if they get better, they are less likely to require further, more expensive treatment. Again, we’re using data to solve Wanamaker’s problem in medicine: we’re spending our resources on what’s effective, on appropriate reminders that are mostly to get patients to take their medications.

More data, more sources

The examples we’ve looked at so far have been limited to traditional sources of medical data: hospitals, research centers, doctor’s offices, insurers. The Internet has enabled the formation of patient networks aimed at sharing data. Health social networks now are some of the largest patient communities. As of November 2011, PatientsLikeMe has over 120,000 patients in 500 different condition groups; ACOR has over 100,000 patients in 127 cancer support groups; 23andMe has over 100,000 members in their genomic database; and diabetes health social network SugarStats has over 10,000 members. These are just the larger communities, thousands of small communities are created around rare diseases, or even uncommon experiences with common diseases. All of these communities are generating data that they voluntarily share with each other and the world.

Increasingly, what they share is not just anecdotal, but includes an array of clinical data. For this reason, these groups are being recruited for large-scale crowdsourced clinical outcomes research.

Thanks to ubiquitous data networking through the mobile network, we can take several steps further. In the past two or three years, there’s been a flood of personal fitness devices (such as the Fitbit) for monitoring your personal activity. There are mobile apps for taking your pulse, and an iPhone attachment for measuring your glucose. There has been talk of mobile applications that would constantly listen to a patient’s speech and detect changes that might be the precursor for a stroke, or would use the accelerometer to report falls. Tanzeem Choudhury has developed an app called Be Well that is intended primarily for victims of depression, though it can be used by anyone. Be Well monitors the user’s sleep cycles, the amount of time they spend talking, and the amount of time they spend walking. The data is scored, and the app makes appropriate recommendations, based both on the individual patient and data collected across all the app’s users.

Continuous monitoring of critical patients in hospitals has been normal for years; but we now have the tools to monitor patients constantly, in their home, at work, wherever they happen to be. And if this sounds like big brother, at this point most of the patients are willing. We don’t want to transform our lives into hospital experiences; far from it! But we can collect and use the data we constantly emit, our “data smog,” to maintain our health, to become conscious of our behavior, and to detect oncoming conditions before they become serious. The most effective medical care is the medical care you avoid because you don’t need it.

Paying for results

Once we’re on the road toward more effective health care, we can look at other ways in which Wanamaker’s problem shows up in the medical industry. It’s clear that we don’t want to pay for treatments that are ineffective. Wanamaker wanted to know which part of his advertising was effective, not just to make better ads, but also so that he wouldn’t have to buy the advertisements that wouldn’t work. He wanted to pay for results, not for ad placements. Now that we’re starting to understand how to make treatment effective, now that we understand that it’s more than rolling the dice and hoping that a treatment that works for a typical patient will be effective for you, we can take the next step: Can we change the underlying incentives in the medical system? Can we make the system better by paying for results, rather than paying for procedures?

It’s shocking just how badly the incentives in our current medical system are aligned with outcomes. If you see an orthopedist, you’re likely to get an MRI, most likely at a facility owned by the orthopedist’s practice. On one hand, it’s good medicine to know what you’re doing before you operate. But how often does that MRI result in a different treatment? How often is the MRI required just because it’s part of the protocol, when it’s perfectly obvious what the doctor needs to do? Many men have had PSA tests for prostate cancer; but in most cases, aggressive treatment of prostate cancer is a bigger risk than the disease itself. Yet the test itself is a significant profit center. Think again about Tamoxifen, and about the pharmaceutical company that makes it. In our current system, what does “100% effective in 80% of the patients” mean, except for a 20% loss in sales? That’s because the drug company is paid for the treatment, not for the result; it has no financial interest in whether any individual patient gets better. (Whether a statistically significant number of patients has side-effects is a different issue.) And at the same time, bringing a new drug to market is very expensive, and might not be worthwhile if it will only be used on the remaining 20% of the patients. And that’s assuming that one drug, not two, or 20, or 200 will be required to treat the unlucky 20% effectively.

It doesn’t have to be this way.

In the U.K., Johnson & Johnson, faced with the possibility of losing reimbursements for their multiple myeloma drug Velcade, agreed to refund the money for patients who did not respond to the drug. Several other pay-for-performance drug deals have followed since, paving the way for the ultimate transition in pharmaceutical company business models in which their product is health outcomes instead of pills. Such a transition would rely more heavily on real-world outcome data (are patients actually getting better?), rather than controlled clinical trials, and would use molecular diagnostics to create personalized “treatment algorithms.” Pharmaceutical companies would also focus more on drug compliance to ensure health outcomes were being achieved. This would ultimately align the interests of drug makers with patients, their providers, and payors.

Similarly, rather than paying for treatments and procedures, can we pay hospitals and doctors for results? That’s what Accountable Care Organizations (ACOs) are about. ACOs are a leap forward in business model design, where the provider shoulders any financial risk. ACOs represent a new framing of the much maligned HMO approaches from the ’90s, which did not work. HMOs tried to use statistics to predict and prevent unneeded care. The ACO model, rather than controlling doctors with what the data says they “should” do, uses data to measure how each doctor performs. Doctors are paid for successes, not for the procedures they administer. The main advantage that the ACO model has over the HMO model is how good the data is, and how that data is leveraged. The ACO model aligns incentives with outcomes: a practice that owns an MRI facility isn’t incentivized to order MRIs when they’re not necessary. It is incentivized to use all the data at its disposal to determine the most effective treatment for the patient, and to follow through on that treatment with a minimum of unnecessary testing.

When we know which procedures are likely to be successful, we’ll be in a position where we can pay only for the health care that works. When we can do that, we’ve solved Wanamaker’s problem for health care.

Enabling data

Data science is not optional in health care reform; it is the linchpin of the whole process. All of the examples we’ve seen, ranging from cancer treatment to detecting hot spots where additional intervention will make hospital admission unnecessary, depend on using data effectively: taking advantage of new data sources and new analytics techniques, in addition to the data the medical profession has had all along.

But it’s too simple just to say “we need data.” We’ve had data all along: handwritten records in manila folders on acres and acres of shelving. Insurance company records. But it’s all been locked up in silos: insurance silos, hospital silos, and many, many doctor’s office silos. Data doesn’t help if it can’t be moved, if data sources can’t be combined.

There are two big issues here. First, a surprising amount of medical records are still either hand-written, or in digital formats that are scarcely better than hand-written (for example, scanned images of hand-written records). Getting medical records into a format that’s computable is a prerequisite for almost any kind of progress. Second, we need to break down those silos.

Anyone who has worked with data knows that, in any problem, 90% of the work is getting the data in a form in which it can be used; the analysis itself is often simple. We need electronic health records: patient data in a more-or-less standard form that can be shared efficiently, data that can be moved from one location to another at the speed of the Internet. Not all data formats are created equal, and some are certainly better than others: but at this point, any machine-readable format, even simple text files, is better than nothing. While there are currently hundreds of different formats for electronic health records, the fact that they’re electronic means that they can be converted from one form into another. Standardizing on a single format would make things much easier, but just getting the data into some electronic form, any, is the first step.

Once we have electronic health records, we can link doctor’s offices, labs, hospitals, and insurers into a data network, so that all patient data is immediately stored in a data center: every prescription, every procedure, and whether that treatment was effective or not. This isn’t some futuristic dream; it’s technology we have now. Building this network would be substantially simpler and cheaper than building the networks and data centers now operated by Google, Facebook, Amazon, Apple, and many other large technology companies. It’s not even close to pushing the limits.

Electronic health records enable us to go far beyond the current mechanism of clinical trials. In the past, once a drug has been approved in trials, that’s effectively the end of the story: running more tests to determine whether it’s effective in practice would be a huge expense. A physician might get a sense for whether any treatment worked, but that evidence is essentially anecdotal: it’s easy to believe that something is effective because that’s what you want to see. And if it’s shared with other doctors, it’s shared while chatting at a medical convention. But with electronic health records, it’s possible (and not even terribly expensive) to collect documentation from thousands of physicians treating millions of patients. We can find out when and where a drug was prescribed, why, and whether there was a good outcome. We can ask questions that are never part of clinical trials: is the medication used in combination with anything else? What other conditions is the patient being treated for? We can use machine learning techniques to discover unexpected combinations of drugs that work well together, or to predict adverse reactions. We’re no longer limited by clinical trials; every patient can be part of an ongoing evaluation of whether his treatment is effective, and under what conditions. Technically, this isn’t hard. The only difficult part is getting the data to move, getting data in a form where it’s easily transferred from the doctor’s office to analytics centers.

To solve problems of hot-spotting (individual patients or groups of patients consuming inordinate medical resources) requires a different combination of information. You can’t locate hot spots if you don’t have physical addresses. Physical addresses can be geocoded (converted from addresses to longitude and latitude, which is more useful for mapping problems) easily enough, once you have them, but you need access to patient records from all the hospitals operating in the area under study. And you need access to insurance records to determine how much health care patients are requiring, and to evaluate whether special interventions for these patients are effective. Not only does this require electronic records, it requires cooperation across different organizations (breaking down silos), and assurance that the data won’t be misused (patient privacy). Again, the enabling factor is our ability to combine data from different sources; once you have the data, the solutions come easily.

Breaking down silos has a lot to do with aligning incentives. Currently, hospitals are trying to optimize their income from medical treatments, while insurance companies are trying to optimize their income by minimizing payments, and doctors are just trying to keep their heads above water. There’s little incentive to cooperate. But as financial pressures rise, it will become critically important for everyone in the health care system, from the patient to the insurance executive, to assume that they are getting the most for their money. While there’s intense cultural resistance to be overcome (through our experience in data science, we’ve learned that it’s often difficult to break down silos within an organization, let alone between organizations), the pressure of delivering more effective health care for less money will eventually break the silos down. The old zero-sum game of winners and losers must end if we’re going to have a medical system that’s effective over the coming decades.

Data becomes infinitely more powerful when you can mix data from different sources: many doctor’s offices, hospital admission records, address databases, and even the rapidly increasing stream of data coming from personal fitness devices. The challenge isn’t employing our statistics more carefully, precisely, or guardedly. It’s about letting go of an old paradigm that starts by assuming only certain variables are key and ends by correlating only these variables. This paradigm worked well when data was scarce, but if you think about, these assumptions arise precisely because data is scarce. We didn’t study the relationship between leukemia and kidney cancers because that would require asking a huge set of questions that would require collecting a lot of data; and a connection between leukemia and kidney cancer is no more likely than a connection between leukemia and flu. But the existence of data is no longer a problem: we’re collecting the data all the time. Electronic health records let us move the data around so that we can assemble a collection of cases that goes far beyond a particular practice, a particular hospital, a particular study. So now, we can use machine learning techniques to identify and test all possible hypotheses, rather than just the small set that intuition might suggest. And finally, with enough data, we can get beyond correlation to causation: rather than saying “A and B are correlated,” we’ll be able to say “A causes B,” and know what to do about it.

Building the health care system we want

The U.S. ranks 37th out of developed economies in life expectancy and other measures of health, while by far outspending other countries on per-capita health care costs. We spend 18% of GDP on health care, while other countries on average spend on the order of 10% of GDP. We spend a lot of money on treatments that don’t work, because we have a poor understanding at best of what will and won’t work.

Part of the problem is cultural. In a country where even pets can have hip replacement surgery, it’s hard to imagine not spending every penny you have to prolong Grandma’s life — or your own. The U.S. is a wealthy nation, and health care is something we choose to spend our money on. But wealthy or not, nobody wants ineffective treatments. Nobody wants to roll the dice and hope that their biology is similar enough to a hypothetical “average” patient. No one wants a “winner take all” payment system in which the patient is always the loser, paying for procedures whether or not they are helpful or necessary. Like Wanamaker with his advertisements, we want to know what works, and we want to pay for what works. We want a smarter system where treatments are designed to be effective on our individual biologies; where treatments are administered effectively; where our hospitals our used effectively; and where we pay for outcomes, not for procedures.

We’re on the verge of that new system now. We don’t have it yet, but we can see it around the corner. Ultra-cheap DNA sequencing in the doctor’s office, massive inexpensive computing power, the availability of EHRs to study whether treatments are effective even after the FDA trials are over, and improved techniques for analyzing data are the tools that will bring this new system about. The tools are here now; it’s up to us to put them into use.

Recommended reading:

We recommend the following books regarding technology, data, and health care reform:

August 13 2012

A grisly job for data scientists

Missing Person: Ai Weiwei by Daquella manera, on FlickrJavier Reveron went missing from Ohio in 2004. His wallet turned up in New York City, but he was nowhere to be found. By the time his parents arrived to search for him and hand out fliers, his remains had already been buried in an unmarked indigent grave. In New York, where coroner’s resources are precious, remains wait a few months to be claimed before they’re buried by convicts in a potter’s field on uninhabited Hart Island, just off the Bronx in Long Island Sound.

The story, reported by the New York Times last week, has as happy an ending as it could given that beginning. In 2010 Reveron’s parents added him to a national database of missing persons. A month later police in New York matched him to an unidentified body and his remains were disinterred, cremated and given burial ceremonies in Ohio.

Reveron’s ordeal suggests an intriguing, and impactful, machine-learning problem. The Department of Justice maintains separate national, public databases for missing people, unidentified people and unclaimed people. Many records are full of rich data that is almost never a perfect match to data in other databases — hair color entered by a police department might differ from how it’s remembered by a missing person’s family; weights fluctuate; scars appear. Photos are provided for many missing people and some unidentified people, and matching them is difficult. Free-text fields in many entries describe the circumstances under which missing people lived and died; a predilection for hitchhiking could be linked to a death by the side of a road.

I’ve called the Department of Justice (DOJ) to ask about the extent to which they’ve worked with computer scientists to match missing and unidentified people, and will update when I hear back. One thing that’s not immediately apparent is the public availability of the necessary training set — cases that have been successfully matched and removed from the lists. The DOJ apparently doesn’t comment on resolved cases, which could make getting this data difficult. But perhaps there’s room for a coalition to request the anonymized data and manage it to the DOJ’s satisfaction while distributing it to capable data scientists.

Photo: Missing Person: Ai Weiwei by Daquella manera, on Flickr

Related:

With new maps and apps, the case for open transit gets stronger

OpenTripPlanner logoEarlier this year, the news broke that Apple would be dropping default support for transit in iOS 6. For people (like me) who use the iPhone to check transit routes and times when they travel, that would mean losing a key feature. It also has the potential to decrease the demand for open transit data from cities, which has open government advocates like Clay Johnson concerned about public transportation and iOS 6.

This summer, New York City-based non-profit Open Plans launched a Kickstarter campaign to fund a new iPhone transit app to fill in the gap.

“From the public perspective, this campaign is about putting an important feature back on the iPhone,” wrote Kevin Webb, a principal at Open Plans, via email. “But for those of us in the open government community, this is about demonstrating why open data matters. There’s no reason why important civic infrastructure should get bound up in a fight between Apple and Google. And in communities with public GTFS, it won’t.”

Open Plans already had a head start in creating a patch for the problem: they’ve been working with transit agencies over the past few years to build OpenTripPlanner, an open source application that uses open transit data to help citizens make transit decisions.

“We were already working on the back-end to support this application but decided to pursue the app development when we heard about Apple’s plans with iOS,” explained Webb. “We were surprised by the public response around this issue (the tens of thousands who joined Walkscore’s petition and wanted to offer a constructive response).”

Crowdfunding digital city infrastructure?

That’s where Kickstarter and crowdfunding come into the picture. The Kickstarter campaign would help Open Plans make OpenTripPlanner a native iPhone app, followed by Android and HTML5 apps down the road. Open Plans’ developers have decided that given mobile browser limitations in iOS, particularly the speed of JavaScript apps, an HTML5 app isn’t a replacement for a native app.

Kickstarter has emerged as a platform for more than backing ideas for cool iPod watches or services. Increasingly, it’s looking like Kickstarter could be a new way for communities to collectively fund the creation of civic apps or services for their towns that government isn’t agile enough to deliver for them. While that’s sure to make some people in traditional positions of power uneasy, it also might be a way to do an end-around traditional procurement processes — contingent upon cities acting as platforms for civic startups to build upon.

“We get foundation and agency-based contract support for our work already,” wrote Webb. “However, we’ve discovered that foundations aren’t interested in these kinds of rider-facing tools, and most agencies don’t have the discretion or the budget to support the development of something universal. As a result, these kinds of projects require speculative investment. One of the awesome things about open data is that it lets folks respond directly and constructively by building something to solve a need, rather than waiting on others to fix it for them.

“Given our experience with transit and open data, we knew that this was a solvable problem; it just required someone to step up to the challenge. We were well positioned to take on that role. However, as a non-profit, we don’t have unlimited resources, so we’d ask for help. Kickstarter seems like the right fit, given the widespread public interest in the problem, and an interesting way to get the message out about our perspective. Not only do we get to raise a little money, but we’re also sharing the story about why open data and open source matter for public infrastructure with a new audience.”

Civic code in active re-use

Webb, who has previously staked out a position that iOS 6 will promote innovation in public transit, says that OpenTripPlanner is already a thriving open source project, with a recent open transit launch in New Orleans, a refresh in Portland and other betas soon to come.

In a welcome development for DC cyclists (including this writer), a version of OpenTripPlanner went live recently at BikePlanner.org. The web app, which notably uses OpenStreetMap as a base layer, lets users either plot a course for their own bike or tap into the Capital Bikeshare network in DC. BikePlanner is a responsive HTML5 app, which means that it looks good and works well on a laptop, iPad, iPhone or Android device.

Focusing on just open transit apps, however, would be to miss the larger picture of new opportunities to build improvements to digital city infrastructure.

There’s a lot more at stake than just rider-facing tools, in Webb’s view — from urban accessibility to extending the GTFS data ecosystem.

“There’s a real need to build a national (and eventually international) transit data infrastructure,” said Webb. “Right now, the USDOT has completely fallen down on the job. The GTFS support we see today is entirely organic, and there’s no clear guidance anywhere about making data public or even creating GTFS in the first place. That means building universal apps takes a lot of effort just wrangling data.”

August 09 2012

Five elements of reform that health providers would rather not hear about

The quantum leap we need in patient care requires a complete overhaul of record-keeping and health IT. Leaders of the health care field know this and have been urging the changes on health care providers for years, but the providers are having trouble accepting the changes for several reasons.

What’s holding them back? Change certainly costs money, but the industry is already groaning its way through enormous paradigm shifts to meet current financial and regulatory climate, so the money might as well be directed to things that work. Training staff to handle patients differently is also difficult, but the staff on the floor of these institutions are experiencing burn-out and can be inspired by a new direction. The fundamental resistance seems to be expectations by health providers and their vendors about the control they need to conduct their business profitably.

A few months ago I wrote an article titled Five Tough Lessons I Had to Learn About Health Care. Here I’ll delineate some elements of a new health care system that are promoted by thought leaders, that echo the evolution of other industries, that will seem utterly natural in a couple decades–but that providers are loathe to consider. I feel that leaders in the field are not confronting that resistance with an equivalent sense of conviction that these changes are crucial.

1. Reform will not succeed unless electronic records standardize on a common, robust format

Records are not static. They must be combined, parsed, and analyzed to be useful. In the health care field, records must travel with the patient. Furthermore, we need an explosion of data analysis applications in order to drive diagnosis, public health planning, and research into new treatments.

Interoperability is a common mantra these days in talking about electronic health records, but I don’t think the power and urgency of record formats can be conveyed in eight-syllable words. It can be conveyed better by a site that uses data about hospital procedures, costs, and patient satisfaction to help consumers choose a desirable hospital. Or an app that might prevent a million heart attacks and strokes.

Data-wise (or data-ignorant), doctors are stuck in the 1980s, buying proprietary record systems that don’t work together even between different departments in a hospital, or between outpatient clinics and their affiliated hospitals. Now the vendors are responding to pressures from both government and the market by promising interoperability. The federal government has taken this promise as good coin, hoping that vendors will provide windows onto their data. It never really happens. Every baby step toward opening up one field or another requires additional payments to vendors or consultants.

That’s why exchanging patient data (health information exchange) requires a multi-million dollar investment, year after year, and why most HIEs go under. And that’s why the HL7 committee, putatively responsible for defining standards for electronic health records, keeps on putting out new, complicated variations on a long history of formats that were not well enough defined to ensure compatibility among vendors.

The Direct project and perhaps the nascent RHEx RESTful exchange standard will let hospitals exchange the limited types of information that the government forces them to exchange. But it won’t create a platform (as suggested in this PDF slideshow) for the hundreds of applications we need to extract useful data from records. Nor will it open the records to the masses of data we need to start collecting. It remains to be seen whether Accountable Care Organizations, which are the latest reform in U.S. health care and are described in this video, will be able to use current standards to exchange the data that each member institution needs to coordinate care. Shahid Shaw has laid out in glorious detail the elements of open data exchange in health care.

2. Reform will not succeed unless massive amounts of patient data are collected

We aren’t giving patients the most effective treatments because we just don’t know enough about what works. This extends throughout the health care system:

  • We can’t prescribe a drug tailored to the patient because we don’t collect enough data about patients and their reactions to the drug.

  • We can’t be sure drugs are safe and effective because we don’t collect data about how patients fare on those drugs.

  • We don’t see a heart attack or other crisis coming because we don’t track the vital signs of at-risk populations on a daily basis.

  • We don’t make sure patients follow through on treatment plans because we don’t track whether they take their medications and perform their exercises.

  • We don’t target people who need treatment because we don’t keep track of their risk factors.

Some institutions have adopted a holistic approach to health, but as a society there’s a huge amount more that we could do in this area. O’Reilly is hosting a conference called Strata Rx on this subject.

Leaders in the field know what health care providers could accomplish with data. A recent article even advises policy-makers to focus on the data instead of the electronic records. The question is whether providers are technically and organizationally prepped to accept it in such quantities and variety. When doctors and hospitals think they own the patients’ records, they resist putting in anything but their own notes and observations, along with lab results they order. We’ve got to change the concept of ownership, which strikes deep into their culture.

3. Reform will not succeed unless patients are in charge of their records

Doctors are currently acting in isolation, occasionally consulting with the other providers seen by their patients but rarely sharing detailed information. It falls on the patient, or a family advocate, to remember that one drug or treatment interferes with another or to remind treatment centers of follow-up plans. And any data collected by the patient remains confined to scribbled notes or (in the modern Quantified Self equivalent) a web site that’s disconnected from the official records.

Doctors don’t trust patients. They have some good reasons for this: medical records are complicated documents in which a slight rewording or typographical error can change the meaning enough to risk a life. But walling off patients from records doesn’t insulate them against errors: on the contrary, patients catch errors entered by staff all the time. So ultimately it’s better to bring the patient onto the team and educate her. If a problem with records altered by patients–deliberately or through accidental misuse–turns up down the line, digital certificates can be deployed to sign doctor records and output from devices.

The amounts of data we’re talking about get really big fast. Genomic information and radiological images, in particular, can occupy dozens of gigabytes of space. But hospitals are moving to the cloud anyway. Practice Fusion just announced that they serve 150,000 medical practitioners and that “One in four doctors selecting an EHR today chooses Practice Fusion.” So we can just hand over the keys to the patients and storage will grow along with need.

The movement for patient empowerment will take off, as experts in health reform told US government representatives, when patients are in charge of their records. To treat people, doctors will have to ask for the records, and the patients can offer the full range of treatment histories, vital signs, and observations of daily living they’ve collected. Applications will arise that can search the data for patterns and relevant facts.

Once again, the US government is trying to stimulate patient empowerment by requiring doctors to open their records to patients. But most institutions meet the formal requirements by providing portals that patients can log into, the way we can view flight reservations on airlines. We need the patients to become the pilots. We also need to give them the information they need to navigate.

4. Reform will not succeed unless providers conform to practice guidlines

Now that the government is forcing doctors to release informtion about outcomes, patients can start to choose doctors and hospitals that offer the best chances of success. The providers will have to apply more rigor to their activities, using checklists and more, to bring up the scores of the less successful providers. Medicine is both a science and an art, but many lag on the science–that is, doing what has been statistically proven to produce the best likely outcome–even at prestigious institutions.

Patient choice is restricted by arbitrary insurance rules, unfortunately. These also contribute to the utterly crazy difficulty determining what a medical procedure will cost as reported by e-Patient Dave and WBUR radio. Straightening out this problem goes way beyond the doctors and hospitals, and settling on a fair, predictable cost structure will benefit them almost as much as patients and taxpayers. Even some insurers have started to see that the system is reaching a dead-end and are erecting new payment mechanisms.

5. Reform will not succeed unless providers and patients can form partnerships

I’m always talking about technologies and data in my articles, but none of that constitutes health. Just as student testing is a poor model for education, data collection is a poor model for medical care. What patients want is time to talk intensively with their providers about their needs, and providers voice the same desires.

Data and good record keeping can help us use our resources more efficiently and deal with the physician shortage, partly by spreading out jobs among other clinical staff. Computer systems can’t deal with complex and overlapping syndromes, or persuade patients to adopt practices that are good for them. Relationships will always have to be in the forefront. Health IT expert Fred Trotter says, “Time is the gas that makes the relationship go, but the technology should be focused on fuel efficiency.”

Arien Malec, former contractor for the Office of the National Coordinator, used to give a speech about the evolution of medical care. Before the revolution in antibiotics, doctors had few tools to actually cure patients, but they live with the patients in the same community and know their needs through and through. As we’ve improved the science of medicine, we’ve lost that personal connection. Malec argued that better records could help doctors really know their patients again. But conversations are necessary too.

The risks and rewards of a health data commons

As I wrote earlier this year in an ebook on data for the public good, while the idea of data as a currency is still in its infancy, it’s important to think about where the future is taking us and our personal data.

If the Obama administration’s smart disclosure initiatives gather steam, more citizens will be able to do more than think about personal data: they’ll be able to access their financial, health, education, or energy data. In the U.S. federal government, the Blue Button initiative, which initially enabled veterans to download personal health data, is now spreading to all federal employees, and it also earned adoption at private institutions like Aetna and Kaiser Permanente. Putting health data to work stands to benefit hundreds of millions of people. The Locker Project, which provides people with the ability to move and store personal data, is another approach to watch.

The promise of more access to personal data, however, is balanced by accompanying risks. Smartphones, tablets, and flash drives, after all, are lost or stolen every day. Given the potential of mhealth, and big data and health care information technology, researchers and policy makers alike are moving forward with their applications. As they do so, conversations and rulemaking about health care privacy will need to take into account not just data collection or retention but context and use.

Put simply, businesses must confront the ethical issues tied to massive aggregation and data analysis. Given that context, Fred Trotter’s post on who owns health data is a crucial read. As Fred highlights, the real issue is not ownership, per se, but “What rights do patients have regarding health care data that refers to them?”

Would, for instance, those rights include the ability to donate personal data to a data commons, much in the same way organs are donated now for research? That question isn’t exactly hypothetical, as the following interview with John Wilbanks highlights.

Wilbanks, a senior fellow at the Kauffman Foundation and director of the Consent to Research Project, has been an advocate for open data and open access for years, including a stint at Creative Commons; a fellowship at the World Wide Web Consortium; and experience in the academic, business, and legislative worlds. Wilbanks will be speaking at the Strata Rx Conference in October.

Our interview, lightly edited for content and clarity, follows.

Where did you start your career? Where has it taken you?

John WilbanksJohn Wilbanks: I got into all of this, in many ways, because I studied philosophy 20 years ago. What I studied inside of philosophy was semantics. In the ’90s, that was actually sort of pointless because there wasn’t much semantic stuff happening computationally.

In the late ’90s, I started playing around with biotech data, mainly because I was dating a biologist. I was sort of shocked at how the data was being represented. It wasn’t being represented in a way that was very semantic, in my opinion. I started a software company and we ran that for a while, [and then] sold it during the crash.

I went to the Worldwide Web Consortium, where I spent a year helping start their Semantic Web for Life Sciences project. While I was there, Creative Commons (CC) asked me to come and start their science project because I had known a lot of those guys. When I started my company, I was at the Berkman Center at Harvard Law School, and that’s where Creative Commons emerged from, so I knew the people. I knew the policy and I had gone off and had this bioinformatics software adventure.

I spent most of the last eight years at CC working on trying to build different commons in science. We looked at open access to scientific literature, which is probably where we had the most success because that’s copyright-centric. We looked at patents. We looked at physical laboratory materials, like stem cells in mice. We looked at different legal regimes to share those things. And we looked at data. We looked at both the technology aspects and legal aspects of sharing data and making it useful.

A couple of times over those years, we almost pivoted from science to health because science is so institutional that it’s really hard for any of the individual players to create sharing systems. It’s not like software, where anyone with a PC and an Internet connection can contribute to free software, or Flickr, where anybody with a digital camera can license something under CC. Most scientists are actually restricted by their institutions. They can’t share, even if they want to.

Health kept being interesting because it was the individual patients who had a motivation to actually create something different than the system did. At the same time, we were watching and seeing the capacity of individuals to capture data about themselves exploding. So, at the same time that the capacity of the system to capture data about you exploded, your own capacity to capture data exploded.

That, to me, started taking on some of the interesting contours that make Creative Commons successful, which was that you didn’t need a large number of people. You didn’t need a very large percentage of Wikipedia users to create Wikipedia. You didn’t need a large percentage of free software users to create free software. If this capacity to generate data about your health was exploding, you didn’t need a very large percentage of those people to create an awesome data resource: you needed to create the legal and technical systems for the people who did choose to share to make that sharing useful.

Since Creative Commons is really a copyright-centric organization, I left because the power on which you’re going to build a commons of health data is going to be privacy power, not copyright power. What I do now is work on informed consent, which is the legal system you need to work with instead of copyright licenses, as well as the technologies that then store, clean, and forward user-generated data to computational health and computational disease research.

What are the major barriers to people being able to donate their data in the same way they might donate their organs?

John Wilbanks: Right now, it looks an awful lot like getting onto the Internet before there was the web. The big ISPs kind of dominated the early adopters of computer technologies. You had AOL. You had CompuServe. You had Prodigy. And they didn’t communicate with each other. You couldn’t send email from AOL to CompuServe.

What you have now depends on the kind of data. If the data that interests you is your genotype, you’re probably a 23andMe customer and you’ve got a bunch of your data at 23andMe. If you are the kind of person who has a chronic illness and likes to share information about that illness, you’re probably a customer at PatientsLikeMe. But those two systems don’t interoperate. You can’t send data from one to the other very effectively or really at all.

On top of that, the system has data about you. Your insurance company has your billing records. Your physician has your medical records. Your pharmacy has your pharmacy records. And if you do quantified self, you’ve got your own set of data streams. You’ve got your Fitbit, the data coming off of your smartphone, and your meal data.

Almost all of these are basically populating different silos. In some cases, you have the right to download certain pieces of the data. For the most part, you don’t. It’s really hard for you, as an individual, to build your own, multidimensional picture of your data, whereas it’s actually fairly easy for all of those companies to sell your data to one another. There’s not a lot of technology that lets you share.

What are some of the early signals we’re seeing about data usage moving into actual regulatory language?

John Wilbanks: The regulatory language actually makes it fairly hard to do contextual privacy waiving, in a Creative Commons sense. It’s hard to do granular permissions around privacy in the way you can do granular conditional copyright grants because you don’t have intellectual property. The only legal tool you have is a contract, and the contracts don’t have a lot of teeth.

It’s pretty hard to do anything beyond a gift. It’s more like organ donation, where you don’t get to decide where the organs go. What I’m working on is basically a donation, not a conditional gift. The regulatory environment makes it quite hard to do anything besides that.

There was a public comment period that just finished. It’s an announcement of proposed rulemaking on what’s called the Common Rule, which is the Department of Health and Human Services privacy language. It was looking to re-examine the rules around letting de-identified data or anonymized data out for widespread use. They got a bunch of comments.

There’s controversy as to how de-identified data can actually be and still be useful. There is going to be, probably, a three-to-five year process where they rewrite the Common Rule and it’ll be more modern. No one knows how modern, but it will be at least more modern when that finishes.

Then there’s another piece in the US — HIPAA — which creates a totally separate regime. In some ways, it is the same as the Common Rule, but not always. I don’t think that’s going to get opened up. The way HIPAA works is that they have 17 direct identifiers that are labeled as identifying information. If you strip those out, it’s considered de-identified.

There’s an 18th bucket, which is anything else that can reasonably identify people. It’s really hard to hit. Right now, your genome is not considered to fall under that. I would be willing to bet within a year or two, it will be.

From a regulatory perspective, you’ve got these overlapping regimes that don’t quite fit and both of them are moving targets. That creates a lot of uncertainty from an investment perspective or from an analytics perspective.

How are you thinking about a “health data commons,” in terms of weighing potential risks against potential social good?

John Wilbanks: I think that that’s a personal judgment as to the risk-benefit decision. Part of the difficulty is that the regulations are very syntactic — “This is what re-identification is” — whereas the concept of harm, benefit, or risk is actually something that’s deeply personal. If you are sick, if you have cancer or a rare disease, you have a very different idea of what risk is compared to somebody who thinks of him or herself as healthy.

What we see — and this is born out in the Framingham Heart Study and all sorts of other longitudinal surveys — is that people’s attitudes toward risk and benefit change depending on their circumstances. Their own context really affects what they think is risky and what they think isn’t risky.

I believe that the early data donors are likely to be people for whom there isn’t a lot of risk perceived because the health system already knows that they’re sick. The health system is already denying them coverage, denying their requests for PET scans, denying their requests for access to care. That’s based on actuarial tables, not on their personal data. It’s based on their medical history.

If you’re in that group of people, then the perceived risk is actually pretty low compared to the idea that your data might actually get used or to the idea that you’re no longer passive. Even if it’s just a donation, you’re doing something outside of the system that’s accelerating the odds of getting something discovered. I think that’s the natural group.

If you think back to the numbers of users who are required to create free software or Wikipedia, to create a cultural commons, a very low percentage is needed to create a useful resource.

Depending on who you talk to, somewhere between 5-10% of all Americans either have a rare disease, have it in their first order family, or have a friend with a rare disease. Each individual disease might not have very many people suffering from it, but if you net them all up, it’s a lot of people. Getting several hundred thousand to a few million people enrolled is not an outrageous idea.

When you look at the existing examples of where such commons have come together, what have been the most important concrete positive outcomes for society?

John Wilbanks: I don’t think we have really even started to see them because most people don’t have computable data about themselves. Most people, if they have any data about themselves, have scans of their medical records.
What we really know is that there’s an opportunity cost to not trying, which is that the existing system is really inefficient, very bad at discovering drugs, and very bad at getting those drugs to market in a timely basis.

That’s one of the reasons we’re doing this is as an experiment. We would like to see exactly how effective big computational approaches are on health data. The problem is that there are two ways to get there.

One is through a set of monopoly companies coming together and working together. That’s how semiconductors work. The other is through an open network approach. There’s not a lot of evidence that things besides these two approaches work. Government intervention is probably not going to work.

Obviously, I come down on the open network side. But there’s an implicit belief, I think, both in the people who are pushing the cooperating monopolies approach and the people who are pushing the open networks approach, that there’s enormous power in the big-data-driven approach. We’re just leaving that on the table right now by not having enough data aggregated.

The benefits to health that will come out will be the ability to increasingly, by looking at a multidimensional picture of a person, predict with some confidence whether or not a drug will work, or whether they’re going to get sick, or how sick they’re going to get, or what lifestyle changes they can make to mitigate an illness. Right now, basically, we really don’t know very much.

Pretty Simple Data Privacy

John Wilbanks discussed “Pretty Simple Data Privacy” during a Strata Online Conference in January 2012. His presentation begins at the 7:18 mark in the following video:

Strata Rx — Strata Rx, being held Oct. 16-17 in San Francisco, is the first conference to bring data science to the urgent issues confronting health care.

Save 20% on registration with the code RADAR20

Photo: Science Commons

August 03 2012

Palo Alto looks to use open data to embrace ‘city as a platform’

In the 21st century, one of the strategies cities around the world are embracing to improve services, increase accountability and stimulate economic activity is to publish open data online. The vision for New York City as a data platform earned wider attention last year, when the Big Apple’s first chief digital officer, Rachel Sterne, pitched the idea to the public.

This week, the city of Palo Alto in California joined over a dozen cities around the United States and globe when it launched its own open data platform. The platform includes an application programming interface (API) which enables direct access through a RESTful interface to open government data published in a JSON format. Datasets can also be embedded like YouTube videos, as below:

“We’re excited to bring the value of Open Data to our community. It is a natural complement to our goal of becoming a leading digital city and a connected community,” said James Keene, Palo Alto City Manager, in a prepared statement. “By making valuable datasets easily available to our residents, we’re further removing the barriers to a more inclusive and transparent local government here in Palo Alto.”

The city initially published open datasets that include the 2010 census data, pavement condition, city tree locations, park locations, bicycle paths and hiking trails, creek water level, rainfall and utility data. Open data about Palo Alto budgets, campaign finance, government salaries, regulations, licensing, or performance — which would all offer more insight into traditional metrics for government accountability — were not part of this first release.

“We are delighted to work with a local, innovative Silicon Valley start-up,” said Dr. Jonathan Reichental, Palo Alto’s chief information officer, in a prepared statement. (Junar’s U.S. offices are in Palo Alto.) “Rather than just publishing lists of datasets, the cloud-based Junar platform has enhancement and visualization capabilities that make the data useful even before it is downloaded or consumed by a software application.”

Notably, the city chose to use Junar, a Chilean software company that raised $1.2 million dollars in funding in May 2012. Junar provides data access in the cloud through the software-as-a-service model. There’s now a more competitive marketplace for open data platforms than has existed in years past, with a new venture-backed startup joining the space.

“The City of Palo Alto joins a group of forward-thinking organizations that are using Open Data as a foundation for more efficient delivery of services, information, and enabling innovation,” said Diego May, CEO and co-founder of Junar, in a prepared statement. “By opening data with the Junar Platform, the City of Palo Alto is exposing and sharing valuable data assets and is also empowering citizens to use and create new applications and services.”

The success or failure of Palo Alto’s push to become a more digital city might be more fairly judged in a year, when measuring downstream consumption of its open data in applications and services by citizens — or by government in increasing productivity — will be possible.

In the meantime, Reichental (who may be familiar to Radar readers as O’Reilly Media’s former CIO) provided more perspective via email on what he’s up to in Palo Alto.

What does it mean for a “city to be a platform?”

Reichental: We think of this as both a broad metaphor and a practicality. Not only do our citizens want to be plugged in to our government operations — open data being one way to achieve this among others — but we want our community and other interested parties to build capability on top of our existing data and services. Recognizing the increasing limitations of local government means you have to find creative ways to extend it and engage with those that have the skills and resources to build a rich and seamless public-private partnership.

Why launch an open data initiative now? What success stories convinced you to make the investment?

Reichental: It’s a response to our community’s desire to easily access their data and our want as a City to unleash the data for better community decision-making and solution development.

We also believe that over time an open data portal will become a standard government offering. Palo Alto wants to be ahead of the curve and create a positive model for other communities.

Seldom does a week pass when a software engineer in our community doesn’t ask me for access to a large dataset to build an app. Earlier this year, the City participated in a hackathon at Stanford University that produced a prototype web application in less than 24 hours. We provided the data. They provided the skills. The results were so impressive, we were convinced then that we should scale this model.

How much work did it take to make your data more open? Is it machine-readable? What format? What cost was involved?

Reichental: We’re experimenting with running our IT department like a start-up, so we’re moving fast. We went from vendor selection to live in just a few weeks. The data in our platform can be exported as a CSV or to a Google Spreadsheet. In addition, we provide an API for direct access to the data. The bulk of the cost was internal staff time. The actual software, which is cloud-based, was under $5000 for the first year.

What are the best examples of open data initiatives delivering sustainable services to citizens?

Reichental: Too many to mention. I really like what they’re doing in San Francisco (http://apps.sfgov.org/showcase/) but there are amazing things happening on data.gov and in New York City. Lots of other cities in the US doing neat things. The UK has done some high-quality budget accountability work.

Are you consuming your own open data?

Reichental: You bet we are.

Why does having an API matter?

Reichental: We believe the main advantage of having an API is for app development. Of course, there will be other use cases that we can’t even think of right now.

Why did you choose Junar instead of Socrata, CKAN or the OGPL from the U.S. federal government?

Reichental: We did review most of the products in the marketplace including some open source solutions. Each had merits. We ultimately decided on Junar for a 1-year commitment, as it seemed to strike the right balance of features, cost, and vision alignment.

Palo Alto has a couple developers in it. How are you engaging them to work with your data?

Reichental: That’s quite the understatement! The buzz already in the developer community is palpable. We’ve been swamped with requests and ideas already. We think one of the first places we’ll see good usage is in the myriad of hackathons/code jams held in the area.

What are the conditions for using your data or making apps?

Reichental: Our terms and conditions are straightforward. The data can be freely used by anyone for almost any purpose, but the condition of use is that the City has no liability or relationship with the use of the data or any derivative.

You told Mashable that you’re trying to acting like a “lean startup.” What does that mean, in practice?

Reichental: This initiative is a good example. Rather than spend time making the go-live product perfect, we went for speed-to-market with the minimally viable solution to get community feedback. We’ll use that feedback to quickly improve on the solution.

With the recent go-live of our redesigned public website, we launched it initially as a beta site; warts and all. We received lots of valuable feedback, made many of the suggested changes, and then cutover from the beta to production. We ended up with a better product.

Our intent is to get more useful capability out to our community and City staff in shorter time. We want to function as close as we can with the community that we serve. And that’s a lot of amazing start-ups.

August 01 2012

Big data is our generation’s civil rights issue, and we don’t know it

Data doesn’t invade people’s lives. Lack of control over how it’s used does.

What’s really driving so-called big data isn’t the volume of information. It turns out big data doesn’t have to be all that big. Rather, it’s about a reconsideration of the fundamental economics of analyzing data.

For decades, there’s been a fundamental tension between three attributes of databases. You can have the data fast; you can have it big; or you can have it varied. The catch is, you can’t have all three at once.

The big data trifectaThe big data trifecta

I’d first heard this as the “three V’s of data”: Volume, Variety, and Velocity. Traditionally, getting two was easy but getting three was very, very, very expensive.

The advent of clouds, platforms like Hadoop, and the inexorable march of Moore’s Law means that now, analyzing data is trivially inexpensive. And when things become so cheap that they’re practically free, big changes happen — just look at the advent of steam power, or the copying of digital music, or the rise of home printing. Abundance replaces scarcity, and we invent new business models.

In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. It’s the instant of context.

That needs repeating:

You decide what data is about the moment you define its schema.

With the new, data-is-abundant model, we collect first and ask questions later. The schema comes after the collection. Indeed, big data success stories like Splunk, Palantir, and others are prized because of their ability to make sense of content well after it’s been collected — sometimes called a schema-less query. This means we collect information long before we decide what it’s for.

And this is a dangerous thing.

When bank managers tried to restrict loans to residents of certain areas (known as redlining) Congress stepped in to stop it (with the Fair Housing Act of 1968). They were able to legislate against discrimination, making it illegal to change loan policy based on someone’s race.

Home Owners' Loan Corporation map showing redlining of hazardous districts in 1936Home Owners' Loan Corporation map showing redlining of hazardous districts in 1936
Home Owners’ Loan Corporation map showing redlining of “hazardous” districts in 1936.


“Personalization” is another word for discrimination. We’re not discriminating if we tailor things to you based on what we know about you — right? That’s just better service.

In one case, American Express used purchase history to adjust credit limits based on where a customer shopped, despite his excellent credit limit:

Johnson says his jaw dropped when he read one of the reasons American Express gave for lowering his credit limit: “Other customers who have used their card at establishments where you recently shopped have a poor repayment history with American Express.”

Some of the things white men liked in 2010, according to OKCupidSome of the things white men liked in 2010, according to OKCupidWe’re seeing the start of this slippery slope everywhere from tailored credit-card limits like this one to car insurance based on driver profiles. In this regard, big data is a civil rights issue, but it’s one that society in general is ill-equipped to deal with.

We’re great at using taste to predict things about people. OKcupid’s 2010 blog post “The Real Stuff White People Like” showed just how easily we can use information to guess at race. It’s a real eye-opener (and the guys who wrote it didn’t include everything they learned — some of it was a bit too controversial). They simply looked at the words one group used which others didn’t often use. The result was a list of “trigger” words for a particular race or gender.

Now run this backwards. If I know you like these things, or see you mention them in blog posts, on Facebook, or in tweets, then there’s a good chance I know your gender and your race, and maybe even your religion and your sexual orientation. And that I can personalize my marketing efforts towards you.

That makes it a civil rights issue.

If I collect information on the music you listen to, you might assume I will use that data in order to suggest new songs, or share it with your friends. But instead, I could use it to guess at your racial background. And then I could use that data to deny you a loan.

Want another example? Check out Private Data In Public Ways, something I wrote a few months ago after seeing a talk at Big Data London, which discusses how publicly available last name information can be used to generate racial boundary maps:

Screen from the Mapping London projectScreen from the Mapping London project
Screen from the Mapping London project.


This TED talk by Malte Spitz does a great job of explaining the challenges of tracking citizens today, and he speculates about whether the Berlin Wall would ever have come down if the Stasi had access to phone records in the way today’s governments do.

So how do we regulate the way data is used?

The only way to deal with this properly is to somehow link what the data is with how it can be used. I might, for example, say that my musical tastes should be used for song recommendation, but not for banking decisions.

Tying data to permissions can be done through encryption, which is slow, riddled with DRM, burdensome, hard to implement, and bad for innovation. Or it can be done through legislation, which has about as much chance of success as regulating spam: it feels great, but it’s damned hard to enforce.

There are brilliant examples of how a quantified society can improve the way we live, love, work, and play. Big data helps detect disease outbreaks, improve how students learn, reveal political partisanship, and save hundreds of millions of dollars for commuters — to pick just four examples. These are benefits we simply can’t ignore as we try to survive on a planet bursting with people and shaken by climate and energy crises.

But governments need to balance reliance on data with checks and balances about how this reliance erodes privacy and creates civil and moral issues we haven’t thought through. It’s something that most of the electorate isn’t thinking about, and yet it affects every purchase they make.

This should be fun.

This post originally appeared on Solve for Interesting. This version has been lightly edited.

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.

Save 20% on registration with the code RADAR20

Related:

July 31 2012

On email privacy, Twitter’s ToS and owning your own platform

The existential challenge for the Internet and society remains is that the technology platforms constitute what many people regard as the new public square are owned by private companies. If you missed the news, Guy Adams, a journalist at the Independent newspaper in England, was suspended by Twitter after he tweeted the corporate email address of a NBC executive, Gary Zenkel. Zenkel is in charge of NBC’s Olympics coverage.

Like many other observers, I assumed that NBC had seen the tweet and filed an objection with Twitter about the email address being tweeted. The email address, after all, was shared with the exhortation to Adams’ followers to write to Zenkel about frustrations with NBC’s coverage of the Olympics, a number of which Jim Stogdill memorably expressed here at Radar and Heidi Moore compared to Wall Street’s hubris.

Today, Guy Adams published two more columns. The first shared his correspondence with Twitter, including a copy of a written statement from an NBC spokesman called Christopher McCloskey that indicated that NBC’s social media department was alerted to Adams’ tweet by Twittersecond column, which followed the @GuyAdams account being reinstated, indicated that NBC had withdrawn their original complaint. Adams tweeted the statement: “we have just received an update from the complainant retracting their original request. Therefore your account has been unsuspended.”

Since the account is back up, is the case over? A tempest in a Twitter teapot? Well, not so much. I see at least three different important issues here related to electronic privacy, Twitter’s terms of service, censorship and how many people think about social media and the Web.

Is a corporate email address private?

Washington Post media critic Erik Wemple is at a loss to explain how tweeting this corporate email address qualifies public rises to the level of disclosing private information.

Can a corporate email address based upon a known nomenclature used by tens of thousands of people “private?” A 2010 Supreme Court ruling on privacy established that electronic messages sent on a corporate server are not private, at least from the employer. But a corporate email address itself? Hmm. Yes, the corporate email address Adams tweeted was available online prior to the tweet if you knew how to find it in a Web search. Danny Sullivan, however, made a strong case that the email address wasn’t widely available in Google, although Adams said he was able to find it in under a minute. There’s also an argument that because an address can be guessed, it is public. Jeff Jarvis and other journalists are saying it isn’t, using the logic that because NBC’s email nomenclature is standardized, it can be easily deduced. I “co-signed” Reuters’ Jack Shafer’s tweet making that assertion.

The question to ask privacy experts, then, is whether a corporate email address is “private” or not.

Fred Cate, a law professor at the Indiana University Maurer School of Law, however, commented via email that “a corporate email address can be private, in the sense that a company protects it and has a legitimate interest in it not being disclosed.” Can it lost its private character due to unauthorized disclosure online? “The answer is probably and regrettably ‘it depends,’” he wrote. “It depends on the breadth of the unauthorized dissemination and the sensitivity of the information and the likely harm if more widely disclosed. An email address that has been disclosed in public blogs would seem fairly widely available, the information is hardly sensitive, and any harm can be avoided by changing the address, so the argument for privacy seems pretty weak to me.”

Danielle Citron, professor of law at the University of Maryland, argues that because Zenkel did not publish his corporate email address on NBC’s site, there’s an argument, though a weak one, that its corporate email addresses are private information only disclosed to a select audience.

“Under privacy tort common law, an unpublished home address has been deemed by courts to be private for purposes of public disclosure of private fact tort if the publication appeared online, even though many people know the address offline,” wrote Citon in an email. “This arose in a cyber harassment case involving privacy torts. Privacy is not a binary concept, that is, one can have privacy in public, at least according to Nader v. GM, the NY [Court of Appeals] found that GM’s zealous surveillance of Ralph Nader, including looking over his shoulder while he took out money from the bank, constituted intrusion of his seclusion, even though he was in public. Now, the court did not find surveillance itself a privacy violation. It was the fact that the surveillance yielded information Nader would have thought no one could see, that is, how much he took out of the bank machine.”

Email is, however, a different case that home addresses, as Citron allowed. “Far less people know one’s home address — neighbors and friends — if a home address is unlisted whereas email addresses are shared with countless people and there is no analogous means to keep it unpublished like home and phone addresses,” Citron wrote. “These qualities may indeed make it a tough sell to suggest that the email address is private.”

Perhaps ironically, the NBC executive’s email address has now been published by many major media outlets and blogs, making it one of the most public email addresses on the planet. Hello, Streisand effect.

Did Twitter break its own Terms of Service?

Specifically, was tweeting someone’s publicly available *work* email address (available online) a a violation of the Twitter’s rules. To a large extent, this hinges upon the answer to the first issue, of privacy.

If a given email address is already public — and it’s been available online for over a year, one line of thinking goes that it can’t be private. Twitter’s position is that it considers a corporate email address to be private and that sharing it therefore breaks the ToS. Alex McGillivray, Twitter’s general counsel, clarified the company’s approach to trust and safety in a post on Twitter’s blog:

We’ve seen a lot of commentary about whether we should have considered a corporate email address to be private information. There are many individuals who may use their work email address for a variety of personal reasons — and they may not. Our Trust and Safety team does not have insight into the use of every user’s email address, and we need a policy that we can implement across all of our users in every instance.

“I do not think privacy can be defined for third parties by terms of service,” wrote Cate, via email. “If Twitter wants to say that the company will treat its users’ email addresses as private it’s fine, but I don’t think it can convincingly say that  other email addresses available in public are suddenly private.”

“If the corporate email was published online previously by the company or by himself, it likely would not amount to public disclosure of private fact under tort law and likely would not meet the strict terms of the TOS, which says nonpublic. Twitter’s policy about email address stems from its judgment that people should not use its service to publicize non-public email addresses, even though such an address is not a secret and countless people in communication with the person know it,” wrote Citon. “Unless Twitter says explicitly, ‘we are adopting this rule for privacy reasons,’ there are reasons that have nothing to do with privacy that might animate that decision, such as preventing fraud.”

The bottom line is that Twitter is a private company with a Terms of Service. It’s not a public utility, as Dave Winer highlighted yesterday, following up today with another argument for a distributed, open system for microblogging. Simply put, there *are* principles for use of Twitter’s platform. They’re in the Rules, Terms of Service and strictures around its API, the evolution of which was recently walked through over at the real-time report.

Ultimately, private companies are bound by the regulations of the FTC or FCC or other relevant regulatory bodies, along with their own rules, not the wishes of users. If Twitter’s users don’t like them or lose trust, their option is to stop using the service or complain loudly. I certainly agree with Jillian C. York, who argues at the EFF that the Guy Adams case demonstrates that Twitter needs a more robust appeals process.

There’s also the question about how the ToS is applied to celebrities on Twitter, who are an attraction for millions of users. In the past, Justin Bieber tweeted someone else’s personal phone number. Spike Lee tweeted a home address, causing someone to receive death threats in Florida. Neither was suspended. Neither the celebrities nor offenders referenced, according to personal accounts, were suspended. In one case, @QueenOfSpain had to get a court order to see any action taken on death threats on Twitter. Twitter’s Safety team has absolutely taken actions in some cases but it certainly might look like there’s a different standard here. The question to ask is whether tickets were filed for Lee or Bieber by the person who was personally affected. Without a ticket, there would be no suspension. Twitter has not commented on that count, under their policy of not commenting about individual users.

Own your own platform

In the wake of this move, there should be some careful consideration by journalists who use Twitter about where and how they do it. McGillivray did explain where Twitter went awry, confirming that someone on the media partnership side of the house flagged a tweet to NBC and reaffirming the principle that Twitter does not remove content on demand:

…we want to apologize for the part of this story that we did mess up. The team working closely with NBC around our Olympics partnership did proactively identify a Tweet that was in violation of the Twitter Rules and encouraged them to file a support ticket with our Trust and Safety team to report the violation, as has now been reported publicly.

Our Trust and Safety team did not know that part of the story and acted on the report as they would any other.

As I stated earlier, we do not proactively report or remove content on behalf of other users no matter who they are. This behavior is not acceptable and undermines the trust our users have in us. We should not and cannot be in the business of proactively monitoring and flagging content, no matter who the user is — whether a business partner, celebrity or friend. As of earlier today, the account has been unsuspended, and we will actively work to ensure this does not happen again.

As I’ve written elsewhere, looking at Twitter, censorship and Internet freedom, my sense is that, of all of the major social media players, Twitter has been one of the leaders in the technology community for sticking up for its users. It’s taken some notable stands, particularly with respect to the matter of fighting to make Twitter subpoena from the U.S. Justice Department regarding user data public.

“Twitter is so hands off, only stepping in to ban people in really narrow circumstances like impersonation and tweeting personal information like non-public email addresses. It also bans impersonation and harassment understood VERY NARROWLY, as credible threats of imminent physical harm,” wrote Citron.  ”That is Twitter’s choice. By my lights, and from conversations with their safety folks, they are very deferential to speech. Indeed, their whole policy is a “we are a speech platform,” implying that what transpires there is public speech and hence subject to great latitude.” 

Much of the good will Twitter had built up, however, may have evaporated after this week. My perspective is that this episode absolutely drives home (again) the need to own your own platform online, particularly for media entities and government. While there is clearly enormous utility in “going where the people are” online to participate in conversations, share news and listen to learn what’s happening, that activity doesn’t come without strings or terms of service.

To be clear, I don’t plan on leaving Twitter any time soon. I do think that McGillivray’s explanation highlights the need for the company to get its internal house in order, in terms of a church and state relationship between its policy and safety team, which makes suspension decisions, and its media partnerships team, which works with parties that might be aggrieved by what Twitter users are tweeting. If Twitter becomes a media company, a future that this NBC Olympics deal suggests, such distinctions could be just as important for it as the “church and state” relationship between traditional newspaper companies or broadcasters.

While that does mean that a media organization could be censored by a distributed denial of service (DDoS) attack (a tactic used in Russia) and that it must get a domain name, set up Web hosting and a content management system, the barrier to entry on all three counts has radically fallen.

The open Internet and World Wide Web, fragile and insecure as they may seem at times, remain the surest way to publish what you want and have it remain online, accessible to the networked world. When you own your own platform online, it’s much harder for a third party company nervous about the reaction of advertisers or media partners to take your voice away.

July 30 2012

Mobile participatory budgeting helps raise tax revenues in Congo

In a world awash in data, connected by social networks and focused on the next big thing, stories about genuine innovation get buried behind the newest shiny app or global development initiative. For billions of people around the world, the reality is that inequality in resources, access to education or clean water, or functional local government remain serious concerns.

South Kivu, located near the border of the Democratic Republic of Congo, has been devastated by the wars that have ravaged the region over the past decade.

Despite that grim context, a pilot program has born unexpected fruit. Mobile technology, civic participation, smarter governance and systems thinking combined to not only give citizens more of a voice in their government but have increased tax revenues as well. Sometimes, positive change happens where one might reasonably least expect it. The video below tells the story. After the jump, World Bank experts talk about story behind the story.

“Beyond creating a more inclusive environment, the beauty of the project in South Kivu is that citizen participation translates into demonstrated and measurable results on mobilizing more public funds for services for the poor,” said Boris Weber, team leader for ICT4Gov at the World Bank Institute for Open Government

, in an interview in Washington. “This makes a strong case when we ask ourselves where the return of investment of open government approaches is.”

Gathering support

The World Bank acted as a convener, in this context, said Tiago Peixoto, an open government specialist at the World Bank, in an interview. The Bank brought together the provincial government and local government to identify the governance issues and propose strategies to address them.

The challenge was straightforward: the South Kivu provincial government needed to relay revenues to the lower level of government to fund services but wasn’t doing so, both because lack of incentives and concerns about how the funds would be spent.

What came out of a four day meeting was a request for a feasibility study on participatory budgeting from the World Bank, said Peixoto.

Initially, the Bank found good conditions with respect to strong civil society, despite years of war. They found a participatory budgeting expert in Cameroon, who came and did workshops with local governments in how the process would work. They chose some cities as control groups, to introduce some scientific rigor.

They shared scholarship on participatory budgeting with all the stakeholders, emphasizing that research shows participation is more effective than penalties in taxation compliance.

“It’s like the process of ownership,” said Peixoto in our interview. “Once you see where money is going, you see how government can work. When you see a wish list, where some things happen and others do not because people aren’t paying, it changes perspectives.”

Hitting the books

When asked to provide more context on the scholarship in this area, Peixoto obliged, via email.

“As shown in a cross-national analysis by Torgler & Schneider (2009), citizens are more willing to pay taxes when they perceive that their preferences are properly taken into account by public institutions,” he wrote.

“Along these lines, the existing evidence suggests the existence of a causal relationship between citizen participation processes and levels of tax compliance. For instance, studies show that Swiss cantons with higher levels of democratic participation present lower tax evasion rates (Pommerehne & Weck-Hannemann 1996, Pommerehne & Frey 1992, Frey 1997). This effect is particularly strong when it comes to direct citizen participation in budgetary decisions, i.e. fiscal referendum (Frey & Feld 2002, Frey et al. 2004, Torgler 2005):

“The fiscal exchange relationship between taxpayers and the state also depends on the politico-economic framework within which the government acts. It has, in particular, been argued that the extent of citizens’ political participation rights systematically affects the kind of tax policy pursued by the government and its tax authority. (…) The more direct democratic the political decision-making procedures of a canton are, the lower is tax evasion according to these studies” (Feld & Frey 2005:29)

“According to his (Torgler) estimates, tax morale is significantly higher in direct democratic cantons. Distinguishing between different instruments of direct democracy, he finds that the fiscal referendum has the highest p