Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 12 2012

Four short links: 12 April 2012

  1. Big Data in Finance (PDF, 9M) -- Algo trading systems have begun to resemble an arms race. Competition, data, and the race for real-time.
  2. A Parent's Guide to 21st Century Learning (Edutopia, free registration required to download) -- What should collaboration, creativity, communication, and critical thinking look like in a modern classroom? How can parents help educators accomplish their goals? We hope this guide helps bring more parents into the conversation about improving education. (via Derek Wenmoth)
  3. Chess Intelligence and Winning -- survey of IQ gaps between contestants needed to win competitions. We could view cops and killers as being involved in a grim contest. In the USA around 65% of all murders are solved. That converts to an average “murder” ELO rating difference between police and murderers of 108 ELO points. It is also known that the mean IQs of murderers and policemen are 87 and 102, respectively. So successfully solving murders is a puzzle then the “a” coefficient is 0.041, and each IQ point difference is worth 7.2 ELO points. I suspect this is masturbatory math extrapolation rather than anything significant or predictive, but the cops-vs-robbers IQ contest was an interesting angle. (via Dr Data's Blog)
  4. Etsy Hacker Grants: Supporting Women in Technology -- Today, in conjunction with Hacker School, Etsy is announcing a new scholarship and sponsorship program for women in technology: we’ll be hosting the summer 2012 session of Hacker School in the Etsy headquarters, and we’re providing ten Etsy Hacker Grants of $5,000 each — a total of $50,000 — to women who want to join but need financial support to do so. Our goal is to bring 20 women to New York to participate, and we hope this will be the first of many steps to encourage more women into engineering at Etsy and across the industry.

Reposted bydatenwolf datenwolf

April 10 2012

Four short links: 10 April 2012

  1. The Instagram Architecture (High Scalability) -- great summary of the Instagram team's post about the technology that runs Instagram. Lots of Python goodness in here.
  2. Mosh -- ssh that lets you roam and stay connected. UTF-8 native.
  3. Android Economics -- working back from Google's declared valuation of Android royalties to figure out how much they have and how it's growing. Error bars for Africa here, but can't argue with the conclusion: Whereas Android generates $1.70/device/year and thus an Android device with a two year life generates about $3.5 to Google over its life, Apple obtained $576.3 for each iOS device it sold in 2011.
  4. UK Govt Digital Service's Design Principles -- if only everything in government followed Principle 1: Start with Needs (User Needs not Government Needs).

Four short links: 9 April 2012

  1. E-Reading/E-Books Data (Luke Wroblewski) -- This past January, paperbacks outsold e-books by less than 6 million units; if e-book market growth continues, it will have far outpaced paperbacks to become the number-one category for U.S. publishers. Combine that with only 21% of American adults having read a ebook, the signs are there that readers of ebooks buy many more books.
  2. Web 2.0 Ends with Data Monopolies (Bryce Roberts) -- in the context of Google Googles: So you’re able to track every website someone sees, every conversation they have, every Ukulele book they purchase and you’re not thinking about business models, eh? Bryce is looking at online businesses as increasingly about exclusive access to data. This is all to feed the advertising behemoth.
  3. Building and Implementing Single Sign On -- nice run-through of the system changes and APIs they built for single-sign on.
  4. How Big are Porn Site (ExtremeTech) -- porn sites cope with astronomical amounts of data. The only sites that really come close in term of raw bandwidth are YouTube or Hulu, but even then YouPorn is something like six times larger than Hulu.

April 09 2012

Operations, machine learning and premature babies

Julie Steele and I recently had lunch with Etsy's John Allspaw and Kellan Elliott-McCrea. I'm not sure how we got there, but we made a connection that was (to me) astonishing between web operations and medical care for premature infants.

I've written several times about IBM's work in neonatal intensive care at the University of Toronto. In any neonatal intensive care unit (NICU), every baby is connected to dozens of monitors. And each monitor is streaming hundreds of readings per second into various data systems. They can generate alerts if anything goes severely out of spec, but in normal operation, they just generate a summary report for the doctor every half hour or so.

IBM discovered that by applying machine learning to the full data stream, they were able to diagnose some dangerous infections a full day before any symptoms were noticeable to a human. That's amazing in itself, but what's more important is what they were looking for. I expected them to be looking for telltale spikes or irregularities in the readings: perhaps not serious enough to generate an alarm on their own, but still, the sort of things you'd intuitively expect of a person about to become ill. But according to Anjul Bhambhri, IBM's Vice President of Big Data, the telltale signal wasn't spikes or irregularities, but the opposite. There's a certain normal variation in heart rate, etc., throughout the day, and babies who were about to become sick didn't exhibit the variation. Their heart rate was too normal; it didn't change throughout the day as much as it should.

That observation strikes me as revolutionary. It's easy to detect problems when something goes out of spec: If you have a fever, you know you're sick. But how do you detect problems that don't set off an alarm? How many diseases have early symptoms that are too subtle for a human to notice, and only accessible to a machine learning system that can sift through gigabytes of data?

In our conversation, we started wondering how this applied to web operations. We have gigabytes of data streaming off of our servers, but the state of system and network monitoring hasn't changed in years. We look for parameters that are out of spec, thresholds that are crossed. And that's good for a lot of problems: You need to know if the number of packets coming into an interface suddenly goes to zero. But what if the symptom we should look for is radically different? What if crossing a threshold isn't what indicates trouble, but the disappearance (or diminution) of some regular pattern? Is it possible that our computing infrastructure also exhibits symptoms that are too subtle for a human to notice but would easily be detectable via machine learning?

We talked a bit about whether it was possible to alarm on the first (and second) derivatives of some key parameters, and of course it is. Doing so would require more sophistication than our current monitoring systems have, but it's not too hard to imagine. But it also misses the point. Once you know what to look for, it's relatively easy to figure out how to detect it. IBM's insight wasn't detecting the patterns that indicated a baby was about to become sick, but using machine learning to figure out what the patterns were. Can we do the same? It's not inconceivable, though it wouldn't be easy.

Web operations has been on the forefront of "big data" since the beginning. Long before we were talking about sentiment analysis or recommendations engines, webmasters and system administrators were analyzing problems by looking through gigabytes of server and system logs, using tools that were primitive or non-existent. MRTG and HP's OpenView were savage attempts to put together information dashboards for IT groups. But at most enterprises, operations hasn't taken the next step. Operations staff doesn't have the resources (neither computational nor human) to apply machine intelligence to our problems. We'd have to capture all the data coming off our our servers for extended periods, not just the server logs that we capture now, but any every kind of data we can collect: network data, environmental data, I/O subsystem data, you name it. At a recent meetup about finance, Abhi Mehta encouraged people to capture and save "everything." He was talking about financial data, but the same applies here. We'd need to build Hadoop clusters to monitor our server farms; we'd need Hadoop clusters to monitor our Hadoop clusters. It's a big investment of time and resources. If we could make that investment, what would we find out? I bet that we'd be surprised.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


Related:




April 06 2012

Top Stories: April 2-6, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Privacy, contexts and Girls Around Me
The application of user data is pushing at the edges of cultural norms. That can be a positive, but finding "the line" requires adherence to a few simple and clear guidelines.

Data as seeds of content
Visualizations are one way to make sense of data, but they aren't the only way. Robbie Allen reveals six additional outputs that help users derive meaningful insights from data.


State of the Computer Book Market 2011
In his annual report, Mike Hendrickson analyzes tech book sales and industry data: Part 1, Overall Market; Part 2, The Categories; Part 3, The Publishers; Part 4, The Languages. (Part 5 is coming next week.)

The do's and don'ts of geo marketing
During his session at this week's Where Conference, Placecast CEO Alistair Goodman examined the layers of context that make for rich, geo-targeted messages.



Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference, May 29 - 31 in San Francisco. Save 20% on registration with the code RADAR20.

Fence photo: Fence Friday by DayTripper (Tom), on Flickr

April 05 2012

Editorial Radar with Mike Loukides & Mike Hendrickson

Mike Loukides and Mike Hendrickson, two of O'Reilly Media's editors, sat down recently to talk about what's on their editorial radars. Mike and Mike have almost 50 years of combined technical book publishing experience and I always enjoy listening to their insight.

In this session, they discuss what they see in the tech space including:

  • How 3D Printing and personal manufacturing will revolutionize the way business is conducted in the U.S. [Discussed at the 00:43 mark ]
  • The rise of mobile and device sensors and how intelligence will be added to all sorts of devices. [Discussed at the 02:15 mark ]
  • Clear winners in today's code space: JavaScript. With Node.js, D3, HTML5, JavaScript is stepping up the plate. [Discussed at the 04:12 mark ]
  • A discussion on the best first language to teach programming and how we need to provide learners with instruction for the things they want to do. [Discussed at the 06:03 mark ]

You can view the entire interview in the following video.

Next month, Mike and Mike will be talking about functional languages.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

April 04 2012

Privacy, contexts and Girls Around Me

Last weekend, I read two excellent articles on the problems that privacy presents in a mobile, digital age. The Atlantic presented a summary of Helen Nissenbaum's thoughts on privacy and social norms: When we discuss the use of online privacy, we too often forget the social context in which data exists, even when we're talking about social media. And Amit Runchal posted a TechCrunch article about the Girls Around Me fiasco, "Creating Victims and Blaming Them," where he points out that the victims of a service like Girls Around Me shouldn't be blamed for not understanding the arcane privacy settings of services like Facebook:

"But ... the women signed up to be a part of this when they signed up to be on Facebook. No. What they signed up for was to be on Facebook. Our identities change depending on our context, no matter what permissions we have given to the Big Blue Eye. Denying us the right to this creates victims who then get blamed for it. 'Well ... you shouldn't have been on Facebook if you didn't want to...' No. Please recognize them as a person. Please recognize what that means.

Runchal's powerful "no" underscores the problem: People sign up with Facebook and Foursquare (which quickly blocked Girls Around Me's access to their API) to communicate with friends, to play games, to find former classmates, and so on. They don't sign up to have their data sold to the highest bidder. And while Facebook and Foursquare have a legitimate right to run a profitable business, their users have a legitimate right to be treated with some respect, and it's hard to construe hundreds of inscrutable privacy settings as "respect." Even if you understand the settings, it's next to impossible to block apps that you don't even know about. Perhaps the only way to protect yourself is a complete retreat into privacy, which defeats the purpose of Facebook.

Runchal's article demonstrates the principles for which Nissenbaum is arguing. Privacy and data don't exist in the abstract. Privacy and data always exist in social contexts, and problems occur when data is taken out of that context. Users give data to Facebook all the time; that's normal, and the service couldn't exist without that happening. Hundreds of millions of people use and enjoy Facebook, so the company is clearly doing a lot of things right. However, handing that same data to another application rips it out of context: Facebook data on its own might be fine, Facebook data crossed with location data from Foursquare is getting fishy (almost any use of location data quickly becomes "fishy"), and that combination published via an app that's designed to encourage stalking has crossed the line. Nissenbaum has articulated the general principle; Runchal has provided an excellent case study.

In a similar vein, Tim O'Reilly has argued that we should regulate the use of data, and expect data collectors to obey cultural norms about reasonable and unreasonable uses of data. A doctor could share your medical history with researchers, but not with an insurance company that might use it to cancel your policy. That's the only way to get the medical progress that comes from sharing data without the chilling side effect of making medical care inaccessible to anyone who actually needs it. Tim has defended Facebook for being willing to push the limits of privacy because that's the only way to find out what the new norms should be and what benefits we can derive from new applications. That's fair enough, and in this case (as I already pointed out), Foursquare was quick to yank API access.

It's useful to imagine the same software with a slightly different configuration. Girls Around Me has undeniably crossed a line. But what if, instead of finding women, the app was Hackers Around Me? That might be borderline creepy, but most people could live with it, and it might even lead to some wonderful impromptu hackathons. EMTs Around Me could save lives. I doubt that you'd need to change a single line of code to implement either of these apps, just some search strings. The problem isn't the software itself, nor is it the victims, but what happens when you move data from one context into another. Moving data about EMTs into context where EMTs are needed is socially acceptable; moving data into a context that facilitates stalking isn't acceptable, and shouldn't be.

The Atlantic's article about Nissenbaum ends with some pessimism about our ability to define social norms surrounding privacy: "It's quite difficult to figure out what the norms for a given situation might be." And that's true. We don't yet know what cultural norms for privacy are, let alone how to regulate for them, or how regulations should evolve as technology evolves and cultural norms change. Locking in our present norms through some badly thought out regulation strikes me as a recipe for disaster. I care much more about the TSA's scanners at an airport than about Google photographing my house for Street View, but I'd be ecstatically surprised to see legislation that reflected my priorities. The New York Times reports that cell phone tracking is routinely used by local law enforcement agencies, with little or no court oversight; and in the current climate, I'd be surprised to see privacy regulation that challenges the widespread use and abuse of surveillance by the police.

But this isn't the time to throw up our hands. It isn't as if we're completely lacking in clue. With that in mind, I'll give Amit Runchal the last word:

"The line is this: When you begin speaking for another person without their permission, you are doing something wrong. When you create another identity for them without their permission, you are doing something wrong. When you make people feel victimized who previously did not feel that way, you are doing something wrong."

Those are words I can live by.

Related:

April 03 2012

Data's next steps

Steve O'Grady (@sogrady) , a developer-focused analyst from RedMonk, views large-scale data collection and aggregation as a problem that has largely been solved. The tools and techniques required for the Googles and Facebooks of the world to handle what he calls "datasets of extraordinary sizes" have matured. In O'Grady's analysis, what hasn't matured are methods for teasing meaning of this data that are accessible to "ordinary users."

Among the other highlights from our interview:

  • O'Grady on the challenge of big data: "Kevin Weil (@kevinweil) from Twitter put it pretty well, saying that it's hard to ask the right question. One of the implications of that statement is that even if we had perfect access to perfect data, it's very difficult to determine what you would want to ask, how you would want to ask it. More importantly, once you get that answer, what are the questions that derive from that?"
  • O'Grady on the scarcity of data scientists: "The difficulty for basically every business on the planet is that there just aren't many of these people. This is, at present anyhow, a relatively rare skill set and therefore one that the market tends to place a pretty hefty premium on."
  • O'Grady on the reasons for using NoSQL: "If you are going down the NoSQL route for the sake of going down the NoSQL route, that's the wrong way to do things. You're likely to end up with a solution that may not even improve things. It may actively harm your production process moving forward because you didn't implement it for the right reasons in the first place."

The full interview is embedded below and available Here. For the entire interview transcript, click here.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Related:

March 21 2012

Four short links: 21 March 2012

  1. S0rce -- gorgeous infographics. They purport to let you Think for Yourself which is bald-faced bullshit: the choice of which data to present, and the invisible collection and curation practices behind the data, is the choice of what story to tell and what it will say. That said, it's wonderful to see the numbers (and they are attributed) behind the Republican Primary and Copyright and Piracy Legislation.
  2. Modern HTTP Servers are Fast -- I remember when the best web engineering in the world would still fall over if a box got more than 10 hits/second. Yes, yes, I'm writing this on my grandpa box. Check out the hardware specs of the box these numbers are from.
  3. MIT App Inventor -- web-based app designer. Does not appear to be open source. There is no long-term sustainability for this kind of development environment: when MIT decide "nah screw it, not going to run this any more" or "hmm, maybe we'll charge for it", you're boned--you can download the "source" to your app in a zip file but AppInventor is the only dev environment which can consume it. I hope it'll become the awesome and easy dev environment that Android needs, but I hope they prevent it from being a dead end.
  4. Daily Deals: Prediction, Social Diffusion, and Reputational Ramifications -- we consider the effects of daily deals on the longer-term reputation of merchants, based on their Yelp reviews before and after they run a daily deal. Our analysis shows that while the number of reviews increases significantly due to daily deals, average rating scores from reviewers who mention daily deals are 10% lower than scores of their peers on average. (via Greg Linden)

March 20 2012

The unreasonable necessity of subject experts

One of the highlights of the 2012 Strata California conference was the Oxford-style debate on the proposition "In data science, domain expertise is more important than machine learning skill." If you weren't there, Mike Driscoll's summary is an excellent overview (full video of the debate is available here). To make the story short, the "cons" won; the audience was won over to the side that machine learning is more important. That's not surprising, given that we've all experienced the unreasonable effectiveness of data. From the audience, Claudia Perlich pointed out that she won data mining competitions on breast cancer, movie reviews, and customer behavior without any prior knowledge. And Pete Warden (@petewarden) made the point that, when faced with the problem of finding "good" pictures on Facebook, he ran a data mining contest at Kaggle.

Data Science Debate panel at Strata CA 12
The "Data Science Debate" panel at Strata California 2012. Watch the debate.

A good impromptu debate necessarily raises as many questions as it answers. Here's the question that I was left with. The debate focused on whether domain expertise was necessary to ask the right questions, but a recent Guardian article,"The End of Theory," asked a different but related question: Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you've gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they're often closely coupled. Often, the only way to know you've put garbage in is that you've gotten garbage out.

By the same token, data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected. "Stupid Data Miner Tricks" is a hilarious send-up of the problems of data mining: It shows how to "predict" the value of the S&P index over a 10-year period based on butter production in Bangladesh, cheese production in the U.S., and the world sheep population.

Cherry picking and overfitting have particularly bad "smells" that are often fairly obvious: The Democrats never lose a Presidential election in a year when the Yankees win the world series, for example. (Hmmm. The 2000 election was rather fishy.) Any reasonably experienced data scientist should be able to stay out of trouble, but what if you treat your data with care and it still spits out an unexpected result? Or an expected result that's too good to be true? After the data crunching has been done, it's the subject expert's job to ensure that your results are good, meaningful, and well-understood.

Let's say you're an audio equipment seller analyzing a lot of purchase data and you find out that people buy more orange juice just before replacing their home audio system. It's an unlikely, absurd (and completely made up) result, but stranger things have happened. I'd probably go and build an audio gear marketing campaign targeting bulk purchasers of orange juice. Sales would probably go up; data is "unreasonably effective," even if you don't know why. This is precisely where things get interesting, and precisely where I think subject matter expertise becomes important: after the fact. Data breeds data, and it's naive to think that marketing audio gear to OJ addicts wouldn't breed more datasets and more analysis. It's naive to think the OJ data wouldn't be used in combination with other datasets to produce second-, third-, and fourth-order results. That's when the unreasonable effectiveness of data isn't enough; that's when it's important to understand the results in ways that go beyond what data analysis alone can currently give us. We may have a useful result that we don't understand, but is it meaningful to combine that result with other results that we may (or may not) understand?



Let's look at a more realistic scenario. Pete Warden's Kaggle-based
algorithm for finding
quality pictures works well
, despite giving the surprising result that

pictures with "Michigan" in the caption are significantly better than
average
. (As are pictures from Peru, and pictures taken of
tombs.) Why Michigan? Your guess is as good as mine.
For Warden's application, building photo
albums on the fly for his company Jetpac, that's fine. But if you're building a more complex
system that plans vacations for photographers, you'd
better know more than that. Why are the photographs good? Is
Michigan a
destination for birders? Is it a destination for people who like
tombs? Is it a destination with artifacts from ancient civilizations?
Or would you be better off recommending a trip to Peru?



Another realistic
scenario: Target recently used purchase histories to
target pregnant
women with ads
for baby-related products, with surprising success.
I won't rehash that story. From that starting point, you can go a
lot further. Pregnancies frequently lead to new car purchases.
New car purchases lead to
new insurance premiums, and I expect data will show that women with
babies are safer drivers. At each
step, you're compounding data with more data. It would certainly
be nice to know you understood what was happening at each step of the
way before offering a teenage driver a low insurance premium just
because she thought a large black handbag (that happened
to be appropriate for storing diapers) looked cool.



There's a limit to the value you
can derive from correct but inexplicable results. (Whatever else one
may say about the Target case, it looks like they made
sure they understood the results.) It takes a
subject matter expert to make the leap from correct results to
understood results. In an email, Pete Warden said:

"My biggest worry is that we're making important decisions based on black-box algorithms that may have hidden and problematic biases. If we're deciding who to give a mortgage based on machine learning, and the system consistently turns down black people, how do we even notice it, let alone fix it, unless we understand what the rules are? A real-world case is trading systems. If you have a mass of tangled and inexplicable logic driving trades, how do you assign blame when something like the Flash Crash happens?

"For decades, we've had computer systems we don't understand making decisions for us, but at least when something went wrong we could go in afterward and figure out what the causes were. More and more, we're going to be left shrugging our shoulders when someone asks us for an explanation."

That's why you need subject matter experts to understand your results, rather than simply accepting them at face value. It's easy to imagine that subject matter expertise requires hiring a PhD in some arcane discipline. For many applications, though, it's much more effective to develop your own expertise. In an email exchange, DJ Patil (@dpatil) said that people often become subject experts just by playing with the data. As an undergrad, he had to analyze a dataset about sardine populations off the coast of California. Trying to understand some anomalies led him to ask questions about coastal currents, why biologists only count sardines at certain stages in their life cycle, and more. Patil said:

"... this is what makes an awesome data scientist. They use data to have a conversation. This way they learn and bring other data elements together, create tests, challenge hypothesis, and iterate."

By asking questions of the data, and using those questions to ask more questions, Patil became an expert in an esoteric branch of marine biology, and in the process greatly increased the value of his results.

When subject expertise really isn't available, it's possible to create a workaround through clever application design. One of my takeaways from Patil's "Data Jujitsu" talk was the clever way LinkedIn "crowdsourced" subject matter expertise to their membership. Rather than sending job recommendations directly to a member, they'd send them to a friend, and ask the friend to pass along any they thought appropriate. This trick doesn't solve problems with hidden biases, and it doesn't give LinkedIn insight into why any given recommendation is appropriate, but it does an effective job of filtering inappropriate recommendations.

Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes "unreasonably effective" through the conversation that takes place after the numbers have been crunched. At his Strata keynote, Avinash Kaushik (@avinash) revisited Donald Rumsfeld's statement about known knowns, known unknowns, and unknown unknowns, and argued that the "unknown unknowns" are where the most interesting and important results lie. That's the territory we're entering here: data-driven results we would never have expected. We can only take our inexplicable results at face value if we're just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they're based. And that's the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can't forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems.

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Related:

March 14 2012

March 09 2012

OK, I Admit It. I have a mancrush on the new Federal CTO, Todd Park

I couldn't be more delighted by the announcement today that Todd Park has been named the new Chief Technology Officer for the United States, replacing Aneesh Chopra.

I first met Todd in 2008 at the urging of Mitch Kapor, who thought that Todd was the best exemplar in the healthcare world of my ideas about the power of data to transform business and society, and that I would find him to be a kindred spirit. And so it was. My lunch with Todd turned into a multi-hour brainstorm as we walked around the cliffs of Lands End in San Francisco. Todd was on fire with ideas about how to change healthcare, and the opportunity of the new job he'd just accepted, to become the CTO at HHS.

Subsequently, I helped Todd to organize a series of workshops and conferences at HHS to plan and execute their open data strategy. I met with Todd and told him how important it was not just to make data public and hope developers would come, but to actually do developer evangelism. I told him how various tech companies ran their developer programs, including some stories about Amazon's rollout of AWS: they had first held a small, private event to which they invited people and companies who'd been unofficially hacking on their data, told them their plans, and recruited them to build apps against the new APIs that were planned. Then, when they made their public announcement, they had cool apps to show, not just good intentions.

Todd immediately grasped the blueprint, and executed with astonishing speed. Before long, he held a workshop for an invited group of developers, entrepreneurs and health data wonks to map out useful data that could be liberated, and useful applications that could be built with it. Six months later, he held a public conference to showcase the 40-odd applications that had been developed. Now in its third year, the event has grown into what Todd calls the Health Datapalooza. As noted on GigaOm, the event has already led to several venture backed startup. (Applications are open for startups to be showcased at this year's event, June 5-6 in Washington D.C.)

Since I introduced him to Eric Ries, author of The Lean Startup, Todd has been introducing the methodology to Washington, insisting on programs that can show real results (learning and pivots) in only 90 days. He just knows how to make stuff happen.

Todd is also an incredibly inspiring speaker. At my various Gov 2.0 events, he routinely got a standing ovation. His enthusiasm, insight, and optimism are infectious.

Todd Park

When Todd Park talks, I listen. (Photo by James Duncan Davidson from the 2010 Gov 2.0 Summit. http://www.flickr.com/photos/oreillyconf/4967787323/in/photostream/)

Many will ask about Todd's technical credentials. After all, he is trained as a healthcare economist, not an engineer or scientist. There are three good answers:

1. Economists are playing an incredibly important role at today's technology companies, as extracting meaning and monetization from massive amounts of data become one of the key levers of success and competitive advantage. (Think Hal Varian at Google, working to optimize the ad auction.) Healthcare in particular is one of those areas where science, human factors, and economics are on a collision course, but virtually every sector of our nation is undergoing a transformation as a result of intelligence derived from data analysis. That's why I put Todd on my list for Forbes.com of the world's most important data scientists.

2. Todd is an enormously successful technology entrepreneur, with two brilliant companies - Athenahealth and Castlight Health - under his belt. In each case, he was able to succeed by understanding the power of data to transform an industry.

3. He's an amazing learner. In a 1998 interview describing the founding of Athena Health, he described his leadership philosophy: "Put enough of an idea together to inspire a team of really good people to jump with you into a general zone like medical practices. Then, just learn as much as you possibly can and what you really can do to be helpful and then act against that opportunity. No question."

Todd is one of the most remarkable people I've ever met, in a career filled with remarkable people. As Alex Howard notes, he should be an inspiration for more "retired" tech entrepreneurs to go into government. This is a guy who could do literally anything he put his mind to, and he's taking up the challenge of making our government smarter about technology. I want to put out a request to all my friends in the technology world: if Todd calls you and asks you for help, please take the call, and do whatever he asks.

Visualization of the Week: Kids Count in Washington, D.C.

At the recent DC Data Without Borders Datadive, a group came together to build a project for DC Action for Children, an advocacy group looking to improve the lives of the youngest citizens in Washington, D.C. The team — comprised of Jason Hoekstra, Sisi Wei, and Jerzy Wieczorek — created a data visualization that shows detailed information about neighborhoods and schools in the DC area.

The visualization includes information about average family income, number of police stations, number of libraries, number of child care facilities, and percentage of families living beneath the poverty level. At the school level, the visualization also shows the percentage of students who receive free and reduced school lunches as well as how well students perform in math and reading compared to other DC schools.

DC Action for Children visualization
Screenshot from the D.C. Kids Count visualization. See the full interactive version.

You can explore the visualization here.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

More Visualizations:

February 24 2012

Top stories: February 20-24, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Data for the public good
The explosion of big data, open data and social data offers new opportunities to address humanity's biggest challenges. The open question is no longer if data can be used for the public good, but how.

Building the health information infrastructure for the modern epatient
The National Coordinator for Health IT, Dr. Farzad Mostashari, discusses patient empowerment, data access and ownership, and other important trends in healthcare.

Big data in the cloud
Big data and cloud technology go hand-in-hand, but it's comparatively early days. Strata conference chair Edd Dumbill explains the cloud landscape and compares the offerings of Amazon, Google and Microsoft.

Everyone has a big data problem
MetaLayer's Jonathan Gosier talks about the need to democratize data tools because everyone has a big data problem.

Three reasons why direct billing is ready for its close-up
David Sims looks at the state of direct billing and explains why it's poised to catch on beyond online games and media.


Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

Cloud photo: Big cloud by jojo nicdao, on Flickr

February 23 2012

Four short links: 23 February 2012

  1. Why Mobile Matters (Luke Wroblewski) -- great demonstration of the changes in desktop and mobile, the new power of Android, and the waning influence of old manufacturers.
  2. It's Called iBooks Author Not iMathTextbooks Author, And The Trouble That Results (Dan Meyer) -- It's curious that even though students own their iBooks forever (ie. they can't resell them or give them away), they can't write in them except in the most cursory ways. Even curiouser, these iBooks could all be wired to the Internet and wired to a classroom through iTunes U, but they'd still be invisible to each other. Your work on your iPad cannot benefit me on mine. At our school, we look for "software with holes in it"--software into which kids put their own answers, photos, stories.
  3. DepthCam -- It’s a live-streaming 3D point-cloud, carried over a binary WebSocket. It responds to movement in the scene by panning the (virtual) camera, and you can also pan and zoom around with the mouse. Very impressive hack with a Kinect! (via Pete Warden)
  4. Starting an Online Store is Not Easy in Greece -- At the health department, they were told that all the shareholders of the company would have to provide chest X-rays, and, in the most surreal demand of all, stool samples. Note to Greece: this is not how you check whether a business plan is full of shit. (via Hacker News)

February 22 2012

Data for the public good

Can data save the world? Not on its own. As an age of technology-fueled transparency, open innovation and big data dawns around the world, the success of new policy won't depend on any single chief information officer, chief executive or brilliant developer. Data for the public good will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, in whatever form it is delivered.

Advocates, watchdogs and government officials now have new tools for data journalism and open government. Globally, there's a wave of transparency that will wash over every industry and government, from finance to healthcare to crime.

In that context, open government is about much more than open data — just look at the issues that flow around the #opengov hashtag on Twitter, including the nature identity, privacy, security, procurement, culture, cloud computing, civic engagement, participatory democracy, corruption, civic entrepreneurship or transparency.

If we accept the premise that Gov 2.0 is a potent combination of open government, mobile, open data, social media, collective intelligence and connectivity, the lessons of the past year suggest that a tidal wave of technology-fueled change is still building worldwide.

The Economist's support for open government data remains salient today:

"Public access to government figures is certain to release economic value and encourage entrepreneurship. That has already happened with weather data and with America's GPS satellite-navigation system that was opened for full commercial use a decade ago. And many firms make a good living out of searching for or repackaging patent filings."

As Clive Thompson reported at Wired last year, public sector data can help fuel jobs, and "shoving more public data into the commons could kick-start billions in economic activity." In the transportation sector, for instance, transit data is open government fuel for economic growth.

There is a tremendous amount of work ahead in building upon the foundations that civil society has constructed over decades. If you want a deep look at what the work of digitizing data really looks like, read Carl Malamud's interview with Slashdot on opening government data.

Data for the public good, however, goes far beyond government's own actions. In many cases, it will happen despite government action — or, often, inaction — as civic developers, data scientists and clinicians pioneer better analysis, visualization and feedback loops.

For every civic startup or regulation, there's a backstory that often involves a broad number of stakeholders. Governments have to commit to open up themselves but will, in many cases, need external expertise or even funding to do so. Citizens, industry and developers have to show up to use the data, demonstrating that there's not only demand, but also skill outside of government to put open data to work in service accountability, citizen utility and economic opportunity. Galvanizing the co-creation of civic services, policies or apps isn't easy, but tapping the potential of the civic surplus has attracted the attention of governments around the world.

There are many challenges for that vision to pass. For one, data quality and access remain poor. Socrata's open data study identified progress, but also pointed to a clear need for improvement: Only 30% of developers surveyed said that government data was available, and of that, 50% of the data was unusable.

Open data will not be a silver bullet to all of society's ills, but an increasing number of states are assembling platforms and stimulating an app economy.

Results-oriented mayors like Rahm Emanuel and Mike Bloomberg are committing to opening Chicago and opening government data in New York City, respectively.

Following are examples of where data for the public good is already having an impact upon the world we live in, along with some ideas about what lies ahead.

Financial good

Anyone looking for civic entrepreneurship will be hard pressed to find a better recent example than BrightScope. The efforts of Mike and Ryan Alfred are in line with traditional entrepreneurship: identifying an opportunity in a market that no one else has created value around, building a team to capitalize on it, and then investing years of hard work to execute on that vision. In the process, BrightScope has made government data about the financial industry more usable, searchable and open to the public.

Due to the efforts of these two entrepreneurs and their California-based startup, anyone who wants to learn more about financial advisers before tapping one to manage their assets can do so online.

Prior to BrightScope, the adviser data was locked up at the Securities and Exchange Commission (SEC) and the Financial Industry Regulatory Authority (FINRA).

"Ryan and I knew this data was there because we were advisers," said BrightScope co-founder Mike Alfred in a 2011 interview. "We knew data had been filed, but it wasn't clear what was being done with it. We'd never seen it liberated from the government databases."

While they knew the public data existed and had their idea years ago, Alfred said it didn't happen because they "weren't in the mindset of being data entrepreneurs" yet. "By going after 401(k) first, we could build the capacity to process large amounts of data," Alfred said. "We could take that data and present it on the web in a way that would be usable to the consumer."

Notably, the government data that BrightScope has gathered on financial advisers goes further than a given profile page. Over time, as search engines like Google and Bing index the information, the data has become searchable in places consumers are actually looking for it. That's aligned with one of the laws for open data that Tim O'Reilly has been sharing for years: Don't make people find data. Make data find the people.

As agencies adapt to new business relationships, consumers are starting to see increased access to government data. Now, more data that the nation's regulatory agencies collected on behalf of the public can be searched and understood by the public. Open data can improve lives, not least through adding more transparency into a financial sector that desperately needs more of it. This kind of data transparency will give the best financial advisers the advantage they deserve and make it much harder for your Aunt Betty to choose someone with a history of financial malpractice.

The next phase of financial data for good will use big data analysis and algorithmic consumer advice tools, or "choice engines," to make better decisions. The vast majority of consumers are unlikely to ever look directly at raw datasets themselves. Instead, they'll use mobile applications, search engines and social recommendations to make smarter choices.

There are already early examples of such services emerging. Billshrink, for example, lets consumers get personalized recommendations for a cheaper cell phone plan based on calling histories. Mint makes specific recommendations on how a citizen can save money based upon data analysis of the accounts added. Moreover, much of the innovation in this area is enabled by the ability of entrepreneurs and developers to go directly to data aggregation intermediaries like Yodlee or CashEdge to license the data.

EMC's Big Data solution accelerates business transformation. We offer a cost-efficient and scale-out IT infrastructure that allows organizations to access broad data sources, collaborate and execute real-time analysis and drive actionable insight.

Transit data as economic fuel

Transit data continues to be one of the richest and most dynamic areas for co-creation of services. Around the United States and beyond, there has been a blossoming of innovation in the city transit sector, driven by the passion of citizens and fueled by the release of real-time transit data by city governments.

Francisca Rojas, research director at the Harvard Kennedy School's Transparency Policy Project, has investigated the dynamics behind the disclosure of data by transit agencies in the United States, which she calls one of the most successful implementations of open government. "In just a few years, a rich community has developed around this data, with visionary champions for disclosure inside transit agencies collaborating with eager software developers to deliver multiple ways for riders to access real-time information about transit," wrote Rojas.

The Massachusetts Bay Transit Authority (MBTA) learned from Portland, Oregon's, TriMet that open data is better. "This was the best thing the MBTA had done in its history," said Laurel Ruma, O'Reilly's director of talent and a long-time resident in greater Boston, in her 2010 Ignite talk on real-time transit data. The MBTA's move to make real-time data available and support it has spawned a new ecosystem of mobile applications, many of which are featured at MBTA.com.

There are now 44 different consumer-facing applications for the TriMet system. Chicago, Washington and New York City also have a growing ecosystem of applications.

As more sensors go online in smarter cities, tracking the movements of traffic patterns will enable public administrators to optimize routes, schedules and capacity, driving efficiency and a better allocation of resources.

Transparency and civic goods

As John Wonderlich, policy director at the Sunlight Foundation, observed last year, access to legislative data brings citizens closer to their representatives. "When developers and programmers have better access to the data of Congress, they can better build the databases and tools that let the rest of us connect with the legislature."

That's the promise of the Sunlight Foundation's work, in general: Technology-fueled transparency will help fight corruption, fraud and reveal the influence behind policies. That work is guided by data, generated, scraped and aggregated from government and regulatory bodies. The Sunlight Foundation has been focused on opening up Congress through technology since the organization was founded. Some of its efforts culminated recently with the publication of a live XML feed for the House floor and a transparency portal for House legislative documents.

There are other horizons for transparency through open government data, which broadly refers to public sector records that have been made available to citizens. For a canonical resource on what makes such releases truly "open," consult the "8 Principles of Open Government Data."

For instance, while gerrymandering has been part of American civic life since the birth of the republic, one of the best policy innovations of 2011 may offer hope for improving the redistricting process. DistrictBuilder, an open-source tool created by the Public Mapping Project, allows anyone to easily create legal districts.

"During the last year, thousands of members of the public have participated in online redistricting and have created hundreds of valid public plans," said Micah Altman, senior research scientist at Harvard University Institute for Quantitative Social Science, via an email last year.

"In substantial part, this is due to the project's effort and software. This year represents a huge increase in participation compared to previous rounds of redistricting — for example, the number of plans produced and shared by members of the public this year is roughly 100 times the number of plans submitted by the public in the last round of redistricting 10 years ago," Altman said. "Furthermore, the extensive news coverage has helped make a whole new set of people aware of the issue and has re framed it as a problem that citizens can actively participate in to solve, rather than simply complain about."

Principles for data in the public good

As a result of digital technology, our collective public memory can now be shared and expanded upon daily. In a recent lecture on public data for public good at Code for America, Michal Migurski of Stamen Design made the point that part of the global financial crisis came through a crisis in public knowledge, citing "The Destruction of Economic Facts," by Hernando de Soto.

To arrive at virtuous feedback loops that amplify the signals that citizens, regulators, executives and elected leaders inundated with information need to make better decisions, data providers and infomediaries will need to embrace key principles, as Migurski's lecture outlined.

First, "data drives demand," wrote Tim O'Reilly, who attended the lecture and distilled Migurski's insights. "When Stamen launched crimespotting.org, it made people aware that the data existed. It was there, but until they put visualization front and center, it might as well not have been."

Second, "public demand drives better data," wrote O'Reilly. "Crimespotting led Oakland to improve their data publishing practices. The stability of the data and publishing on the web made it possible to have this data addressable with public links. There's an 'official version,' and that version is public, rather than hidden."

Third, "version control adds dimension to data," wrote O'Reilly. "Part of what matters so much when open source, the web, and open data meet government is that practices that developers take for granted become part of the way the public gets access to data. Rather than static snapshots, there's a sense that you can expect to move through time with the data."

The case for open data

Accountability and transparency are important civic goods, but adopting open data requires grounded arguments for a city chief financial officer to support these initiatives. When it comes to making a business case for open data, John Tolva, the chief technology officer for Chicago, identified four areas that support the investment in open government:

  1. Trust — "Open data can build or rebuild trust in the people we serve," Tolva said. "That pays dividends over time."
  2. Accountability of the work force — "We've built a performance dashboard with KPIs [key performance indicators] that track where the city directly touches a resident."
  3. Business building — "Weather apps, transit apps ... that's the easy stuff," he said. "Companies built on reading vital signs of the human body could be reading the vital signs of the city."
  4. Urban analytics — "Brett [Goldstein] established probability curves for violent crime. Now we're trying to do that elsewhere, uncovering cost savings, intervention points, and efficiencies."

New York City is also using data internally. The city is doing things like applying predictive analytics to building code violations and housing data to try to understand where potential fire risks might exist.

"The thing that's really exciting to me, better than internal data, of course, is open data," said New York City chief digital officer Rachel Sterne during her talk at Strata New York 2011. "This, I think, is where we really start to reach the potential of New York City becoming a platform like some of the bigger commercial platforms and open data platforms. How can New York City, with the enormous amount of data and resources we have, think of itself the same way Facebook has an API ecosystem or Twitter does? This can enable us to produce a more user-centric experience of government. It democratizes the exchange of information and services. If someone wants to do a better job than we are in communicating something, it's all out there. It empowers citizens to collaboratively create solutions. It's not just the consumption but the co-production of government services and democracy."

The promise of data journalism

NYTimes: 365/360 - 1984 (in color) by blprnt_van, on FlickrThe ascendance of data journalism in media and government will continue to gather force in the years ahead.

Journalists and citizens are confronted by unprecedented amounts of data and an expanded number of news sources, including a social web populated by our friends, family and colleagues. Newsrooms, the traditional hosts for information gathering and dissemination, are now part of a flattened environment for news. Developments often break first on social networks, and that information is then curated by a combination of professionals and amateurs. News is then analyzed and synthesized into contextualized journalism.

Data is being scraped by journalists, generated from citizen reporting, or gleaned from massive information dumps — such as with the Guardian's formidable data journalism, as detailed in a recent ebook. ScraperWiki, a favorite tool of civic coders at Code for America and elsewhere, enables anyone to collect, store and publish public data. As we grapple with the consumption challenges presented by this deluge of data, new publishing platforms are also empowering us to gather, refine, analyze and share data ourselves, turning it into information.

There are a growing number of data journalism efforts around the world, from New York Times interactive features to the award-winning investigative work of ProPublica. Here are just a few promising examples:

  • Spending Stories, from the Open Knowledge Foundation, is designed to add context to news stories based upon government data by connecting stories to the data used.
  • Poderopedia is trying to bring more transparency to Chile, using data visualizations that draw upon a database of editorial and crowdsourced data.
  • The State Decoded is working to make the law more user-friendly.
  • Public Laboratory is a tool kit and online community for grassroots data gathering and research that builds upon the success of Grassroots Mapping.
  • Internews and its local partner Nai Mediawatch launched a new website that shows incidents of violence against journalists in Afghanistan.

Open aid and development

The World Bank has been taking unprecedented steps to make its data more open and usable to everyone. The data.worldbank.org website that launched in September 2010 was designed to make the bank's open data easier to use. In the months since, more than 100 applications have been built using the data.

"Up until very recently, there was almost no way to figure out where a development project was," said Aleem Walji, practice manager for innovation and technology at the World Bank Institute, in an interview last year. "That was true for all donors, including us. You could go into a data bank, find a project ID, download a 100-page document, and somewhere it might mention it. To look at it all on a country level was impossible. That's exactly the kind of organization-centric search that's possible now with extracted information on a map, mashed up with indicators. All of sudden, donors and recipients can both look at relationships."

Open data efforts are not limited to development. More data-driven transparency in aid spending is also going online. Last year, the United States Agency for International Development (USAID) launched a public engagement effort to raise awareness about the devastating famine in the Horn of Africa. The FWD campaign includes a combination of open data, mapping and citizen engagement.

"Frankly, it's the first foray the agency is taking into open government, open data, and citizen engagement online," said Haley Van Dyck, director of digital strategy at USAID, in an interview last year.

"We recognize there is a lot more to do on this front, but are happy to start moving the ball forward. This campaign is different than anything USAID has done in the past. It is based on informing, engaging, and connecting with the American people to partner with us on these dire but solvable problems. We want to change not only the way USAID communicates with the American public, but also the way we share information."

USAID built and embedded interactive maps on the FWD site. The agency created the maps with open source mapping tools and published the datasets it used to make these maps on data.gov. All are available to the public and media to download and embed as well.

The combination of publishing maps and the open data that drives them simultaneously online is significantly evolved for any government agency, and it serves as a worthy bar for other efforts in the future to meet. USAID accomplished this by migrating its data to an open, machine-readable format.

"In the past, we released our data in inaccessible formats — mostly PDFs — that are often unable to be used effectively," said Van Dyck. "USAID is one of the premiere data collectors in the international development space. We want to start making that data open, making that data sharable, and using that data to tell stories about the crisis and the work we are doing on the ground in an interactive way."

Crisis data and emergency response

Unprecedented levels of connectivity now exist around the world. According to a 2011 survey from the Pew Internet and Life Project, more than 50% of American adults use social networks, 35% of American adults have smartphones, and 78% of American adults are connected to the Internet. When combined, those factors mean that we now see earthquake tweets spread faster than the seismic waves themselves. Networked publics can now share the effects of disasters in real time, providing officials with unprecedented insight into what's happening. Citizens act as sensors in the midst of the storm, creating an ad hoc system of networked accountability through data.

The growth of an Internet of Things is an important evolution. What we saw during Hurricane Irene in 2011 was the increasing importance of an Internet of people, where citizens act as sensors during an emergency. Emergency management practitioners and first responders have woken up to the potential of using social data for enhanced situational awareness and resource allocation.

An historic emergency social data summit in Washington in 2010 highlighted how relevant this area has become. And last year's hearing in the United States Senate on the role of social media in emergency management was "a turning point in Gov 2.0," said Brian Humphrey of the Los Angeles Fire Department.

The Red Cross has been at the forefront of using social data in a time of need. That's not entirely by choice, given that news of disasters has consistently broken first on Twitter. The challenge is for the men and women entrusted with coordinating response to identify signals in the noise.

First responders and crisis managers are using a growing suite of tools for gathering information and sharing crucial messages internally and with the public. Structured social data and geospatial mapping suggest one direction where these tools are evolving in the field.

A web application from ESRI deployed during historic floods in Australia demonstrated how crowdsourced social intelligence provided by Ushahidi can enable emergency social data to be integrated into crisis response in a meaningful way.

The Australian flooding web app includes the ability to toggle layers from OpenStreetMap, satellite imagery, and topography, and then filter by time or report type. By adding structured social data, the web app provides geospatial information system (GIS) operators with valuable situational awareness that goes beyond standard reporting, including the locations of property damage, roads affected, hazards, evacuations and power outages.

Long before the floods or the Red Cross joined Twitter, however, Brian Humphrey of the Los Angeles Fire Department (LAFD) was already online, listening. "The biggest gap directly involves response agencies and the Red Cross," said Humphrey, who currently serves as the LAFD's public affairs officer. "Through social media, we're trying to narrow that gap between response and recovery to offer real-time relief."

After the devastating 2010 earthquake in Haiti, the evolution of volunteers working collaboratively online also offered a glimpse into the potential of citizen-generated data. Crisis Commons has acted as a sort of "geeks without borders." Around the world, developers, GIS engineers, online media professionals and volunteers collaborated on information technology projects to support disaster relief for post-earthquake Haiti, mapping streets on OpenStreetMap and collecting crisis data on Ushahidi.

Healthcare

What happens when patients find out how good their doctors really are? That was the question that Harvard Medical School professor Dr. Atul Gawande asked in the New Yorker, nearly a decade ago.

The narrative he told in that essay makes the history of quality improvement in medicine compelling, connecting it to the creation of a data registry at the Cystic Fibrosis Foundation in the 1950s. As Gawande detailed, that data was privately held. After it became open, life expectancy for cystic fibrosis patients tripled.

In 2012, the new hope is in big data, where techniques for finding meaning in the huge amounts of unstructured data generated by healthcare diagnostics offer immense promise.

The trouble, say medical experts, is that data availability and quality remain significant pain points that are holding back existing programs.

There are, literally, bright spots that suggest what's possible. Dr. Gawande's 2011 essay, which considered whether "hotspotting" using health data could help lower medical costs by giving the neediest patients better care, offered another perspective on the issue. Early outcomes made the approach look compelling. As Dr. Gawande detailed, when a Medicare demonstration program offered medical institutions payments that financed the coordination of care for its most chronically expensive beneficiaries, hospital stays and trips to the emergency rooms dropped more than 15% over the course of three years. A test program adopting a similar approach in Atlantic City saw a 25% drop in costs.

Through sharing data and knowledge, and then creating a system to convert ideas into practice, clinicians in the ImproveCareNow network were able to improve the remission rate for Crohn's disease from 49% to 67% without the introduction of new drugs.

In Britain, researchers found that the outcomes for adult cardiac patients improved after the publication of information on death rates. With the release of meaningful new open government data about performance and outcomes from the British national healthcare system, similar improvements may be on the way.

"I do believe we are at the beginning of a revolutionary moment in health care, when patients and clinicians collect and share data, working together to create more effective health care systems," said Susannah Fox, associate director for digital strategy at the Pew Internet and Life Project, in an interview in January. Fox's research has documented the social life of health information, the concept of peer-to-peer healthcare, and the role of the Internet among people living with chronic disease.

In the past few years, entrepreneurs, developers and government agencies have been collaboratively exploring the power of open data to improve health. In the United States, the open data story in healthcare is evolving quickly, from new mobile apps that lead to better health decisions to data spurring changes in care at the U.S. Department of Veterans Affairs.

Since he entered public service, Todd Park, the first chief technology officer of the U.S. Department of Health and Human Services (HHS), has focused on unleashing the power of open data to improve health. If you aren't familiar with this story, read the Atlantic's feature article that explores Park's efforts to revolutionize the healthcare industry through better use of data.

Park has focused on releasing data at Health.Data.Gov. In a speech to a Hacks and Hackers meetup in New York City in 2011, Park emphasized that HHS wasn't just releasing new data: "[We're] also making existing data truly accessible or usable," he said, taking "stuff that's in a book or on a website and turning it into machine-readable data or an API."

Park said it's still quite early in the project and that the work isn't just about data — it's about how and where it's used. "Data by itself isn't useful. You don't go and download data and slather data on yourself and get healed," he said. "Data is useful when it's integrated with other stuff that does useful jobs for doctors, patients and consumers."

What lies ahead

There are four trends that warrant special attention as we look to the future of data for public good: civic network effects, hybridized data models, personal data ownership and smart disclosure.

Civic network effects

Community is a key ingredient in successful open government data initiatives. It's not enough to simply release data and hope that venture capitalists and developers magically become aware of the opportunity to put it to work. Marketing open government data is what repeatedly brought federal Chief Technology Officer Aneesh Chopra and Park out to Silicon Valley, New York City and other business and tech hubs.

Despite the addition of topical communities to Data.gov, conferences and new media efforts, government's attempts to act as an "impatient convener" can only go so far. Civic developer and startup communities are creating a new distributed ecosystem that will help create that community, from BuzzData to Socrata to new efforts like Max Ogden's DataCouch.

Smart disclosure

There are enormous economic and civic good opportunities in the "smart disclosure" of personal data, whereby a private company or government institution provides a person with access to his or her own data in open formats. Smart disclosure is defined by Cass Sunstein, Administrator of the White House Office for Information and Regulatory Affairs, as a process that "refers to the timely release of complex information and data in standardized, machine-readable formats in ways that enable consumers to make informed decisions."

For instance, the quarterly financial statements of the top public companies in the world are now available online through the Securities and Exchange Commission.

Why does it matter? The interactions of citizens with companies or government entities generate a huge amount of economically valuable data. If consumers and regulators had access to that data, they could tap it to make better choices about everything from finance to healthcare to real estate, much in the same way that web applications like Hipmunk and Zillow let consumers make more informed decisions.

Personal data assets

When a trend makes it to the World Economic Forum (WEF) in Davos, it's generally evidence that the trend is gathering steam. A report titled "Personal Data Ownership: The Emergence of a New Asset Class" suggests that 2012 will be the year when citizens start thinking more about data ownership, whether that data is generated by private companies or the public sector.

"Increasing the control that individuals have over the manner in which their personal data is collected, managed and shared will spur a host of new services and applications," wrote the paper's authors. "As some put it, personal data will be the new 'oil' — a valuable resource of the 21st century. It will emerge as a new asset class touching all aspects of society."

The idea of data as a currency is still in its infancy, as Strata Conference chair Edd Dumbill has emphasized. The Locker Project, which provides people with the ability to move their own data around, is one of many approaches.

The growth of the Quantified Self movement and online communities like PatientsLikeMe and 23andMe validates the strength of the movement. In the U.S. federal government, the Blue Button initiative, which enables veterans to download personal health data, has now spread to all federal employees and earned adoption at Aetna and Kaiser Permanente.

In early 2012, a Green Button was launched to unleash energy data in the same way. Venture capitalist Fred Wilson called the Green Button an "OAuth for energy data."

Wilson wrote:

"It is a simple standard that the utilities can implement on one side and web/mobile developers can implement on the other side. And the result is a ton of information sharing about energy consumption and, in all likelihood, energy savings that result from more informed consumers."

Hybridized public-private data

Free or low-cost online tools are empowering citizens to do more than donate money or blood: Now, they can donate, time, expertise or even act as sensors. In the United States, we saw a leading edge of this phenomenon in the Gulf of Mexico, where Oil Reporter, an open source oil spill reporting app, provided a prototype for data collection via smartphone. In Japan, an analogous effort called Safecast grew and matured in the wake of the nuclear disaster that resulted from a massive earthquake and subsequent tsunami in 2011.

Open source software and citizens acting as sensors have steadily been integrated into journalism over the past few years, most dramatically in the videos and pictures uploaded after the 2009 Iran election and during 2011's Arab Spring.

Citizen science looks like the next frontier. Safecast is combining open data collected by citizen science with academic, NGO and open government data (where available), and then making it widely available. It's similar to other projects, where public data and experimental data are percolating.

Public data is a public good

Despite the myriad challenges presented by legitimate concerns about privacy, security, intellectual property and liability, the promise of more informed citizens is significant. McKinsey's 2011 report dubbed big data as the next frontier for innovation, with billions of dollars of economic value yet to be created. When that innovation is applied on behalf of the public good, whether it's in city planning, transit, healthcare, government accountability or situational awareness, those effects will be extended.

We're entering the feedback economy, where dynamic feedback loops between customers and corporations, partners and providers, citizens and governments, or regulators and companies can both drive efficiencies and leaner, smarter governments.

The exabyte age will bring with it the twin challenges of information overload and overconsumption, both of which will require organizations of all sizes to use the emerging toolboxes for filtering, analysis and action. To create public good from public goods — the public sector data that governments collect, the private sector data that is being collected and the social data that we generate ourselves — we will need to collectively forge new compacts that honor existing laws and visionary agreements that enable the new data science to put the data to work.

Photo: NYTimes: 365/360 - 1984 (in color) by blprnt_van, on Flickr

Related:

February 15 2012

What the data can tell us about dating and other social congregation

Valentine's Day turned out to be a good time to discuss data crunching of online dating. Kevin Lewis, a PhD candidate in sociology and Berkman Center Fellow, drew an overflow room today for his talk Mate Choice in an Online Dating Site. It's yet another example of how, as people go online, they leave a trail of data that could never be captured before.

Here are some examples how traditional researchers are restricted:

  • They can get marriage data, but have much less data about dating, cohabitation without marriage, and other non-traditional arrangements that are increasingly common. Dating sites let us in at a much earlier stage in a relationship that may or may not lead to marriage.

  • They can measure certain recorded demographics such as age and race, but miss a huge range of criteria by which people evaluate potential mates. People enter lots of interesting facts about themselves and their hoped-for mates on dating sites.

  • Because researchers miss the initial contacts, they have trouble tracing back from a result (marriage) to the criteria used by the dating couples.

As an example of the the last problem, Lewis mentioned the observation that people usually date and marry others with similar levels of formal education. Actually, researchers have long hypothesized that men don't care much about women's educational levels. They would be willing to date and marry outside their educational levels. It's the women who care, and since they rule out men with much higher or lower educational levels, we end up with the current results.

Now Lewis can cite concrete data proving that hypothesis. On a dating site, men initiate and respond to contacts with women of many different levels. But the women don't initiate many contacts outside their own level, and don't respond to contacts from men outside that level.

How did Lewis conduct his research? Briefly, he persuaded OkCupid to give him a large data set stripped of free-text fields, but containing information on race, religion, and several other criteria. He chose data in the New York City area for heterosexual couples. Considering that 22% of heterosexual adults have found their current partners through online sites (the figure is even higher for same-sex couples: 61%), this is a lot of valuable data.

Of course, there are risks in extrapolating from this data set. Admittedly, OkCupid users tend to be younger and more Internet-savvy than the overall dating population. It's hard to tell whether some criterion is truly a determining factor or a consequence of some other factor (for instance, educational level is correlated with age). Still, Lewis controlled for variables a good deal and feels there is a lot of statistical validity to his findings.

As just one other example, he documented a lot of contacts across racial lines, more than one might expect. But there were definite patterns. For instance, black women received a lot fewer contacts from other races than most groups. In this way, the data on dating gives us a look at our values in choices in other forms of social interaction, not just romance.

February 10 2012

O'Reilly Radar Show 2/10/12: The 5 trends that will shape the data world

Below you'll find the script and associated links from the February 10, 2012 episode of O'Reilly Radar. An archive of past shows is available through O'Reilly Media's YouTube channel and you can subscribe to episodes of O'Reilly Radar via iTunes.


Introduction

There are five major trends that will shape the data world in the months to come. Strata Conference chair Edd Dumbill reveals them in this episode of O'Reilly Radar. [Starts 12 seconds in.]

Also in this episode: We revisit a conversation with Wired's Kevin Kelly in which he discusses freemium models and why digital rights management will likely persist in some form or another. [Interview begins at 11:04.]

Radar posts of note

[This segment begins at the 10:06 mark.]

For now, legislators have backed off of the Stop Online Piracy Act and the Protect IP Act, but the friction between media companies and online piracy persists. In his piece "SOPA and PIPA are bad industrial policy," Tim O'Reilly explains why these efforts — and those sure to emerge down the road — hold back innovative business models that grow the overall market.

It's the hot trend in software right now, but what does big data mean, and how can you exploit it? In "What is big data?," Strata chair Edd Dumbill presents an introduction and orientation to the big data landscape.

Finally, books, publishing processes and readers have all made the jump to digital, and that's creating considerable opportunities for publishing startups. Justo Hidalgo explores the digital shift in his piece, "Three reasons why we're in a golden age of publishing entrepreneurship."

As always, links to these stories and other resources mentioned during this episode are available at radar.oreilly.com/show.

Radar video spotlight

At the 2011 Tools of Change for Publishing conference I had a chance to interview Wired's Kevin Kelly about two topics that continue to play big roles in the content world: the freemium model and digital rights management.

As you'll see in the following video, Kelly has a unique, long-view perspective on both of these issues.

[Interview begins at 11:04.]

Closing

Just a reminder that you can always catch episodes of O'Reilly Radar at youtube.com/oreillymedia and subscribe to episodes through iTunes.

All of the links and resources mentioned during this episode are posted at radar.oreilly.com/show.

That's all we have for this episode. Thanks for joining us and we'll see you again soon.

February 07 2012

Four short links: 7 February 2012

  1. Integrated Content Editor (GitHub) -- a track changes implementation, built in javascript, for anything that is contenteditable on the web, written by the NY Times team and open sourced.
  2. Data Tables -- featureful jQuery plugin for tables of data. (via Javascript Weekly)
  3. Creating a Developer Community (Slideshare) -- treat the problem like a channel conversion funnel: turn visitors into downloaders, downloaders into users, users into contributors. His screenshots of shitty conversions are great! (via Kohsuke Kawaguchi)
  4. Sex Differences in Intimate Relationships (PDF) -- Albert-Laszlo Barabasi and others use social graph analysis to analyze communications patterns in relationships. Notice that not only does the preference for an opposite-sex “best friend” kick in significantly earlier for females than for males (~18 years vs mid-20s, respectively), but females maintain a higher plateau value for much longer. More reality mining to understand ourselves. (via Sean Gourley)

February 01 2012

Why Hadoop caught on

Doug Cutting (@cutting) is a founder of the Apache Hadoop project and an architect at Hadoop provider Cloudera. When Cutting expresses surprise at Hadoop's growth — as he does below — that carries a lot of weight.

In the following interview, Cutting explains why he's surprised at Hadoop's ascendance, and he looks at the factors that helped Hadoop catch on. He'll expand on some of these points during his Hadoop session at the upcoming Strata Conference.

Why do you think Hadoop has caught on?

Doug CuttingDoug Cutting: Hadoop is a technology whose time had come. As computer use has spread, institutions are generating vastly more data. While commodity hardware offers affordable raw storage and compute horsepower, before Hadoop, there was no commodity software to harness it. Without tools, useful data was simply discarded.

Open source is a methodology for commoditizing software. Google published its technological solutions, and the Hadoop community at Apache brought these to the rest of the world. Commodity hardware combined with the latent demand for data analysis formed the fuel that Hadoop ignited.

Are you surprised at its growth?

Doug Cutting: Yes. I didn't expect Hadoop to become such a central component of data processing. I recognized that Google's techniques would be useful to other search engines and that open source was the best way to spread these techniques. But I did not realize how many other folks had big data problems nor how many of these Hadoop applied to.

What role do you see Hadoop playing in the near-term future of data science and big data?

Doug Cutting: Hadoop is a central technology of big data and data science. HDFS is where folks store most of their data, and MapReduce is how they execute most of their analysis. There are some storage alternatives — for example, Cassandra and CouchDB, and useful computing alternatives, like S4, Giraph, etc. — but I don't see any of these replacing HDFS or MapReduce soon as the primary tools for big data.

Long term, we'll see. The ecosystem at Apache is a loosely-coupled set of separate projects. New components are regularly added to augment or replace incumbents. Such an ecosystem can survive the obsolescence of even its most central components.

In your Strata session description, you note that "Apache Hadoop forms the kernel of an operating system for big data." What else is in that operating system? How is that OS being put to use?

Doug Cutting: Operating systems permit folks to share resources, managing permissions and allocations. The two primary resources are storage and computation. Hadoop provides scalable storage through HDFS and scalable computation through MapReduce. It supports authorization, authentication, permissions, quotas and other operating system features. So, narrowly speaking, Hadoop alone is an operating system.

But no one uses Hadoop alone. Rather, folks also use HBase, Hive, Pig, Flume, Sqoop and many other ecosystem components. So, just as folks refer to more than the Linux kernel when they say "Linux," folks often refer to the entire Hadoop ecosystem when they say "Hadoop." Apache BigTop combines many of these ecosystem projects together into a distribution, much like RHL and Ubuntu do for Linux.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.