Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 14 2012

Stories over spreadsheets

I didn't realize how much I dislike spreadsheets until I was presented with a vision of the future where their dominance isn't guaranteed.

That eye-opening was offered by Narrative Science CTO Kris Hammond (@whisperspace) during a recent interview. Hammond's company turns data into stories: They provide sentences and paragraphs instead of rows and columns. To date, much of the attention Narrative Science has received has focused on the media applications. That's a natural starting point. Heck, I asked him about those very same things when I first met Hammond at Strata in New York last fall. But during our most recent chat, Hammond explored the other applications of narrative-driven data analysis.

"Companies, God bless them, had a great insight: They wanted to make decisions based upon the data that's out there and the evidence in front of them," Hammond said. "So they started gathering that data up. It quickly exploded. And they ended up with huge data repositories they had to manage. A lot of their effort ended up being focused on gathering that data, managing that data, doing analytics across that data, and then the question was: What do we do with it?"

Hammond sees an opportunity to extract and communicate the insights locked within company data. "We'll be the bridge between the data you have, the insights that are in there, or insights we can gather, and communicating that information to your clients, to your management, and to your different product teams. We'll turn it into something that's intelligible instead of a list of numbers, a spreadsheet, or a graph or two. You get a real narrative; a real story in that data."

My takeaway: The journalism applications of this are intriguing, but these other use cases are empowering.

Why? Because most people don't speak fluent "spreadsheet." They see all those neat rows and columns and charts, and they know something important is tucked in there, but what that something is and how to extract it aren't immediately clear. Spreadsheets require effort. That's doubly true if you don't know what you're looking for. And if data analysis is an adjacent part of a person's job, more effort means those spreadsheets will always be pushed to the side. "I'll get to those next week when I've got more time ..."

We all know how that plays out.

But what if the spreadsheet wasn't our default output anymore? What if we could take things most of us are hard-wired to understand — stories, sentences, clear guidance — and layer it over all that vital data? Hammond touched on that:

"For some people, a spreadsheet is a great device. For most people, not so much so. The story. The paragraph. The report. The prediction. The advisory. Those are much more powerful objects in our world, and they're what we're used to."

He's right. Spreadsheets push us (well, most of us) into a cognitive corner. Open a spreadsheet and you're forced to recalibrate your focus to see the data. Then you have to work even harder to extract meaning. This is the best we can do?

With that in mind, I asked Hammond if the spreadsheet's days are numbered.

"There will always be someone who uses a spreadsheet," Hammond said. "But, I think what we're finding is that the story is really going to be the endpoint. If you think about it, the spreadsheet is for somebody who really embraces the data. And usually what that person does is they reduce that data down to something that they're going to use to communicate with someone else."

A thought on dashboards

I used to view dashboards as the logical step beyond raw data and spreadsheets. I'm not so sure about that anymore, at least in terms of broad adoption. Dashboards are good tools, and I anticipate we'll have them from now until the end of time, but they're still weighed down by a complexity that makes them inaccessible.

It's not that people can't master the buttons and custom reports in dashboards; they simply don't have time. These people — and I include myself among them — need something faster and knob-free. Simplicity is the thing that will ultimately democratize data reporting and data insights. That's why the expansion of data analysis requires a refinement beyond our current dashboards. There's a next step that hasn't been addressed.

Does the answer lie in narrative? Will visualizations lead the way? Will a hybrid format take root? I don't know what the final outputs will look like, but the importance of data reporting means someone will eventually crack the problem.

Full interview

You can see the entire discussion with Hammond in the following video.

Related:

June 06 2012

Who owns patient data?

Who owns a patient's health information?

  • The patient to whom it refers?
  • The health provider that created it?
  • The IT specialist who has the greatest control over it?

The notion of ownership is inadequate for health information. For instance, no one has an absolute right to destroy health information. But we all understand what it means to own an automobile: You can drive the car you own into a tree or into the ocean if you want to. No one has the legal right to do things like that to a "master copy" of health information.

All of the groups above have a complex series of rights and responsibilities relating to health information that should never be trivialized into ownership.

Raising the question of ownership at all is a hash argument. What is a hash argument? Here's how Julian Sanchez describes it:

"Come to think of it, there's a certain class of rhetoric I'm going to call the 'one-way hash' argument. Most modern cryptographic systems in wide use are based on a certain mathematical asymmetry: You can multiply a couple of large prime numbers much (much, much, much, much) more quickly than you can factor the product back into primes. A one-way hash is a kind of 'fingerprint' for messages based on the same mathematical idea: It's really easy to run the algorithm in one direction, but much harder and more time consuming to undo. Certain bad arguments work the same way — skim online debates between biologists and earnest ID (Intelligent Design) aficionados armed with talking points if you want a few examples: The talking point on one side is just complex enough that it's both intelligible — even somewhat intuitive — to the layman and sounds as though it might qualify as some kind of insight ... The rebuttal, by contrast, may require explaining a whole series of preliminary concepts before it's really possible to explain why the talking point is wrong."

The question "Who owns the data?" presumes that the notion of ownership is valid, and it jettisons those foolish enough to try to answer the question into a needless circular debate. Once you mistakenly assume that the question is answerable, you cannot help but back an unintelligible position.

Ownership is a poor starting point for health data because the concept itself doesn't map well to the people and organizations that have relationships with that data. The following chart shows what's possible depending on a given role.



















































Person / Privilege Delete their copy of data Arbitrarily (without logs) edit their copy of data Correct the provider's copy of the data Append to the provider's copy of the data Acquire copies of HIPAA-covered data Sourcing Provider No. HIPAA mandates that the provider who creates HIPAA-covered data must ensure that a copy of the record is available. Mere deletion is not a privilege that providers have with their copies of patient records. Most EHR systems enforce this rule for providers.

No. While providers can change the contents of the EHR, they are not allowed to change the contents without a log of those changes being maintained. Many EHRs contain the concept of "signing" EHR data, which translates to "the patient data entering the state where it cannot be changed without logging anymore."

Yes. Providers can correct their copy of the EHR data, providing they maintain a copy of the incorrect version of the data. Again, EHR software enforces this rule.
Yes. The providers can merely add to data, without changing the "correctness" of previous instances of the data. EHR systems should seamlessly handle this case.
Sometimes. Depending on the ongoing "treatment" status of the patient, providers typically have the right to acquire copies of treatment data from other treating providers. If they are "fired," they can lose this right.
  Person / Privilege Delete their copy of data Arbitrarily (without logs) edit their copy of data Correct the provider's copy of the data Append to the provider's copy of the data Acquire copies of HIPAA-covered data
Patient rights Yes, they can delete their own copies of their patient records, but requests to providers that their charts be deleted will be denied.
No. Patients cannot change the "canonical" version of a patient record.

No. While patients have the right to comment on and amend the file, they can merely suggest that the "canonical" version of the patient record be updated.

Yes. The patient has the right to append to EHR records under HIPAA. HIPAA does not require that this amendment impact the "canonical" version of the patient record, but these additions must be present somewhere, and there is likely to be a substantial civil liability for providers who fail to act in a clinically responsible manner on the amended data. The relationship between "patient amendments" and the "canonical version" is a complex procedural and technical issue that will see lots of attention in the years to come.

Usually. Patients typically have the right to access the contents of an EHR system, assuming they pay a copying cost. EHRs frequently make this copying cost unreasonable, and the results are so dense that they are not useful. There are also exceptions to this "right to read," including psychiatric notes and legal investigations.
  Person / Privilege Delete their copy of data Arbitrarily (without logs) edit their copy of data Correct the provider's copy of the data Append to the provider's copy of the data Acquire copies of HIPAA-covered data
True Copyright Ownership (i.e. the relationship you have with a paper you have written or a photo you have taken) Yes. You can destroy things you own. Yes. You can change things you own without recording what changes you made. No. If you hold copyright to material and someone has purchased a right to a copy of that material, you cannot make them change it, even if you make "corrections." Sometimes, people use licensing rather than mere "copy sales" to enforce this right (i.e. Microsoft might have the right to change your copy of Windows, etc.).
No. Again, you have no rights to change another person's copy of something you own the copyright to. Again, some people use licensing as a means to gain this power rather than just "sale of a copy."
No. You do not have an automatic right to copies of other people's copyrighted works, even if they depict you somehow. (This is why your family photographer can gouge you on reprints.)   Person / Privilege Delete their copy of data Arbitrarily (without logs) edit their copy of data Correct the provider's copy of the data Append to the provider's copy of the data Acquire copies of HIPAA-covered data

IT Specialist

Kind of. Regulations dictate that IT specialists and vendors should not have the right to delete patient records. But root (or admin) access to the underlying EHR databases ensure that only people with backend access can truly delete patient records. Only people with direct access to source code or direct access to the database can completely circumvent EHR logging systems. The "delete privilege" is somewhat difficult to accomplish entirely without detection, however, since it is likely that someone (i.e. the patient) will know that the record should be present.

Yes. Source code or database-level access ensures that patient records can be modified without logging.

Yes. Source code or database-level access ensures that patient records can be modified without logging.

Yes. Source code or database-level access ensures that patient records can be modified without logging.

No. Typically, database administrators and programmers do not have the standing to request medical records from other sources.
 

Ergo, neither a patient nor a doctor nor the programmer has an "ownership" relationship with patient data. All of them have a unique set of privileges that do not line up exactly with any traditional notion of "ownership." Ironically, it is neither the patient nor the provider (when I say "provider," this usually means a doctor) who is closest to "owning" the data. The programmer has the most complete access and the only role with the ability to avoid rules that are enforced automatically by electronic health record (EHR) software.

So, asking "who owns the data?" is a meaningless, time-wasting, and shallow conceptualization of the issue at hand.

The real issue is: "What rights do patients have regarding healthcare data that refers to them?" This is a deep question because patient rights to data vary depending on how the data was acquired. For instance, a standalone personal health record (PHR) is primarily governed by the end-user license agreement (EULA) between the patient and the PHR provider (which usually gives the patient wildly varying rights), while right to a doctor's EHR data is dictated by both HIPAA and Meaningful Use standards.

Usually, what people really mean when they say "The patient owns the data" is "The patient's needs and desires regarding data should be respected." That is a wonderful instinct, but unless we are going to talk about specific privileges enabled by regulation or law, it really means "whatever the provider/programmer holding the data thinks it means."

For instance, while current Meaningful Use does require providers to give patients digital access to summary documents, there is no requirement for "complete" and "instant" access to the full contents of the EHR. While HIPAA mandates "complete" access, the EHR serves to make printed copies of digitized patient data completely useless. The devil is in the details here, and when people start going on about "the patient owning the data," what they are really doing is encouraging a mental shortcut that cannot readily be undone.

Note: This is a refresh of an article originally published here. Photo on home and category pages: Stethoscope by rosmary, on Flickr

Meaningful Use and Beyond: A Guide for IT Staff in Health Care — Meaningful Use underlies a major federal incentives program for medical offices and hospitals that pays doctors and clinicians to move to electronic health records (EHR). This book is a Rosetta Stone for the IT implementer who wants to help organizations harness EHR systems.


Related:


June 04 2012

Can Future Advisor be the self-driving car for financial advice?

Future AdvisorLast year, venture capitalist Marc Andreessen famously wrote that software is eating the world. The impact of algorithms upon media, education, healthcare and government, among many other verticals, is just beginning to be felt, and with still unfolding consequences for the industries disrupted.

Whether it's the prospect of IBM's Watson offering a diagnosis to a patient or Google's self-driving car taking over on the morning commute, there are going to be serious concerns raised about safety, power, control and influence.

Doctors and lawyers note, for good reason, that their public appearances on radio, television and the Internet should not be viewed as medical or legal advice. While financial advice may not pose the same threat to a citizen as an incorrect medical diagnosis or treatment, poor advice could have pretty significant downstream outcomes.

That risk isn't stopping a new crop of startups from looking for a piece of the billions of dollars paid every year to financial advisors. Future Advisor launched in 2010 with the goal of providing better financial advice through the Internet using data and algorithms. They're competing against startups like Wealthfront and Betterment, among others.

Not everyone is convinced of the validity of this algorithmically mediated approach to financial advice. Mike Alfred, the co-founder of BrightScope (which has liberated financial advisor data itself), wrote in Forbes this spring that online investment firms are wrong about financial advisors:

"While singularity proponents may disagree with me here, I believe that some professions have a fundamentally human component that will never be replaced by computers, machines, or algorithms. Josh Brown, an independent advisor at Fusion Analytics Investment Partners in NYC, recently wrote that 'for 12,000 years, anywhere someone has had wealth through the history of civilization, there's been a desire to pay others for advice in managing it.' In some ways, it's no different from the reason why many seek out the help of a psychiatrist. People want the comfort of a human presence when things aren't going well. A computer arguably may know how to allocate funds in a normal market environment, but can it talk you off the cliff when things go to hell? I don't think so. Ric Edelman, Chairman & CEO of Edelman Financial Services, brings up another important point. According to him, 'most consumers are delegators and procrastinators, and need the advisor to get them to do what they know they need to do but won't do if left on their own'."

To get the other side of this story, I recently talked with Bo Lu (@bolu), one of the two co-founders of Future Advisor. Lu explained how the service works, where the data comes from and whether we should fear the dispassionate influence of our new robotic financial advisor overlords.

Where did the idea for Future Advisor come from?

Lu: The story behind Future Advisor is one of personal frustration. We started the company in 2010 when my co-founder and I were working at Microsoft. Our friends who had reached their mid-20s were really making money for the first time in their lives. They were now being asked to make decisions, such as "Where do I open an IRA? What do I do with my 401K?" As is often the case, they went to the friend who had the most experience, which in this case turned out to be me. So I said, "Well, let's just find you guys a good financial advisor and then we'll do this," because somehow in my mind, I thought, "Financial advisors do this."

It turned out that all of the financial advisors we found fell into two distinct classes. One were folks that were really nice but essentially in very kind words said, "Maybe you'd be more comfortable at the lower stakes table." We didn't meet any of their minimums. You needed a million dollars or at least a half million to get their services.

The other kinds of financial advisors who didn't have minimums immediately started trying to sell my friends term life insurance and annuities. I'm like, "These guys are 25. There's no reason for you to be doing this." Then I realized there was a misalignment of incentives there. We noticed that our friends were making a small set of the same mistakes over and over again, such as not having the right diversification for their age and their portfolio, or paying too much in mutual fund fees. Most people didn't understand that mutual funds charged fees and were not being tax efficient. We said, "Okay, this looks like a data problem that we can help solve for you guys." That's the genesis out of which Future Advisor was born.

What problem are you working on solving?

Bo Lu: Future Advisor is really trying to do one single thing: deliver on the vision that high-quality financial advice should be able to be produced cheaply and, thus, be broadly accessible to everyone.

If you look at the current U.S. market of financial advisors and you multiply the number of financial advisors in the U.S. — which is roughly a quarter-million people — by what is generally accepted to be a full book of clients, you'll realize that even at full capacity, the U.S. advisor market can serve only about 11% of U.S. households.

In serving that 11% of U.S. households, the advisory market for retail investing makes about $20 billion. This is a classic market where a service is extremely expensive but in being so can only serve a small percentage of the addressable market. As we walked into this, we realized that we're part of something bigger. If you look at 60 years ago, a big problem was that everyone wanted a color television and they just weren't being manufactured quickly or cheaply enough. Manufacturing scale has caught up to us. Now, everything you want you generally can have because manufactured things are cheap. Creating services is still extremely expensive and non-scalable. Healthcare as a service, education as a service and, of course, financial services, financial advising service comes to mind. What we're doing is taking information technology, like computer science, to scale a service in the way the electrical engineering of our forefathers scaled manufacturing.

How big is the team? How are you working together?

Bo Lu: The team has eight people in Seattle. It's almost exactly half finance and half engineering. We unabashedly have a bunch of engineers from MIT, which is where my co-founder went to school, essentially sucking the brains out of the finance team and putting them in software. It's really funny because a lot of the time when we design an algorithm, we actually just sit down and say, "Okay, let's look at a bunch of examples and see what the intuitive decisions are of science people and then try to encode them."

We rely heavily on the existing academic literature in both computational finance and economics because a lot of this work has been done. The interesting thing is that the knowledge is not the problem. The knowledge exists, and it's unequivocal in the things that are good for investors. Paying less in fees is good for investors. Being more tax efficient is good for investors. How to do that is relatively easy. What's hard for the industry for a long time has been to scalably apply those principles in a nuanced way to everybody's unique situation. That's something that software is uniquely good at doing.

How do you think about the responsibility of providing financial advice that traditionally has been offered by highly certified professionals who've taken exams, worked at banks, and are expensive to get to because of that professional experience?

Bo Lu: There's a couple of answers to that question, one of which is the folks on our team have the certifications that people look for. We've got certified financial advisors*, CFAs, which is a private designation on the team. We have math PhDs from the University of Washington on the team. The people who create the software are the caliber of people that you would want to be sitting down with you and helping you with your finances in the first place.

The second part of that is that we ourselves are a registered investment advisor. You'll see many websites that on the bottom say, "This is not intended to be financial advice." We don't say that. This is intended to be financial advice. We're registered federally with the SEC as a registered investment advisor and have passed all of the exams necessary.

*In the interview, Lu said that FutureAdvisor has 'certified financial advisors'. In this context, CFA stood for something else: The Future Advisor team includes Simon Moore, a chartered financial analyst, who advises the startup on investing algorithms design.

Where does the financial data behind the site come from?

Bo Lu: From the consumer side, the site has only four steps. These four steps are very familiar to anyone who's used a financial advisor before. A client signs up for the products. It's a free web service, designed to help everyone. In step one, they answer a couple of questions about their personal situation: age, how much they make, when they want to retire. Then they're asked the kinds of questions that good financial advisors ask, such as your risk tolerance. Here, you start to see that we rely on academic work as much as possible.

There is a great set of work out of the University of Kentucky on risk tolerance questionnaires. Whereas most companies just use some questionnaire they came up with internally, we went and scoured literature to find exact questions that were specifically worded — and have been tested under those wordings to yield statistically significant deviations in determining risk tolerance. So we use those questions. With that information, the algorithm can then come up with a target portfolio allocation for the customer.

In step two, the customer can synchronize or import data from their existing financial institutions into the software. We use Yodlee, which you've written about before. It's the same technology that Mint used to import detailed data about what you already hold in your 401K, in your IRA, and in all of your other investment accounts.

Step three is the dashboard. The dashboard shows your investments at a level that makes sense, rather than current brokerages where when you log in, they tell you how much money you have, with a list of funds you have, and how much they've changed in the last 24 hours of trading. We answer four questions on the dashboard.

  1. Am I on track?
  2. Am I well-diversified for this goal?
  3. Am I overpaying in hidden fees in my mutual funds?
  4. Am I as tax efficient as I could be?

We answer those four questions and then in the final step of the process, we give algorithmically-generated, step-by-step instructions about how to improve your portfolio. This includes specific advice like "this many shares of Fund X to buy this many shares of Fund Y" in your IRA. When the consumer sees this, he or she can go and, with this help, clean up their portfolios. It's kind of like diagnosis and prescription for your portfolio.

There are three separate streams of data underlying the product. One is the Yodlee stream, which is detailed holdings data from hundreds of financial institutions. Two is data about what's in a fund. That comes from Morningstar. Morningstar, of course, gets it from the SEC because mutual funds are required to disclose this. So we can tell, for example, if a fund is an international fund or a domestic fund, what the fees are, and what it holds. The third dataset is from the datasets that we have to tier in ourselves, which is 401K data from the Department of Labor.

On top of this triad of datasets sits our algorithm, which has undergone six to eight months of beta testing with customers. (We launched the product in March 2012.) That algorithm asks, "Okay, given these three datasets, what is the current state of your portfolio? What is the minimum number of moves to reduce both transaction costs and any capital gains that you might incur to get you from where you are to roughly where you need to be?" That's how the product works under the covers.

What's the business model?

Bo Lu: You can think of it as similar to Redfin. Redfin allows individual realtors to do more work by using algorithms to help them do all of the repetitive parts. Our product and the web service is free and will always be free. Information wants to be free. That's how we work in software. It doesn't cost us anything for an additional person to come and use the website.

The way that Future Advisor makes money is that we charge for advisor time. A small percentage of customers will have individual questions about their specific situation or want to talk to a human being and have them answer some questions. This is actually good in two ways.

One, it helps the transition from a purely human service to what we think will eventually be an almost purely digital service. People who are somewhere along that continuum of wanting someone to talk to but don't need someone full-time to talk to can still do that.

Two, those conversations are a great way for us to find out, in aggregate, what the things are that the software doesn't yet do or doesn't do well. Overall, if we take a ton of calls that are all the same, then it means there's an opportunity for the software to step in, scale that process, and help people who don't want to call us or who can't afford to call us to get that information.

What's the next step?

Bo Lu: This is a problem that has a dramatic possible impact attached to it. Personal investing, what the industry calls "retail investing," is a closed-loop system. Money goes in, and it's your money, and it stays there for a while. Then it comes out, and it's still your money. There's very little additional value creation by the financial advisory industry.

It may sound like I'm going out on a limb to say this, but it's generally accepted that the value creation of you and I putting our hard-earned money into the market is actually done by companies. Companies deploy that capital, they grow, and they return that capital in the form of higher stock prices or dividends, fueling the engine of our economic growth.

There are companies across the country and across the world adding value to people's lives. There's little to no value to be added by financial advisors trying to pick stocks. It's actually academically proven that there's negative value to be added there because it turns out the only people who make money are financial advisors.

This is a $20 billion market. But really what that means is that it's a $20 billion tax on individual American investors. If we're successful, we're going to reduce that $20 billion tax to a much smaller number by orders of magnitude. The money that's saved is kept by individual investors, and they keep more of what's theirs.

Because of the size of this market and the size of the possible impact, we are venture-backed because we can really change the world for the better if we're successful. There are a bunch of the great folks in the Valley who have done a lot of work in money and the democratization of software and money tools.

What's the vision for the future of your startup?

Bo Lu: I was just reading your story about smart disclosure a little while ago. There's a great analogy in there that I think applies aptly to us. It's maps. The first maps were paper. Today if you look at the way a retail investor absorbs information, it's mostly paper. They get a prospectus in the mail. They have a bunch of disclosures they have to sign — and the paper is extremely hard to read. I don't know if you've ever tried to read a prospectus; it's something that very few of us enjoy. (I happen to be one of them, but I understand if not everyone's me.) They're extremely hard to parse.

Then we moved on to the digital age of folks taking the data embedded in those prospectuses and making them available. That was Morningstar, right? Now we're moving into the age of folks taking that data and mating it with other data, such as 401K data and your own personal financial holdings data, to make individual personalized recommendations. That's Future Advisor the way it is today.

But just as maps moved from paper maps to Google Maps, it didn't stop there. It moves and has moved to self-autonomous cars. There will be a day when you and I don't ever have to look at a map because, rather than the map being a tool to help me make the decision to get somewhere, the map will be a part of a service I use that just gets the job done. It gets me from point A to point B.

In finance, the job is to invest my money properly. Steward it so that it grows, so that it's there for me when I retire. That's our vision as well. We're going to move from being an information service to actually doing it for you. It's just a default way so that if you do nothing, your financial assets are well taken care of. That's what we think is the ultimate vision of this: Everything works beautifully and you no longer have to think about it.

We're now asked to make ridiculous decisions about spreading money between a checking account, an IRA, a savings account and a 401K, which really make no sense to most of us. The vision is to have one pot of money that invests itself correctly, that you put money into when you earn money. You take money out when you spend it. You don't have to make any decisions that you were never trained nor educated to make about your own personal finances because it just does the right thing. The self-driving car is our vision.

Connecting the future of personal finance with an autonomous car is an interesting perspective. Just as with outsourcing driving, however, there's the potential for negative outcomes. Do you have any concerns about the algorithm going awry?

Bo Lu: We are extremely cognizant of the weighty matters that we are working with here. We have a ton of testing that happens internally. You could even criticize us, as a software development firm, in that we're moving slower than other software development firms. We're not going to move as quickly as Twitter or Foursquare because, to be honest, if they mess up, it's not that big a deal. We're extremely careful about it.

At the same time, I think the Google self-driving car analogy is apt because people immediately say, "Well, what if the car gets into an accident?" Those kinds of fears exist in all fields that matter.


Analysis: Why this matters

"The analogy that comes to mind for me isn't the self-driving car," commented Mike Loukides, via email. "It's personalized medicine."

One of the big problems in health care is that to qualify treatments, we do testing over a very wide sample, and reject it if it doesn't work better than a placebo. But what about drugs that are 100% effective on 10% of the population, but 0% effective on 90%? They're almost certainly rejected. It strikes me that what Future Advisor is doing isn't so much helping you to go on autopilot, but getting beyond generic prescriptions and generating customized advice, just as a future MD might be able to do a DNA sequence in his office and generate a custom treatment.

The secret sauce for Future Advisor is the combination of personal data, open government data and proprietary algorithms. The key to realizing value, in this context, is combining multiple data streams with a user interface that's easy for a consumer to navigate. That combination has long been known by another name: It's a mashup. But the mashups of 2012 have something that those of 2002 didn't have, at least in volume or quality: data.

Future Advisor, Redfin (real estate) or Castlight (healthcare) are all interesting examples of entrepreneurs creating data products from democratized government data. Future Advisor uses data from consumers and the U.S. Department of Labor, Redfin synthesizes data from economists and government agencies, and Castlight uses health data from the U.S. Department of Health and Human Services. In each case, they provide a valuable service and/or product by making sense of that data deluge.

Related:

May 31 2012

Strata Week: MIT and Massachusetts bet on big data

Here are a few of the big data stories that caught my attention this week.

MIT makes a big data push

MIT unveiled its big data research plans this week with a new initiative: bigdata@csail. CSAIL is the university's Computer Science and Artificial Intelligence Laboratory. According to the initiative's website, the project will "identify and develop the technologies needed to solve the next generation data challenges which require the ability to scale well beyond what today's computing platforms, algorithms, and methods can provide."

The research will be funded in part by Intel, which will contribute $2.5 million per year for up to five years. As part of the announcement, Massachusetts Governor Deval Patrick added that his state was forming a Massachusetts Big Data initiative that would provide matching grants for big data research, something he hopes will make the state "well-known for big data research."

Cisco's predictions for the Internet

Cisco released its annual forecast for Internet networking. Not surprisingly, Cisco projects massive growth in networking, with annual global IP traffic reaching 1.3 zettabytes by 2016. "The projected increase of global IP traffic between 2015 and 2016 alone is more than 330 exabytes," according to the company's press release, "which is almost equal to the total amount of global IP traffic generated in 2011 (369 exabytes)."

Cisco points to a number of factors contributing to the explosion, including more Internet-connected devices, more users, faster Internet speeds, and more video.

Open data startup Junar raises funding

The Chilean data startup Junar announced this week that it had raised a seed round of funding. The startup is an open data platform with the goal of making it easy for anyone to collect, analyze, and publish. GigaOm's Barb Darrow writes:

"Junar's Open Data Platform promises to make it easier for users to find the right data (regardless of its underlying format); enhance it with analytics; publish it; enable interaction with comments and annotation; and generate reports. Throughout the process it also lets user manage the workflow and track who has accessed and downloaded what, determine which data sets are getting the most traction etc."

Junar joins a number of open data startups and marketplaces that offer similar or related services, including Socrata and DataMarket.

Have data news to share?

Feel free to email me.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR

May 18 2012

Visualization of the Week: Urban metabolism

This week's visualization comes from PhD candidates David Quinn and Daniel Wiesmann, who've built an interactive web-mapping tool that lets you explore the "urban metabolism" of major U.S. cities. The map includes data about cities' and neighborhoods' energy usage (kilowatt per hour per person) and material intensity (kilo per person) patterns. You can also view population density.

resource_intensity.jpg
Click to see the full interactive version.

Quinn writes that "one of the objectives of this work is to share the results of our analysis. We would like to help provide better urban data to researchers." The map allows users to analyze information on the screen, draw out an area to analyze, compare multiple areas, and generate a report (downloadable as a PDF) with more details, including information about the specific data sources.

Quinn is a graduate student at MIT; Wiesmann is a PhD candidate at the Instituto Superior Técnico in Lisbon, Portugal.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

More Visualizations:

May 15 2012

Profile of the Data Journalist: The Data News Editor

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society. (You can learn more about this world and the emerging leaders of this discipline in the newly released "Data Journalism Handbook.")

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

John Keefe (@jkeefe) is a senior editor for data news and journalism technology at WNYC public radio, based in New York City, NY. He attracted widespread attention when an online map he built using available data beat the Associated Press with Iowa caucus results earlier this year. He's posted numerous tutorials and resources for budding data journalists, including how to map data onto county districts, use APIs, create news apps without a backend content management system and make election results maps. As you'll read below, Keefe is a great example of a journalist who picked up these skills from the data journalism community and the Hacks/Hackers group.

Our interview follows, lightly edited for content and clarity. (I've also added a Twitter list of data journalist from the New York Times' Jacob Harris.)

Where do you work now? What is a day in your life like?

I work in the middle of the WNYC newsroom -- quite literally. So throughout the day, I have dozens of impromptu conversations with reporters and editors about their ideas for maps and data projects, or answering questions about how to find or download data.

Our team works almost entirely on "news time," which means our creations hit the Web in hours and days more often than weeks and months. So I'm often at my laptop creating or tweaking maps and charts to go with online stories. That said, Wednesday mornings it's breakfast at a Chelsea cafe with collaborators at Balance Media to update each other on longer-range projects and tools we make for the newsroom and then open source, like Tabletop.js and our new vertical timeline.

Then there are key meetings, such as the newsroom's daily and weekly editorial discussions, where I look for ways to contribute and help. And because there's a lot of interest and support for data news at the station, I'm also invited to larger strategy and planning meetings.

How did you get started in data journalism? Did you get any special degrees or certificates?

I've been fascinated with the intersection of information, design and technology since I was a kid. In the last couple of years, I've marveled at what journalists at the New York Times, ProPublica and the Chicago Tribune were doing online. I thought the public radio audience, which includes a lot of educated, curious people, would appreciate such data projects at WNYC, where I was news director.

Then I saw that Aron Pilhofer of the New York Times would be teaching a programming workshop at the 2009 Online News Association annual meeting. I signed up. In preparation, I installed Django on my laptop and started following the beginner's tutorial on my subway commute. I made my first "Hello World!" web app on the A Train.

I also started hanging out at Hacks/Hackers meetups and hackathons, where I'd watch people code and ask questions along the way.

Some of my experimentation made it onto the WNYC's website -- including our 2010 Census maps and the NYC Hurricane Evacuation map ahead of Hurricane Irene. Shortly thereafter, WNYC management asked me to focus on it full-time.

Did you have any mentors? Who? What were the most important resources they shared with you?

I could not have done so much so fast without kindness, encouragement and inspiration from Pilhofer at the Times; Scott Klein, Al Shaw, Jennifer LaFleur and Jeff Larson at ProPublica; , Chris Groskopf, Joe Germuska and Brian Boyer at the Chicago Tribune; and Jenny 8. Lee of, well, everywhere.

Each has unstuck me at various key moments and all have demonstrated in their own work what amazing things were possible. And they have put a premium on sharing what they know -- something I try to carry forward.

The moment I may remember most was at an afternoon geek talk aimed mainly at programmers programmers. After seeing a demo of a phone app called Twilio, I turned to Al Shaw, sitting next to me, and lamented that I had no idea how to play with such things.

"You absolutely can do this," he said.

He encouraged me to pick up Sinatra, a surprisingly easy way to use the Ruby programming language. And I was off.

What does your personal data journalism "stack" look like? What tools could you not live without?

Google Maps - Much of what I can turn around quickly is possible because of Google Maps. I'm also experimenting with MapBox and Geocommons for more data-intensive mapping projects, like our NYC diversity map.

Google Fusion Tables - Essential for my wrangling, merging and mapping of data sets on the fly.

Google Spreadsheets - These have become the "backend" to many of our data projects, giving reporters and editors direct access to the data driving an application, chart or map. We wire them to our apps using Tabletop.js, an open-source program we helped to develop.

TextMate - A programmer's text editor for Mac. There are several out there, and some are free. TextMate is my fave.

The JavaScript Tools Bundle for Textmate - It checks my JavaScript code ever time I save, flagging me to near-invisible, infuriating errors such as a stray comma or a missing parenthesis. I'm certain this one piece of software has given me more days with my kids.

Firebug for Firefox - Lets you see what your code is doing in the browser. Essential for troubleshooting CSS and JavaScript, and great for learning how the heck other people make cool stuff.

Amazon S3 - Most of what we build are static pages of html and JavaScript, which we host in the Amazon cloud and embed into article pages on our CMS.

census.ire.org - A fabulous, easy-to-navigate presentation of US Census data made by a bunch of journo-programmers for Investigative Reporters and Editors. I send someone there probably once a week.

What data journalism project are you the most proud of working on or creating?

I'd have to say our GOP Iowa Caucuses feature. It has several qualities I like:

  • Mashed-up data -- It mixes live, county vote results with Patchwork Nation community types.
  • A new take -- We know other news sites would shade Iowa's counties by the winner; we shaded them by community type and showed who won which categories.
  • Complete sharability -- We made it super-easy for anyone to embed the map into their own site, which was possible because the results came license-free from the state GOP via Google.
  • Key code from another journalist -- The map-rollover coolness comes from code built by Albert Sun, then of the Wall Street Journal and now at the New York Times.
  • Rapid learning -- I taught myself a LOT of JavaScript quickly.
  • Reusability -- We used it for which we did for each state until Santorum bowed out.


Bonus: I love that I made most of it sitting at my mom's kitchen table over winter break.

Where do you turn to keep your skills updated or learn new things?

WNYC's editors and reporters. They have the bug, and they keep coming up with new and interesting projects. And I find project-driven learning is the most effective way to discover new things. New York Public Radio -- which runs WNYC along with classical radio station WQXR, New Jersey Public Radio and a street-level performance space -- also has a growing stable of programmers and designers, who help me build things, teach me amazing tricks and spot my frequent mistakes.

The IRE/NICAR annual conference. It's a meetup of the best journo-programmers in the country, and it truly seems each person is committed to helping others learn. They're also excellent at celebrating the successes of others.

Twitter. I follow a bunch of folks who seem to tweet the best stuff, and try to keep a close eye on 'em.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Candidates, companies, municipalities, agencies and non-profit organizations all are using data. And a lot of that data is about you, me and the people we cover.

So first off, journalism needs an understanding of the data available and what it can do. It's just part of covering the story now. To skip that part of the world would shortchange our audience, and our democracy. Really.

And the better we can both present data to the general public and tell data-driven (or -supported) stories with impact, the better we can do great journalism.

May 07 2012

A brief history of data journalism

The following is an excerpt from "The Data Journalism Handbook," a collection of essays and resources covering the growing field of data journalism.


Data journalism imagesIn August 2010 some colleagues and I organised what we believe was one of the first international "data journalism" conferences, which took place in Amsterdam. At this time there wasn't a great deal of discussion around this topic and there were only a couple of organizations that were widely known for their work in this area.

The way that media organizations like Guardian and the New York Times handled the large amounts of data released by Wikileaks is one of the major steps that brought the term into prominence. Around that time the term started to enter into more widespread usage, alongside "computer-assisted reporting," to describe how journalists were using data to improve their coverage and to augment in-depth investigations into a given topic.

Speaking to experienced data journalists and journalism scholars on Twitter it seems that one of the earliest formulations of what we now recognise as data journalism was in 2006 by Adrian Holovaty, founder of EveryBlock — an information service which enables users to find out what has been happening in their area, on their block. In his short essay "A fundamental way newspaper sites need to change," he argues that journalists should publish structured, machine-readable data, alongside the traditional "big blob of text":

"For example, say a newspaper has written a story about a local fire. Being able to read that story on a cell phone is fine and dandy. Hooray, technology! But what I really want to be able to do is explore the raw facts of that story, one by one, with layers of attribution, and an infrastructure for comparing the details of the fire — date, time, place, victims, fire station number, distance from fire department, names and years experience of firemen on the scene, time it took for firemen to arrive — with the details of previous fires. And subsequent fires, whenever they happen."

But what makes this distinctive from other forms of journalism which use databases or computers? How — and to what extent — is data journalism different from other forms of journalism from the past?

"Computer-Assisted Reporting" and "Precision Journalism"

Using data to improve reportage and delivering structured (if not machine readable) information to the public has a long history. Perhaps most immediately relevant to what we now call data journalism is "computer-assisted reporting" or "CAR," which was the first organised, systematic approach to using computers to collect and analyze data to improve the news.

CAR was first used in 1952 by CBS to predict the result of the presidential election. Since the 1960s, (mainly investigative, mainly U.S.-based) journalists, have sought to independently monitor power by analyzing databases of public records with scientific methods. Also known as "public service journalism," advocates of these computer-assisted techniques have sought to reveal trends, debunk popular knowledge and reveal injustices perpetrated by public authorities and private corporations. For example, Philip Meyer tried to debunk received readings of the 1967 riots in Detroit — to show that it was not just less-educated Southerners who were participating. Bill Dedman's "The Color of Money" stories in the 1980s revealed systemic racial bias in lending policies of major financial institutions. In his "What Went Wrong," Steve Doig sought to analyze the damage patterns from Hurricane Andrew in the early 1990s, to understand the effect of flawed urban development policies and practices. Data-driven reporting has brought valuable public service, and has won journalists famous prizes.

In the early 1970s the term "precision journalism" was coined to describe this type of news-gathering: "the application of social and behavioral science research methods to the practice of journalism." Precision journalism was envisioned to be practiced in mainstream media institutions by professionals trained in journalism and social sciences. It was born in response to "new journalism," a form of journalism in which fiction techniques were applied to reporting. Meyer suggests that scientific techniques of data collection and analysis rather than literary techniques are what is needed for journalism to accomplish its search for objectivity and truth.

Precision journalism can be understood as a reaction to some of journalism's commonly cited inadequacies and weaknesses: dependence on press releases (later described as "churnalism"), bias towards authoritative sources, and so on. These are seen by Meyer as stemming from a lack of application of information science techniques and scientific methods such as polls and public records. As practiced in the 1960s, precision journalism was used to represent marginal groups and their stories. According to Meyer:

"Precision journalism was a way to expand the tool kit of the reporter to make topics that were previously inaccessible, or only crudely accessible, subject to journalistic scrutiny. It was especially useful in giving a hearing to minority and dissident groups that were struggling for representation."

An influential article published in the 1980s about the relationship between journalism and social science echoes current discourse around data journalism. The authors, two U.S. journalism professors, suggest that in the 1970s and 1980s the public's understanding of what news is broadens from a narrower conception of "news events" to "situational reporting," or reporting on social trends. By using databases of — for example — census data or survey data, journalists are able to "move beyond the reporting of specific, isolated events to providing a context which gives them meaning."

As we might expect, the practise of using data to improve reportage goes back as far as "data" has been around. As Simon Rogers points out, the first example of data journalism at the Guardian dates from 1821. It is a leaked table of schools in Manchester listing the number of students who attended it and the costs per school. According to Rogers this helped to show for the first time the real number of students receiving free education, which was much higher than what official numbers showed.

Data Journalism in the Guardian in 1821
Data Journalism in the Guardian in 1821 (The Guardian)

Another early example in Europe is Florence Nightingale and her key report, "Mortality of the British Army," published in 1858. In her report to the parliament she used graphics to advocate improvements in health services for the British army. The most famous is her "coxcomb," a spiral of sections, each representing deaths per month, which highlighted that the vast majority of deaths were from preventable diseases rather than bullets.

Mortality of the British Army by Florence Nightingale
Mortality of the British Army by Florence Nightingale (Image from Wikipedia)

Data journalism and Computer-Assisted Reporting

At the moment there is a "continuity and change" debate going on around the label "data journalism" and its relationship with these previous journalistic practices which employ computational techniques to analyze datasets.

Some argue that there is a difference between CAR and data journalism. They say that CAR is a technique for gathering and analyzing data as a way of enhancing (usually investigative) reportage, whereas data journalism pays attention to the way that data sits within the whole journalistic workflow. In this sense data journalism pays as much — and sometimes more — attention to the data itself, rather than using data simply as a means to find or enhance stories. Hence we find the Guardian Datablog or the Texas Tribune publishing datasets alongside stories, or even just datasets by themselves for people to analyze and explore.

Another difference is that in the past investigative reporters would suffer from a poverty of information relating to a question they were trying to answer or an issue that they were trying to address. While this is of course still the case, there is also an overwhelming abundance of information that journalists don't necessarily know what to do with. They don't know how to get value out of data. A recent example is the Combined Online Information System, the U.K.'s biggest database of spending information — which was long sought after by transparency advocates, but which baffled and stumped many journalists upon its release. As Philip Meyer recently wrote to me: "When information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important."

On the other hand, some argue that there is no meaningful difference between data journalism and computer-assisted reporting. It is by now common sense that even the most recent media practices have histories, as well as something new in them. Rather than debating whether or not data journalism is completely novel, a more fruitful position would be to consider it as part of a longer tradition, but responding to new circumstances and conditions. Even if there might not be a difference in goals and techniques, the emergence of the label "data journalism" at the beginning of the century indicates a new phase wherein the sheer volume of data that is freely available online combined with sophisticated user-centric tools, self-publishing and crowdsourcing tools enables more people to work with more data more easily than ever before.

Data journalism is about mass data literacy

Digital technologies and the web are fundamentally changing the way information is published. Data journalism is one part in the ecosystem of tools and practices that have sprung up around data sites and services. Quoting and sharing source materials is in the nature of the hyperlink structure of the web and the way we are accustomed to navigate information today. Going further back, the principle that sits at the foundation of the hyperlinked structure of the web is the citation principle used in academic works. Quoting and sharing the source materials and the data behind the story is one of the basic ways in which data journalism can improve journalism, what Wikileaks founder Julian Assange calls "scientific journalism."

By enabling anyone to drill down into data sources and find information that is relevant to them, as well as to verify assertions and challenge commonly received assumptions, data journalism effectively represents the mass democratisation of resources, tools, techniques and methodologies that were previously used by specialists — whether investigative reporters, social scientists, statisticians, analysts or other experts. While currently quoting and linking to data sources is particular to data journalism, we are moving towards a world in which data is seamlessly integrated into the fabric of media. Data journalists have an important role in helping to lower the barriers to understanding and interrogating data, and increasing the data literacy of their readers on a mass scale.

At the moment the nascent community of people who called themselves data journalists is largely distinct from the more mature CAR community. Hopefully in the future we will see stronger ties between these two communities, in much the same way that we see new NGOs and citizen media organizations like ProPublica and the Bureau of Investigative Journalism work hand in hand with traditional news media on investigations. While the data journalism community might have more innovative ways of delivering data and presenting stories, the deeply analytical and critical approach of the CAR community is something that data journalism could certainly learn from.

This excerpt was lightly edited. Links were added for EveryBlock, the Guardian Datablog, Texas Tribune datasets, the Combined Online Information System, and Julian Assange's reference to "scientific journalism."

The Data Journalism Handbook (Early Release) — This collaborative book aims to answer questions like: Where can I find data? What tools can I use? How can I find stories in data? (The digital Early Release edition includes raw and unedited content. You'll receive updates when significant changes are made, as well as the final ebook version.)

Related:

Reposted bynunatak nunatak

April 26 2012

April 16 2012

What it takes to build great machine learning products

Machine learning (ML) is all the rage, riding tight on the coattails of the "big data" wave. Like most technology hype, the enthusiasm far exceeds the realization of actual products. Arguably, not since Google's tremendous innovations in the late '90s/early 2000s has algorithmic technology led to a product that has permeated the popular culture. That's not to say there haven't been great ML wins since, but none have as been as impactful or had computational algorithms at their core. Netflix may use recommendation technology, but Netflix is still Netflix without it. There would be no Google if Page, Brin, et al., hadn't exploited the graph structure of the web and anchor text to improve search.

So why is this? It's not for lack of trying. How many startups have aimed to bring natural language processing (NLP) technology to the masses, only to fade into oblivion after people actually try their products? The challenge in building great products with ML lies not in just understanding basic ML theory, but in understanding the domain and problem sufficiently to operationalize intuitions into model design. Interesting problems don't have simple off-the-shelf ML solutions. Progress in important ML application areas, like NLP, come from insights specific to these problems, rather than generic ML machinery. Often, specific insights into a problem and careful model design make the difference between a system that doesn't work at all and one that people will actually use.

The goal of this essay is not to discourage people from building amazing products with ML at their cores, but to be clear about where I think the difficulty lies.

Progress in machine learning

Machine learning has come a long way over the last decade. Before I started grad school, training a large-margin classifier (e.g., SVM) was done via John Platt's batch SMO algorithm. In that case, training time scaled poorly with the amount of training data. Writing the algorithm itself required understanding quadratic programming and was riddled with heuristics for selecting active constraints and black-art parameter tuning. Now, we know how to train a nearly performance-equivalent large-margin classifier in linear time using a (relatively) simple online algorithm (PDF). Similar strides have been made in (probabilistic) graphical models: Markov-chain Monte Carlo (MCMC) and variational methods have facilitated inference for arbitrarily complex graphical models [1]. Anecdotally, take at look at papers over the last eight years in the proceedings of the Association for Computational Linguistics (ACL), the premiere natural language processing publication. A top paper from 2011 has orders of magnitude more technical ML sophistication than one from 2003.

On the education front, we've come a long way as well. As an undergrad at Stanford in the early-to-mid 2000s, I took Andrew Ng's ML course and Daphne Koller's probabilistic graphical model course. Both of these classes were among the best I took at Stanford and were only available to about 100 students a year. Koller's course in particular was not only the best course I took at Stanford, but the one that taught me the most about teaching. Now, anyone can take these courses online.

As an applied ML person — specifically, natural language processing — much of this progress has made aspects of research significantly easier. However, the core decisions I make are not which abstract ML algorithm, loss-function, or objective to use, but what features and structure are relevant to solving my problem. This skill only comes with practice. So, while it's great that a much wider audience will have an understanding of basic ML, it's not the most difficult part of building intelligent systems.

Interesting problems are never off the shelf

The interesting problems that you'd actually want to solve are far messier than the abstractions used to describe standard ML problems. Take machine translation (MT), for example. Naively, MT looks like a statistical classification problem: You get an input foreign sentence and have to predict a target English sentence. Unfortunately, because the space of possible English is combinatorially large, you can't treat MT as a black-box classification problem. Instead, like most interesting ML applications, MT problems have a lot of structure and part of the job of a good researcher is decomposing the problem into smaller pieces that can be learned or encoded deterministically. My claim is that progress in complex problems like MT comes mostly from how we decompose and structure the solution space, rather than ML techniques used to learn within this space.

Machine translation has improved by leaps and bounds throughout the last decade. I think this progress has largely, but not entirely, come from keen insights into the specific problem, rather than generic ML improvements. Modern statistical MT originates from an amazing paper, "The mathematics of statistical machine translation" (PDF), which introduced the noisy-channel architecture on which future MT systems would be based. At a very simplistic level, this is how the model works [2]: For each foreign word, there are potential English translations (including the null word for foreign words that have no English equivalent). Think of this as a probabilistic dictionary. These candidate translation words are then re-ordered to create a plausible English translation. There are many intricacies being glossed over: how to efficiently consider candidate English sentences and their permutations, what model is used to learn the systematic ways in which reordering occurs between languages, and the details about how to score the plausibility of the English candidate (the language model).

The core improvement in MT came from changing this model. So, rather than learning translation probabilities of individual words, to instead learn models of how to translate foreign phrases to English phrases. For instance, the German word "abends" translates roughly to the English prepositional phrase "in the evening." Before phrase-based translation (PDF), a word-based model would only get to translate to a single English word, making it unlikely to arrive at the correct English translation [3]. Phrase-based translation generally results in more accurate translations with fluid, idiomatic English output. Of course, adding phrased-based emissions introduces several additional complexities, including how to how to estimate phrase-emissions given that we never observe phrase segmentation; no one tells us that "in the evening" is a phrase that should match up to some foreign phrase. What's surprising here is that there aren't general ML improvements that are making this difference, but problem-specific model design. People can and have implemented more sophisticated ML techniques for various pieces of an MT system. And these do yield improvements, but typically far smaller than good problem-specific research insights.

Franz Och, one of the authors of the original Phrase-based papers, went on to Google and became the principle person behind the search company's translation efforts. While the intellectual underpinnings of Google's system go back to Och's days as a research scientist at the Information Sciences Institute (and earlier as a graduate student), much of the gains beyond the insights underlying phrase-based translation (and minimum-error rate training, another of Och's innovations) came from a massive software engineering effort to scale these ideas to the web. That effort itself yielded impressive research into large-scale language models and other areas of NLP. It's important to note that Och, in addition to being a world-class researcher, is also, by all accounts, an incredibly impressive hacker and builder. It's this rare combination of skill that can bring ideas all the way from a research project to where Google Translate is today.

Defining the problem

But I think there's an even bigger barrier beyond ingenious model design and engineering skills. In the case of machine translation and speech recognition, the problem being solved is straightforward to understand and well-specified. Many of the NLP technologies that I think will revolutionize consumer products over the next decade are much vaguer. How, exactly, can we take the excellent research in structured topic models, discourse processing, or sentiment analysis and make a mass-appeal consumer product?

Consider summarization. We all know that in some way, we'll want products that summarize and structure content. However, for computational and research reasons, you need to restrict the scope of this problem to something for which you can build a model, an algorithm, and ultimately evaluate. For instance, in the summarization literature, the problem of multi-document summarization is typically formulated as selecting a subset of sentences from the document collection and ordering them. Is this the right problem to be solving? Is the best way to summarize a piece of text a handful of full-length sentences? Even if a summarization is accurate, does the Franken-sentence structure yield summaries that feel inorganic to users?

Or, consider sentiment analysis. Do people really just want a coarse-grained thumbs-up or thumbs-down on a product or event? Or do they want a richer picture of sentiments toward individual aspects of an item (e.g., loved the food, hated the decor)? Do people care about determining sentiment attitudes of individual reviewers/utterances, or producing an accurate assessment of aggregate sentiment?

Typically, these decisions are made by a product person and are passed off to researchers and engineers to implement. The problem with this approach is that ML-core products are intimately constrained by what is technically and algorithmically feasible. In my experience, having a technical understanding of the range of related ML problems can inspire product ideas that might not occur to someone without this understanding. To draw a loose analogy, it's like architecture. So much of the construction of a bridge is constrained by material resources and physics that it doesn't make sense to have people without that technical background design a bridge.

The goal of all this is to say that if you want to build a rich ML product, you need to have a rich product/design/research/engineering team. All the way from the nitty gritty of how ML theory works to building systems to domain knowledge to higher-level product thinking to technical interaction and graphic design; preferably people who are world-class in one of these areas but also good in several. Small talented teams with all of these skills are better equipped to navigate the joint uncertainty with respect to product vision as well as model design. Large companies that have research and product people in entirely different buildings are ill-equipped to tackle these kinds of problems. The ML products of the future will come from startups with small founding teams that have this full context and can all fit in the proverbial garage.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

[1]: Although MCMC is a much older statistical technique, its broad use in large-scale machine learning applications is relatively recent.

[2]: The model is generative, so what's being described here is from the point-of-view of inference; the model's generative story works in reverse.

[3]: IBM model 3 introduced the concept of fertility to allow a given word to generate multiple independent target translation words. While this could generate the required translation, the probability of the model doing so is relatively low.

Related:

April 05 2012

Data as seeds of content

Despite the attention big data has received in the media and among the technology community, it is surprising that we are still shortchanging the full capabilities of what data can do for us. At times, we get caught up in the excitement of the technical challenge of processing big data and lose sight of the ultimate goal: to derive meaningful insights that can help us make informed decisions and take action to improve our businesses and our lives.

I recently spoke on the topic of automating content at the O'Reilly Strata Conference. It was interesting to see the various ways companies are attempting to make sense out of big data. Currently, the lion's share of the attention is focused on ways to analyze and crunch data, but very little has been done to help communicate results of big data analysis. Data can be a very valuable asset if properly exploited. As I'll describe, there are many interesting applications one can create with big data that can describe insights or even become monetizable products.

To date, the de facto format for representing big data has been visualizations. While visualizations are great for compacting a large amount of data into something that can be interpreted and understood, the problem is just that — visualizations still require interpretation. There were many sessions at Strata about how to create effective visualizations, but the reality is the quality of visualizations in the real world varies dramatically. Even for the visualizations that do make intuitive sense, they often require some expertise and knowledge of the underlying data. That means a large number of people who would be interested in the analysis won't be able to gain anything useful from it because they don't know how to interpret the information.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20


To be clear, I'm a big fan of visualizations, but they are not the end-all in data analysis. They should be considered just one tool in the big data toolbox. I think of data as the seeds for content, whereby data can ultimately be represented in a number of different formats depending on your requirements and target audiences. In essence, data are the seeds that can spout as large a content tree as your imagination will allow.

Below, I describe each limb of the content tree. The examples I cite are sports related because that's what we've primarily focused on at my company, Automated Insights. But we've done very similar things in other content areas rich in big data, such as finance, real estate, traffic and several others. In each case, once we completed our analysis and targeted the type of content we wanted to create, we completely automated the future creation of the content.

Long-form content

By long-form, I mean three or more paragraphs — although it could be several pages or even book length — that use human-readable language to reveal key trends, records and deltas in data. This is the hardest form of content to automate, but technology in this space is rapidly improving. For example, here is a recap of an NFL game generated out of box score and play-by-play data.

A long-form sports recap driven by data
A long-form sports recap driven by data. See the full story.

Short-form content

These are bullets, headlines, and tweets of insights that can boil a huge dataset into very actionable bits of language. For example, here is a game notes article that was created automatically out of an NCAA basketball box score and historical stats.

Mobile and social content

We've done a lot of work creating content for mobile applications and various social networks. Last year, we auto-generated more than a half-million tweets. For example, here is the automated Twitter stream we maintain that covers UNC Basketball.

Metrics

By metrics, I'm referring to the process of creating a single number that's representative of a larger dataset. Metrics are shortcuts to boil data into something easier to understand. For instance, we've created metrics for various sports, such as a quarterback ranking system that's based on player performance.

Real-time updates

Instead of thinking of data as something you crunch and analyze days or weeks after it was created, there are opportunities to turn big data into real-time information that provides interested users with updates as soon as they occur. We have a real-time NCAA basketball scoreboard that updates with new scores.

Content applications

This is one few people consider, but creating content-based applications is a great way to make use of and monetize data. For example, we created StatSmack, which is an app that allows sports fans to discover 10-20+ statistically based "slams" that enable them to talk trash about any team.

A variation on visualizations

Used in the right context, visualizations can be an invaluable tool for understanding a large dataset. The secret is combining bulleted text-based insights with the graphical visualization to allow them to work together to truly inform the user. For example, this page has a chart of win probability over the course of game seven of the 2011 World Series game. It shows the ebb and flow of the game.

Win probability from World Series 2011 game 7
Play-by-play win probability from game seven of the 2011 World Series.

What now?

As more people get their heads around how to crunch and analyze data, the issue of how to effectively communicate insights from that data will be a bigger concern. We are still in the very early stages of this capability, so expect a lot of innovation over the next few years related to automating the conversion of data to content.

Related:


April 03 2012

Data's next steps

Steve O'Grady (@sogrady) , a developer-focused analyst from RedMonk, views large-scale data collection and aggregation as a problem that has largely been solved. The tools and techniques required for the Googles and Facebooks of the world to handle what he calls "datasets of extraordinary sizes" have matured. In O'Grady's analysis, what hasn't matured are methods for teasing meaning of this data that are accessible to "ordinary users."

Among the other highlights from our interview:

  • O'Grady on the challenge of big data: "Kevin Weil (@kevinweil) from Twitter put it pretty well, saying that it's hard to ask the right question. One of the implications of that statement is that even if we had perfect access to perfect data, it's very difficult to determine what you would want to ask, how you would want to ask it. More importantly, once you get that answer, what are the questions that derive from that?"
  • O'Grady on the scarcity of data scientists: "The difficulty for basically every business on the planet is that there just aren't many of these people. This is, at present anyhow, a relatively rare skill set and therefore one that the market tends to place a pretty hefty premium on."
  • O'Grady on the reasons for using NoSQL: "If you are going down the NoSQL route for the sake of going down the NoSQL route, that's the wrong way to do things. You're likely to end up with a solution that may not even improve things. It may actively harm your production process moving forward because you didn't implement it for the right reasons in the first place."

The full interview is embedded below and available Here. For the entire interview transcript, click here.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Related:

March 30 2012

Top Stories: March 26-30, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Designing great data products
Data scientists need a systematic design process to build increasingly sophisticated products. That's where the Drivetrain Approach comes in. (This report is also available as a free ebook.)


Five tough lessons I had to learn about health care
Despite the disappointments Andy Oram has experienced while learning about health care, he expects the system to change for the better.


A huge competitive advantage awaits bold publishers
"The Lean Startup" author Eric Ries talks about his experiences working with traditional publishing structures and how they can benefit from lean startup principles.

The Reading Glove engages senses and objects to tell a story
What if you mashed up a non-linear narrative, a tangible computing environment and a hint of a haunted house experience? You might get the Reading Glove, a novel way to experience a story.


Passwords and interviews
A candidate that forks over a social media password during an interview could become an employee that gives out a password in other situations. Employers aren't making that connection.


Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference, May 29 - 31 in San Francisco. Save 20% on registration with the code RADAR20.

Ambulance photo: Ambulance by plong, on Flickr

March 28 2012

Designing great data products

By Jeremy Howard, Margit Zwemer and Mike Loukides

In the past few years, we've seen many data products based on predictive modeling. These products range from weather forecasting to recommendation engines to services that predict airline flight times more accurately than the airline itself. But these products are still just making predictions, rather than asking what action they want someone to take as a result of a prediction. Prediction technology can be interesting and mathematically elegant, but we need to take the next step. The technology exists to build data products that can revolutionize entire industries. So, why aren't we building them?

To jump-start this process, we suggest a four-step approach that has already transformed the insurance industry. We call it the Drivetrain Approach, inspired by the emerging field of self-driving vehicles. Engineers start by defining a clear objective: They want a car to drive safely from point A to point B without human intervention. Great predictive modeling is an important part of the solution, but it no longer stands on its own; as products become more sophisticated, it disappears into the plumbing. Someone using Google's self-driving car is completely unaware of the hundreds (if not thousands) of models and the petabytes of data that make it work. But as data scientists build increasingly sophisticated products, they need a systematic design approach. We don't claim that the Drivetrain Approach is the best or only method; our goal is to start a dialog within the data science and business communities to advance our collective vision.

Objective-based data products

We are entering the era of data as drivetrain, where we use data not just to generate more data (in the form of predictions), but use data to produce actionable outcomes. That is the goal of the Drivetrain Approach. The best way to illustrate this process is with a familiar data product: search engines. Back in 1997, AltaVista was king of the algorithmic search world. While their models were good at finding relevant websites, the answer the user was most interested in was often buried on page 100 of the search results. Then, Google came along and transformed online search by beginning with a simple question: What is the user's main objective in typing in a search query?

Drivetrain approach
The four steps in the Drivetrain Approach. Click to enlarge.

Google realized that the objective was to show the most relevant search result; for other companies, it might be increasing profit, improving the customer experience, finding the best path for a robot, or balancing the load in a data center. Once we have specified the goal, the second step is to specify what inputs of the system we can control, the levers we can pull to influence the final outcome. In Google's case, they could control the ranking of the search results. The third step was to consider what new data they would need to produce such a ranking; they realized that the implicit information regarding which pages linked to which other pages could be used for this purpose. Only after these first three steps do we begin thinking about building the predictive models. Our objective and available levers, what data we already have and what additional data we will need to collect, determine the models we can build. The models will take both the levers and any uncontrollable variables as their inputs; the outputs from the models can be combined to predict the final state for our objective.

Step 4 of the Drivetrain Approach for Google is now part of tech history: Larry Page and Sergey Brin invented the graph traversal algorithm PageRank and built an engine on top of it that revolutionized search. But you don't have to invent the next PageRank to build a great data product. We will show a systematic approach to step 4 that doesn't require a PhD in computer science.

The Model Assembly Line: A case study of Optimal Decisions Group

Optimizing for an actionable outcome over the right predictive models can be a company's most important strategic decision. For an insurance company, policy price is the product, so an optimal pricing model is to them what the assembly line is to automobile manufacturing. Insurers have centuries of experience in prediction, but as recently as 10 years ago, the insurance companies often failed to make optimal business decisions about what price to charge each new customer. Their actuaries could build models to predict a customer's likelihood of being in an accident and the expected value of claims. But those models did not solve the pricing problem, so the insurance companies would set a price based on a combination of guesswork and market studies.

This situation changed in 1999 with a company called Optimal Decisions Group (ODG). ODG approached this problem with an early use of the Drivetrain Approach and a practical take on step 4 that can be applied to a wide range of problems. They began by defining the objective that the insurance company was trying to achieve: setting a price that maximizes the net-present value of the profit from a new customer over a multi-year time horizon, subject to certain constraints such as maintaining market share. From there, they developed an optimized pricing process that added hundreds of millions of dollars to the insurers' bottom lines. [Note: Co-author Jeremy Howard founded ODG.]

ODG identified which levers the insurance company could control: what price to charge each customer, what types of accidents to cover, how much to spend on marketing and customer service, and how to react to their competitors' pricing decisions. They also considered inputs outside of their control, like competitors' strategies, macroeconomic conditions, natural disasters, and customer "stickiness." They considered what additional data they would need to predict a customer's reaction to changes in price. It was necessary to build this dataset by randomly changing the prices of hundreds of thousands of policies over many months. While the insurers were reluctant to conduct these experiments on real customers, as they'd certainly lose some customers as a result, they were swayed by the huge gains that optimized policy pricing might deliver. Finally, ODG started to design the models that could be used to optimize the insurer's profit.

Drivetrain Step 4:  The Model Assembly Line
Drivetrain Step 4: The Model Assembly Line. Picture a Model Assembly Line for data products that transforms the raw data into an actionable outcome. The Modeler takes the raw data and converts it into slightly more refined predicted data. Click to enlarge.

The first component of ODG's Modeler was a model of price elasticity (the probability that a customer will accept a given price) for new policies and for renewals. The price elasticity model is a curve of price versus the probability of the customer accepting the policy conditional on that price. This curve moves from almost certain acceptance at very low prices to almost never at high prices.

The second component of ODG's Modeler related price to the insurance company's profit, conditional on the customer accepting this price. The profit for a very low price will be in the red by the value of expected claims in the first year, plus any overhead for acquiring and servicing the new customer. Multiplying these two curves creates a final curve that shows price versus expected profit (see Expected Profit figure, below). The final curve has a clearly identifiable local maximum that represents the best price to charge a customer for the first year.

Expected profit
Expected profit.

ODG also built models for customer retention. These models predicted whether customers would renew their policies in one year, allowing for changes in price and willingness to jump to a competitor. These additional models allow the annual models to be combined to predict profit from a new customer over the next five years.

This new suite of models is not a final answer because it only identifies the outcome for a given set of inputs. The next machine on the assembly line is a Simulator, which lets ODG ask the "what if" questions to see how the levers affect the distribution of the final outcome. The expected profit curve is just a slice of the surface of possible outcomes. To build that entire surface, the Simulator runs the models over a wide range of inputs. The operator can adjust the input levers to answer specific questions like, "What will happen if our company offers the customer a low teaser price in year one but then raises the premiums in year two?" They can also explore how the distribution of profit is shaped by the inputs outside of the insurer's control: "What if the economy crashes and the customer loses his job? What if a 100-year flood hits his home? If a new competitor enters the market and our company does not react, what will be the impact on our bottom line?" Because the simulation is at a per-policy level, the insurer can view the impact of a given set of price changes on revenue, market share, and other metrics over time.

The Simulator's result is fed to an Optimizer, which takes the surface of possible outcomes and identifies the highest point. The Optimizer not only finds the best outcomes, it can also identify catastrophic outcomes and show how to avoid them. There are many different optimization techniques to choose from (see see sidebar, below), but it is a well-understood field with robust and accessible solutions. ODG's competitors use different techniques to find an optimal price, but they are shipping the same over-all data product. What matters is that using a Drivetrain Approach combined with a Model Assembly Line bridges the gap between predictive models and actionable outcomes. Irfan Ahmed of CloudPhysics provides a good taxonomy of predictive modeling that describes this entire assembly line process:

"When dealing with hundreds or thousands of individual components models to understand the behavior of the full-system, a 'search' has to be done. I think of this as a complicated machine (full-system) where the curtain is withdrawn and you get to model each significant part of the machine under controlled experiments and then simulate the interactions. Note here the different levels: models of individual components, tied together in a simulation given a set of inputs, iterated through over different input sets in a search optimizer."



Sidebar: Optimization in the real world

Optimization is a classic problem that has been studied by Newton and Gauss all the way up to mathematicians and engineers in the present day. Many optimization procedures are iterative; they can be thought of as taking a small step, checking our elevation and then taking another small uphill step until we reach a point from which there is no direction in which we can climb any higher. The danger in this hill-climbing approach is that if the steps are too small, we may get stuck at one of the many local maxima in the foothills, which will not tell us the best set of controllable inputs. There are many techniques to avoid this problem, some based on statistics and spreading our bets widely, and others based on systems seen in nature, like biological evolution or the cooling of atoms in glass.

Optimization

Optimization is a process we are all familiar with in our daily lives, even if we have never used algorithms like gradient descent or simulated annealing. A great image for optimization in the real world comes up in a recent TechZing podcast with the co-founders of data-mining competition platform Kaggle. One of the authors of this paper was explaining an iterative optimization technique, and the host says, "So, in a sense Jeremy, your approach was like that of doing a startup, which is just get something out there and iterate and iterate and iterate." The takeaway, whether you are a tiny startup or a giant insurance company, is that we unconsciously use optimization whenever we decide how to get to where we want to go.

Drivetrain Approach to recommender systems

Let's look at how we could apply this process to another industry: marketing. We begin by applying the Drivetrain Approach to a familiar example, recommendation engines, and then building this up into an entire optimized marketing strategy.

Recommendation engines are a familiar example of a data product based on well-built predictive models that do not achieve an optimal objective. The current algorithms predict what products a customer will like, based on purchase history and the histories of similar customers. A company like Amazon represents every purchase that has ever been made as a giant sparse matrix, with customers as the rows and products as the columns. Once they have the data in this format, data scientists apply some form of collaborative filtering to "fill in the matrix." For example, if customer A buys products 1 and 10, and customer B buys products 1, 2, 4, and 10, the engine will recommend that A buy 2 and 4. These models are good at predicting whether a customer will like a given product, but they often suggest products that the customer already knows about or has already decided not to buy. Amazon's recommendation engine is probably the best one out there, but it's easy to get it to show its warts. Here is a screenshot of the "Customers Who Bought This Item Also Bought" feed on Amazon from a search for the latest book in Terry Pratchett's "Discworld series:"

Terry Pratchett books at Amazon

All of the recommendations are for other books in the same series, but it's a good assumption that a customer who searched for "Terry Pratchett" is already aware of these books. There may be some unexpected recommendations on pages 2 through 14 of the feed, but how many customers are going to bother clicking through?

Instead, let's design an improved recommendation engine using the Drivetrain Approach, starting by reconsidering our objective. The objective of a recommendation engine is to drive additional sales by surprising and delighting the customer with books he or she would not have purchased without the recommendation. What we would really like to do is emulate the experience of Mark Johnson, CEO of Zite, who gave a perfect example of what a customer's recommendation experience should be like in a recent TOC talk. He went into Strand bookstore in New York City and asked for a book similar to Toni Morrison's "Beloved." The girl behind the counter recommended William Faulkner's "Absolom Absolom." On Amazon, the top results for a similar query leads to another book by Toni Morrison and several books by well-known female authors of color. The Strand bookseller made a brilliant but far-fetched recommendation probably based more on the character of Morrison's writing than superficial similarities between Morrison and other authors. She cut through the chaff of the obvious to make a recommendation that will send the customer home with a new book, and returning to Strand again and again in the future.

This is not to say that Amazon's recommendation engine could not have made the same connection; the problem is that this helpful recommendation will be buried far down in the recommendation feed, beneath books that have more obvious similarities to "Beloved." The objective is to escape a recommendation filter bubble, a term which was originally coined by Eli Pariser to describe the tendency of personalized news feeds to only display articles that are blandly popular or further confirm the readers' existing biases.

As with the AltaVista-Google example, the lever a bookseller can control is the ranking of the recommendations. New data must also be collected to generate recommendations that will cause new sales. This will require conducting many randomized experiments in order to collect data about a wide range of recommendations for a wide range of customers.

The final step in the drivetrain process is to build the Model Assembly Line. One way to escape the recommendation bubble would be to build a Modeler containing two models for purchase probabilities, conditional on seeing or not seeing a recommendation. The difference between these two probabilities is a utility function for a given recommendation to a customer (see Recommendation Engine figure, below). It will be low in cases where the algorithm recommends a familiar book that the customer has already rejected (both components are small) or a book that he or she would have bought even without the recommendation (both components are large and cancel each other out). We can build a Simulator to test the utility of each of the many possible books we have in stock, or perhaps just over all the outputs of a collaborative filtering model of similar customer purchases, and then build a simple Optimizer that ranks and displays the recommended books based on their simulated utility. In general, when choosing an objective function to optimize, we need less emphasis on the "function" and more on the "objective." What is the objective of the person using our data product? What choice are we actually helping him or her make?

Recommendation engine
Recommendation Engine.

Optimizing lifetime customer value

This same systematic approach can be used to optimize the entire marketing strategy. This encompasses all the interactions that a retailer has with its customers outside of the actual buy-sell transaction, whether making a product recommendation, encouraging the customer to check out a new feature of the online store, or sending sales promotions. Making the wrong choices comes at a cost to the retailer in the form of reduced margins (discounts that do not drive extra sales), opportunity costs for the scarce real-estate on their homepage (taking up space in the recommendation feed with products the customer doesn't like or would have bought without a recommendation) or the customer tuning out (sending so many unhelpful email promotions that the customer filters all future communications as spam). We will show how to go about building an optimized marketing strategy that mitigates these effects.

As in each of the previous examples, we begin by asking: "What objective is the marketing strategy trying to achieve?" Simple: we want to optimize the lifetime value from each customer. Second question: "What levers do we have at our disposal to achieve this objective?" Quite a few. For example:

  1. We can make product recommendations that surprise and delight (using the optimized recommendation outlined in the previous section).
  2. We could offer tailored discounts or special offers on products the customer was not quite ready to buy or would have bought elsewhere.
  3. We can even make customer-care calls just to see how the user is enjoying our site and make them feel that their feedback is valued.

What new data do we need to collect? This can vary case by case, but a few online retailers are taking creative approaches to this step. Online fashion retailer Zafu shows how to encourage the customer to participate in this collection process. Plenty of websites sell designer denim, but for many women, high-end jeans are the one item of clothing they never buy online because it's hard to find the right pair without trying them on. Zafu's approach is not to send their customers directly to the clothes, but to begin by asking a series of simple questions about the customers' body type, how well their other jeans fit, and their fashion preferences. Only then does the customer get to browse a recommended selection of Zafu's inventory. The data collection and recommendation steps are not an add-on; they are Zafu's entire business model — women's jeans are now a data product. Zafu can tailor their recommendations to fit as well as their jeans because their system is asking the right questions.

Zafu screenshots

Starting with the objective forces data scientists to consider what additional models they need to build for the Modeler. We can keep the "like" model that we have already built as well as the causality model for purchases with and without recommendations, and then take a staged approach to adding additional models that we think will improve the marketing effectiveness. We could add a price elasticity model to test how offering a discount might change the probability that the customer will buy the item. We could construct a patience model for the customers' tolerance for poorly targeted communications: When do they tune them out and filter our messages straight to spam? ("If Hulu shows me that same dog food ad one more time, I'm gonna stop watching!") A purchase sequence causality model can be used to identify key "entry products." For example, a pair of jeans that is often paired with a particular top, or the first part of a series of novels that often leads to a sale of the whole set.

Once we have these models, we construct a Simulator and an Optimizer and run them over the combined models to find out what recommendations will achieve our objectives: driving sales and improving the customer experience.

A look inside the Modeler
A look inside the Modeler. Click to enlarge.

Best practices from physical data products

It is easy to stumble into the trap of thinking that since data exists somewhere abstract, on a spreadsheet or in the cloud, that data products are just abstract algorithms. So, we would like to conclude by showing you how objective-based data products are already a part of the tangible world. What is most important about these examples is that the engineers who designed these data products didn't start by building a neato robot and then looking for something to do with it. They started with an objective like, "I want my car to drive me places," and then designed a covert data product to accomplish that task. Engineers are often quietly on the leading edge of algorithmic applications because they have long been thinking about their own modeling challenges in an objective-based way. Industrial engineers were among the first to begin using neural networks, applying them to problems like the optimal design of assembly lines and quality control. Brian Ripley's seminal book on pattern recognition gives credit for many ideas and techniques to largely forgotten engineering papers from the 1970s.

When designing a product or manufacturing process, a drivetrain-like process followed by model integration, simulation and optimization is a familiar part of the toolkit of systems engineers. In engineering, it is often necessary to link many component models together so that they can be simulated and optimized in tandem. These firms have plenty of experience building models of each of the components and systems in their final product, whether they're building a server farm or a fighter jet. There may be one detailed model for mechanical systems, a separate model for thermal systems, and yet another for electrical systems, etc. All of these systems have critical interactions. For example, resistance in the electrical system produces heat, which needs to be included as an input for the thermal diffusion and cooling model. That excess heat could cause mechanical components to warp, producing stresses that should be inputs to the mechanical models.

The screenshot below is taken from a model integration tool designed by Phoenix Integration. Although it's from a completely different engineering discipline, this diagram is very similar to the Drivetrain Approach we've recommended for data products. The objective is clearly defined: build an airplane wing. The wing box includes the design levers like span, taper ratio and sweep. The data is in the wing materials' physical properties; costs are listed in another tab of the application. There is a Modeler for aerodynamics and mechanical structure that can then be fed to a Simulator to produce the Key Wing Outputs of cost, weight, lift coefficient and induced drag. These outcomes can be fed to an Optimizer to build a functioning and cost-effective airplane wing.

Screenshot from a model integration tool
Screenshot from a model integration tool designed by Phoenix Integration.

As predictive modeling and optimization become more vital to a wide variety of activities, look out for the engineers to disrupt industries that wouldn't immediately appear to be in the data business. The inspiration for the phrase "Drivetrain Approach," for example, is already on the streets of Mountain View. Instead of being data driven, we can now let the data drive us.

Suppose we wanted to get from San Francisco to the Strata 2012 Conference in Santa Clara. We could just build a simple model of distance / speed-limit to predict arrival time with little more than a ruler and a road map. If we want a more sophisticated system, we can build another model for traffic congestion and yet another model to forecast weather conditions and their effect on the safest maximum speed. There are plenty of cool challenges in building these models, but by themselves, they do not take us to our destination. These days, it is trivial to use some type of heuristic search algorithm to predict the drive times along various routes (a Simulator) and then pick the shortest one (an Optimizer) subject to constraints like avoiding bridge tolls or maximizing gas mileage. But why not think bigger? Instead of the femme-bot voice of the GPS unit telling us which route to take and where to turn, what would it take to build a car that would make those decisions by itself? Why not bundle simulation and optimization engines with a physical engine, all inside the black box of a car?

Let's consider how this is an application of the Drivetrain Approach. We have already defined our objective: building a car that drives itself. The levers are the vehicle controls we are all familiar with: steering wheel, accelerator, brakes, etc. Next, we consider what data the car needs to collect; it needs sensors that gather data about the road as well as cameras that can detect road signs, red or green lights, and unexpected obstacles (including pedestrians). We need to define the models we will need, such as physics models to predict the effects of steering, braking and acceleration, and pattern recognition algorithms to interpret data from the road signs.

As one engineer on the Google self-driving car project put it in a recent Wired article, "We're analyzing and predicting the world 20 times a second." What gets lost in the quote is what happens as a result of that prediction. The vehicle needs to use a simulator to examine the results of the possible actions it could take. If it turns left now, will it hit that pedestrian? If it makes a right turn at 55 mph in these weather conditions, will it skid off the road? Merely predicting what will happen isn't good enough. The self-driving car needs to take the next step: after simulating all the possibilities, it must optimize the results of the simulation to pick the best combination of acceleration and braking, steering and signaling, to get us safely to Santa Clara. Prediction only tells us that there is going to be an accident. An optimizer tells us how to avoid accidents.

Improving the data collection and predictive models is very important, but we want to emphasize the importance of beginning by defining a clear objective with levers that produce actionable outcomes. Data science is beginning to pervade even the most bricks-and-mortar elements of our lives. As scientists and engineers become more adept at applying prediction and optimization to everyday problems, they are expanding the art of the possible, optimizing everything from our personal health to the houses and cities we live in. Models developed to simulate fluid dynamics and turbulence have been applied to improving traffic and pedestrian flows by using the placement of exits and crowd control barriers as levers. This has improved emergency evacuation procedures for subway stations and reduced the danger of crowd stampedes and trampling during sporting events. Nest is designing smart thermostats that learn the home-owner's temperature preferences and then optimizes their energy consumption. For motor vehicle traffic, IBM performed a project with the city of Stockholm to optimize traffic flows that reduced congestion by nearly a quarter, and increased the air quality in the inner city by 25%. What is particularly interesting is that there was no need to build an elaborate new data collection system. Any city with metered stoplights already has all the necessary information; they just haven't found a way to suck the meaning out of it.

In another area where objective-based data products have the power to change lives, the CMU extension in Silicon Valley has an active project for building data products to help first responders after natural or man-made disasters. Jeannie Stamberger of Carnegie Mellon University Silicon Valley explained to us many of the possible applications of predictive algorithms to disaster response, from text-mining and sentiment analysis of tweets to determine the extent of the damage, to swarms of autonomous robots for reconnaissance and rescue, to logistic optimization tools that help multiple jurisdictions coordinate their responses. These disaster applications are a particularly good example of why data products need simple, well-designed interfaces that produce concrete recommendations. In an emergency, a data product that just produces more data is of little use. Data scientists now have the predictive tools to build products that increase the common good, but they need to be aware that building the models is not enough if they do not also produce optimized, implementable outcomes.

The future for data products

We introduced the Drivetrain Approach to provide a framework for designing the next generation of great data products and described how it relies at its heart on optimization. In the future, we hope to see optimization taught in business schools as well as in statistics departments. We hope to see data scientists ship products that are designed to produce desirable business outcomes. This is still the dawn of data science. We don't know what design approaches will be developed in the future, but right now, there is a need for the data science community to coalesce around a shared vocabulary and product design process that can be used to educate others on how to derive value from their predictive models. If we do not do this, we will find that our models only use data to create more data, rather than using data to create actions, disrupt industries and transform lives.


Do we want products that deliver data, or do we want products that deliver results based on data? Jeremy Howard examined these questions in his Strata CA 12 session, "From Predictive Modelling to Optimization: The Next Frontier." Full video from that session is embedded below:

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Related:

March 22 2012

Direct sales uncover hidden trends for publishers

O'Reilly direct sales channelOne of the most important reasons publishers should invest in a direct channel is because of all the data it provides. Retailers are only going to share a certain amount of customer information with you, but when you make the sale yourself, you have full access to the resulting data stream.

As you may already know, when you buy an ebook from oreilly.com, you end up with access to multiple formats of that product. Unlike Amazon, where you only get a Mobi file, or Apple, where you only get an EPUB file, oreilly.com provides both (as well as PDF and oftentimes a couple of others). This gives the customer the freedom of format choice, but it also gives us insight into what our customers prefer. We often look at download trends to see whether PDF is still the most popular format (it is) and whether Mobi or EPUB are gaining momentum (they are). But what we hadn't done was ask our customers a few simple questions to help us better understand their e-reading habits. We addressed those habits in a recent survey. Here are the questions we asked:

  • If you purchase an ebook from oreilly.com, which of the following is the primary device you will read it on? [Choices included laptop, desktop, iOS devices, Android devices, various Kindle models, and other ereaders/tablets.]
  • On which other devices do you plan to view your ebook?
  • If you purchase an ebook from oreilly.com, which of the following is the primary format in which you plan to read the book? [Choices included PDF, EPUB, Mobi, APK and Daisy formats.]
  • What other ebook formats, if any, do you plan to use?

We ran the survey for about a month and the answers might surprise you. Bear in mind that we realize our audience is unique. O'Reilly caters to technology professionals and enthusiasts. Our customers are also often among the earliest of early adopters.

So, what's the primary ereading device used by these early adopters and techno-enthusiasts? Their iPads. That's not shocking, but what's interesting is how only 25% of respondents said the iPad is their primary device. A whopping 46% said their laptop or desktop computer was their primary ereading device.

Despite all the fanfare about Kindles, iPads, tablets and E Ink devices, the bulk of our customers are still reading their ebooks on an old-fashioned laptop or desktop computer. It's also important to note that the most popular format isn't EPUB or Mobi. Approximately half the respondents said PDF is their primary format. When you think about it, this makes a lot of sense. Again, our audience is largely IT practitioners, coding or solving other problems in front of their laptops/desktops, so they like having the content on that same screen. And just about everyone has Adobe Acrobat on their computer, so the PDF format is immediately readable on most of the laptops/desktops our customers touch.

I've spoken with a number of publishers who rely almost exclusively on Amazon data and trends to figure out what their customers want. What a huge mistake. Even though your audience might be considerably different than O'Reilly's, how do you truly know what they want and need if you're relying on an intermediary (with an agenda) to tell you? Your hidden trend might not have anything to do with devices or formats but rather reader/app features or content delivery. If you don't take the time to build a direct channel, you may never know the answers. In fact, without a direct channel, you might not even know the questions that need to be asked.

Joe Wikert (@joewikert) tweeted select stats and findings from O'Reilly's ereader survey.

Associated photo on home and category pages: Straight as an Arrow by Jeremy Vandel, on Flickr

Mini TOC Chicago — Being held April 9, Mini TOC Chicago is a one-day event focusing on Chicago's thriving publishing, tech, and bookish-arts community.

Register to attend Mini TOC Chicago

Related:

March 20 2012

The unreasonable necessity of subject experts

One of the highlights of the 2012 Strata California conference was the Oxford-style debate on the proposition "In data science, domain expertise is more important than machine learning skill." If you weren't there, Mike Driscoll's summary is an excellent overview (full video of the debate is available here). To make the story short, the "cons" won; the audience was won over to the side that machine learning is more important. That's not surprising, given that we've all experienced the unreasonable effectiveness of data. From the audience, Claudia Perlich pointed out that she won data mining competitions on breast cancer, movie reviews, and customer behavior without any prior knowledge. And Pete Warden (@petewarden) made the point that, when faced with the problem of finding "good" pictures on Facebook, he ran a data mining contest at Kaggle.

Data Science Debate panel at Strata CA 12
The "Data Science Debate" panel at Strata California 2012. Watch the debate.

A good impromptu debate necessarily raises as many questions as it answers. Here's the question that I was left with. The debate focused on whether domain expertise was necessary to ask the right questions, but a recent Guardian article,"The End of Theory," asked a different but related question: Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you've gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they're often closely coupled. Often, the only way to know you've put garbage in is that you've gotten garbage out.

By the same token, data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected. "Stupid Data Miner Tricks" is a hilarious send-up of the problems of data mining: It shows how to "predict" the value of the S&P index over a 10-year period based on butter production in Bangladesh, cheese production in the U.S., and the world sheep population.

Cherry picking and overfitting have particularly bad "smells" that are often fairly obvious: The Democrats never lose a Presidential election in a year when the Yankees win the world series, for example. (Hmmm. The 2000 election was rather fishy.) Any reasonably experienced data scientist should be able to stay out of trouble, but what if you treat your data with care and it still spits out an unexpected result? Or an expected result that's too good to be true? After the data crunching has been done, it's the subject expert's job to ensure that your results are good, meaningful, and well-understood.

Let's say you're an audio equipment seller analyzing a lot of purchase data and you find out that people buy more orange juice just before replacing their home audio system. It's an unlikely, absurd (and completely made up) result, but stranger things have happened. I'd probably go and build an audio gear marketing campaign targeting bulk purchasers of orange juice. Sales would probably go up; data is "unreasonably effective," even if you don't know why. This is precisely where things get interesting, and precisely where I think subject matter expertise becomes important: after the fact. Data breeds data, and it's naive to think that marketing audio gear to OJ addicts wouldn't breed more datasets and more analysis. It's naive to think the OJ data wouldn't be used in combination with other datasets to produce second-, third-, and fourth-order results. That's when the unreasonable effectiveness of data isn't enough; that's when it's important to understand the results in ways that go beyond what data analysis alone can currently give us. We may have a useful result that we don't understand, but is it meaningful to combine that result with other results that we may (or may not) understand?



Let's look at a more realistic scenario. Pete Warden's Kaggle-based
algorithm for finding
quality pictures works well
, despite giving the surprising result that

pictures with "Michigan" in the caption are significantly better than
average
. (As are pictures from Peru, and pictures taken of
tombs.) Why Michigan? Your guess is as good as mine.
For Warden's application, building photo
albums on the fly for his company Jetpac, that's fine. But if you're building a more complex
system that plans vacations for photographers, you'd
better know more than that. Why are the photographs good? Is
Michigan a
destination for birders? Is it a destination for people who like
tombs? Is it a destination with artifacts from ancient civilizations?
Or would you be better off recommending a trip to Peru?



Another realistic
scenario: Target recently used purchase histories to
target pregnant
women with ads
for baby-related products, with surprising success.
I won't rehash that story. From that starting point, you can go a
lot further. Pregnancies frequently lead to new car purchases.
New car purchases lead to
new insurance premiums, and I expect data will show that women with
babies are safer drivers. At each
step, you're compounding data with more data. It would certainly
be nice to know you understood what was happening at each step of the
way before offering a teenage driver a low insurance premium just
because she thought a large black handbag (that happened
to be appropriate for storing diapers) looked cool.



There's a limit to the value you
can derive from correct but inexplicable results. (Whatever else one
may say about the Target case, it looks like they made
sure they understood the results.) It takes a
subject matter expert to make the leap from correct results to
understood results. In an email, Pete Warden said:

"My biggest worry is that we're making important decisions based on black-box algorithms that may have hidden and problematic biases. If we're deciding who to give a mortgage based on machine learning, and the system consistently turns down black people, how do we even notice it, let alone fix it, unless we understand what the rules are? A real-world case is trading systems. If you have a mass of tangled and inexplicable logic driving trades, how do you assign blame when something like the Flash Crash happens?

"For decades, we've had computer systems we don't understand making decisions for us, but at least when something went wrong we could go in afterward and figure out what the causes were. More and more, we're going to be left shrugging our shoulders when someone asks us for an explanation."

That's why you need subject matter experts to understand your results, rather than simply accepting them at face value. It's easy to imagine that subject matter expertise requires hiring a PhD in some arcane discipline. For many applications, though, it's much more effective to develop your own expertise. In an email exchange, DJ Patil (@dpatil) said that people often become subject experts just by playing with the data. As an undergrad, he had to analyze a dataset about sardine populations off the coast of California. Trying to understand some anomalies led him to ask questions about coastal currents, why biologists only count sardines at certain stages in their life cycle, and more. Patil said:

"... this is what makes an awesome data scientist. They use data to have a conversation. This way they learn and bring other data elements together, create tests, challenge hypothesis, and iterate."

By asking questions of the data, and using those questions to ask more questions, Patil became an expert in an esoteric branch of marine biology, and in the process greatly increased the value of his results.

When subject expertise really isn't available, it's possible to create a workaround through clever application design. One of my takeaways from Patil's "Data Jujitsu" talk was the clever way LinkedIn "crowdsourced" subject matter expertise to their membership. Rather than sending job recommendations directly to a member, they'd send them to a friend, and ask the friend to pass along any they thought appropriate. This trick doesn't solve problems with hidden biases, and it doesn't give LinkedIn insight into why any given recommendation is appropriate, but it does an effective job of filtering inappropriate recommendations.

Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes "unreasonably effective" through the conversation that takes place after the numbers have been crunched. At his Strata keynote, Avinash Kaushik (@avinash) revisited Donald Rumsfeld's statement about known knowns, known unknowns, and unknown unknowns, and argued that the "unknown unknowns" are where the most interesting and important results lie. That's the territory we're entering here: data-driven results we would never have expected. We can only take our inexplicable results at face value if we're just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they're based. And that's the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can't forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems.

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Related:

March 17 2012

Profile of the Data Journalist: The Homicide Watch

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Chris Amico (@eyeseast) is a journalist and web developer based in Washington, DC, where he works on NPR's State Impact project, building a platform for local reporters covering issues in their states. Laura Norton Amico (@LauraNorton) is the editor of Homicide Watch (@HomicideWatch), an online community news platform in Washington, D.C. that aspires to cover every homicide in the District of Columbia. And yes, the similar names aren't a coincidence: the Amicos were married in 2010.

Since Homicide Watch launched in 2009, it's been earning praise and interest from around the digital world, including a profile by the Nieman Lab at Harvard University that asked whether a local blog "could fill the gaps of DC's homicide coverage. Notably, Homicide Watch has turned up a number of unreported murders.

In the process, the site has also highlighted an important emerging set of data that other digital editors should consider: using inbound search engine analytics for reporting. As Steve Myers reported for the Poynter Institute, Homicide Watch used clues in site search queries to ID a homicide victim. We'll see if the Knight Foundation think this idea has legs: the husband and wife team have applied for a Knight News Challenge grant to build a tooklit for real-time investigative reporting from site analytics.

The Amico's success with the site - which saw big growth in 2011 -- offers an important case study into why organizing beats may well hold similar importance as investigative projects. It also will be a case study with respect to sustainability and business models for the "new news,"as Homicide Watch looks to license its platform to news outlets across the country.

Below, I've embedded a presentation on Homicide Watch from the January 2012 meeting of the Online News Association. Our interview follows.

Watch live streaming video from onlinenewsassociation at livestream.com

Where do you work now? What is a day in your life like?

Laura: I work full time right now for Homicide Watch, a database driven beat publishing platform for covering homicides. Our flagship site is in DC, and I’m the editor and primary reporter on that site as well as running business operations for the brand.

My typical days start with reporting. First, news checks, and maybe posting some quick posts on anything that’s happened overnight. After that, it’s usually off to court to attend hearings and trials, get documents, reporting stuff. I usually have to to-do list for the day that includes business meetings, scheduling freelancers, mapping out long-term projects, doing interviews about the site, managing our accounting, dealing with awards applications, blogging about the start-up data journalism life on my personal blog and for ONA at journalists.org, guest teaching the occasional journalism class, and meeting deadlines for freelance stories. The work day never really ends; I’m online keeping an eye on things until I go to bed.

Chris: I work for NPR, on the State Impact project, where I build news apps and tools for journalists. With Homicide Watch, I work in short bursts, usually an hour before dinner and a few hours after. I’m a night owl, so if I let myself, I’ll work until 1 or 2 a.m., just hacking at small bugs on the site. I keep a long list of little things I can fix, so I can dip into the codebase, fix something and deploy it, then do something else. Big features, like tracking case outcomes, tend to come from weekend code sprints.

How did you get started in data journalism? Did you get any special degrees or certificates?

Laura: Homicide Watch DC was my first data project. I’ve learned everything I know now from conceiving of the site, managing it as Chris built it, and from working on it. Homicide Watch DC started as a spreadsheet. Our start-up kit for newsrooms starting Homicide Watch sites still includes filling out a spreadsheet. The best lesson I learned when I was starting out was to find out what all the pieces are and learn how to manage them in the simplest way possible.

Chris: My first job was covering local schools in southern California, and data kept creeping into my beat. I liked having firm answers to tough questions, so I made sure I knew, for example, how many graduates at a given high school met the minimum requirements for college. California just has this wealth of education data available, and when I started asking questions of the data, I got stories that were way more interesting.

I lived in Dalian, China for a while. I helped start a local news site with two other expats (Alex Bowman and Rick Martin). We put everything we knew about the city -- restaurant reviews, blog posts, photos from Flickr -- into one big database and mapped it all. It was this awakening moment when suddenly we had this resource where all the information we had was interlinked. When I came back to California, I sat down with a book on Python and Django and started teaching myself to code. I spent a year freelancing in the Bay Area, writing for newspapers by day, learning Python by night. Then the NewsHour hired me.

Did you have any mentors? Who? What were the most important resources they shared with you?

Laura: Chris really coached me through the complexities of data journalism when we were creating the site. He taught me that data questions are editorial questions. When I realized that data could be discussed as an editorial approach, it opened the crime beat up. I learned to ask questions of the information I was gathering in a new way.

Chris: My education has been really informal. I worked with a great reporter at my first job, Bob Wilson, who is a great interviewer of both people and spreadsheets. At NewsHour, I worked with Dante Chinni on Patchwork Nation, who taught me about reporting around a central organizing principle. Since I’ve started coding, I’ve ended up in this great little community of programmer-journalists where people bound ideas around and help each other out.

What does your personal data journalism "stack" look like? What tools could you not live without?

Laura: The site itself and its database which I report to and from, WordPress, Wordpress analytics, Google Analytics, Google Calendar, Twitter, Facebook, Storify, Document Cloud, VINElink, and DC Superior Court’s online case lookup.

Chris: Since I write more Python than prose these days, I spend most of my time in a text editor (usually TextMate) on a MacBook Pro. I try not to do anything with git.

What data journalism project are you the most proud of working on or creating?

Laura: Homicide Watch is the best thing I’ve ever done. It’s not just about the data, and it’s not just about the journalism, but it’s about meeting a community need in an innovative way. I stared thinking about a Homicide Watchtype site when I was trying to follow a few local cases shortly after moving to DC. It was nearly impossible to find news sources for the information. I did find that family and friends of victims and suspects were posting newsy updates in unusual places -- online obituaries and Facebook memorial pages, for example. I thought a lot about how a news product could fit the expressed need for news, information, and a way for the community to stay in touch about cases.

The data part developed very naturally out of that. The earliest description of the site was “everything a reporter would have in their notebook or on their desk while covering a murder case from start to finish.” That’s still one of the guiding principals of the site, but it’s also meant that organizing that information is super important. What good is making court dates public if you’re not doing it on a calendar, for example.

We started, like I said, with a spreadsheet that listed everything we knew: victim name, age, race, gender, method of death, place of death, link to obituary, photo, suspect name, age, race, gender, case status, incarceration status, detective name, age, race, gender, phone number, judge assigned to case, attorneys connected to the case, co-defendants, connections to other murder cases.

And those are just the basics. Any reporter covering a murder case, crime to conviction, should have that information. What Homicide Watch does is organize it, make as much of it public as we can, and then report from it. It’s led to some pretty cool work, from developing a method to discover news tips in analytics, to simply building news packages that accomplish more than anyone else can.

Chris: Homicide Watch is really the project I wanted to build for years. It’s data-driven beat reporting, where the platform and the editorial direction are tightly coupled. In a lot of ways, it’s what I had in mind when I was writing about frameworks for reporting.

The site is built to be a crime reporter’s toolkit. It’s built around the way Laura works, based on our conversations over the dinner table for the first six months of the site’s existence. Building it meant understanding the legal system, doing reporting and modeling reality in ways I hadn’t done before, and that was a challenge on both the technical and editorial side.

Where do you turn to keep your skills updated or learn new things?

Laura: Assigning myself new projects and tasks is the best way for me to learn; it forces me to find solutions for what I want to do. I’m not great at seeking out resources on my own, but I keep a close eye on Twitter for what others are doing, saying about it, and reading.

Chris: Part of my usual morning news reading is a run through a bunch of programming blogs. I try to get exposed to technologies that have no immediate use to me, just so it keeps me thinking about other ways to approach a problem and to see what other problems people are trying to solve.

I spend a lot of time trying to reverse-engineer other people’s projects, too. Whenever someone launches a new news app, I’ll try to find the data behind it, take a dive through the source code if it’s available and generally see if I can reconstruct how it came together.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Laura: Working on Homicide Watch has taught me that news is about so much more than “stories.” If you think about a typical crime brief, for example, there’s a lot of information in there, starting with the "who-what-where-when." Once that brief is filed and published, though, all of that information disappears.

Working with news apps gives us the ability to harness that information and reuse/repackage it. It’s about slicing our reporting in as many ways as possible in order to make the most of it. On Homicide Watch, that means maintaining a database and creating features like victims’ and suspects’ pages. Those features help regroup, refocus, and curate the reporting into evergreen resources that benefit both reporters and the community.

Chris: Spend some time with your site analytics. You’ll find that there’s no one thing your audience wants. There isn’t even really one audience. Lots of people want lots of different things at different times, or at least different views of the information you have.

One of our design goals with Homicide Watch is “never hit a dead end.” A user may come in looking for information about a certain case, then decide she’s curious about a related issue, then wonder which cases are closed. We want users to be able to explore what we’ve gathered and to be able to answer their own questions. Stories are part of that, but stories are data, too.

March 14 2012

Now available: "Planning for Big Data"

Planning for Big DataEarlier this month, more than 2,500 people came together for the O'Reilly Strata Conference in Santa Clara, Calif. Though representing diverse fields, from insurance to media and high-tech to healthcare, attendees buzzed with a new-found common identity: they are data scientists. Entrepreneurial and resourceful, combining programming skills with math, data scientists have emerged as a new profession leading the march toward data-driven business.

This new profession rides on the wave of big data. Our businesses are creating ever more data, and as consumers we are sources of massive streams of information, thanks to social networks and smartphones. In this raw material lies much of value: insight about businesses and markets, and the scope to create new kinds of hyper-personalized products and services.

Five years ago, only big business could afford to profit from big data: Walmart and Google, specialized financial traders. Today, thanks to an open source project called Hadoop, commodity Linux hardware and cloud computing, this power is in reach for everyone. A data revolution is sweeping business, government and science, with consequences as far reaching and long lasting as the web itself.

Where to start?

Every revolution has to start somewhere, and the question for many is "how can data science and big data help my organization?" After years of data processing choices being straightforward, there's now a diverse landscape to negotiate. What's more, to become data driven, you must grapple with changes that are cultural as well as technological.

Our aim with Strata is to help you understand what big data is, why it matters, and where to get started. In the wake the recent conference, we're delighted to announce the publication of our "Planning for Big Data" book. Available as a free download, the book contains the best insights from O'Reilly Radar authors over the past three months, including myself, Alistair Croll, Julie Steele and Mike Loukides.

"Planning for Big Data" is for anybody looking to get a concise overview of the opportunity and technologies associated with big data. If you're already working with big data, hand this book to your colleagues or executives to help them better appreciate the issues and possibilities.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Related data reports and ebooks

March 08 2012

Profile of the Data Journalist: The Storyteller and The Teacher

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Sarah Cohen (@sarahduke), the Knight professor of the practice of journalism and public policy at Duke University, and Anthony DeBarros (@AnthonyDB), the senior database editor at USA Today, were both important sources of historical perspective for my feature on how data journalism is evolving from "computer-assisted reporting" (CAR) to a powerful Web-enabled practice that uses cloud computing, machine learning and algorithms to make sense of unstructured data.

The latter halves of our interviews, which focused upon their personal and professional experience, follow.

What data journalism project are you the most proud of working on or creating?

DeBarros: "In 2006, my USA TODAY colleague Robert Davis and I built a database of 620 students killed on or near college campuses and mined it to show how freshmen were uniquely vulnerable. It was a heart-breaking but vitally important story to tell. We won the 2007 Missouri Lifestyle Journalism Awards for the piece, and followed it with an equally wrenching look at student deaths from fires."

Cohen: "I'd have to say the Pulitzer-winning series on child deaths in DC, in which we documented that children were dying in predictable circumstances after key mistakes by people who knew that their agencies had specific flaws that could let them fall through the cracks.

I liked working on the Post's POTUS Tracker and Head Count. Those were Web projects that were geared at accumulating lots of little bits about Obama's schedule and his appointees, respectively, that we could share with our readers while simultaneously building an important dataset for use down the road. Some of the Post's Solyndra and related stories, I have heard, came partly from studying the president's trips in POTUS Tracker.

There was one story, called "Misplaced Trust," on DC's guardianship system, that created immediate change in Superior Court, which was gratifying. "Harvesting Cash," our 18-month project on farm subsidies, also helped point out important problems in that system.

The last one, I'll note, is a piece of a project I worked on, in which the DC water authority refused to release the results of a massive lead testing effort, which in turn had shown widespread contamination. We got the survey from a source, but it was on paper.

After scanning, parsing, and geocoding, we sent out a team of reporters to neighborhoods to spot check the data, and also do some reporting on the neighborhoods. We ended up with a story about people who didn't know what was near them.

We also had an interesting experience: the water authority called our editor to complain that we were going to put all of the addresses online -- they felt that it was violating peoples' privacy, even though we weren't identifyng the owners or the residents. It was more important to them that we keep people in the dark about their blocks. Our editor at the time, Len Downie, said, "you're right. We shouldn't just put it on the Web." He also ordered up a special section to put them all in print.

Where do you turn to keep your skills updated or learn new things?

Cohen: "It's actually a little harder now that I'm out of the newsroom, surprisingly. Before, I would just dive into learning something when I'd heard it was possible and I wanted to use it to get to a story. Now I'm less driven, and I have to force myself a little more. I'm hoping to start doing more reporting again soon, and that the Reporters' Lab will help there too.

Lately, I've been spending more time with people from other disciplines to understand better what's possible, like machine learning and speech recognition at Carnegie Mellon and MIT, or natural language processing at Stanford. I can't DO them, but getting a chance to understand what's out there is useful. NewsFoo, SparkCamp and NICAR are the three places that had the best bang this year. I wish I could have gone to Strata, even if I didn't understand it all."

DeBarros: For surveillance, I follow really smart people on Twitter and have several key Google Reader subscriptions.

To learn, I spend a lot of time training after work hours. I've really been pushing myself in the last couple of years to up my game and stay relevant, particularly by learning Python, Linux and web development. Then I bring it back to the office and use it for web scraping and app building.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Cohen: "I think anything that gets more leverage out of fewer people is important in this age, because fewer people are working full time holding government accountable. The news apps help get more eyes on what the government is doing by getting more of what we work with and let them see it. I also think it helps with credibility -- the 'show your work' ethos -- because it forces newsrooms to be more transparent with readers / viewers.

For instance, now, when I'm judging an investigative prize, I am quite suspicious of any project that doesn't let you see each item, I.e., when they say, "there were 300 cases that followed this pattern," I want to see all 300 cases, or all cases with the 300 marked, so I can see whether I agree.

DeBarros: "They're important because we're living in a data-driven culture. A data-savvy journalist can use the Twitter API or a spreadsheet to find news as readily as he or she can use the telephone to call a source. Not only that, we serve many readers who are accustomed to dealing with data every day -- accountants, educators, researchers, marketers. If we're going to capture their attention, we need to speak the language of data with authority. And they are smart enough to know whether we've done our research correctly or not.

As for news apps, they're important because -- when done right -- they can make large amounts of data easily understood and relevant to each person using them."

These interviews were edited and condensed for clarity.

March 07 2012

Data markets survey


The sale of data is a venerable business, and has existed since the
middle of the 19th century, when Paul Reuter began providing
telegraphed stock exchange prices between Paris and London, and New
York newspapers founded the Associated Press.

The web has facilitated a blossoming of information providers. As the ability to discover and exchange data improves, the need to rely on aggregators such as Bloomberg or Thomson Reuters is declining. This is a good thing: the business models of large aggregators do not readily scale to web startups, or casual use of data in analytics.

Instead, data is increasingly offered through online marketplaces: platforms that host data from publishers and offer it to consumers. This article provides an overview of the most mature data markets, and contrasts their different approaches and facilities.

What do marketplaces do?

Most of the consumers of data from today's marketplaces are developers. By adding another dataset to your own business data, you can create insight. To take an example from web analytics: by mixing an IP address database with the logs from your website, you can understand where your customers are coming from, then if you add demographic data to the mix, you have some idea of their socio-economic bracket and spending ability.

Such insight isn't limited to analytic use only, you can use it to provide value back to a customer. For instance, by recommending restaurants local to the vicinity of a lunchtime appointment in their calendar. While many datasets are useful, few are as potent as that of location in the way they provide context to activity.

Marketplaces are useful in three major ways. First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume.

In general, one of the important barriers to the development of the data marketplace economy is the ability of enterprises to store and make use of the data. A principle of big data is that it's often easier to move your computation to the data, rather than the reverse. Because of this, we're seeing the increasing integration between cloud computing facilities and data markets: Microsoft's data market is tied to its Azure cloud, and Infochimps offers hosted compute facilities. In short-term cases, it's probably easier to export data from your business systems to a cloud platform than to try and expand internal systems to integrate external sources.

While cloud solutions offer a route forward, some marketplaces also make the effort to target end-users. Microsoft's data marketplace can be accessed directly through Excel, and DataMarket provides online visualization and exploration tools.

The four most established data marketplaces are Infochimps, Factual, Microsoft Windows Azure Data Marketplace, and DataMarket. A table comparing these providers is presented at the end of this article, and a brief discussion of each marketplace follows.

Infochimps

According to founder Flip Kromer, Infochimps was created to give data life in the same way that code hosting projects such as SourceForge or GitHub give life to code. You can improve code and share it: Kromer wanted the same for data. The driving goal behind Infochimps is to connect every public and commercially available database in the world to a common platform.

Infochimps realized that there's an important network effect of "data with the data," that the best way to build a data commons and a data marketplace is to put them together in the same place. The proximity of other data makes all the data more valuable, because of the ease with which it can be found and combined.

The biggest challenge in the two years Infochimps has been operating is that of bootstrapping: a data market needs both supply and demand. Infochimps' approach is to go for a broad horizontal range of data, rather than specialize. According to Kromer, this is because they view data's value as being in the context it provides: in giving users more insight about their own data. To join up data points into a context, common identities are required (for example, a web page view can be given a geographical location by joining up the IP address of the page request with that from the IP address in an IP intelligence database). The benefit of common identities and data integration is where hosting data together really shines, as Infochimps only needs to integrate the data once for customers to reap continued benefit: Infochimps sells datasets which are pre-cleaned and integrated mash-ups of those from their providers.

By launching a big data cloud hosting platform alongside its marketplace, Infochimps is seeking to build on the importance of data locality.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Factual

Factual was envisioned by founder and CEO Gil Elbaz as an open data platform, with tools that could be leveraged by community contributors to improve data quality. The vision is very similar to that of Infochimps, but in late 2010 Factual elected to concentrate on one area of the market: geographical and place data. Rather than pursue a broad strategy, the idea is to become a proven and trusted supplier in one vertical, then expand. With customers such as Facebook, Factual's strategy is paying off.

According to Elbaz, Factual will look to expand into verticals other than local information in 2012. It is moving one vertical at a time due to the marketing effort required in building quality community and relationships around the data.

Unlike the other main data markets, Factual does not offer reselling facilities for data publishers. Elbaz hasn't found that the cash on offer is attractive enough for many organizations to want to share their data. Instead, he believes that the best way to get data you want is to trade other data, which could provide business value far beyond the returns of publishing data in exchange for cash. Factual offer incentives to their customers to share data back, improving the quality of the data for everybody.

Windows Azure Data Marketplace

Launched in 2010, Microsoft's Windows Azure Data Marketplace sits alongside the company's Applications marketplace as part of the Azure cloud platform. Microsoft's data market is positioned with a very strong integration story, both at the cloud level and with end-user tooling.

Through use of a standard data protocol, OData, Microsoft offers a well-defined web interface for data access, including queries. As a result, programs such as Excel and PowerPivot can directly access marketplace data: giving Microsoft a strong capability to integrate external data into the existing tooling of the enterprise. In addition, OData support is available for a broad array of programming languages.

Azure Data Marketplace has a strong emphasis on connecting data consumers to publishers, and most closely approximates the popular concept of an "iTunes for Data." Big name data suppliers such as Dun & Bradstreet and ESRI can be found among the publishers. The marketplace contains a good range of data across many commercial use cases, and tends to be limited to one provider per dataset — Microsoft has maintained a strong filter on the reliability and reputation of its suppliers.

DataMarket

Where the other three main data marketplaces put a strong focus on the developer and IT customers, DataMarket caters to the end-user as well. Realizing that interacting with bland tables wasn't engaging users, founder Hjalmar Gislason worked to add interactive visualization to his platform.

The result is a data marketplace that is immediately useful for researchers and analysts. The range of DataMarket's data follows this audience too, with a strong emphasis on country data and economic indicators. Much of the data is available for free, with premium data paid at the point of use.

DataMarket has recently made a significant play for data publishers, with the emphasis on publishing, not just selling data. Through a variety of plans, customers can use DataMarket's platform to publish and sell their data, and embed charts in their own pages. At the enterprise end of their packages, DataMarket offers an interactive branded data portal integrated with the publisher's own web site and user authentication system. Initial customers of this plan include Yankee Group and Lux Research.


Data markets compared





 
Azure
Datamarket
Factual
Infochimps



Data sources
Broad range
Range, with a focus on country and industry stats
Geo-specialized, some other datasets
Range, with a focus on geo, social and web sources


Free data
Yes
Yes
-
Yes


Free trials of paid data
Yes
-
Yes, limited free use of APIs
-


Delivery
OData API
API, downloads
API, downloads for heavy users
API, downloads


Application hosting
Windows Azure
-
-
Infochimps Platform


Previewing
Service Explorer
Interactive visualization
Interactive search
-


Tool integration
Excel, PowerPivot, Tableau and other OData consumers-
Developer tool integrations
-


Data publishing
Via database connection or web service
Upload or web/database connection.
Via upload or web service.
Upload


Data reselling
Yes, 20% commission on non-free datasets
Yes. Fees and commissions vary. Ability to create branded data market
-
Yes. 30% commission on non-free datasets.


Launched
2010
2010
2007
2009


Other data suppliers

While this article has focused on the more general purpose marketplaces, several other data suppliers are worthy of note.

Social dataGnip and Datasift specialize in offering social media data streams, in particular Twitter.

Linked dataKasabi, currently in beta, is a marketplace that is distinctive for hosting all its data as Linked Data, accessible via web standards such as SPARQL and RDF.

Wolfram Alpha — Perhaps the most prolific integrator of diverse databases, Wolfram Alpha recently added a Pro subscription level that permits the end user to download the data resulting from a computation.

Related:

March 06 2012

Profile of the Data Journalist: The Data Editor

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Meghan Hoyer (@MeghanHoyer) is a data editor based in Virginia. Our interview follows.

Where do you work now? What is a day in your life like?

I work in an office within The Virginian Pilot’s newsroom. I’m a one-person team, so there’s no such thing as typical.

What I might do: Help a reporter pull Census data, work with IT on improving our online crime report app, create a DataTable of city property assessment changes, and plan training for a group of co-workers who’d like to grow their online skills. At least, that’s what I’m doing today.

Tomorrow, it’ll be helping with our online election report, planning a strategy to clean a dirty database, and working with a reporter to crunch data for a crime trend story.

How did you get started in data journalism? Did you get any special degrees or certificates?

I have a journalism degree from Northwestern, but I got started the same way most reporters probably got started - I had questions about my community and I wanted quantifiable answers. How had the voting population in a booming suburb changed? Who was the region’s worst landlord? Were our localities going after delinquent taxpayers? Anecdotes are nice, but it’s an amazingly powerful thing to be able to get the true measure of a situation. Numbers and analysis help provide a better focus - and sometimes, they upend entirely your initial theory.

Did you have any mentors? Who? What were the most important resources they shared with you?

I haven’t collected a singular mentor as much as a group of people whose work I keep tabs on, for inspiration and follow-up. The news community is pretty small. A lot of people have offered suggestions, guidance, cheat sheets and help over the years. Data journalism - from analysis to building apps -- is definitely not something you can or need to learn in a bubble all on your own.

What does your personal data journalism "stack" look like? What tools could you not live without?

In terms of daily tools, I keep it basic: Google docs, Fusion Tables and Refine, QGIS, SQLite and Excel are all in use pretty much every day.

I’ve learned some Python and JavaScript for specific projects and to automate some of the newsroom’s daily tasks, but I definitely don’t have the programming or technical background that a lot of people in this field have. That’s left me trying to learn as much as I can as quick as I can.

In terms of a data stack, we keep information such as public employee salaries, land assessment databases and court record databases (among others) updated in a shared drive in our newsroom. It’s amazing how often reporters use them, even if it’s just to find out which properties a candidate owns or how long a police officer caught at a DUI checkpoint has been on the force.

What data journalism project are you the most proud of working on or creating?

A few years ago, I combined property ownership records, code enforcement citations, real estate tax records and rental inspection information from all our local cities and found a company with hundreds of derelict properties.

Their properties seemed to change hands often, so a partner and I then hand-built a database from thousands of land deeds that proved the company was flipping houses among investors in a $26 million mortgage fraud scheme. None of the cities in our region had any idea this was going on because they were dealing with each parcel as a separate entity.

That’s what combining sets of data can get you - a better overall view of what’s really happening. While government agencies are great at collecting piles of data, it’s that kind of larger analysis that’s missing.

Where do you turn to keep your skills updated or learn new things?

To be honest - Twitter. I get a lot of ideas and updates on new tools there. And the NICAR conference and listserv. Usually when you hit up against a problem - whether it’s dealing with a dirty dataset or figuring out how to best visualize your data -- it’s something that someone else has already faced.

I also learn a lot from the people within our newsroom. We have a talented group of web producers who all are eager to try new things and learn.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Data is everywhere, but in most cases it’s just stockpiled and warehoused without a second thought to analysis or using it to solve larger problems.

Journalists are in a unique position to make sense of it, to find the stories in it, to make sure that governments and organizations are considering the larger picture.

I think, too, that people in our field need to truly push for open government in the sense not of government building interfaces for data, but for just releasing raw data streams. Government is still far too stuck in the “Here’s a PDF of a spreadsheet” mentality. That doesn’t create informed citizens, and it doesn’t lead to innovative ways of thinking about government.

I’ve been involved recently in a community effort to create an API and then apps out of the regional transit authority’s live bus GPS stream. It has been a really fun project - and something that I hope other local governments in our area take note of.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl