Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 01 2012

Profile of the Data Journalist: The Elections Developer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Derek Willis (@derekwillis) is a news developer based in New York City. Our interview follows.

Where do you work now? What is a day in your life like?

I work for The New York Times as a developer in the Interactive News Technologies group. A day in my work life usually includes building or improving web applications relating to politics, elections and Congress, although I also get the chance to branch out to do other things. Since elections are such an important subject, I try to think of ways to collect information we might want to display and of ways to get that data in front of readers in an intelligent and creative manner.

How did you get started in data journalism? Did you get any special degrees or certificates?

No, I started working with databases in graduate school at the University of Florida (I left for a job before finishing my master's degree). I had an assistantship at an environmental occupations training center and part of my responsibilities was to maintain the mailing list database. And I just took to it - I really enjoyed working with data, and once I found Investigative Reporters & Editors, things just took off for me.

Did you have any mentors? Who? What were the most important resources they shared with you?

A ton of mentors, mostly met through IRE but also people at my first newspaper job at The Palm Beach Post. A researcher there, Michelle Quigley, taught me how to find information online and how sometimes you might need to take an indirect route to locating the stuff you want. Kinsey Wilson, now the chief content officer at NPR, hired me at Congressional Quarterly and constantly challenged me to think bigger about data and the news. And my current and former colleagues at The Times and The Washington Post are an incredible source of advice, counsel and inspiration.

What does your personal data journalism "stack" look like? What tools could you not live without?

It's pretty basic: spreadsheets, databases (MySQL, PostgreSQL, SQLite) and a programming language like Python or, these days, Ruby. I've been lucky to find excellent tools in the Ruby world, such as the Remote Table gem by Brighter Planet, and a host of others. I like PostGIS for mapping stuff.

What data journalism project are you the most proud of working on or creating?

I'm really proud of the elections work at The Times, but can't take credit for how good it looks. A project called Toxic Waters also was incredibly challenging and rewarding to work on, too. But my favorite might be the first one: the Congressional Votes Database that Adrian Holovaty, Alyson Hurt and I created at The Post in late 2005. It was a milestone for me and for The Post, and helped set the bar for what news organizations could do with data on the web.

Where do you turn to keep your skills updated or learn new things?

My colleagues are my first source. When you work with Jeremy Ashkenas, the author of the Backbone and Underscore Javascript libraries, you see and learn new things all the time. Our team is constantly bouncing new concepts around. I wish I had more time to learn new things; maybe after the elections!

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

A couple of reasons: one is that we live in an age where information is plentiful. Tools that can help distill and make sense of it are valuable. They save time and convey important insights. News organizations can't afford to cede that role. The second is that they really force you to think about how the reader/user is getting this information and why. I think news apps demand that you don't just build something because you like it; you build it so that others might find it useful.

This email interview has been edited and condensed for clarity.

Profile of the Data Journalist: The Long Form Developer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Dan Nguyen (@dancow) is an investigative developer/journalist based in Manhattan. Our interview follows.

Where do you work now? What is a day in your life like?

I'm a news app developer at ProPublica, where I've worked for about 3.5 years. It's hard to say what our typical day is like. Ideally, I either have a project or am writing code to collect the data to determine whether a project is worth doing (or just doing old-fashioned reading of articles/papers that may spark ideas for things to look at). We're a small operation so we have our hands on any of the daily news production as well, including helping the reporters put together online features for their more print-focused work.

How did you get started in data journalism? Did you get any special degrees or certificates?

I stumbled into data journalism because I had always been interested in being a journalist but double majored in journalism and computer engineering just in case the job market didn't work out. Out of college, I got a good job as a traditional print reporter at a regional newspaper but was eventually asked to help with the newsroom's online side. I got back into programming and started to realize there was a role for programming and important journalism.

Did you have any mentors? Who? What were the most important resources they shared with you?

The mix of programming and journalism is still relatively new, so I didn't have any formal mentors in it. I was of course lucky that my boss at ProPublica, Scott Klein, had a great vision about the role of news applications and our investigative journalism. We were also fortunate to have Brian Boyer (now the news applications editor at the Tribune company) to work with us as we started doing news apps with Ruby on Rails, as he had come into journalism from being a professional developer.

What does your personal data journalism "stack" look like? What tools could you not live without?

In terms of day-to-day tools, I use RVM (Ruby Version Manager) to run multiple versions of Ruby, which is my all purpose tool for doing any kind of batch task work, text processing/parsing, number crunching, and of course Ruby on Rails development. Git, of course, is essential, and I combine that with Dropbox to keep versioned copies of personal projects and data work. On top of that, my most frequently used tool is Google Refine, which takes the tedium out of exploring new data sets, especially if I have to clean them.

What data journalism project are you the most proud of working on or creating?

The project I'm most proud of is something I did before SOPA Opera, which was our Dollars for Docs project in 2010. It started off with just a blog post I wrote to teach other journalists how web scraping was useful. In this case, I scraped a website Pfizer used to disclose what it paid doctors to do promotional and consulting work. My colleagues noticed and said that we could do that for every company that had been disclosing payments. Because each company disclosed these payments in a variety of formats, including Flash containers and PDFs, few people had tried to analyze these disclosures in bulk, to see nationwide trends in these financial relationships.

A lot of the data work happened behind the scenes, including writing dozens of scrapers to cross-reference our database of payments with state medical board and med school listings. For the initial story, we teamed up with five other newsrooms, including NPR and the Boston Globe, which required programmatically creating a system in which we could coordinate data and research. With all the data we had, and the number of reporters and editors working on this outside of our walls, this wasn't a project that would've succeeded by just sending Excel files back and forth.

The website we built from that data is our most visited project yet, as millions of people used it to look up their doctors. Afterwards, we shared our data with any news outlet that asked, so hundreds of independently reported stories came from our data. Among the results were that the drug companies and the med schools revisited their screening and conflict of interest policies.

So, in terms of impact, Dollars for Docs is the project I'm proudest of. But it shares something in common with SOPA Opera (which was mostly a solo project that took a couple weeks), in that both projects were based of already well-known and long-ago-publicized data. But with data journalism techniques, there are countless new angles to important issues, and countless new and interesting ways to tell their stories.

Where do you turn to keep your skills updated or learn new things?

I check Hacker News and the programming subreddit constantly to see what new hacks, projects, and plugins that the community is putting out. I also have a huge backlog of programming books, some of them free that were posted on HN, on my Kindle.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

I went into journalism because I wanted to be a longform writer in the tradition of the New Yorker. But I'm fortunate that I stumbled onto the path of using programming to do journalism; more and more, I'm seeing how important stories aren't being done even though the data and information are out in broad daylight (as they were in D4D and SOPA Opera) because we have relatively few journalists with the skills or mindset to process and understand that data. Of course, doing this work doesn't preclude me from presenting in a longform article; it just happens that programming also provides even more ways to present a story when narrative isn't the only (or the ideal) way to do so.

Strata Week: Datasift lets you mine two years of Twitter data

Here are a few of the data stories that caught my attention this week.

Twitter's historical archives, via Datasift

DataSiftDatasift, one of the two companies that has official access to the Twitter firehose (the other being Gnip) announced its new Historics service this week, giving customers access to up to two years' worth of historical Tweets. (By comparison, Gnip offers 30 days of Twitter data, and other developers and users have access to roughly a week's worth of Tweets.)

GigaOm's Barb Darrow responded to those who might be skeptical about the relevance of this sort of historic Twitter data in a service that emphasizes real-time. Darrow noted that DataSift CEO Rob Bailey said companies planning new products, promotions or price changes would do well to study the impact of their past actions before proceeding and that Twitter is the perfect venue for that.

Another indication of the desirability of this new Twitter data: the waiting list for Historics already includes a number of Fortune 500 companies. The service will get its official launch in April.

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Building a school of data

Although there are plenty of ways to receive formal training in math, statistics and engineering, there aren't a lot of options when it comes to an education specifically in data science.

To that end, the Open Knowledge Foundation and Peer to Peer University (P2PU) have proposed a School of Data, arguing that:

"It will be years before data specialist degree paths become broadly available and accepted, and even then, time-intensive degree courses may not be the right option for journalists, activists, or computer programmers who just need to add data skills to their existing expertise. What is needed are flexible, on-demand, shorter learning options for people who are actively working in areas that benefit from data skills, particularly those who may have already left formal education programmes."

The organizations are seeking volunteers to help develop the project, whether that's in the form of educational materials, learning challenges, mentorship, or a potential student body.

Strata in California

The Strata Conference wraps up today in Santa Clara, Calif. If you missed Strata this year and weren't able to catch the livestream of the conference, look for excerpts and videos posted here on Radar and through the O'Reilly YouTube channel in the coming weeks.

And be sure to make plans for Strata New York, being held October 23-25. That event will mark the merger with Hadoop World. The call for speaker proposals for Strata NY is now open.

Got data news?

Feel free to email me.

Related:

February 14 2012

The bond between data and journalism grows stronger

While reporters and editors have been the traditional vectors for information gathering and dissemination, the flattened information environment of 2012 now has news breaking first online, not on the newsdesk.

That doesn't mean that the integrated media organizations of today don't play a crucial role. Far from it. In the information age, journalists are needed more than ever to curate, verify, analyze and synthesize the wash of data.

To learn more about the shifting world of data journalism, I interviewed Liliana Bounegru (@bb_liliana), project coordinator of SYNC3 and Data Driven Journalism at the European Journalism Centre.

What's the difference between the data journalism of today and the computer-assisted reporting (CAR) of the past?

Liliana Bounegru: There is a "continuity and change" debate going on around the label "data journalism" and its relationship with previous journalistic practices that employ computational techniques to analyze datasets.

Some argue [PDF] that there is a difference between CAR and data journalism. They say that CAR is a technique for gathering and analyzing data as a way of enhancing (usually investigative) reportage, whereas data journalism pays attention to the way that data sits within the whole journalistic workflow. In this sense, data journalism pays equal attention to finding stories and to the data itself. Hence, we find the Guardian Datablog or the Texas Tribune publishing datasets alongside stories, or even just datasets by themselves for people to analyze and explore.

Another difference is that in the past, investigative reporters would suffer from a poverty of information relating to a question they were trying to answer or an issue that they were trying to address. While this is, of course, still the case, there is also an overwhelming abundance of information that journalists don't necessarily know what to do with. They don't know how to get value out of data. As Philip Meyer recently wrote to me: "When information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important."

On the other hand, some argue that there is no difference between data journalism and computer-assisted reporting. It is by now common sense that even the most recent media practices have histories as well as something new in them. Rather than debating whether or not data journalism is completely novel, a more fruitful position would be to consider it as part of a longer tradition but responding to new circumstances and conditions. Even if there might not be a difference in goals and techniques, the emergence of the label "data journalism" at the beginning of the century indicates a new phase wherein the sheer volume of data that is freely available online combined with sophisticated user-centric tools enables more people to work with more data more easily than ever before. Data journalism is about mass data literacy.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

What does data journalism mean for the future of journalism? Are there new business models here?

Liliana Bounegru: There are all kinds of interesting new business models emerging with data journalism. Media companies are becoming increasingly innovative with the way they produce revenues, moving away from subscription-based models and advertising to offering consultancy services, as in the case of the German award-winning OpenDataCity.

Digital technologies and the web are fundamentally changing the way we do journalism. Data journalism is one part in the ecosystem of tools and practices that have sprung up around data sites and services. Quoting and sharing source materials (structured data) is in the nature of the hyperlink structure of the web and in the way we are accustomed to navigating information today. By enabling anyone to drill down into data sources and find information that is relevant to them as individuals or to their community, as well as to do fact checking, data journalism provides a much needed service coming from a trustworthy source. Quoting and linking to data sources is specific to data journalism at the moment, but seamless integration of data in the fabric of media is increasingly the direction journalism is going in the future. As Tim Berners-Lee says, "data-driven journalism is the future".

What data-driven journalism initiatives have caught your attention?

Liliana Bounegru: The data journalism project FarmSubsidy.org is one of my favorites. It addresses a real problem: The European Union (EU) is spending 48% of its budget on agriculture subsidies, yet the money doesn't reach those who need it.

Tracking payments and recipients of agriculture subsidies from the European Union to all member states is a difficult task. The data is scattered in different places in different formats, with some missing and some scanned in from paper records. It is hard to piece it together to form a comprehensive picture of how funds are distributed. The project not only made the data available to anyone in an easy to understand way, but it also advocated for policy changes and better transparency laws.

LRA Crisis Tracker

Another of my favorite examples is the LRA Crisis Tracker, a real-time crisis mapping platform and data collection system. The tracker makes information about the attacks and movements of the Lord's Resistance Army (LRA) in Africa publicly available. It helps to inform local communities, as well as the organizations that support the affected communities, about the activities of the LRA through an early-warning radio network in order to reduce their response time to incidents.

I am also a big fan of much of the work done by the Guardian Datablog. You can find lots of other examples featured on datadrivenjournalism.net, along with interviews, case studies and tutorials.

I've talked to people like Chicago Tribune news app developer Brian Boyer about the emerging "newsroom stack." What do you feel are the key tools of the data journalist?

Liliana Bounegru: Experienced data journalists list spreadsheets as a top data journalism tool. Open source tools and web-based applications for data cleaning, analysis and visualization play very important roles in finding and presenting data stories. I have been involved in organizing several workshops on ScraperWiki and Google Refine for data collection and analysis. We found that participants were quite able to quickly ask and answer new kinds of questions with these tools.

How does data journalism relate to open data and open government?

Liliana Bounegru: Open government data means that more people can access and reuse official information published by government bodies. This in itself is not enough. It is increasingly important that journalists can keep up and are equipped with skills and resources to understand open government data. Journalists need to know what official data means, what it says and what it leaves out. They need to know what kind of picture is being presented of an issue.

Public bodies are very experienced in presenting data to the public in support of official policies and practices. Journalists, however, will often not have this level of literacy. Only by equipping journalists with the skills to use data more effectively can we break the current asymmetry, where our understanding of the information that matters is mediated by governments, companies and other experts. In a nutshell, open data advocates push for more data, and data journalists help the public to use, explore and evaluate it.

This interview has been edited and condensed for clarity.

Photo on associated home and category pages: NYTimes: 365/360 - 1984 (in color) by blprnt_van, on Flickr.

Related:

February 09 2012

Strata Week: Your personal automated data scientist

Here are a few of the data stories that caught my attention this week:

Wolfram|Alpha Pro: An on-call data scientist

The computational knowledge engine Wolfram|Alpha unveiled a pro version this week. For $4.99 per month ($2.99 for students), Wolfram|Alpha Pro offers access to more of the computational power "under the hood" of the site, in part by allowing users to upload their own datasets, which Wolfram|Alpha will in turn analyze.

This includes:

  • Text files — Wolfram|Alpha will respond with the character and word count, provide an estimate on how long it would take to read aloud, and reveal the most common word, average sentence length and more.
  • Spreadsheets — It will crunch the numbers and return a variety of statistics and graphs.
  • Image files — It will analyze the image's dimensions, size, and colors, and let you apply several different filters.

Wolfram Alpha Pro example
Wolfram|Alpha Pro subscribers can upload and analyze their own datasets.

There's also a new extended keyboard that contains the Greek alphabet and other special characters for manually entering data. Data and analysis from these entries and any queries can also be downloaded.

"In a sense," writes Wolfram's founder Stephen Wolfram, "the concept is to imagine what a good data scientist would do if confronted with your data, then just immediately and automatically do that — and show you the results."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Crisis-mapping and data protection standards

Ushahidi's Patrick Meier takes a look at the recently released Data Protection Manual issued by the International Organization for Migration (IOM). According to the IOM, the manual is meant to serve as a guide to help:

" ... protect the personal data of the migrants in its care. It follows concerns about the general increase in data theft and loss and the recognition that hackers are finding ever more sophisticated ways of breaking into personal files. The IOM Data Protection Manual aims to protect the integrity and confidentiality of personal data and to prevent inappropriate disclosure."

Meier describes the manual as "required reading" but notes that there is no mention of social media in the 150-page document. "This is perfectly understandable given IOM's work," he writes, "but there is no denying that disaster-affected communities are becoming more digitally-enabled — and thus, increasingly the source of important, user-generated information."

Meier moves through the Data Protection Manual's principles, highlighting the ones that may be challenged when it comes to user-generated, crowdsourced data and raising important questions about consent, privacy, and security.

Doubting the dating industry's algorithms

Many online dating websites claim that their algorithms are able to help match singles with their perfect mate. But a forthcoming article in "Psychological Science in the Public Interest," a journal of the Association for Psychological Science, casts some doubt on the data science of dating.

According to the article's lead author Eli Finkel, associate professor of social psychology at Northwestern University, "there is no compelling evidence that any online dating matching algorithm actually works." Finkel argues that dating sites' algorithms do not "adhere to the standards of science," and adds that "it is unlikely that their algorithms can work, even in principle, given the limitations of the sorts of matching procedures that these sites use."

It's "relationship science" versus the in-take questions that most dating sites ask in order to help users create their profiles and suggest matches. Finkel and his coauthors note that some of the strongest predictors for good relationships — such as how couples interact under pressure — aren't assessed by dating sites.

The paper calls for the creation of a panel to grade the scientific credibility of each online dating site.

Got data news?

Feel free to email me.

Related:

January 31 2012

Embracing the chaos of data

A data scientist and a former Apple engineer, Pete Warden (@petewarden) is now the CTO of the new travel photography startup Jetpac. Warden will be a keynote speaker at the upcoming Strata Conference, where he'll explain why we should rethink our approach to data. Specifically, rather than pursue the perfection of structured information, Warden says we should instead embrace the chaos of unstructured data. He expands on that idea in the following interview.

What do you mean asking data scientists to embrace the chaos of data?

Pete WardenPete Warden: The heart of data science is designing instruments to turn signals from the real world into actionable information. Fighting the data providers to give you those signals in a convenient form is a losing battle, so the key to success is getting comfortable with messy requirements and chaotic inputs. As an engineer, this can feel like a deal with the devil, as you have to accept error and uncertainty in your results. But the alternative is no results at all.

Are we wasting time trying to make unstructured data structured?

Pete Warden: Structured data is always better than unstructured, when you can get it. The trouble is that you can't get it. Most structured data is the result of years of effort, so it is only available with a lot of strings, either financial or through usage restrictions.

The first advantage of unstructured data is that it's widely available because the producers don't see much value in it. The second advantage is that because there's no "structuring" work required, there's usually a lot more of it, so you get much broader coverage.

A good comparison is Yahoo's highly-structured web directory versus Google's search index built on unstructured HTML soup. If you were looking for something that was covered by Yahoo, its listing was almost always superior, but there were so many possible searches that Google's broad coverage made it more useful. For example, I hear that 30% of search queries are "once in history" events — unique combinations of terms that never occur again.

Dealing with unstructured data puts the burden on the consuming application instead of the publisher of the information, so it's harder to get started, but the potential rewards are much greater.

How do you see data tools developing over the next few years? Will they become more accessible to more people?

Pete Warden: One of the key trends is the emergence of open-source projects that deal with common patterns of unstructured input data. This is important because it allows one team to solve an unstructured-to-structured conversion problem once, and then the entire world can benefit from the same solution. For example, turning street addresses into latitude/longitude positions is a tough problem that involves a lot of fuzzy textual parsing, but open-source solutions are starting to emerge.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Associated photo on home and category pages: "mess with graphviz by Toms Bauģis, on Flickr

Related:

January 13 2012

Top Stories: January 9-14, 2012

Here's a look at the top stories published across O'Reilly sites this week.

What is big data?
It's the hot trend in software right now, but what does big data mean, and how can you exploit it? Strata chair Edd Dumbill presents an introduction and orientation to the big data landscape.

Can Maryland's other "CIO" cultivate innovation in government?
Maryland's first chief innovation officer, Bryan Sivak, is looking for the levers that will help state government to be smarter, not bigger. From embracing collective intelligence to data-driven policy, Sivak is defining what it means to be innovative in government.

Three reasons why we're in a golden age of publishing entrepreneurship
Books, publishing processes and readers have all made the jump to digital, and that's creating considerable opportunities for publishing startups.

The rise of programmable self
Taking a cue from the Quantified Self movement, the programmable self is the combination of a digital motivation hack with a digital system that tracks behavior. Fred Trotter looks at companies and projects relevant to the programmable-self space.

A venture into self-publishing
Scott Berkun turned to self-publishing with his latest book, "Mindfire." In this TOC podcast, Berkun discusses the experience and says the biggest surprise was the required PR effort.


Tools of Change for Publishing, being held February 13-15 in New York, is where the publishing and tech industries converge. Register to attend TOC 2012.

January 11 2012

What is big data?

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

The hot IT buzzword of 2012, big data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lie valuable patterns and information, previously hidden because of the amount of work required to extract them. To leading corporations, such as Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today's commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well-resourced. Big data processing is eminently feasible for even the small garage startups, who can cheaply rent server time in the cloud.

The value of big data to an organization falls into two categories: analytical use, and enabling new products. Big data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers' transactions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports.

The past decade's successful web startups are prime examples of big data used as an enabler of new products and services. For example, by combining a large number of signals from a user's actions and those of their friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business. It's no coincidence that the lion's share of ideas and tools underpinning big data have emerged from Google, Yahoo, Amazon and Facebook.

The emergence of big data into the enterprise brings with it a necessary counterpart: agility. Successfully exploiting the value in big data requires experimentation and exploration. Whether creating new products or looking for ways to gain competitive advantage, the job calls for curiosity and an entrepreneurial outlook.

Data image

What does big data look like?

As a catch-all term, "big data" can be pretty nebulous, in the same way that the term "cloud" covers diverse technologies. Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing?

To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They're a helpful lens through which to view and understand the nature of the data and the software platforms available to exploit them. Most probably you will contend with each of the Vs to one degree or another.


Volume

The benefit gained from the ability to process large amounts of information is the main attraction of big data analytics. Having more data beats out having better models: simple bits of math can be unreasonably effective given large amounts of data. If you could run that forecast taking into account 300 factors rather than 6, could you predict demand better?

This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it.

Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, processing options break down broadly into a choice between massively parallel processing architectures — data warehouses or databases such as Greenplum — and Apache Hadoop-based solutions. This choice is often informed by the degree to which the one of the other "Vs" — variety — comes into play. Typically, data warehousing approaches involve predetermined schemas, suiting a regular and slowly evolving dataset. Apache Hadoop, on the other hand, places no conditions on the structure of the data it can process.

At its core, Hadoop is a platform for distributing computing problems across a number of servers. First developed and released as open source by Yahoo, it implements the MapReduce approach pioneered by Google in compiling its search indexes. Hadoop's MapReduce involves distributing a dataset among multiple servers and operating on the data: the "map" stage. The partial results are then recombined: the "reduce" stage.

To store data, Hadoop utilizes its own distributed filesystem, HDFS, which makes data available to multiple computing nodes. A typical Hadoop usage pattern involves three stages:

  • loading data into HDFS,
  • MapReduce operations, and
  • retrieving results from HDFS.

This process is by nature a batch operation, suited for analytical or non-interactive computing tasks. Because of this, Hadoop is not itself a database or data warehouse solution, but can act as an analytical adjunct to one.

One of the most well-known Hadoop users is Facebook, whose model follows this pattern. A MySQL database stores the core data. This is then reflected into Hadoop, where computations occur, such as creating recommendations for you based on your friends' interests. Facebook then transfers the results back into MySQL, for use in pages served to users.


Velocity

The importance of data's velocity — the increasing rate at which data flows into an organization — has followed a similar pattern to that of volume. Problems previously restricted to segments of industry are now presenting themselves in a much broader setting. Specialized companies such as financial traders have long turned systems that cope with fast moving data to their advantage. Now it's our turn.

Why is that so? The Internet and mobile era means that the way we deliver and consume products and services is increasingly instrumented, generating a data flow back to the provider. Online retailers are able to compile large histories of customers' every click and interaction: not just the final sales. Those who are able to quickly utilize that information, by recommending additional purchases, for instance, gain competitive advantage. The smartphone era increases again the rate of data inflow, as consumers carry with them a streaming source of geolocated imagery and audio data.

It's not just the velocity of the incoming data that's the issue: it's possible to stream fast-moving data into bulk storage for later batch processing, for example. The importance lies in the speed of the feedback loop, taking data from input through to decision. A commercial from IBM makes the point that you wouldn't cross the road if all you had was a five-minute old snapshot of traffic location. There are times when you simply won't be able to wait for a report to run or a Hadoop job to complete.

Industry terminology for such fast-moving data tends to be either "streaming data," or "complex event processing." This latter term was more established in product categories before streaming processing data gained more widespread relevance, and seems likely to diminish in favor of streaming.

There are two main reasons to consider streaming processing. The first is when the input data are too fast to store in their entirety: in order to keep storage requirements practical some level of analysis must occur as the data streams in. At the extreme end of the scale, the Large Hadron Collider at CERN generates so much data that scientists must discard the overwhelming majority of it — hoping hard they've not thrown away anything useful. The second reason to consider streaming is where the application mandates immediate response to the data. Thanks to the rise of mobile applications and online gaming this is an increasingly common situation.

Product categories for handling streaming data divide into established proprietary products such as IBM's InfoSphere Streams, and the less-polished and still emergent open source frameworks originating in the web industry: Twitter's Storm, and Yahoo S4.

As mentioned above, it's not just about input data. The velocity of a system's outputs can matter too. The tighter the feedback loop, the greater the competitive advantage. The results might go directly into a product, such as Facebook's recommendations, or into dashboards used to drive decision-making.

It's this need for speed, particularly on the web, that has driven the development of key-value stores and columnar databases, optimized for the fast retrieval of precomputed information. These databases form part of an umbrella category known as NoSQL, used when relational models aren't the right fit.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Variety

Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesn't fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application.

Even on the web, where computer-to-computer communication ought to bring some guarantees, the reality of data is messy. Different browsers send different data, users withhold information, they may be using differing software versions or vendors to communicate with you. And you can bet that if part of the process involves a human, there will be error and inconsistency.

A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application. One such example is entity resolution, the process of determining exactly what a name refers to. Is this city London, England, or London, Texas? By the time your business logic gets to it, you don't want to be guessing.

The process of moving from source data to processed application data involves the loss of information. When you tidy up, you end up throwing stuff away. This underlines a principle of big data: when you can, keep everything. There may well be useful signals in the bits you throw away. If you lose the source data, there's no going back.

Despite the popularity and well understood nature of relational databases, it is not the case that they should always be the destination for data, even when tidied up. Certain data types suit certain classes of database better. For instance, documents encoded as XML are most versatile when stored in a dedicated XML store such as MarkLogic. Social network relations are graphs by nature, and graph databases such as Neo4J make operations on them simpler and more efficient.

Even where there's not a radical data type mismatch, a disadvantage of the relational database is the static nature of its schemas. In an agile, exploratory environment, the results of computations will evolve with the detection and extraction of more signals. Semi-structured NoSQL databases meet this need for flexibility: they provide enough structure to organize data, but do not require the exact schema of the data before storing it.

In practice

We have explored the nature of big data, and surveyed the landscape of big data from a high level. As usual, when it comes to deployment there are dimensions to consider over and above tool selection.

Cloud or in-house?

The majority of big data solutions are now provided in three forms: software-only, as an appliance or cloud-based. Decisions between which route to take will depend, among other things, on issues of data locality, privacy and regulation, human resources and project requirements. Many organizations opt for a hybrid solution: using on-demand cloud resources to supplement in-house deployments.

Big data is big

It is a fundamental fact that data that is too big to process conventionally is also too big to transport anywhere. IT is undergoing an inversion of priorities: it's the program that needs to move, not the data. If you want to analyze data from the U.S. Census, it's a lot easier to run your code on Amazon's web services platform, which hosts such data locally, and won't cost you time or money to transfer it.

Even if the data isn't too big to move, locality can still be an issue, especially with rapidly updating data. Financial trading systems crowd into data centers to get the fastest connection to source data, because that millisecond difference in processing time equates to competitive advantage.

Big data is messy

It's not all about infrastructure. Big data practitioners consistently report that 80% of the effort involved in dealing with data is cleaning it up in the first place, as Pete Warden observes in his Big Data Glossary: "I probably spend more time turning messy source data into something usable than I do on the rest of the data analysis process combined."

Because of the high cost of data acquisition and cleaning, it's worth considering what you actually need to source yourself. Data marketplaces are a means of obtaining common data, and you are often able to contribute improvements back. Quality can of course be variable, but will increasingly be a benchmark on which data marketplaces compete.

Culture

The phenomenon of big data is closely tied to the emergence of data science, a discipline that combines math, programming and scientific instinct. Benefiting from big data means investing in teams with this skillset, and surrounding them with an organizational willingness to understand and use data for advantage.

In his report, "Building Data Science Teams," D.J. Patil characterizes data scientists as having the following qualities:

  • Technical expertise: the best data scientists typically have deep expertise in some scientific discipline.
  • Curiosity: a desire to go beneath the surface and discover and
    distill a problem down into a very clear set of hypotheses that can be
    tested.
  • Storytelling: the ability to use data to tell a story and to be
    able to communicate it effectively.
  • Cleverness: the ability to look at a problem in different,
    creative ways.

The far-reaching nature of big data analytics projects can have uncomfortable aspects: data must be broken out of silos in order to be mined, and the organization must learn how to communicate and interpet the results of analysis.

Those skills of storytelling and cleverness are the gateway factors that ultimately dictate whether the benefits of analytical labors are absorbed by an organization. The art and practice of visualizing data is becoming ever more important in bridging the human-computer gap to mediate analytical insight in a meaningful way.


Know where you want to go

Finally, remember that big data is no panacea. You can find patterns and clues in your data, but then what? Christer Johnson, IBM's leader for advanced analytics in North America, gives this advice to businesses starting out with big data: first, decide what problem you want to solve.

If you pick a real business problem, such as how you can change your advertising strategy to increase spend per customer, it will guide your implementation. While big data work benefits from an enterprising spirit, it also benefits strongly from a concrete goal.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

December 14 2011

Five big data predictions for 2012

As the "coming out" year for big data and data science draws to a close, what can we expect over the next 12 months?

More powerful and expressive tools for analysis

HadoopThis year has seen consolidation and engineering around improving the basic storage and data processing engines of NoSQL and Hadoop. That will doubtless continue, as we see the unruly menagerie of the Hadoop universe increasingly packaged into distributions, appliances and on-demand cloud services. Hopefully it won't be long before that's dull, yet necessary, infrastructure.

Looking up the stack, there's already an early cohort of tools directed at programmers and data scientists (Karmasphere, Datameer), as well as Hadoop connectors for established analytical tools such as Tableau and R. But there's a way to go in making big data more powerful: that is, to decrease the cost of creating experiments.

Here are two ways in which big data can be made more powerful.

  1. Better programming language support. As we consider data, rather than business logic, as the primary entity in a program, we must create or rediscover idiom that lets us focus on the data, rather than abstractions leaking up from the underlying Hadoop machinery. In other words: write shorter programs that make it clear what we're doing with the data. These abstractions will in turn lend themselves to the creation of better tools for non-programmers.
  2. We require better support for interactivity. If Hadoop has any weakness, it's in the batch-oriented nature of computation it fosters. The agile nature of data science will favor any tool that permits more interactivity.

Streaming data processing

Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.

Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.

For some applications, there just isn't enough storage in the world to store every piece of data your business might receive: at some point you need to make a decision to throw things away. Having streaming computation abilities enables you to analyze data or make decisions about discarding it without having to go through the store-compute loop of map/reduce.

Emerging contenders in the real-time framework category include Storm, from Twitter, and S4, from Yahoo.

Rise of data marketplaces

Your own data can become that much more potent when mixed with other datasets. For instance, add in weather conditions to your customer data, and discover if there are weather related patterns to your customers' purchasing patterns. Acquiring these datasets can be a pain, especially if you want to do it outside of the IT department, and with some exactness. The value of data marketplaces is in providing a directory to this data, as well as streamlined, standardized methods of delivering it. Microsoft's direction of integrating its Azure marketplace right into analytical tools foreshadows the coming convenience of access to data.

Development of data science workflows and tools

As data science teams become a recognized part of companies, we'll see a more regularized expectation of their roles and processes. One of the driving attributes of a successful data science team is its level of integration into a company's business operations, as opposed to being a sidecar analysis team.

EMC ChorusSoftware developers already have a wealth of infrastructure that is both logistical and social, including wikis and source control, along with tools that expose their process and requirements to business owners. Integrated data science teams will need their own versions of these tools to collaborate effectively. One example of this is EMC Greenplum's Chorus, which provides a social software platform for data science. In turn, use of these tools will support the emergence of data science process within organizations.

Data science teams will start to evolve repeatable processes, hopefully agile ones. They could do worse than to look at the ground-breaking work newspaper data teams are doing at news organizations such as The Guardian and New York Times: given short timescales these teams take data from raw form to a finished product, working hand-in-hand with the journalist.

Increased understanding of and demand for visualization

Visualization fulfills two purposes in a data workflow: explanation and exploration. While business people might think of a visualization as the end result, data scientists also use visualization as a way of looking for questions to ask and discovering new features of a dataset.

If becoming a data-driven organization is about fostering a better feel for data among all employees, visualization plays a vital role in delivering data manipulation abilities to those without direct programming or statistical skills.

Throughout a year dominated by business' constant demand for data scientists, I've repeatedly heard from data scientists about what they want most: people who know how to create visualizations.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Related:

December 13 2011

Tapping into a world of ambient data

More data was transmitted over the Internet in 2010 than in all other years combined. That's one reason why this year's Web 2.0 Summit used the "data frame" to explore the new landscape of digital business — from mobile to social to location to government.

Microsoft is part of this conversation about big data, with respect to the immense resources and technical talent the Redmond-based software giant continues to hold. During Web 2.0 Summit, I interviewed Microsoft Fellow David Campbell about his big data work and thinking. A video of our interview is below, with key excerpts added afterward.

What's Microsoft's role in the present and future of big data?

David Campbell: I've been a data geek for 25-plus years. You go back five to seven years ago, it was kind of hard to get some of the younger kids to think that the data space was interesting to solve problems. Databases are kind of boring stuff, but the data space is amazingly exciting right now.

It's a neat thing to have one part of the company that's processing petabytes of data on tens and hundreds of thousands of servers and then another part that's a commercial business. In the last couple of years, what's been interesting is to see them come together, with things that scale even on the commercial side. That's the cool part about it, and the cool part of being at Microsoft now.

What's happening now seems like it wasn't technically possible a few years ago. Is that the case?

David Campbell: Yes, for a variety of reasons. If you think about the costs just to acquire the data, you can still pay people to type stuff in. It's roughly $1 per kilobyte. But you go back 25 or 30 years and virtually all of the data that we were working with had come off human fingertips. Now it's just out there. Even inherently analog things like phone calls and pictures — they're just born digital. To store it, we've gone from $1,000-per-megabyte 25 years ago to $40-per-terabyte for raw storage. That's an incredible shift.

How is Microsoft working with data startups?

David Campbell: The interesting thing about the data space is that we're talking about a lot of people with machine learning experience. They know a particular domain, but it's really hard for them to go find a set of customers. So, let's say that they've got an algorithm or a model that might be relevant to 5,000 people. It's really hard for them to go find those people.

We built this thing a couple of years ago called the DataMarket. The idea is to change the route to market. So, people can take their model and place it on the DataMarket and then others can go find it.

Here's the example I use inside the company, for those old enough to remember: When people were building Visual Basic controls, it was way harder to write one than it was to consume one. The guys writing the controls didn't have to go find the guy who was building the dentist app. They just published it in this thing from way back when it was actually, on paper, called "Programmer's Paradise," and then the guy who was writing the dentist's app would go there to find what he needed.

It's the same sort of thing here. How do you connect those people, those data scientists, who are going to remain a rare commodity with the set of people who can make use of the models they have?

How are the tools of data science changing?

David Campbell: Tooling is going to be a big challenge and a big opportunity here. We announced a tool recently that we call the Data Explorer, which lets people discover other forms of data — some in the DataMarket, some that they have. They can mash it up, turn it around and then republish it.

One of the things we looked at when we started building the tools is that people tend to do mashups today in what I was calling a "last-mile tool." They might use Access or Excel or some other tool. When they were done, they could share it with anyone else who had the same tool. The idea of the Data Explorer is to back up one step and produce something that is itself a data source that's then consumable by a large number of last-mile tools. You can program against the service itself to produce applications and whatnot.

How should companies collect and use data? What strategic advice would you offer?

David Campbell: From the data side, we've lived in what we'd call a world of scarcity. We thought that data was expensive to store, so we had to get rid of it as soon as possible. You don't want it unless you have a good use for it. Now we think about data from a perspective of abundance.

Part of the challenge, 10 or 15 years years ago, was where do I go get the data? Where do I tap in? But in today's world, everything is so interconnected. It's just a matter of teeing into it. The phrase I've used instead of big data is "ambient data." It's just out there and available.

The recommendation would be to stop and think about the latent value in all that data that's there to be collected and that's fairly easy to store now. That's the challenge and the opportunity for all of us.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


This interview was edited and condensed.

Related:

December 08 2011

Strata Week: The looming data science talent shortage

Here are a few of the big-data stories that caught my attention this week.

Data scientists in demand

This week, EMC released (pdf) the findings of its recent survey of the data science community. Calling it the largest ever survey of its kind, the EMC Data Science Study included responses from more than 500 data scientists, information analysts, and data specialists from the U.S., U.K., France, Germany, India and China.

The majority of respondents (83%) said they believed that new technologies would increase the need for data scientists. But 64% also felt as though this new demand for data scientists would outstrip the supply (31% said demand would "significantly outpace" supply). Just 12% felt as though future data science jobs would be filled by current business intelligence professionals.

Chart from Data Science Revealed studyThe source for future talent? College students, not surprisingly — 34% said future data science jobs would go to computer science grads; 24% said these jobs would go to those from other disciplines. And in the case of data scientists, those may well be college students with masters or PhDs — some 40% of data scientists have an advanced degree, and nearly one in 10 have a doctorate. In comparison, less than 1% of business intelligence professionals have a PhD.

But the problems that the data science community faces aren't simply a future talent shortage. Just a third of respondents said they were confident in their company's ability to make data-driven business decisions. Again, respondents pointed to a shortage of employees with the right training or skills (32%). Budget shortages were also an issue (32%).

Another problem uncovered by the survey: data accessibility. Just 12% of business intelligence analysts and 22% of data scientists say they "strongly believe" that employees have the access they need to run experiments on data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Carrier IQ and big data

The mobile intelligence company Carrier IQ has gone from obscurity to infamy following the discovery by Android developer Trevor Eckhart that Carrier IQ's rootkit software could record all sorts of user data — texts, web browsing, keystrokes, and even phone calls.

The software is on an estimated 100 million phones — Android and iOS alike — and the news of it has prompted calls for an FTC investigation, questions from a Senator, and class-action lawsuits.

Carrier IQ issued a statement, explaining that "Our software makes your phone better by delivering intelligence on the performance of mobile devices and networks to help the operators provide optimal service efficiency."

But at GigaOm, Kevin Fitchard called Carrier IQ's relationships to handset makers and carriers a "bizarre big-data triangle":

This is big data for the mobile world — massive databases of consumer behavior delving into when, how and in what manner we use our devices. By Carrier IQ's own admission, its software is embedded in more than 150 million handsets. There are plenty of companies that would find that information enormously useful. The problem is Carrier IQ never got permission from all these smartphone users to collect that data, never told them it was gathering it, and never provided a way of opting out.

DataSift will soon offer access to historical tweets

DataSift Historical DataIt was April of last year when Twitter announced it was donating its entire archive to the Library of Congress, and since then, researchers have been waiting to get their hands on this older Twitter data.

As it currently stands, you can only search Twitter back as far as a week. And while you can get access to the Twitter firehose, that's little help at looking at the historical record.

But starting soon, developers and researchers will have access to a bit more of that record when DataSift begins offering historical data. DataSift's alpha version will offer access to 60 days' worth of the Twitter feed, and when the service formally launches next year, DataSift promises more data.

It's not quite the Library of Congress, which, as we noted earlier this year, is working on the technology infrastructure to make the historical Tweets indexable and accessible. The Library of Congress does have access to the Twitter firehose (via the other stream provider, Gnip), so it looks like that's where the complete record will, for now at least, reside.

Got data news?

Feel free to email me.

Related:

November 08 2011

When good feedback leaves a bad impression

http://assets.en.oreilly.com/1/eventprovider/1/_@user_4490.jpgIf a teacher is prone to hyperbole — lots of "greats!" and "excellents!" and "A+++" grades — it's natural for a student to perceive a mere "good" as an undesirable response. According to Panagiotis Ipeirotis, associate professor at New York University, the same perception applies to online reviews.

In a recent interview, Ipeirotis touched on the the negative impact of good-enough reviews and a host of other data-related topics. Highlights from the interview (below) included:

  • Sentiment analysis is a commonly used tool for measuring what people are saying about a particular company or brand, but it has issues. "The problem with sentiment analysis," said Ipeirotis, "is that it tends to be rather generic, and it's not customized to the context in which people read." Ipeirotis pointed to Amazon as a good example here, where customer feedback about a merchant that says "good packaging" might initially appear as positive sentiment, but "good" feedback can have a negative effect on sales. "People tend to exaggerate a lot on Amazon. 'Excellent seller.' 'Super-duper service.' 'Lightning-fast delivery.' So when someone says 'good packaging,' it's perceived as, 'that's all you've got?'" [Discussed at the 0:42 mark.]
  • Ipeirotis suggested that people should challenge the initial conclusions they make from data. "Every time that something seems to confirm your intuition too much, I think it's good to ask for feedback." [Discussed at 2:24.]
  • Ipeirotis has done considerable research on Amazon's Mechanical Turk (MTurk) platform. He described MTurk as "an interesting example of a market that started with the wrong design." Amazon thought that its cloud-based labor service would be "yet another of its cloud services." But a market that "involves people who are strategic and responding to incentives," said Ipeirotis, "is very different than a market for CPUs and so on." Because Amazon didn't take this into consideration early on, the service has faced spam and reputation issues. Ipeirotis pointed to the site's use of anonymity as an example: Anonymity was supposed to protect privacy, but it's actually hurt some of the people who are good at what they do because anonymity is often associated with spammers. [Discussed at 2:55.]

The full interview is available in the following video:

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Some quotes from this interview were edited and condensed for clarity.

Related:

November 03 2011

The number one trait you want in a data scientist

"Data scientist" is an on-the-rise job title, but what are the skills that make a good one? And how can both data scientists and the companies they work for make sure data-driven insights become actionable?

In a recent interview, DJ Patil (@dpatil), formerly the chief scientist at LinkedIn and now the data scientist in residence at Greylock Partners, discussed common data scientist traits and the challenges that those in the profession face getting their work onto company roadmaps.

Highlights from the interview (below) included:

  • What makes a good data scientist? According to Patil, the number one trait of data scientists is "a passion for really getting to an answer." This does mean, Patil said, that personality might trump skills. Pointing to what he calls "data jujitsu" — the art of turning data into products — he noted that some people can approach a problem "very heavily and very aggressively" using all sorts of computing tools. "But one data scientist who's clever can get results far faster. And typically in a business situation, that's going to have better payoff." Patil pointed to a site like Kaggle, where people compete to solve data problems, and noted that despite the number of data scientists there using machine learning and artificial intelligence, some of them are "getting beat by people who just have good, interesting insights." [Discussed at the 1:34 mark.]
  • Despite the "data smarts and street smarts" that Patil sees as key to data science, data scientists sometimes struggle to get companies to pay attention to the insights data science can provide. The good news is that Patil anticipates this attention issue will fade in the future. Once organizations recognize the importance of data, they'll identify and handle data in better ways. Furthermore, we'll see "a new generation of designers, product managers and GMs who are also data scientists and not just former engineers." [Discussed at 4:09.]

The full interview is available in the following video:

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


Related:

October 26 2011

Four short links: 26 October 2011

  1. CPAN Turns 0x10 -- sixteenth anniversary of the creation of the Comprehensive Perl Archive Network. Now holds 480k objects.
  2. Subtext -- social bookreading by adding chat, links, etc. to a book. I haven't tried the implementation yet but I've wanted this for years. (Just haven't wanted to jump into the cesspool of rights negotiations enough to actually build it :-) (via David Eagleman)
  3. Questions to Ask about Election Polls -- information to help you critically consume data analysis. (via Rachel Cunliffe)
  4. Technologies, Potential, and Implications of Additive Manufacturing (PDF) -- AM is a group of emerging technologies that create objects from the bottom-up by adding material one cross-sectional layer at a time. [...] Ultimately, AM has the potential to be as disruptive as the personal computer and the internet. The digitization of physical artifacts allows for global sharing and distribution of designed solutions. It enables crowd-sourced design (and individual fabrication) of physical hardware. It lowers the barriers to manufacturing, and allows everyone to become an entrepreneur. (via Bruce Sterling)

September 16 2011

Putting innovation and tech to work against breast cancer

GE challengeIn April, Jeff Hammerbacher looked around Silicon Valley and made an observation to Businessweek that spread like wildfire: "The best minds of my generation are thinking about how to make people click ads," he said. "That sucks."

With the launch of General Electric's Healthymagination Cancer Challenge, the best and brightest technical minds have been called to work on something that matters: fight breast cancer.

The open innovation challenge was launched yesterday in New York City. GE and a number of venture capitalists are putting $100 million behind the challenge as part of GE's larger billion-dollar commitment to fund cancer-related R&D over the next five years.

Tim O'Reilly moderated two panels during the launch yesterday that highlighted some of the challenges and opportunities in the fight against breast cancer. Video of the event is embedded below.

[Disclosure: Tim O'Reilly will be one of the judges in GE's investment challenge.]

A moment of convergence

While the Internet is changing healthcare, what happens next is immensely important to everyone.

"I turned to healthcare partly because I saw an immense hunger among the developers that I work with to start working on stuff that matters," said O'Reilly at the launch.

O'Reilly noted the combination of medical data and data tools is enticing to developers. "As we've been hearing, there are new diagnostic technologies that are producing massive amounts of data," he said. "And of course, crunching data and extracting meaning is something that the big Silicon Valley companies have worked to perfect. We're at a moment of convergence and I'm fascinated by what is happening as these two worlds come together."

Bob Kocher of VenRock cited three reasons why "cancer won't know what happened when we've finished":

  1. New data — "We are great at making sense out of data and we're getting better every day," Kocher said.
  2. New demand — "Thank God screening will be available to all Americans," he said. "Hopefully, we will reach them where they are, with technologies that are more sensitive, more reliable, more pleasant, and making it more pervasive. We'll catch cancer at a point where we can absolutely take care of it."
  3. New economics — "Our health system economics are changing in ways that I think actually will foster much better treatment of patients, more reliably, with drugs that work better with fewer side effects," Kocher said.

What's required for innovation? Beth Comstock, senior VP and CMO at General Electric, said that a global survey by GE returned three simple truths for what's needed: collaboration, the role of the creative individual, and profit with a purpose. When it comes to the latter, "there's nothing more relevant than healthcare."

Applying that care to where it's needed most was a point of agreement for all of the panelists. "Open innovation in health doesn't matter if we can't get it to the patient and deliver it," said O'Reilly.

Atul Gawande has written about lowering medical costs by giving the neediest patients better care with a process called "hotspotting." Give the success of the approach in Camden, New Jersey, similar data-driven measures for providing healthcare in communities may be in our future.

Personalized medicine and molecular biology

Personalized medicine, driven by the ongoing discoveries in molecular biology, is "just what's next," said GE chairman and CEO Jeffrey Immelt. To take on the immense challenge that breast cancer presents, it will require systems thinking to address both outcomes and cost over time.

Immelt is not the only executive bullish on the potential of new technologies to help breast cancer patients. "We'll see more innovation in the next five years in cancer research and development than we saw in the last 50 years," said Ron Andrews, CEO of Clarient.

Innovation needs partnership to scale, however, said Sue Siegel a general partner at Mohr Davidow. The ideas submitted to the GE challenge need to be open and scalable to have the biggest impact, she noted.

Siegel posited that the road to a cure will be through molecular diagnostics. The challenge is that less than 1% of spending is on diagnostics, said Siegel, in the context of a healthcare industry that represents $2.6 trillion of the U.S. GDP — and yet most clinical decisions are based on diagnostics. In that context, diagnostic data appears to be a significantly undervalued resource.

"We need to value the diagnostic data as much as we do the therapies," said Risa Stack, a general partner at Kleiner Perkins Caufield & Byers. Stack said that they're thinking of a "diagnostics registry," a website that would enable people to know the different kinds of diagnostics available to patients.

"The time for personalized healthcare is now in oncology, said Greg Plowman, senior vice president for research at ImClone Systems, a subsidiary of Eli Lilly. "What's best for the patient is knowing that this drug is best for them," he said. According to Plowman, Eli Lilly is investing heavily in new diagnostics and looking for partnerships.

Susan Love of UCLA noted that screening for breast cancer, however, is still one size fits all. Breast cancer for young women is more aggressive and less likely to be picked up by traditional mechanisms, she said. "We need to focus on screening — not just personalized medicine at the end. Do it at the beginning."


Obstacles to innovation in healthcare

For entrepreneurs, there are always obstacles to building any company. It is, however, 100 times harder to be an entrepreneur inside health and wellness, said Steve Krein, co-founder of StartUp Health. "Everything is stacked against you," he said, from regulations to the patient feedback cycle.

Krein sees an "incredible amount" of people who are interested in the healthcare space but are frustrated by barriers. He emphasized that there are important opportunities for entrepreneurs to seize, particularly in the "gap" between the Internet and a doctor's visit, where they're left alone with a search box.

There are two things that take too long, said Kocher: regulations and reimbursement. In his view, the Food and Drug Administration needs to get involved earlier to help startups navigate the system.

In a larger sense, O'Reilly suggested the healthcare industry apply a lesson from Google's playbook. The search giant solved a problem that Sam Wannamaker famously articulated about advertising: he knew half of ads work but not which half. By applying data-driven approaches to healthcare, there might be huge potential to know more about what's working and create feedback loops that allow physicians and regulators to iterate quickly.

We now have the ability to move to much more real-time monitoring of what works, O'Reilly said, suggesting that "regulations need to move from a stack of paper to a set of processes for monitoring in real-time."

That could become particularly important if more health data was voluntarily introduced into the startup ecosystem through the Blue Button, a technical mechanism for enabling citizens to download their personal health information and take it with them. "Once patients have their own data, they're much more willing to share than the law will allow," said O'Reilly, but they "will tend to share if they think it will solve their health crisis."

As entrepreneurs consider how to innovate, O'Reilly said, it's important to recognize that the "change in business model is often as important as the change in technology."

A mobile revolution is coming to healthcare

After the forum, O'Reilly tweeted that healthcare is due for a "UI revolution." He cited a statistic that 1 in 5 physicians now owns an iPad and that by 2014, virtually all physicians are expected to have a tablet.

Over the past five years, said MedHelp CEO John de Souza during the launch event, monthly visitors to MedHelp.com have grown from 1 million to 12 million, and mobile visitors have grown from 3% to 30% of that traffic.

The "mobile phone is becoming a health hub," said Souza, with the ability to transmit and collect data. The two big impediments to growth are manual entry and data monitoring. Data needs to be automatically collected and sent on to someone else looking at data through tele-monitoring, where they can analyze it and inform a physician.

Krein cited the iPad as one of the most transformative technologies in healthcare because the simplified user experience has opened the door to different thinking. Krein said that when they opened up StartupAcademy and 125 entrepreneurs applied, half of them had some element of mobile health in the proposals that included the use of an iOS or Android device.


The future of healthcare is social

As reported elsewhere, social media is changing healthcare by connecting patients to information and, increasingly, each other.

As the panelists acknowledged, advocates have built huge communities and created seminal change both online and offline.

There is an opportunity for people to share actual outcomes, said O'Reilly. Given that people are using the Internet to share that information, it becomes a useful source for patients and physicians. "We do see people looking for answers in the Internet," he said. "The key thing in patient's education is teaching people how to ask better questions."

Love went beyond peer-to-peer healthcare: we can really educate the public not just about the treatment but about the research too, she said, including how to get it done and how to participate. "That's the only way to get the cause, not just the cure."

A personal challenge

I can't claim to be unbiased about breast cancer. Both my mother and grandmother have had it and survived. Through their experiences, I learned just how many other women are affected. Breast cancer statistics are stark: about 1 in 8 women in the United States will develop invasive breast cancer over the course of their lifetime. More than 200,000 new cases of breast cancer are detected every year in the U.S. alone. Globally, breast cancer is the number one cancer for women in both the developing world and developed world, according to the World Health Organization. Hundreds of thousands of those diagnosed die.

Nancy Brinkler, the founder of the Susan G. Komen Foundation, lost her sister to breast cancer at the age of 36. We've moved from a society where breast cancer couldn't be said on television to one where billions are invested worldwide, she noted at the launch.

"We don't have the knowledge of how to defeat it but do know more about the biology," Brinkler said. While relative survival rates have improved for those who have access to early screening and treatment, "where a woman lives or how many resources she has should never determine whether she lives." To move forward "will require a bridge between science and society."

If healthcare data and the energy of innovation can be harnessed to create earlier detection and targeted therapies, more women diagnosed with breast cancer will join the millions of survivors.

Top Stories: September 12-16, 2011

Here's a look at the top stories published across O'Reilly sites this week.

Building data science teams
A data science team needs people with the right skills and perspectives, and it also requires strong tools, processes, and interaction between the team and the rest of the company.

The evolution of data products
The real changes in our lives will come from products that have the richness of data without calling attention to the data.

The work of data journalism: Find, clean, analyze, create ... repeat
Simon Rogers discusses the grunt work and tools behind The Guardian's data stories.

Social data: A better way to track TV
PeopleBrowsr CEO Jodee Rich says social data offers a better way to see what TV audiences watch and what they care about.


When media rebooted, it brought marketing with it
In this TOC podcast, Twist Image president Mitch Joel talks about some of the common challenges facing the music, magazine and book publishing sectors.




Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively. Save 30% on registration with the code ORM30.

Building data science teams

Starting in 2008, Jeff Hammerbacher (@hackingdata) and I sat down to share our experiences building the data and analytics groups at Facebook and LinkedIn. In many ways, that meeting was the start of data science as a distinct professional specialization (see the
"What makes a data scientist" section of this report for the story on how we came up with the title "Data Scientist"). Since then, data science has taken on a life of its own. The hugely positive response to "What Is Data Science?," a great introduction to the meaning of data science in today's world, showed that we were at the start of a movement. There are now regular meetups, well-established startups, and even college curricula focusing on data science. As McKinsey's big data research report and LinkedIn's data indicates, data science talent is in high demand.

This increase in the demand for data scientists has been driven by the success of the major Internet companies. Google, Facebook, LinkedIn, and Amazon have all made their marks by using data creatively: not just warehousing data, but turning it into something of value. Whether that value is a search result, a targeted advertisement, or a list of possible acquaintances, data science is producing products that people want and value. And it's not just Internet companies: Walmart doesn't produce "data products" as such, but they're well known for using data to optimize every aspect of their retail operations.

Given how important data science has grown, it's important to think about what data scientists add to an organization, how they fit in, and how to hire and build effective data science teams.

Analytics and Data Science Job Growth
Courtesy LinkedIn Corp.

Being data driven

Everyone wants to build a data-driven organization. It's a popular phrase and there are plenty of books, journals, and technical blogs on the topic. But what does it really mean to be "data driven"? My definition is:

A data-driven organization acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.

There are many ways to assess whether an organization is data driven. Some like to talk about how much data they generate. Others like to talk about the sophistication of data they use, or the process of internalizing data. I prefer to start by highlighting organizations that use data effectively.

Ecommerce companies have a long history of using data to benefit their organizations. Any good salesman instinctively knows how to suggest further purchases to a customer. With "People who viewed this item also viewed ...," Amazon moved this technique online. This simple implementation of collaborative filtering is one of their most used features; it is a powerful mechanism for serendipity outside of traditional search. This feature has become so popular that there are now variants such as "People who viewed this item bought ... ." If a customer isn't quite satisfied with the product he's looking at, suggest something similar that might be more to his taste. The value to a master retailer is obvious: close the deal if at all possible, and instead of a single purchase, get customers to make two or more purchases by suggesting things they're likely to want. Amazon revolutionized electronic commerce by bringing these techniques online.



Data products are at the heart of social networks. After all, what is a social network if not a huge dataset of users with connections to each other, forming a graph? Perhaps the most important product for a social network is something to help users connect with others. Any new user needs to find friends, acquaintances, or contacts. It's not a good user experience to force users to search for their friends, which is often a surprisingly difficult task. At LinkedIn, we invented People You May Know (PYMK) to solve this problem. It's easy for software to predict that if James knows Mary, and Mary knows John Smith, then James may know John Smith. (Well, conceptually easy. Finding connections in graphs gets tough quickly as the endpoints get farther apart. But solving that problem is what data scientists are for.) But imagine searching for John Smith by name on a network with hundreds of millions of users!

Although PYMK was novel at the time, it has become a critical part of every social network's offering. Facebook not only supports its own version of PYMK, they monitor the time it takes for users to acquire friends. Using sophisticated tracking and analysis technologies, they have identified the time and number of connections it takes to get a user to long-term engagement. If you connect with a few friends, or add friends slowly, you won't stick around for long. By studying the activity levels that lead to commitment, they have designed the site to decrease the time it takes for new users to connect with the critical number of friends.

Netflix does something similar in their online movie business. When you sign up, they strongly encourage you to add to the queue of movies you intend to watch. Their data team has discovered that once you add more than than a certain number of movies, the probability you will be a long-term customer is significantly higher. With this data, Netflix can construct, test, and monitor product flows to maximize the number of new users who exceed the magic number and become long-term customers. They've built a highly optimized registration/trial service that leverages this information to engage the user quickly and efficiently.

Netflix, LinkedIn, and Facebook aren't alone in using customer data to encourage long-term engagement — Zynga isn't just about games. Zynga constantly monitors who their users are and what they are doing, generating an incredible amount of data in the process. By analyzing how people interact with a game over time, they have identified tipping points that lead to a successful game. They know how the probability that users will become long-term changes based on the number of interactions they have with others, the number of buildings they build in the first n days, the number of mobsters they kill in the first m hours, etc. They have figured out the keys to the engagement challenge and have built their product to encourage users to reach those goals. Through continued testing and monitoring, they refined their understanding of these key metrics.

Receive results faster with Aster Data's approach to big data analytics. Learn more.

Google and Amazon pioneered the use of A/B testing to optimize the layout of a web page. For much of the web's history, web designers worked by intuition and instinct. There's nothing wrong with that, but if you make a change to a page, you owe it to yourself to ensure that the change is effective. Do you sell more product? How long does it take for users to find the result they're looking for? How many users give up and go to another site? These questions can only be answered by experimenting, collecting the data, and doing the analysis, all of which are second nature to a data-driven company.

Yahoo has made many important contributions to data science. After observing Google's use of MapReduce to analyze huge datasets, they realized that they needed similar tools for their own business. The result was Hadoop, now one of the most important tools in any data scientist's repertoire. Hadoop has since been commercialized by Cloudera, Hortonworks (a Yahoo spin-off), MapR, and several other companies. Yahoo didn't stop with Hadoop; they have observed the importance of streaming data, an application that Hadoop doesn't handle well, and are working on an open source tool called S4 (still in the early stages) to handle streams effectively.

Payment services, such as PayPal, Visa, American Express, and Square, live and die by their abilities to stay one step ahead of the bad guys. To do so, they use sophisticated fraud detection systems to look for abnormal patterns in incoming data. These systems must be able to react in milliseconds, and their models need to be updated in real time as additional data becomes available. It amounts to looking for a needle in a haystack while the workers keep piling on more hay. We'll go into more details about fraud and security later in this article.

Google and other search engines constantly monitor search relevance metrics to identify areas where people are trying to game the system or where tuning is required to provide a better user experience. The challenge of moving and processing data on Google's scale is immense, perhaps larger than any other company today. To support this challenge, they have had to invent novel technical solutions that range from hardware (e.g., custom computers) to software (e.g., MapReduce) to algorithms (PageRank), much of which has now percolated into open source software projects.

I've found that the strongest data-driven organizations all live by the motto "if you can't measure it, you can't fix it" (a motto I learned from one of the best operations people I've worked with). This mindset gives you a fantastic ability to deliver value to your company by:

  • Instrumenting and collecting as much data as you can. Whether you're doing business intelligence or building products, if you don't collect the data, you can't use it.
  • Measuring in a proactive and timely way. Are your products, and strategies succeeding? If you don't measure the results, how do you know?
  • Getting many people to look at data. Any problems that may be present will become obvious more quickly — "with enough eyes all bugs are shallow."
  • Fostering increased curiosity about why the data has changed or is not changing. In a data-driven organization, everyone is thinking about the data.

It's easy to pretend that you're data driven. But if you get into the mindset to collect and measure everything you can, and think about what the data you've collected means, you'll be ahead of most of the organizations that claim to be data driven. And while I have a lot to say about professional data scientists later in this post, keep in mind that data isn't just for the professionals. Everyone should be looking at the data.


The roles of a data scientist

In every organization I've worked with or advised, I've always found that data scientists have an influence out of proportion to their numbers. The many roles that data scientists can play fall into the following domains.

Decision sciences and business intelligence

Data has long played a role in advising and assisting operational and strategic thinking. One critical aspect of decision-making support is defining, monitoring, and reporting on key metrics. While that may sound easy, there is a real art to defining metrics that help a business better understand its "levers and control knobs." Poorly-chosen metrics can lead to blind spots. Furthermore, metrics must always be used in context with each other. For example, when looking at percentages, it is still important to see the raw numbers. It is also essential that metrics evolve as the sophistication of the business increases. As an analogy, imagine a meteorologist who can only measure temperature. This person's forecast is always going to be of lower quality than the meteorologist who knows how to measure air pressure. And the meteorologist who knows how to use humidity will do even better, and so on.

Once metrics and reporting are established, the dissemination of data is essential. There's a wide array of tools for publishing data, ranging from simple spreadsheets and web forms, to more sophisticated business intelligence products. As tools get more sophisticated, they typically add the ability to annotate and manipulate (e.g., pivot with other data elements) to provide additional insights.

More sophisticated data-driven organizations thrive on the "democratization" of data. Data isn't just the property of an analytics group or senior management. Everyone should have access to as much data as legally possible. Facebook has been a pioneer in this area. They allow anyone to query the company's massive Hadoop-based data store using a language called Hive. This way, nearly anyone can create a personal dashboard by running scripts at regular intervals. Zynga has built something similar, using a completely different set of technologies. They have two copies of their data warehouses. One copy is used for operations where there are strict service-level agreements (SLA) in place to ensure reports and key metrics are always accessible. The other data store can be accessed by many people within the company, with the understanding that performance may not be always optimal. A more traditional model is used by eBay, which uses technologies like Teradata to create cubes of data for each team. These cubes act like self-contained datasets and data stores that the teams can interact with.

As organizations have become increasingly adept with reporting and analysis, there has been increased demand for strategic decision-making using data. We have been calling this new area "decision sciences." These teams delve into existing data sources and meld them with external data sources to understand the competitive landscape, prioritize strategy and tactics, and provide clarity about hypotheses that may arise during strategic planning. A decision sciences team might take on a problem, like which country to expand into next, or it might investigate whether a particular market is saturated. This analysis might, for example, require mixing census data with internal data and then building predictive models that can be tested against existing data or data that needs to be acquired.

One word of caution: people new to data science frequently look for a "silver bullet," some magic number around which they can build their entire system. If you find it, fantastic, but few are so lucky. The best organizations look for levers that they can lean on to maximize utility, and then move on to find additional levers that increase the value of their business.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Product and marketing analytics

Product analytics represents a relatively new use of data. Teams create applications that interact directly with customers, such as:

  • Products that provide highly personalized content (e.g., the ordering/ranking of information in a news feed).
  • Products that help drive the company's value proposition (e.g., "People You May Know" and other applications that suggest friends or other types of connections).
  • Products that facilitate the introduction into other products (e.g., "Groups You May Like," which funnels you into LinkedIn's Groups product area).
  • Products that prevent dead ends (e.g., collaborative filters that suggest further purchases, such as Amazon's "People who viewed this item also viewed ...").
  • Products that are stand alone (e.g., news relevancy products like Google News, LinkedIn Today, etc.).

Given the rapidly decreasing cost of computation, it is easier than ever to use common algorithms and numerical techniques to test the effectiveness of these products.

Similar to product analytics, marketing analytics uses data to explain and showcase a service or product's value proposition. A great example of marketing analytics is OKCupid's blog, which uses internal and external data sources to discuss larger trends. For example, one well-known post correlates the number of sexual partners with smartphone brands. Do iPhone users have more fun? OKCupid knows. Another post studied what kinds of profile pictures are attractive, based on the number of new contacts they generated. In addition to a devoted following, these blog posts are regularly picked up by traditional media, and shared virally through social media channels. The result is a powerful marketing tactic that drives both new users and returning users. Other companies that have used data to drive blogging as a marketing strategy include Mint, LinkedIn, Facebook, and Uber.

Email has long been the basis for online communication with current and potential customers. Using analytics as a part of an email targeting strategy is not new, but powerful analytical technologies can help to create email marketing programs that provide rich content. For example, LinkedIn periodically sends customers updates about changes to their networks: new jobs, significant posts, new connections. This would be spam if it were just a LinkedIn advertisement. But it isn't — it's relevant information about people you already know. Similarly, Facebook uses email to encourage you to come back to the site if you have been inactive. Those emails highlight the activity of your most relevant friends. Since it is hard to delete an email that tells you what your friends are up to, it's extremely effective.

Fraud, abuse, risk and security

Online criminals don't want to be found. They try to hide in the data. There are several key components in the constantly evolving war between attackers and defenders: data collection, detection, mitigation, and forensics. The skills of data scientists are well suited to all of these components.

Any strategy for preventing and detecting fraud and abuse starts with data collection. Data collection is always a challenge, and it is tough to decide how much instrumentation is sufficient. Attackers are always looking to exploit the limitations of your data, but constraints such as cost and storage capacity mean that it's usually impossible to collect all the data you'd like. The ability to recognize which data needs to be collected is essential. There's an inevitable "if only" moment during an attack: "if only we had collected x and y, we'd be able to see what is going on."

Another aspect of incident response is the time required to process data. If an attack is evolving minute by minute, but your processing layer takes hours to analyze the data, you won't be able to respond effectively. Many organizations are finding that they need data scientists, along with sophisticated tooling, to process and analyze data quickly enough to act on it.

Once the attack is understood, the next phase is mitigation. Mitigation usually requires closing an exploit or developing a model that segments bad users from good users. Success in this area requires the ability to take existing data and transform it into new variables that can be acted upon. This is a subtle but critical point. As an example, consider IP addresses. Any logging infrastructure almost certainly collects the IP addresses that connect to your site. Addresses by themselves are of limited use. However, an IP address can be transformed into variables such as:

  • The number of bad actors seen from this address during some period of time.
  • The country from which the address originated, and other geographic information.
  • Whether the address is typical for this time of day.

From this data, we now have derived variables that can be built into a model for an actionable result. Domain experts who are data scientists understand how to make variables out of the data. And from those variables, you can build detectors to find the bad guys.

Finally, forensics builds a case against the attackers and helps you learn about the true nature of the attack and how to prevent (or limit) such attacks in the future. Forensics can be a time-consuming process where the data scientists sift through all of the data to piece together a puzzle. Once the puzzle has been put together, new tooling, processes, and monitoring can be put in place.

Data services and operations

One of the foundational components of any data organization is data services and operations. This team is responsible for the databases, data stores, data structures (e.g., data schemas), and the data warehouse. They are also responsible for the monitoring and upkeep of these systems. The other functional areas cannot exist without a top-notch data services and operations group; you could even say that the other areas live on top of this area. In some organizations, these teams exist independently of traditional operations teams. In my opinion, as these systems increase in sophistication, they need even greater coordination with operations groups. The systems and services this functional area provides need to be deployed in traditional data centers or in the cloud, and they need to be monitored for stability; staff also must be on hand to respond when systems go down. Established operations groups have expertise in these areas, and it makes sense to take advantage of such skills.

As an organization builds out its reporting requirements, the data services and operations team should become responsible for the reporting layer. While team members may not focus on defining metrics, they are critical in ensuring that the reports are delivered in a timely manner. Therefore, collaboration between data services and decision sciences is absolutely essential. For example, while a metric may be easy to define on paper, implementing it as part of a regular report may be unrealistic: the database queries required to implement the metric may be too complex to run as frequently as needed.

Data engineering and infrastructure

It's hard to understate the sophistication of the tools needed to instrument, track, move, and process data at scale. The development and implementation of these technologies is the responsibility of the data engineering and infrastructure team. The technologies have evolved tremendously over the past decade, with an incredible amount of collaboration taking place through open source projects. Here are just a few samples:

  • Kafka, Flume, and Scribe are tools for streaming data collection. While the models differ, the general idea is that these programs collect data from many sources; aggregate the data; and feed it to a database, a system like Hadoop, or other clients.
  • Hadoop is currently the most widely used framework for processing data. Hadoop is an open source implementation of the MapReduce programming model that Google popularized in 2004. It is inherently batch-oriented; several newer technologies are aimed at processing streaming data, such as S4 and Storm.
  • Azkaban and Oozie are job schedulers. They manage and coordinate complex data flows.
  • Pig and Hive are languages for querying large non-relational datastores. Hive is very similar to SQL. Pig is a data-oriented scripting language.
  • Voldemort, Cassandra, and HBase are data stores that have been designed for good performance on very large datasets.

Equally important is the ability to build monitoring and deployment technologies for these systems.

In addition to building the infrastructure, data engineering and infrastructure takes ideas developed by the product and marketing analytics group and implements them so they can operate in production at scale. For example, a recommendation engine for videos may be prototyped using SQL, Pig, or Hive. If testing shows that the recommendation engine is of value, it will need to be deployed so that it supports SLAs specifying appropriate availability and latencies. Migrating the product from prototype into production may require re-implementing it so it can deliver performance at scale. If SQL and a relational database prove to be too slow, you may need to move to HBase, queried by Hive or Pig. Once the application has been deployed, it must be monitored to ensure that it continues meeting its requirements. It must also be monitored to ensure that it is producing relevant results. Doing so requires more sophisticated software development.

Organizational and reporting alignment

Should an organization be structured according to the functional areas I've discussed, or via some other mechanism? There is no easy answer. Key things to consider include the people involved, the size and scale of the organization, and the organizational dynamics of the company (e.g., whether the company is product, marketing, or engineering driven).

In the early stages, people must wear multiple hats. For example, in a startup, you can't afford separate groups for analytics, security, operations, and infrastructure: one or two people may have to do everything. But as an organization grows, people naturally become more specialized. In addition, it's a good idea to remove any single points of failure. Some organizations use a "center-of-excellence model," where there is a centralized data team. Others use a hub-and-spoke model, where there is one central team and members are embedded within sponsoring teams (for example, the sales team may sponsor people in analytics to support their business needs). Some organizations are fully decentralized, and each team hires to fill its own requirements.

As vague as that answer is, here are the three lessons I've learned:

  1. If the team is small, its members should sit close to each other. There are many nuances to working with data, and high-speed interaction between team members resolves painful, trivial issues.
  2. Train people to fish — it only increases your organization's ability to be data driven. As previously discussed, organizations like Facebook and Zynga have democratized data effectively. As a result, these companies have more people conducting more analysis and looking at key metrics. This kind of access was nearly unheard of as little as five years ago. There is a down side: the increased demands on the infrastructure and need for training. The infrastructure challenge is largely a technical problem, and one of the easiest ways to manage training is to set up "office hours" and schedule data classes.
  3. All of the functional areas must stay in regular contact and communication. As the field of data science grows, technology and process innovations will also continue to grow. To keep up to date it is essential for all of these teams to share their experiences. Even if they are not part of the same reporting structure, there is a common bond of data that ties everyone together.

What makes a data scientist?

When Jeff Hammerbacher and I talked about our data science teams, we realized that as our organizations grew, we both had to figure out what to call the people on our teams. "Business analyst" seemed too limiting. "Data analyst" was a contender, but we felt that title might limit what people could do. After all, many of the people on our teams had deep engineering expertise. "Research scientist" was a reasonable job title used by companies like Sun, HP, Xerox, Yahoo, and IBM. However, we felt that most research scientists worked on projects that were futuristic and abstract, and the work was done in labs that were isolated from the product development teams. It might take years for lab research to affect key products, if it ever did. Instead, the focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new.

(Note: Although the term "data science" has a long history — usually referring to business intelligence — "data scientist" appears to be new. Jeff and I have been asking if anyone else has used this term before we coined it, but we've yet to find anyone who has.)

But how do you find data scientists? Whenever someone asks that question, I refer them back to a more fundamental question: what makes a good data scientist? Here is what I look for:

  • Technical expertise: the best data scientists typically have deep expertise in some scientific discipline.
  • Curiosity: a desire to go beneath the surface and discover and distill a problem down into a very clear set of hypotheses that can be tested.
  • Storytelling: the ability to use data to tell a story and to be able to communicate it effectively.
  • Cleverness: the ability to look at a problem in different, creative ways.

People often assume that data scientists need a background in computer science. In my experience, that hasn't been the case: my best data scientists have come from very different backgrounds. The inventor of LinkedIn's People You May Know was an experimental physicist. A computational chemist on my decision sciences team had solved a 100-year-old problem on energy states of water. An oceanographer made major impacts on the way we identify fraud. Perhaps most surprising was the neurosurgeon who turned out to be a wizard at identifying rich underlying trends in the data.

All the top data scientists share an innate sense of curiosity. Their curiosity is broad, and extends well beyond their day-to-day activities. They are interested in understanding many different areas of the company, business, industry, and technology. As a result, they are often able to bring disparate areas together in a novel way. For example, I've seen data scientists look at sales processes and realize that by using data in new ways they can make the sales team far more efficient. I've seen data scientists apply novel DNA sequencing techniques to find patterns of fraud.

What unifies all these people? They all have strong technical backgrounds. Most have advanced degrees (although I've worked with several outstanding data scientists who haven't graduated from college). But the real unifying thread is that all have had to work with a tremendous amount of data before starting to work on the "real" problem. When I was a first-year graduate student, I was interested in weather forecasting. I had an idea about how to understand the complexity of weather, but needed lots of data. Most of the data was available online, but due to its size, the data was in special formats and spread out over many different systems. To make that data useful for my research, I created a system that took over every computer in the department from 1 AM to 8 AM. During that time, it acquired, cleaned, and processed that data. Once done, my final dataset could easily fit in a single computer's RAM. And that's the whole point. The heavy lifting was required before I could start my research. Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation isn't something that gets in the way of solving the problem: it is the problem.

These are some examples of training that hone the skills a data scientist needs to be successful:

  • Finding rich data sources.
  • Working with large volumes of data despite hardware, software, and bandwidth constraints.
  • Cleaning the data and making sure that data is consistent.
  • Melding multiple datasets together.
  • Visualizing that data.
  • Building rich tooling that enables others to work with data effectively.

One of the challenges of identifying data scientists is that there aren't many of them (yet). There are a number of programs that are helping train people, but the demand outstrips the supply. And experiences like my own suggest that the best way to become a data scientist isn't to be trained as a data scientist, but to do serious, data-intensive work in some other discipline.

Hiring data scientists was such a challenge at every place I've worked that I've adopted two models for building and training new hires. First, hire people with diverse backgrounds who have histories of playing with data to create something novel. Second, take incredibly bright and creative people right out of college and put them through a very robust internship program.

Another way to find great data scientists is to run a competition, like Netflix did. The Netflix Prize was a contest organized to improve their ability to predict how much a customer would enjoy a movie. If you don't want to organize your own competition, you can look at people who have performed well in competitions run by others. Kaggle and Topcoder are great resources when looking for this kind of talent. Kaggle has found its own top talent by hiring the best performers from its own competitions.

Hiring and talent

Many people focus on hiring great data scientists, but they leave out the need for continued intellectual and career growth. These key aspects of growth are what I call talent growth. In the three years that I led LinkedIn's analytics and data teams, we developed a philosophy around three principles for hiring and talent growth.

Would we be willing to do a startup with you?

This is the first question we ask ourselves as a team when we meet to evaluate a candidate. It sums up a number of key criteria:

  • Time: If we're willing to do a startup with you, we're agreeing that we'd be willing to be locked in a small room with you for long periods of time. The ability to enjoy another person's company is critical to being able to invest in each other's growth.
  • Trust: Can we trust you? Will we have to look over your shoulder to make sure you're doing an A+ job? That may go without saying, but the reverse is also important: will you trust me? If you don't trust me, we're both in trouble.
  • Communication: Can we communicate with each other quickly and efficiently? If we're going to spend a tremendous amount of time together and if we need to trust each other, we'll need to communicate. Over time, we should be able to anticipate each other's needs in a way that allows us to be highly efficient.

Can you "knock the socks off" of the company in 90 days?

Once the first criteria has been met, it's critical to establish mechanisms to ensure that the candidate will succeed. We do this by setting expectations for the quality of the candidate's work, and by setting expectations for the velocity of his or her progress.

First, the "knock the socks off" part: by setting the goal high, we're asking whether you have the mettle to be part of an elite team. More importantly, it is a way of establishing a handshake for ensuring success. That's where the 90 days comes in. A new hire won't come up with something mind blowing if the team doesn't bring the new hire up to speed quickly. The team needs to orient new hires around existing systems and processes. Similarly, the new hire needs to make the effort to progress, quickly. Does this person ask questions when they get stuck? There are no dumb questions, and toughing it out because you're too proud or insecure to ask is counterproductive. Can the new hire bring a new system up in a day, or does it take a week or more? It's important to understand that doing something mind-blowing in 90 days is a team goal, as much as an individual goal. It is essential to pair the new hire with a successful member of the team. Success is shared.

This criterion sets new hires up for long-term success. Once they've passed the first milestone, they've done something that others in the company can recognize, and they have the confidence that will lead to future achievements. I've seen everyone from interns all the way to seasoned executives meet this criterion. And many of my top people have had multiple successes in their first 90 days.

In four to six years, will you be doing something amazing?

What does it mean to do something amazing? You might be running the team or the company. You might be doing something in a completely different discipline. You may have started a new company that's changing the industry. It's difficult to talk concretely because we're talking about potential and long-term futures. But we all want success to breed success, and I believe we can recognize the people who will help us to become mutually successful.

I don't necessarily expect a new hire to do something amazing while he or she works for me. The four- to six-year horizon allows members of the team to build long-term road maps. Many organizations make the time commitment amorphous by talking about vague, never-ending career ladders. But professionals no longer commit themselves to a single company for the bulk of their careers. With each new generation of professionals, the number of organizations and even careers has increased. So rather than fight it, embrace the fact that people will leave, so long as they leave to do something amazing. What I'm interested in is the potential: if you have that potential, we all win and we all grow together, whether your biggest successes come with my team or somewhere else.

Finally, this criteria is mutual. A new hire won't do something amazing, now or in the future, if the organization he or she works for doesn't hold up its end of the bargain. The organization must provide a platform and opportunities for the individual to be successful. Throwing a new hire into the deep end and expecting success doesn't cut it. Similarly, the individual must make the company successful to elevate the platform that he or she will launch from.

Building the LinkedIn data science team

I'm proud of what we've accomplished in building the LinkedIn data team. However, when we started, it didn't look anything like the organization that is there today. We started with 1.5 engineers (who would later go on to invent Voldemort, Kafka, and the real-time recommendation engine systems), no data services team (there wasn't even a data warehouse), and five analysts (who would later become the core of LinkedIn's data science group) who supported everyone from the CFO to the product managers.

When we started to build the team, the first thing I did was go to many different technical organizations (the likes of Yahoo, eBay, Google, Facebook, Sun, etc.) to get their thoughts and opinions. What I found really surprised me. The companies all had fantastic sets of employees who could be considered "data scientists." However, they were uniformly discouraged. They did first-rate work that they considered critical, but that had very little impact on the organization. They'd finish some analysis or come up with some ideas, and the product managers would say "that's nice, but it's not on our roadmap." As a result, the data scientists developing these ideas were frustrated, and their organizations had trouble capitalizing on what they were capable of doing.

Our solution was to make the data group a full product team responsible for designing, implementing, and maintaining products. As a product team, data scientists could experiment, build, and add value directly to the company. This resulted not only in further development of LinkedIn products like PYMK and Who's Viewed My Profile, but also in features like Skills, which tracks various skills and assembles a picture of what's needed to succeed in any given area, and Career Explorer, which helps you explore different career trajectories.

It's important that our data team wasn't comprised solely of mathematicians and other "data people." It's a fully integrated product group that includes people working in design, web development, engineering, product marketing, and operations. They all understand and work with data, and I consider them all data scientists. We intentionally kept the distinction between different roles in the group blurry. Often, an engineer can have the insight that makes it clear how the product's design should work, or vice-versa — a designer can have the insight that helps the engineers understand how to better use the data. Or it may take someone from marketing to understand what a customer really wants to accomplish.

The silos that have traditionally separated data people from engineering, from design, and from marketing, don't work when you're building data products. I would contend that it is questionable whether those silos work for any kind of product development. But with data, it never works to have a waterfall process in which one group defines the product, another builds visual mock-ups, a data scientist preps the data, and finally a set of engineers builds it to some specification document. We're not building Microsoft Office, or some other product where there's 20-plus years of shared wisdom about how interfaces should work. Every data project is a new experiment, and design is a critical part of that experiment. It's similar for operations: data products present entirely different stresses on a network and storage infrastructure than traditional sites. They capture much more data: petabytes and even exabytes. They deliver results that mash up data from many sources, some internal, some not. You're unlikely to create a data product that is reliable and that performs reasonably well if the product team doesn't incorporate operations from the start. This isn't a simple matter of pushing the prototype from your laptop to a server farm.

Finally, quality assurance (QA) of data products requires a radically different approach. Building test datasets is nontrivial, and it is often impossible to test all of the use cases. As different data streams come together into a final product, all sorts of relevance and precision issues become apparent. To develop this kind of product effectively, the ability to adapt and iterate quickly throughout the product life cycle is essential. To ensure agility, we build small groups to work on specific products, projects, or analyses. When we can, I like to seat anyone with a dependency with another person in the same area.

A data science team isn't just people: it's tooling, processes, the interaction between the team and the rest of the company, and more. At LinkedIn, we couldn't have succeeded if it weren't for the tools we used. When you're working with petabytes of data, you need serious power tools to do the heavy lifting. Some, such as Kafka and Voldemort (now open source projects) were homegrown, not because we thought we should have our own technology, but because we didn't have a choice. Our products couldn't scale without them. In addition to these technologies, we use other open source technologies such as Hadoop and many vendor-supported solutions as well. Many of these are for data warehousing, and traditional business intelligence.

Tools are important because they allow you to automate. Automation frees up time, and makes it possible to do the creative work that leads to great products. Something as simple as reducing the turnaround time on a complex query from "get the result in the morning" to "get the result after a cup of coffee" represents a huge increase in productivity. If queries run overnight, you can only afford to ask questions when you already think you know the answer. If queries run in minutes, you can experiment and be creative.

Interaction between the data science teams and the rest of corporate culture is another key factor. It's easy for a data team (any team, really) to be bombarded by questions and requests. But not all requests are equally important. How do you make sure there's time to think about the big questions and the big problems? How do you balance incoming requests (most of which are tagged "as soon as possible") with long-term goals and projects? It's important to have a culture of prioritization: everyone in the group needs to be able to ask about the priority of incoming requests. Everything can't be urgent.

The result of building a data team is, paradoxically, that you see data products being built in all parts of the company. When the company sees what can be created with data, when it sees the power of being data enabled, you'll see data products appearing everywhere. That's how you know when you've won.

Reinvention

Companies are always looking to reinvent themselves. There's never been a better time: from economic pressures that demand greater efficiency, to new kinds of products that weren't conceivable a few years ago, the opportunities presented by data are tremendous.

But it's a mistake to treat data science teams like any old product group. (It is probably a mistake to treat any old product group like any old product group, but that's another issue.) To build teams that create great data products, you have to find people with the skills and the curiosity to ask the big questions. You have build cross-disciplinary groups with people who are comfortable creating together, who trust each other, and who are willing to help each other be amazing. It's not easy, but if it were easy, it wouldn't be as much fun.

Related:

September 15 2011

The evolution of data products

In "What is Data Science?," I started to talk about the nature of data products. Since then, we've seen a lot of exciting new products, most of which involve data analysis to an extent that we couldn't have imagined a few years ago. But that begs some important questions: What happens when data becomes a product, specifically, a consumer product? Where are data products headed? As computer engineers and data scientists, we tend to revel in the cool new ways we can work with data. But to the consumer, as long as the products are about the data, our job isn't finished. Proud as we may be about what we've accomplished, the products aren't about the data; they're about enabling their users to do whatever they want, which most often has little to do with data.

It's an old problem: the geeky engineer wants something cool with lots of knobs, dials, and fancy displays. The consumer wants an iPod, with one tiny screen, one jack for headphones, and one jack for charging. The engineer wants to customize and script it. The consumer wants a cool matte aluminum finish on a device that just works. If the consumer has to script it, something is very wrong. We're currently caught between the two worlds. We're looking for the Steve Jobs of data — someone who can design something that does what we want without getting us involved in the details.


Disappearing data

We've become accustomed to virtual products, but it's only appropriate to start by appreciating the extent to which data products have replaced physical products. Not that long ago, music was shipped as chunks of plastic that weighed roughly a pound. When the music was digitized and stored on CDs, it became a data product that weighed under an ounce, but was still a physical object. We've moved even further since: many of the readers of this article have bought their last CD, and now buy music exclusively in online form, through iTunes or Amazon. Video has followed the same path, as analog VHS videotapes became DVDs and are now streamed through Netflix, a pure data product.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30



But while we're accustomed to the displacement of physical products by
virtual products, the question of how we take the next step — where
data recedes into the background — is surprisingly tough.
Do we want products that deliver data?
Or do we want products that deliver results based on data? We're
evolving toward the latter, though we're not there yet. The iPod may
be the best example of a product that pushes the data into the
background to deliver what the user wants, but its partner
application, iTunes, may be the worst. The user interface to iTunes
is essentially a spreadsheet that exposes all of your music
collection's metadata. Similarly, the
"People You May Know" feature on social sites such as LinkedIn and
Facebook delivers recommendations: a list of people in the
database who are close to you in one way or another. While that's
much more friendly than iTunes' spreadsheet, it is still a
list, a classic data structure. Products like these have a "data smell." I call them "overt"
data products because the data is clearly visible as part
of the deliverable.



A list may be an appropriate way to deliver potential contacts, and a
spreadsheet may be an appropriate way to edit music metadata. But
there are many other kinds of deliverables that help us to understand
where data products are headed. At a recent event at IBM Research, IBM
demonstrated an application that accurately predicts bus arrival times,
based on real-time analysis of traffic data.
(London is about to roll out something similar.) Another IBM project implemented a

congestion management system
for Stockholm that brought about significant
decreases in traffic and air pollution. A href="http://www.globalgiants.com/archives/2010/04/city_traffic_ma.html">newer
initiative allows drivers to text their destinations to a service, and receive an optimized route, given current traffic and weather conditions. Is a bus arrival time data?
Probably so. Is a route another list structure, like a list of potential Facebook friends? Yes, though the real deliverable here is reduced transit time and an improved
environment. The data is still in the foreground, but we're starting
to look beyond the data to the bigger picture: better quality of life.

These projects suggest the next step in the evolution toward data products that deliver results rather than data. Recently, Ford discussed some experimental work in which they used Google's prediction and mapping capabilities to optimize mileage in hybrid cars based on predictions about where the driver was going. It's clearly a data product: it's doing data analysis on historical driving data and knowledge about road conditions. But the deliverable isn't a route or anything the driver actually sees — it's optimized engine usage and lower fuel consumption. We might call such a product, in which the data is hidden, a "covert" data product.

We can push even further. The user really just wants to get from point A to point B. Google has demonstrated a self-driving car that solves this problem. A self-driving car is clearly not delivering data as the result, but there are massive amounts of data behind the scenes, including maps, Street View images of the roads (which, among other things, help it to compute the locations of curbs, traffic lights, and stop signs), and data from sensors on the car. If we ever find out everything that goes into the data processing for a self-driving car, I believe we'll see a masterpiece of extracting every bit of value from many data sources. A self-driving car clearly takes the next step to solving a user's real problem while making the data hide behind the scenes.

Once you start looking for data products that deliver real-world results rather than data, you start seeing them everywhere. One IBM project involved finding leaks in Dubuque, Iowa's, public water supply. Water is being used all the time, but sudden changes in usage could represent a leak. Leaks have a unique signature: they can appear at any time, particularly at times when you would expect usage to be low. Unlike someone watering his lawn, flushing a toilet, or filling a pool, leaks don't stop. What's the deliverable? Lower water bills and a more robust water system during droughts — not data, but the result of data.

In medical care, doctors and nurses frequently have more data at their disposal than they know what to do with. The problem isn't the data, but seeing beyond the data to the medical issue. In a collaboration between IBM and the University of Ontario, researchers knew that most of the data streaming from the systems monitoring premature babies was discarded. While readings of a baby's vital signs might be taken every few milliseconds, they were being digested into a single reading that was checked once or twice an hour. By taking advantage of the entire data stream, it was possible to detect the onset of life-threatening infections as much as 24 hours before the symptoms were apparent to a human. Again, a covert data product; and the fact that it's covert is precisely what makes it valuable. A human can't deal with the raw data, and digesting the data into hourly summaries so that humans can use it makes it less useful, not more. What doctors and nurses need isn't data, they need to know that the sick baby is about to get sicker.

Eben Hewitt, author of "Cassandra: The Definitive Guide," works for a large hotel chain. He told me that the hotel chain considers itself a software company that delivers a data product. The company's real expertise lies in the reservation systems, the supply management systems, and the rest of the software that glues the whole enterprise together. It's not a small task. They're tracking huge numbers of customers making reservations for hundreds of thousands of rooms at tens of thousands of properties, along with various awards programs, special offers, rates that fluctuate with holidays and seasons, and so forth. The complexity of the system is certainly on par with LinkedIn, and the amount of data they manage isn't that much smaller. A hotel looks awfully concrete, but in fact, your reservation at Westin or Marriott or Day's Inn is data. You don't experience it as data, however — you experience it as a comfortable bed at the end of a long day. The data is hidden, as it should be.

I see another theme developing. Overt products tend to depend on overt data collection: LinkedIn and Facebook don't have any data that wasn't given to them explicitly, though they may be able to combine it in unexpected ways. With covert data products, not only is data invisible in the result, but it tends to be collected invisibly. It has to be collected invisibly: we would not find a self-driving car satisfactory if we had to feed it with our driving history. These products are frequently built from data that's discarded because nobody knows how to use it; sometimes it's the "data exhaust" that we leave behind as our cell phones, cars, and other devices collect information on our activities. Many cities have all the data they need to do real-time traffic analysis; many municipal water supplies have extensive data about water usage, but can't yet use the data to detect leaks; many hospitals connect patients to sensors, but can't digest the data that flows from those sensors. We live in an ocean of ambient data, much of which we're unaware. The evolution of data products will center around discovering uses for these hidden sources of data.

The power of combining data

The first generation of data products, such as CDDB, were essentially a single database. More recent products, such as LinkedIn's Skills database, are composites: Skills incorporates databases of users, employers, job listings, skill descriptions, employment histories, and more. Indeed, the most important operation in data science may be a "join" between different databases to answer questions that couldn't be answered by either database alone.

Facebook's facial recognition provides an excellent example of the power in linked databases. In the most general case, identifying faces (matching a face to a picture, given millions of possible matches) is an extremely difficult problem. But that's not the problem Facebook has solved. In a reply to Tim O'Reilly, Jeff Jonas said that while one-to-many picture identification remains an extremely difficult problem, one-to-few identification is relatively easy. Facebook knows about social networks, and when it sees a picture, Facebook knows who took it and who that person's friends are. It's a reasonable guess that any faces in the picture belong to the taker's Facebook friends. So Facebook doesn't need to solve the difficult problem of matching against millions of pictures; it only needs to match against pictures of friends. The power doesn't come from a database of millions of photos; it comes from joining the photos to the social graph.


The goal of discovery

Many current data products are recommendation engines, using collaborative filtering or other techniques to suggest what to buy, who to friend, etc. One of the holy grails of the "new media" is to build customized, personalized news services that automatically find what the user thinks is relevant and interesting. Tools like Apple's Genius look through your apps or your record collection to make recommendations about what else to buy. "People you may know," a feature common to many social sites, is effectively a recommendation engine.

But mere recommendation is a shallow goal. Recommendation engines aren't, and can't, be the end of the road. I recently spent some time talking to Bradford Cross (@bradfordcross), founder of Woven, and eventually realized that his language was slightly different from the language I was used to. Bradford consistently talked about "discovery," not recommendation. That's a huge difference. Discovery is the key to building great data products, as opposed to products that are merely good.

The problem with recommendation is that it's all about recommending something that the user will like, whether that's a news article, a song, or an app. But simply "liking" something is the wrong criterion. A couple months ago, I turned on Genius on my iPad, and it said things like "You have Flipboard, maybe you should try Zite." D'oh. It looked through all my apps, and recommended more apps that were like the apps I had. That's frustrating because I don't need more apps like the ones I have. I'd probably like the apps it recommended (in fact, I do like Zite), but the apps I have are fine. I need apps that do something different. I need software to tell me about things that are entirely new, ideally something I didn't know I'd like or might have thought I wouldn't like. That's where discovery takes over. What kind of insight are we talking about here? I might be delighted if Genius said, "I see you have ForScore, you must be a musician, why don't you try Smule's Magic Fiddle" (well worth trying, even if you're not a musician). That's where recommendation starts making the transition to discovery.

Eli Pariser's "The Filter Bubble" is an excellent meditation on the danger of excessive personalization and a media diet consisting only of stuff selected because you will "like" it. If I only read news that has been preselected to be news I will "like," news that fits my personal convictions and biases, not only am I impoverished, but I can't take part in the kind of intelligent debate that is essential to a healthy democracy. If I only listen to music that has been chosen because I will "like" it, my music experience will be dull and boring. This is the world of E.M. Forster's story "The Machine Stops," where the machine provides a pleasing, innocuous cocoon in which to live. The machine offers music, art, and food — even water, air, and bedding; these provide a context for all "ideas" in an intellectual space where direct observation is devalued, even discouraged (and eventually forbidden). And it's no surprise that when the machine breaks down, the consequences are devastating.

I do not believe it is possible to navigate the enormous digital library that's available to us without filtering, nor does Pariser. Some kind of programmatic selection is an inevitable part of the future. Try doing Google searches in Chrome's Incognito mode, which suppresses any information that could be used to personalize search results. I did that experiment, and it's really tough to get useful search results when Google is not filtering based on its prior knowledge of your interests.

But if we're going to break out of the cocoon in which our experience of the world is filtered according to our likes and dislikes, we need to get beyond naïve recommendations to break through to discovery. I installed the iPad Zite app shortly after it launched, and I find that it occasionally breaks through to discovery. It can find articles for me that I wouldn't have found for myself, that I wouldn't have known to look for. I don't use the "thumbs up" and "thumbs down" buttons because I don't want Zite to turn into a parody of my tastes. Unfortunately, that seems to be happening anyway. I find that Zite is becoming less interesting over time: even without the buttons, I suspect that my Twitter stream is telling Zite altogether too much about what I like and degrading the results. Making the transition from recommendation to true discovery may be the toughest problem we face as we design the next generation of data products.

Interfaces

In the dark ages of data products, we accessed data through computers: laptops and desktops, and even minicomputers and mainframes if you go back far enough. When music and video first made the transition from physical products to data products, we listened and watched on our computers. But that's no longer the case: we listen to music on iPods; read books on Kindles, Nooks, and iPads; and watch online videos on our Internet-enabled televisions (whether the Internet interface is part of the TV itself or in an external box, like the Apple TV). This transition is inevitable. Computers make us aware of data as data: one disk failure will make you painfully aware that your favorite songs, movies, and photos are nothing more than bits on a disk drive.

It's important that Apple was at the core of this shift. Apple is a master of product design and user interface development. And it understood something about data that those of use who preferred listening to music through WinAmp or FreeAmp (now Zinf) missed: data products would never become part of our lives until the computer was designed out of the system. The user experience was designed into the product from the start. DJ Patil (@dpatil), Data Scientist in Residence at Greylock Partners, says that when building a data product, it is critical to integrate designers into the engineering team from the beginning. Data products frequently have special challenges around inputting or displaying data. It's not sufficient for engineers to mock up something first and toss it over to design. Nor is it sufficient for designers to draw pretty wireframes without understanding what the product is or how it works. The earlier design is integrated into the product group and the deeper the understanding designers have of the product, the better the results will be. Patil suggested that FourSquare succeeded because it used GPS to make checking into a location trivially simple. That's a design decision as much as a technical decision. (Success isn't fair: as a Dodgeball review points out, position wasn't integrated into cell phones, so Dodgeball's user interface was fundamentally hobbled.) To listen to music, you don't want a laptop with a disk drive, a filesystem, and a user interface that looks like something from Microsoft Office; you want something as small and convenient as a 1960s transistor radio, but much more capable and flexible.

What else needs to go if we're going to get beyond a geeky obsession with the artifact of data to what the customer wants? Amazon has done an excellent job of packaging ebooks in a way that is unobtrusive: the Kindle reader is excellent, it supports note taking and sharing, and Amazon keeps your location in sync across all your devices. There's very little file management; it all happens in Amazon's cloud. And the quality is excellent. Nothing gives a product a data smell quite as much as typos and other errors. Remember Project Gutenberg?

Back to music: we've done away with ripping CDs and managing the music ourselves. We're also done with the low-quality metadata from CDDB (although I've praised CDDB's algorithm, the quality of its data is atrocious, as anyone with songs by John "Lennnon" knows). Moving music to the cloud in itself is a simplification: you don't need to worry about backups or keeping different devices in sync. It's almost as good as an old phonograph, where you could easily move a record from one room to another, or take it to a friend's house. But can the task of uploading and downloading music be eliminated completely? We're partway there, but not completely. Can the burden of file management be eliminated? I don't really care about the so-called "death of the filesystem," but I do care about shielding users from the underlying storage mechanism, whether local or in the cloud.

New interfaces for data products are all about hiding the data itself, and getting to what the user wants. The iPod revolutionized audio not by adding bells and whistles, but by eliminating knobs and controls. Music had become data. The iPod turned it back into music.

The drive toward human time

It's almost shocking that in the past, Google searches were based on indexes that were built as batch jobs, with possibly a few weeks before a given page made it into the index. But as human needs and requirements have driven the evolution of data products, batch processing has been replaced by "human time," a term coined by Justin Sheehy (@justinsheehy), CTO of Basho Technologies. We probably wouldn't complain about search results that are a few minutes late, or maybe even an hour, but having to wait until tomorrow to search today's Twitter stream would be out of the question. Many of my examples only make sense in human time. Bus arrival times don't make sense after the bus has left, and while making predictions based on the previous day's traffic might have some value, to do the job right you need live data. We'd laugh at a self-driving car that used yesterday's road conditions. Predicting the onset of infection in a premature infant is only helpful if you can make the prediction before the infection becomes apparent to human observers, and for that you need all the data streaming from the monitors.

To meet the demands of human time, we're entering a new era in data tooling. Last September, Google blogged about Caffeine and Percolator, its new framework for doing real-time analysis. Few details about Percolate are available, but we're starting to see new tools in the open source world: Apache Flume adds real-time data collection to Hadoop-based systems. A recently announced project, Storm, claims to be the Hadoop of real-time processing. It's a framework for assembling complex topologies of message processing pipelines and represents a major rethinking of how to build data products in a real-time, stream-processing context.

Conclusions

Data products are increasingly part of our lives. It's easy to look at the time spent in Facebook or Twitter, but the real changes in our lives will be driven by data that doesn't look like data: when it looks like a sign saying the next bus will arrive in 10 minutes, or that the price of a hotel reservation for next week is $97. That's certainly the tack that Apple is taking. If we're moving to a post-PC world, we're moving to a world where we interact with appliances that deliver the results of data, rather than the data itself. Music and video may be represented as a data stream, but we're interested in the music, not the bits, and we are already moving beyond interfaces that force us to deal with its "bitly-ness": laptops, files, backups, and all that. We've witnessed the transformation from vinyl to CD to digital media, but the process is ongoing. We rarely rip CDs anymore, and almost never have to haul out an MP3 encoder. The music just lives in the cloud (whether it's Amazon's, Apple's, Google's, or Spotify's). Music has made the transition from overt to covert. So have books. Will you have to back up your self-driving route-optimized car? I doubt it. Though that car is clearly a data product, the data that drives it will have disappeared from view.

Earlier this year Eric Schmidt said:

Google needs to move beyond the current search format of you entering a query and getting 10 results. The ideal would be us knowing what you want before you search for it...

This controversial and somewhat creepy statement actually captures the next stage in data evolution. We don't want lists or spreadsheets; we don't want data as data; we want results that are in tune with our human goals and that cause the data to recede into the background. We need data products that derive their power by mashing up many sources. We need products that deliver their results in human time, rather than as batch processes run at the convenience of a computing system. And most crucially, we need data products that go beyond mere recommendation to discovery. When we have these products, we will forget that we are dealing with data. We'll just see the results, which will be aligned with our needs.

We are seeing a transformation in data products similar to what we have seen in computer networking. In the '80s and '90s, you couldn't have a network without being intimately aware of the plumbing. You had to manage addresses, hosts files, shared filesystems, even wiring. The high end of technical geekery was wiring a house with Ethernet. But all that network plumbing hasn't just moved into the walls: it's moved into the ether and disappeared entirely. Someone with no technical background can now build a wireless network for a home or office by doing little more than calling the cable company. Data products are striving for the same goal: consumers don't want to, or need to, be aware that they are using data. When we achieve that, when data products have the richness of data without calling attention to themselves as data, we'll be ready for the next revolution.

Related:

September 07 2011

Look at Cook sets a high bar for open government data visualizations

Every month, more open government data is available online. Open government data is being used in mobile apps, baked into search engines or incorporated into powerful data visualizations. An important part of that trend is that local governments are becoming data suppliers.

For local, state and federal governments, however, releasing data is not enough. Someone has to put it to work, pulling the data together to create cohesive stories so citizens and other stakeholders can gain more knowledge. Sometimes this work is performed by public servants, though data visualization and user experience design has historically not been the strong suit of government employees. In the hands of skilled developers and designers, however, open data can be used to tell powerful stories.

One of the best recent efforts at visualizing local open government data can be found at Look at Cook, which tracks government budgets and expenditures from 1993-2011 in Cook County, Illinois.

look-at-cook-screenshot.jpg

The site was designed and developed by Derek Eder and Nick Rougeux, in collaboration with Cook County Commissioner John Fritchey. Below, Eder explains how they built the site, the civic stack tools they applied, and the problems Look at Cook aims to solve.

Why did you build Look at Cook?

Derek Eder: After being installed as a Cook County Commissioner, John Fritchey, along with the rest of the Board of Commissioners, had to tackle a very difficult budget season. He realized that even though the budget books were presented in the best accounting format possible and were also posted online in PDF format, this information was still not friendly to the public. After some internal discussion, one of his staff members, Seth Lavin approached me and Nick Rougeux and asked that we develop a visualization that would let the public easily explore and understand the budget in greater detail. Seth and I had previously connected through some of Chicago's open government social functions, and we were looking for an opportunity for the county and the open government community to collaborate.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

What problems does Look at Cook solve for government?

Derek Eder: Look at Cook shines a light on what's working in the system and what's not. Cook County, along with many other municipalities, has its fair share of problems, but before you can even try to fix any of them, you need to understand what they are. This visualization does exactly that. You can look at the Jail Diversion department in the Public Safety Fund and compare it to the Corrections and Juvenile Detention departments. They have an inverse relationship, and you can actually see one affecting the other between 2005 and 2007. There are probably dozens of other stories like these hidden within the budget data. All that was needed was an easy way to find and correlate them — which anyone can now do with our tool.

Look at Cook visualization example
Is there a relationship between the lower funding for Cook County's Jail Diversion and Crime Prevention division and the higher funding levels for the Department of Corrections and the Juvenile Temporary Detention Center divisions? (Click to enlarge.)

What problems does Look at Cook solve for citizens?

Derek Eder: Working on and now using Look at Cook opened my eyes to what Cook County government does. In Chicago especially, there is a big disconnect between where the county begins and where the city ends. Now I can see that the county runs specific hospitals and jails, maintains highways, and manages dozens of other civic institutions. Additionally, I know how much money it is spending on each, and I can begin to understand just how $3.5 billion dollars are spent every year. If I'm interested, I can take it a step further and start asking questions about why the county spends money on what it does and how it has been distributed over the last 18 years. Examples include:

  • Why did the Clerk of the Circuit Court get a 480% increase in its budget between 2007 and 2008? See the 2008 public safety fund.
  • How is the Cook County Board President going to deal with a 74% decrease in appropriations for 2011? See the 2011 president data.
  • What happened in 2008 when the Secretary of the Board of Commissioners got its funding reallocated to the individual District Commissioners? See the 2008 corporate fund.

As a citizen, I now have a powerful tool for asking these questions and being more involved in my local government.

What data did you use?

Derek Eder: We were given budget data in a fairly raw format as a basic spreadsheet broken down into appropriations and expenditures by department and year. That data went back to 1993. Collectively, we and Commissioner Fritchey's office agreed that clear descriptions of everything were crucial to the success of the site, so his office diligently spent the time to write and collect them. They also made connections between all the data points so we could see what control officer was in charge of what department, and they hunted down the official websites for each department.

What tools did you use to build Look at Cook?

Derek Eder: Our research began with basic charts in Excel to get an initial idea of what the data looked like. Considering the nature of the data, we knew we wanted to show trends over time and let people compare departments, funds, and control officers. This made line and bar charts a natural choice. From there, we created a couple iterations of wireframes and storyboards to get an idea of the visual layout and style. Given our prior technical experience building websites at Webitects, we decided to use free tools like jQuery for front-end functionality and Google Fusion Tables to house the data. We're also big fans of Google Analytics, so we're using it to track how people are using the site.

Specifically, we used:

What design principles did you apply?

Derek Eder: Our guiding principles were clarity and transparency. We were already familiar with other popular visualizations, like the New York Times' federal budget and the Death and Taxes poster from WallStats. While they were intriguing, they seemed to lack some of these traits. We wanted to illustrate the budget in a way that anyone could explore without being an expert in county government. From a visual standpoint, the goal was to present the information professionally and essentially let the visuals get out of the way so the data could be the focus.

We feel that designing with data means that the data should do most of the talking. Effective design encourages people to explore information without making them feel overwhelmed. A good example of this is how we progressively expose more information as people drill down into departments and control officers. Effective design should also create some level of emotional connection with people so they understand what they're seeing. For example, someone may know one of the control officers or have had an experience with one of the departments. This small connection draws their attention to those areas and gets them to ask questions about why things are the way they are.

This interview was edited and condensed.

Related:

September 01 2011

Just published: "Big Data Now"

Big Data NowWe've taken the wraps off of "Big Data Now," a free collection that brings together much of the data coverage we've featured here on Radar over the last year.

Mike Loukides kicked things off in June 2010 with "What is data science?" and from there we've pursued the various threads and themes that emerged. The interesting thing is that our coverage areas have evolved organically. We didn't set out to explore specific domains or technologies. Instead, we sought to chronicle the natural growth of the data science space.

That evolution allowed the content to — in a way — categorize itself, and those categories became evident when we were assembling the table of contents for "Big Data Now." Here's how the topics sorted out:

Data issues — The opportunities and ambiguities of the data space are evident in this segment's discussions around privacy, the implications of data-centric industries, and even in the debate about the phrase "data science" itself.

The application of data — An exploration of data applications showed that this segment is quickly expanding to include everything from data startups to established enterprises to media/journalism to education and research. A "data product" can emerge from virtually any domain.

Data science and data tools — The tools and technologies that drive data science are, of course, essential to this space, but the varied techniques being applied are also key to understanding the big data arena.

The business of data — This is all about the actions connected to data — the process of finding, organizing, and analyzing data that allows organizations of all sizes to improve and innovate.

To be clear, "Big Data Now" represents the story up to this point (hence, the "now" part of the title). In the weeks and months ahead we'll certainly see important shifts in the data landscape. That's why we'll continue to follow this space through ongoing Radar coverage and our series of online and in-person Strata events.

You can download "Big Data Now" for free here.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl