Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 05 2012

Data as seeds of content

Despite the attention big data has received in the media and among the technology community, it is surprising that we are still shortchanging the full capabilities of what data can do for us. At times, we get caught up in the excitement of the technical challenge of processing big data and lose sight of the ultimate goal: to derive meaningful insights that can help us make informed decisions and take action to improve our businesses and our lives.

I recently spoke on the topic of automating content at the O'Reilly Strata Conference. It was interesting to see the various ways companies are attempting to make sense out of big data. Currently, the lion's share of the attention is focused on ways to analyze and crunch data, but very little has been done to help communicate results of big data analysis. Data can be a very valuable asset if properly exploited. As I'll describe, there are many interesting applications one can create with big data that can describe insights or even become monetizable products.

To date, the de facto format for representing big data has been visualizations. While visualizations are great for compacting a large amount of data into something that can be interpreted and understood, the problem is just that — visualizations still require interpretation. There were many sessions at Strata about how to create effective visualizations, but the reality is the quality of visualizations in the real world varies dramatically. Even for the visualizations that do make intuitive sense, they often require some expertise and knowledge of the underlying data. That means a large number of people who would be interested in the analysis won't be able to gain anything useful from it because they don't know how to interpret the information.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20


To be clear, I'm a big fan of visualizations, but they are not the end-all in data analysis. They should be considered just one tool in the big data toolbox. I think of data as the seeds for content, whereby data can ultimately be represented in a number of different formats depending on your requirements and target audiences. In essence, data are the seeds that can spout as large a content tree as your imagination will allow.

Below, I describe each limb of the content tree. The examples I cite are sports related because that's what we've primarily focused on at my company, Automated Insights. But we've done very similar things in other content areas rich in big data, such as finance, real estate, traffic and several others. In each case, once we completed our analysis and targeted the type of content we wanted to create, we completely automated the future creation of the content.

Long-form content

By long-form, I mean three or more paragraphs — although it could be several pages or even book length — that use human-readable language to reveal key trends, records and deltas in data. This is the hardest form of content to automate, but technology in this space is rapidly improving. For example, here is a recap of an NFL game generated out of box score and play-by-play data.

A long-form sports recap driven by data
A long-form sports recap driven by data. See the full story.

Short-form content

These are bullets, headlines, and tweets of insights that can boil a huge dataset into very actionable bits of language. For example, here is a game notes article that was created automatically out of an NCAA basketball box score and historical stats.

Mobile and social content

We've done a lot of work creating content for mobile applications and various social networks. Last year, we auto-generated more than a half-million tweets. For example, here is the automated Twitter stream we maintain that covers UNC Basketball.

Metrics

By metrics, I'm referring to the process of creating a single number that's representative of a larger dataset. Metrics are shortcuts to boil data into something easier to understand. For instance, we've created metrics for various sports, such as a quarterback ranking system that's based on player performance.

Real-time updates

Instead of thinking of data as something you crunch and analyze days or weeks after it was created, there are opportunities to turn big data into real-time information that provides interested users with updates as soon as they occur. We have a real-time NCAA basketball scoreboard that updates with new scores.

Content applications

This is one few people consider, but creating content-based applications is a great way to make use of and monetize data. For example, we created StatSmack, which is an app that allows sports fans to discover 10-20+ statistically based "slams" that enable them to talk trash about any team.

A variation on visualizations

Used in the right context, visualizations can be an invaluable tool for understanding a large dataset. The secret is combining bulleted text-based insights with the graphical visualization to allow them to work together to truly inform the user. For example, this page has a chart of win probability over the course of game seven of the 2011 World Series game. It shows the ebb and flow of the game.

Win probability from World Series 2011 game 7
Play-by-play win probability from game seven of the 2011 World Series.

What now?

As more people get their heads around how to crunch and analyze data, the issue of how to effectively communicate insights from that data will be a bigger concern. We are still in the very early stages of this capability, so expect a lot of innovation over the next few years related to automating the conversion of data to content.

Related:


April 03 2012

Data's next steps

Steve O'Grady (@sogrady) , a developer-focused analyst from RedMonk, views large-scale data collection and aggregation as a problem that has largely been solved. The tools and techniques required for the Googles and Facebooks of the world to handle what he calls "datasets of extraordinary sizes" have matured. In O'Grady's analysis, what hasn't matured are methods for teasing meaning of this data that are accessible to "ordinary users."

Among the other highlights from our interview:

  • O'Grady on the challenge of big data: "Kevin Weil (@kevinweil) from Twitter put it pretty well, saying that it's hard to ask the right question. One of the implications of that statement is that even if we had perfect access to perfect data, it's very difficult to determine what you would want to ask, how you would want to ask it. More importantly, once you get that answer, what are the questions that derive from that?"
  • O'Grady on the scarcity of data scientists: "The difficulty for basically every business on the planet is that there just aren't many of these people. This is, at present anyhow, a relatively rare skill set and therefore one that the market tends to place a pretty hefty premium on."
  • O'Grady on the reasons for using NoSQL: "If you are going down the NoSQL route for the sake of going down the NoSQL route, that's the wrong way to do things. You're likely to end up with a solution that may not even improve things. It may actively harm your production process moving forward because you didn't implement it for the right reasons in the first place."

The full interview is embedded below and available Here. For the entire interview transcript, click here.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Related:

February 24 2012

Top stories: February 20-24, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Data for the public good
The explosion of big data, open data and social data offers new opportunities to address humanity's biggest challenges. The open question is no longer if data can be used for the public good, but how.

Building the health information infrastructure for the modern epatient
The National Coordinator for Health IT, Dr. Farzad Mostashari, discusses patient empowerment, data access and ownership, and other important trends in healthcare.

Big data in the cloud
Big data and cloud technology go hand-in-hand, but it's comparatively early days. Strata conference chair Edd Dumbill explains the cloud landscape and compares the offerings of Amazon, Google and Microsoft.

Everyone has a big data problem
MetaLayer's Jonathan Gosier talks about the need to democratize data tools because everyone has a big data problem.

Three reasons why direct billing is ready for its close-up
David Sims looks at the state of direct billing and explains why it's poised to catch on beyond online games and media.


Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

Cloud photo: Big cloud by jojo nicdao, on Flickr

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl