Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 05 2012

Data as seeds of content

Despite the attention big data has received in the media and among the technology community, it is surprising that we are still shortchanging the full capabilities of what data can do for us. At times, we get caught up in the excitement of the technical challenge of processing big data and lose sight of the ultimate goal: to derive meaningful insights that can help us make informed decisions and take action to improve our businesses and our lives.

I recently spoke on the topic of automating content at the O'Reilly Strata Conference. It was interesting to see the various ways companies are attempting to make sense out of big data. Currently, the lion's share of the attention is focused on ways to analyze and crunch data, but very little has been done to help communicate results of big data analysis. Data can be a very valuable asset if properly exploited. As I'll describe, there are many interesting applications one can create with big data that can describe insights or even become monetizable products.

To date, the de facto format for representing big data has been visualizations. While visualizations are great for compacting a large amount of data into something that can be interpreted and understood, the problem is just that — visualizations still require interpretation. There were many sessions at Strata about how to create effective visualizations, but the reality is the quality of visualizations in the real world varies dramatically. Even for the visualizations that do make intuitive sense, they often require some expertise and knowledge of the underlying data. That means a large number of people who would be interested in the analysis won't be able to gain anything useful from it because they don't know how to interpret the information.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20


To be clear, I'm a big fan of visualizations, but they are not the end-all in data analysis. They should be considered just one tool in the big data toolbox. I think of data as the seeds for content, whereby data can ultimately be represented in a number of different formats depending on your requirements and target audiences. In essence, data are the seeds that can spout as large a content tree as your imagination will allow.

Below, I describe each limb of the content tree. The examples I cite are sports related because that's what we've primarily focused on at my company, Automated Insights. But we've done very similar things in other content areas rich in big data, such as finance, real estate, traffic and several others. In each case, once we completed our analysis and targeted the type of content we wanted to create, we completely automated the future creation of the content.

Long-form content

By long-form, I mean three or more paragraphs — although it could be several pages or even book length — that use human-readable language to reveal key trends, records and deltas in data. This is the hardest form of content to automate, but technology in this space is rapidly improving. For example, here is a recap of an NFL game generated out of box score and play-by-play data.

A long-form sports recap driven by data
A long-form sports recap driven by data. See the full story.

Short-form content

These are bullets, headlines, and tweets of insights that can boil a huge dataset into very actionable bits of language. For example, here is a game notes article that was created automatically out of an NCAA basketball box score and historical stats.

Mobile and social content

We've done a lot of work creating content for mobile applications and various social networks. Last year, we auto-generated more than a half-million tweets. For example, here is the automated Twitter stream we maintain that covers UNC Basketball.

Metrics

By metrics, I'm referring to the process of creating a single number that's representative of a larger dataset. Metrics are shortcuts to boil data into something easier to understand. For instance, we've created metrics for various sports, such as a quarterback ranking system that's based on player performance.

Real-time updates

Instead of thinking of data as something you crunch and analyze days or weeks after it was created, there are opportunities to turn big data into real-time information that provides interested users with updates as soon as they occur. We have a real-time NCAA basketball scoreboard that updates with new scores.

Content applications

This is one few people consider, but creating content-based applications is a great way to make use of and monetize data. For example, we created StatSmack, which is an app that allows sports fans to discover 10-20+ statistically based "slams" that enable them to talk trash about any team.

A variation on visualizations

Used in the right context, visualizations can be an invaluable tool for understanding a large dataset. The secret is combining bulleted text-based insights with the graphical visualization to allow them to work together to truly inform the user. For example, this page has a chart of win probability over the course of game seven of the 2011 World Series game. It shows the ebb and flow of the game.

Win probability from World Series 2011 game 7
Play-by-play win probability from game seven of the 2011 World Series.

What now?

As more people get their heads around how to crunch and analyze data, the issue of how to effectively communicate insights from that data will be a bigger concern. We are still in the very early stages of this capability, so expect a lot of innovation over the next few years related to automating the conversion of data to content.

Related:


November 04 2011

Top Stories: October 31-November 4, 2011

Here's a look at the top stories published across O'Reilly sites this week.

How I automated my writing career
You scale content businesses by increasing the number of people who create the content ... or so conventional wisdom says. Learn how a former author is using software to simulate and expand human-quality writing.

What does privacy mean in an age of big data?
Ironclad digital privacy isn't realistic, argues "Privacy and Big Data" co-author Terence Craig. What we need instead are laws and commitments founded on transparency.

If your data practices were made public, would you be nervous?
Solon Barocas, a doctoral student at New York University, discusses consumer perceptions of data mining and how companies and data scientists can shape data mining's reputation.

Five ways to improve publishing conferences
Keynotes and panel discussions may not be the best way to program conferences. What if organizers instead structured events more like a great curriculum?


Anthropology extracts the true nature of tech
Genevieve Bell, director of interaction and experience research at Intel, talks about how anthropology can inform business decisions and product design.


Tools of Change for Publishing, being held February 13-15 in New York, is where the publishing and tech industries converge. Register to attend TOC 2012.

November 03 2011

How I automated my writing career

In 2001, I got an itch to write a book. Like many people, I naïvely thought, "I have a book or two in me," as if writing a book is as easy as putting pen to paper. It turns out to be very time consuming, and that's after you've spent countless hours learning and researching and organizing your topic of choice. But I marched on and wrote or co-wrote 10 books in a five-year period. I'm a glutton for punishment.

My day job during that time was programming. I've been programming for 16 years. My whole career I've focused on automating the un-automatable — essentially making computers do things people never thought they could do. By the time I started on my 10th book, I got another kind of itch — I wanted to automate my writing career. I was getting bored with the tedium of writing books, and the money wasn't that good.

But that's absurd, right? How can a computer possibly write something coherent and informative, much less entertaining? The "how can a computer possibly do X?" questions are the ones I've spent my career trying to answer. So, I set out on a quest to create software that could write. It took more effort than writing 10 books put together, but after building a team of 12 people, we were able to use our software to generate more than 100,000 sports-related stories in a nine-month period.

Before I get into specifics with what our software produces, I think it's worth highlighting some of the attributes that make software a great candidate to be a writer:

  • Software doesn't get writer's block, and it can work around the clock.
  • Software can't unionize or file class-action lawsuits because we don't pay enough (like many of the content farms have had to deal with).
  • Software doesn't get bored and start wondering how to automate itself.
  • Software can be reprogrammed, refactored and improved — continuously.
  • Software can benefit from the input of multiple people. This is unlike traditional writing, which tends to be a solitary event (+1 if you count the editor).
  • Perhaps most importantly, software can access and analyze significantly more data than what a single person (or even a group of people) can do on their own.

Software isn't a panacea, though. Not all content can be easily automated (yet). The type of content my company, Automated Insights, has automated is quantitatively oriented. That's the trick. We've automated content by applying meaning to numbers, to data. Sports was the first category we tackled. Sports by their nature are very data heavy. By our internal estimates, 70% of all sports-related articles are analyzing numbers in one form or another.

Our technology combines a large database of structured data, a real-time feed of stats, and a large database of phrases, and algorithms to tie it all together to produce articles from two to eight paragraphs in length. The algorithms look for interesting patterns in the data to determine what to write about.

In November of 2010, we launched the StatSheet Network, a collection of 345 websites (one for every Division-I NCAA Basketball team) that were fully automated. Check out my favorite team: UNC Tar Heels.

Automated game recap
Software mines data to construct short game recaps. (Click to see full story.)

We included the typical kind of stats you'd expect on a basketball site, but also embedded visualizations and our fully automated articles. We automated 14 different types of stories, everything from game recaps and previews to players of the week and historical retrospectives. Recently, we launched similar sites for every MLB team (check out the Detroit Tigers site), and soon we are launching sites for every NFL and NCAA Football team.

Sports is only one of many different categories we are working on. We've also done work in finance, real estate and a few other data-intensive industries. However, don't limit your thinking on what's possible. We get a steady stream of requests from non-obvious industries, such as pharmaceutical clinical trials and even domain name registrars. Any area that has large datasets where people are trying to derive meaning from the data are potential candidates for our technology.

Automation plus human, not automation versus human

Creating software that can write long-form narratives is very difficult, full of all sorts of interesting artificial intelligence, machine learning and natural language problems. But with the right mix of talent (and funding), we've been able to do it. It really does take a keen understanding of how software and the written word can work together.

I often hear it suggested that software-generated prose must be very bland and stilted. That's only the case if the folks behind the software write bland and stilted prose. Software can be just as opinionated as any writer.

A common, and funny, question I get from journalists is: "when will you automate me out out of a job?" I find the question humorous because built into the question is the assumption that if our software can write the perfect story on a particular topic, then no one else should attempt to write about it. That's just not going to happen. What's happening instead is that media companies are using our software to help scale their businesses. Initially, that takes the form of generating stories on topics a media outlet didn't have the resources to cover. In other cases, it means putting our stories through an editorial process that customizes the content to the specific needs of the publisher. You still need humans for that. There will be less of a need for folks to spend their time writing purely quantitative pieces, but that should be liberating. Now, they can focus on more qualitative, value-added commentary that humans are inherently good at. Quantitative stories can — and probably should — be mostly automated because computers are better at that.

Software will make hyperlocal content possible and even profitable. Many companies have tried to solve the "hyperlocal problem" with minimal success. It's just too hard to scale content creation out to every town in the U.S. (or the world, for that matter). For certain categories (e.g. high school sports), software-generated content makes perfect sense. You'll see automated content play a big role here in the coming years.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Software-generated books?

Because I've been so focused on running Automated Insights, I haven't had time to write any new books recently. I suggested to a colleague that we should turn our software loose and have it write my next book. He looked at me and asked, "How can it possibly do that?" That's what I like to hear.

But is a software-generated book even feasible? Our software can create eight paragraphs now, but is it possible to create eight chapters' worth of content? The answer is "yes," but not quite the same kind of technical books I used to write, at least right now. It would be easy for us to extend our technology to write even longer pieces. That's not the issue. Our software is good at quantitative analysis using structured data.

The kind of books I used to write were not based on data and were qualitative in nature. I pulled from my experience and did supplemental research, made a judgment on the best way to perform a task, then documented it. We are in the early stages of building software that will do more qualitative analysis such as this, but that's a much harder challenge. The main advantage of today's usage of software writing is to automate repetitive types of content. This is less applicable for books.

In the near term, the writers at O'Reilly and elsewhere have nothing to worry about. But I wouldn't count out automation in the long term.

Associated box score photo on home and category pages via Wikipedia.

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl