Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 12 2012

Strata Week: Add structured data, lose local flavor?

Here are a few of the data stories that caught my attention this week:

A possible downside to Wikidata

Wikidata data model diagram
Screenshot from the Wikidata Data Model page.

The Wikimedia Foundation — the good folks behind Wikipedia — recently proposed a Wikidata initiative. It's a new project that would build out a free secondary database to collect structured data that could provide support in turn for Wikipedia and other Wikimedia projects. According to the proposal:

"Many Wikipedia articles contain facts and connections to other articles that are not easily understood by a computer, like the population of a country or the place of birth of an actor. In Wikidata, you will be able to enter that information in a way that makes it processable by the computer. This means that the machine can provide it in different languages, use it to create overviews of such data, like lists or charts, or answer questions that can hardly be answered automatically today."

But in The Atlantic this week, Mark Graham, a research fellow at the Oxford Research Institute, takes a look at the proposal, calling these "changes that have worrying connotations for the diversity of knowledge in the world's sixth most popular website." Graham points to the different language editions of Wikipedia, noting that the encyclopedic knowledge contained therein is always highly diverse. "Not only does each language edition include different sets of topics, but when several editions do cover the same topic, they often put their own, unique spin on the topic. In particular, the ability of each language edition to exist independently has allowed each language community to contextualize knowledge for its audience."

Graham fears that emphasizing a standardized, machine-readable, semantic-oriented Wikipedia will lose this local flavor:

"The reason that Wikidata marks such a significant moment in Wikipedia's history is the fact that it eliminates some of the scope for culturally contingent representations of places, processes, people, and events. However, even more concerning is that fact that this sort of congealed and structured knowledge is unlikely to reflect the opinions and beliefs of traditionally marginalized groups."

His arguments raise questions about the perceived universality of data, when in fact what we might find instead is terribly nuanced and localized, particularly when that data is contributed by humans who are distributed globally.

The intricacies of Netflix personalization

Netflix suggestion buttonNetflix's recommendation engine is often cited as a premier example of how user data can be mined and analyzed to build a better service. This week, Netflix's Xavier Amatriain and Justin Basilico penned a blog post offering insights into the challenges that the company — and thanks to the Netflix Prize, the data mining and machine learning communities — have faced in improving the accuracy of movie recommendation engines.

The Netflix post raises some interesting questions about how the means of content delivery have changed recommendations. In other words, when Netflix refocused on its streaming product, viewing interests changed (and not just because the selection changed). The same holds true for the multitude of ways in which we can now watch movies via Netflix (there are hundreds of different device options for accessing and viewing content from the service).

Amatriain and Basilico write:

"Now it is clear that the Netflix Prize objective, accurate prediction of a movie's rating, is just one of the many components of an effective recommendation system that optimizes our members' enjoyment. We also need to take into account factors such as context, title popularity, interest, evidence, novelty, diversity, and freshness. Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts."

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.


Reposted byRK RK

January 31 2012

Embracing the chaos of data

A data scientist and a former Apple engineer, Pete Warden (@petewarden) is now the CTO of the new travel photography startup Jetpac. Warden will be a keynote speaker at the upcoming Strata Conference, where he'll explain why we should rethink our approach to data. Specifically, rather than pursue the perfection of structured information, Warden says we should instead embrace the chaos of unstructured data. He expands on that idea in the following interview.

What do you mean asking data scientists to embrace the chaos of data?

Pete WardenPete Warden: The heart of data science is designing instruments to turn signals from the real world into actionable information. Fighting the data providers to give you those signals in a convenient form is a losing battle, so the key to success is getting comfortable with messy requirements and chaotic inputs. As an engineer, this can feel like a deal with the devil, as you have to accept error and uncertainty in your results. But the alternative is no results at all.

Are we wasting time trying to make unstructured data structured?

Pete Warden: Structured data is always better than unstructured, when you can get it. The trouble is that you can't get it. Most structured data is the result of years of effort, so it is only available with a lot of strings, either financial or through usage restrictions.

The first advantage of unstructured data is that it's widely available because the producers don't see much value in it. The second advantage is that because there's no "structuring" work required, there's usually a lot more of it, so you get much broader coverage.

A good comparison is Yahoo's highly-structured web directory versus Google's search index built on unstructured HTML soup. If you were looking for something that was covered by Yahoo, its listing was almost always superior, but there were so many possible searches that Google's broad coverage made it more useful. For example, I hear that 30% of search queries are "once in history" events — unique combinations of terms that never occur again.

Dealing with unstructured data puts the burden on the consuming application instead of the publisher of the information, so it's harder to get started, but the potential rewards are much greater.

How do you see data tools developing over the next few years? Will they become more accessible to more people?

Pete Warden: One of the key trends is the emergence of open-source projects that deal with common patterns of unstructured input data. This is important because it allows one team to solve an unstructured-to-structured conversion problem once, and then the entire world can benefit from the same solution. For example, turning street addresses into latitude/longitude positions is a tough problem that involves a lot of fuzzy textual parsing, but open-source solutions are starting to emerge.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Associated photo on home and category pages: "mess with graphviz by Toms Bauģis, on Flickr


November 03 2011

How I automated my writing career

In 2001, I got an itch to write a book. Like many people, I naïvely thought, "I have a book or two in me," as if writing a book is as easy as putting pen to paper. It turns out to be very time consuming, and that's after you've spent countless hours learning and researching and organizing your topic of choice. But I marched on and wrote or co-wrote 10 books in a five-year period. I'm a glutton for punishment.

My day job during that time was programming. I've been programming for 16 years. My whole career I've focused on automating the un-automatable — essentially making computers do things people never thought they could do. By the time I started on my 10th book, I got another kind of itch — I wanted to automate my writing career. I was getting bored with the tedium of writing books, and the money wasn't that good.

But that's absurd, right? How can a computer possibly write something coherent and informative, much less entertaining? The "how can a computer possibly do X?" questions are the ones I've spent my career trying to answer. So, I set out on a quest to create software that could write. It took more effort than writing 10 books put together, but after building a team of 12 people, we were able to use our software to generate more than 100,000 sports-related stories in a nine-month period.

Before I get into specifics with what our software produces, I think it's worth highlighting some of the attributes that make software a great candidate to be a writer:

  • Software doesn't get writer's block, and it can work around the clock.
  • Software can't unionize or file class-action lawsuits because we don't pay enough (like many of the content farms have had to deal with).
  • Software doesn't get bored and start wondering how to automate itself.
  • Software can be reprogrammed, refactored and improved — continuously.
  • Software can benefit from the input of multiple people. This is unlike traditional writing, which tends to be a solitary event (+1 if you count the editor).
  • Perhaps most importantly, software can access and analyze significantly more data than what a single person (or even a group of people) can do on their own.

Software isn't a panacea, though. Not all content can be easily automated (yet). The type of content my company, Automated Insights, has automated is quantitatively oriented. That's the trick. We've automated content by applying meaning to numbers, to data. Sports was the first category we tackled. Sports by their nature are very data heavy. By our internal estimates, 70% of all sports-related articles are analyzing numbers in one form or another.

Our technology combines a large database of structured data, a real-time feed of stats, and a large database of phrases, and algorithms to tie it all together to produce articles from two to eight paragraphs in length. The algorithms look for interesting patterns in the data to determine what to write about.

In November of 2010, we launched the StatSheet Network, a collection of 345 websites (one for every Division-I NCAA Basketball team) that were fully automated. Check out my favorite team: UNC Tar Heels.

Automated game recap
Software mines data to construct short game recaps. (Click to see full story.)

We included the typical kind of stats you'd expect on a basketball site, but also embedded visualizations and our fully automated articles. We automated 14 different types of stories, everything from game recaps and previews to players of the week and historical retrospectives. Recently, we launched similar sites for every MLB team (check out the Detroit Tigers site), and soon we are launching sites for every NFL and NCAA Football team.

Sports is only one of many different categories we are working on. We've also done work in finance, real estate and a few other data-intensive industries. However, don't limit your thinking on what's possible. We get a steady stream of requests from non-obvious industries, such as pharmaceutical clinical trials and even domain name registrars. Any area that has large datasets where people are trying to derive meaning from the data are potential candidates for our technology.

Automation plus human, not automation versus human

Creating software that can write long-form narratives is very difficult, full of all sorts of interesting artificial intelligence, machine learning and natural language problems. But with the right mix of talent (and funding), we've been able to do it. It really does take a keen understanding of how software and the written word can work together.

I often hear it suggested that software-generated prose must be very bland and stilted. That's only the case if the folks behind the software write bland and stilted prose. Software can be just as opinionated as any writer.

A common, and funny, question I get from journalists is: "when will you automate me out out of a job?" I find the question humorous because built into the question is the assumption that if our software can write the perfect story on a particular topic, then no one else should attempt to write about it. That's just not going to happen. What's happening instead is that media companies are using our software to help scale their businesses. Initially, that takes the form of generating stories on topics a media outlet didn't have the resources to cover. In other cases, it means putting our stories through an editorial process that customizes the content to the specific needs of the publisher. You still need humans for that. There will be less of a need for folks to spend their time writing purely quantitative pieces, but that should be liberating. Now, they can focus on more qualitative, value-added commentary that humans are inherently good at. Quantitative stories can — and probably should — be mostly automated because computers are better at that.

Software will make hyperlocal content possible and even profitable. Many companies have tried to solve the "hyperlocal problem" with minimal success. It's just too hard to scale content creation out to every town in the U.S. (or the world, for that matter). For certain categories (e.g. high school sports), software-generated content makes perfect sense. You'll see automated content play a big role here in the coming years.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Software-generated books?

Because I've been so focused on running Automated Insights, I haven't had time to write any new books recently. I suggested to a colleague that we should turn our software loose and have it write my next book. He looked at me and asked, "How can it possibly do that?" That's what I like to hear.

But is a software-generated book even feasible? Our software can create eight paragraphs now, but is it possible to create eight chapters' worth of content? The answer is "yes," but not quite the same kind of technical books I used to write, at least right now. It would be easy for us to extend our technology to write even longer pieces. That's not the issue. Our software is good at quantitative analysis using structured data.

The kind of books I used to write were not based on data and were qualitative in nature. I pulled from my experience and did supplemental research, made a judgment on the best way to perform a task, then documented it. We are in the early stages of building software that will do more qualitative analysis such as this, but that's a much harder challenge. The main advantage of today's usage of software writing is to automate repetitive types of content. This is less applicable for books.

In the near term, the writers at O'Reilly and elsewhere have nothing to worry about. But I wouldn't count out automation in the long term.

Associated box score photo on home and category pages via Wikipedia.


July 26 2011

Real-time data needs to power the business side, not just tech

In 2005, real-time data analysis was being pioneered and predicted to "transform society." A few short years later, the technology is a reality and indeed is changing the way people do business. But Theo Schlossnagle (@postwait), principal and CEO of OmniTI, says we're not quite there yet.

In a recent interview, Schlossnagle said that not only does the current technology allow less-qualified people to analyze data, but that most of the analysis being done is strictly for technical benefit. The real benefit will be realized when the technology is capable of powering real-time business decisions.

Our interview follows.

How has data analysis evolved over the last few years?

TheoSchlossnagle.jpgTheo Schlossnagle: The general field of data analysis has actually devolved over the last few years because the barrier to entry is dramatically lower. You now have a lot of people attempting to analyze data with no sound mathematics background. I personally see a lot of "analysis" happening that is less mature than your run-of-the-mill graduate-level statistics course or even undergraduate-level signal analysis course.

But where does it need to evolve? Storage is cheaper and more readily available than ever before. This leads organizations to store data like its going out of style. This isn't a bad thing, but it causes a significantly lower signal-to-noise ratio. Data analysis techniques going forward will need to evolve much better noise reduction capabilities.

What does real-time data allow that wasn't available before?

Theo Schlossnagle: Real-time data has been around for a long time, so in a lot of ways, it isn't offering anything new. But the tools to process data in real-time have evolved quite a bit. CEP systems now provide a much more accessible approach to dealing with data in real time and building millisecond-granularity real-time systems. In a web application, imagine being about to observe something about a user and make an intelligent decision on that data combined with a larger aggregate data stream — all before before you've delivered the headers back to the user.

What's required to harness real-time analysis?

Theo Schlossnagle: Low-latency messaging infrastructure and a good CEP system. In my work we use either RabbitMQ or ZeroMQ and a whole lot of Esper.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Does there need to be a single person at a company who collects data, analyzes and makes recommendations, or is that something that can be done algorithmically?

Theo Schlossnagle: You need to have analysts, and I think it is critically important to have them report into the business side — marketing, product, CFO, COO — instead of into the engineering side. We should be doing data analysis to make better business decisions. It is vital to make sure we are always supplied with intelligent and rewarding business questions.

A lot of data analysis done today is technical analysis for technical benefit. The real value is when we can take this technology and expertise and start powering better real-time business decisions. Some of the areas doing real-time analysis well in this regard include finance, stock trading, and high-frequency traders.


July 14 2011

Why files need to die

Filing Cabinet by Robin Kearney, on FlickrFiles are an outdated concept. As we go about our daily lives, we don't open up a file for each of our friends or create folders full of detailed records about our shopping trips. Create, watch, socialize, share, and plan — these are the new verbs of the Internet age — not open, save, close and trash.

Clinging to outdated concepts stifles innovation. Consider the QWERTY keyboard. It was designed 133 years ago to slow down typists who were causing typewriter hammers to jam. The last typewriter factory in the world closed last month, and yet even the shiny new iPad 2 still uses the same layout. Creative alternatives like Dvorak and more recently Swype still struggle to compete with this deeply ingrained idea of how a keyboard should look.

Today we use computers for everything from booking travel to editing snapshots, and we accumulate many thousands of files. As a result, we've become digital librarians, devising naming schemes and folder systems just to cope with the mountains of digital "stuff" in our lives.

The file folder metaphor makes no sense in today's world. Gone are the smoky 1970s offices where secretaries bustled around fetching armfuls of paperwork for their bosses, archiving cardboard files in dusty cabinets. Our lives have gone digital and our data zips around the world in seconds as we buy goods online or chat with distant relatives.

A file is a snapshot of a moment in time. If I email you a document, I'm freezing it and making an identical copy. If either of us wants to change it, we have to keep our two separate versions in sync.

So it's no wonder that as we try and force this dated way of thinking onto today's digital landscape, we are virtually guaranteed the pains of lost data, version conflicts and failed uploads.

It's time for a new way to store data – a new mental model that reflects the way we use computers today.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Flogging a dead horse

Microsoft, Apple and Linux have all failed to provide ways to work with our data in an intuitive way. Many new products have emerged to try and ease our pain, such as Dropbox and Infovark, but they're limited by the tired model of files and folders.

The emergence of Web 2.0 offered new hope, with much brouhaha over folksonomies. The idea was to harness "people power" by getting us to tag pictures or websites with meaningful labels, removing the need for folders. But Flickr and Delicious, poster boys of the tagging revolution, have fallen from favor and as the tools have stagnated and enthusiasm for tagging has dwindled.

Clearly, human knowledge is needed for computers to make sense of our data – but relying on human effort to digitize that knowledge by labeling files or entering data can only take us so far. Even Wikipedia has vast gaps in its coverage.

Instead, we need computers to interpret and organize data for us automatically. This means they'll store not only our data, but also information about that data and what it means – metadata. We need them to really understand our digital information as something more than a set of text documents and binary streams. Only then will we be freed from our filing frustrations.

I am not a machine, don't make me think like one

In all our efforts to interact with computers, we're forced to think like a machine: What device should I access? What format is that file? What application should I launch to read it? But that's not how the brain works. We form associations between related things, and that's how we access our memories:

Associative recall in the brain

Wouldn't it be nice if we could navigate digital data in this way? Isn't it about time that computers learned to express the world in our terms, not theirs?

It might seem like a far-off dream, but it's achievable. To do this, computers will need to know what our data relates to. They can learn this by capturing information automatically and using it to annotate our data at the point it is first stored — saving us from tedious data entry and filing later.

For example, camera manufacturers have realized that adding GPS to cameras provides valuable metadata for each photograph. Back at your PC, your geo-tagged images will be automatically grouped by time and location with zero effort.

Our digital lives are full of signals and sensors that can be similarly harnessed:

  • ReQall uses your calendar and to-do list activity to help deliver information at the right time.
  • RescueTime tracks the websites and programs you use to understand your working habits.
  • Lifelogging projects like MyLifeBits go further still, recording audio and video of your life to provide a permanent record.
  • A research project at Ryerson University demonstrates the idea of context-aware computing — combining live, local data and user information to deliver highly relevant, customized content.

Semantics: Teaching computers to understand human language

Metadata annotation via sensors and semantic annotation

As this diagram shows, hardware and software sensors can only tell half the story. Where computers stand to learn the most is by analyzing the meanings behind the 1s and 0s. Once computers understand our language, our documents and correspondence are no longer just isolated files. They become source material, full of facts and ready to be harvested.

This is the science of semantics — programs that can extract meaning from the written word.

Here's some of what we can do with semantic technology today:

Today, most semantic research is done by enterprises that can afford to spend time and money on enterprise content management (ECM) and content analytics systems to make sense of their vast digital troves. But soon consumers will reap the benefits of semantic technology too, as these applications show:

  • While surfing the web, we can chat and interact around particular movies, books or activities using the browser plug-in GetGlue, which scans the text in the web pages you visit to identify recognized social objects.
  • We will soon have our own intelligent agents, the first of which is Siri, an iPhone app that can book movie tickets or make restaurant reservations without us having to fill in laborious online forms.

This ability for computers to understand our content is critical as we move toward file-less computing. A new era of information-based applications is beginning, but its success requires a world where information isn't fragmented across different files.

Time for a new view of data

Let's use your summer vacation as an example: All the digital information relating to your vacation is scattered across hundreds of files, emails and transactions, often locked into different applications, services and formats.

No matter how many fancy applications you have for "seamlessly syncing" of all these files, any talk of interoperability is meaningless until you have a basic fabric for viewing and interacting with your data at a higher level.

If not files, then what? The answer is surprisingly simple.

What is the one thing all your data has in common?


Almost all data can be thought of as a stream, changing over time:

The streams of my digital life

Already we generate vast streams of data as we go about our lives: credit card purchases, web history, photographs, file edits. We never get to see them on screen like that though. Combining these streams into a single timeline — a personal life stream — brings everything together in a way that makes sense:

A personal life stream

Asking the computer "Show me everything I was doing at 3 p.m. yesterday." or "Where
are Thursday's figures?" is something we can't easily do today. Products such as AllOfMe are beginning to experiment in this space.

We can go further — time itself can be used to help associate things. For example: Since I can only be in one place at one time, everything that happens there and then must be related:

All data at the same time is related

The computer can easily help me access the most relevant information — it just needs to track back along the streams to the last time I was at a certain place or with a specific person:

Related data can be found by finding previous occurrences on each stream

The world — our lives — is interconnected, and data needs to be the same.

This timeline-based view of data is useful, but it becomes even more powerful when combined with the annotations and semantic metadata gathered earlier. With this much cross-linking between data, our information can now be associated with everything it relates to, automatically.

Finally, we can do away with files because we have a system that works like the brain does – giving us another new power — to traverse effortlessly from one related concept or entity to another until we reach the desired information:

Associative data navigation

In a system like this we navigate based on what the data means to us – not which file it is located in.

There will be technical challenges in maintaining data that resides on different devices and is held by different service providers, but cloud computing industry giants like Amazon and Google have already solved much more difficult problems.

A world without files

In the world of linked data and semantically indexed information, saving or losing data is not something we'll have to worry about. The stream is saved. Think about it: You'd never have to organize your emails or project plans because everything would be there, as connected as the thoughts in your head. Collaborating and sharing would simply mean giving other people access to read from or contribute to part of your stream.

We already see a glimpse of this world when we look at Facebook. It's no wonder that it's so successful; it lets us deal with people, events, messages and photos — the real fabric of our everyday lives — not artificial constructs like files, folders and programs

Files are a relic of a bygone age. Often, we hang onto ideas long past their due date because it's what we've always done. But if we're willing to let go of the past, a fascinating world of true human-computer interaction and easy-to-find information awaits.

Moving beyond files to associative and stream-based models will have profound implications. Data will be traceable, creators will be able to retain control of their works, and copies will know they are copies. Piracy and copyright debates will be turned on their heads, as the focus shifts from copying to the real question of who can access what. Data traceability could also help counter the spread of viral rumors and inaccurate news reports.

Issues like anonymity, data security and personal privacy will require a radical rethink. But wouldn't it be empowering to control your own information and who can access it? There's no reason why big corporations should have control of our data. With the right general-purpose operating system that makes hosting a piece of data, recording its metadata and managing access to it as easy as sharing a photo on Facebook, we will all be empowered to embrace our digital futures like never before.

Photo: Filing Cabinet by Robin Kearney, on Flickr


March 01 2011

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!