Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 06 2012

Top Stories: April 2-6, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Privacy, contexts and Girls Around Me
The application of user data is pushing at the edges of cultural norms. That can be a positive, but finding "the line" requires adherence to a few simple and clear guidelines.

Data as seeds of content
Visualizations are one way to make sense of data, but they aren't the only way. Robbie Allen reveals six additional outputs that help users derive meaningful insights from data.

State of the Computer Book Market 2011
In his annual report, Mike Hendrickson analyzes tech book sales and industry data: Part 1, Overall Market; Part 2, The Categories; Part 3, The Publishers; Part 4, The Languages. (Part 5 is coming next week.)

The do's and don'ts of geo marketing
During his session at this week's Where Conference, Placecast CEO Alistair Goodman examined the layers of context that make for rich, geo-targeted messages.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference, May 29 - 31 in San Francisco. Save 20% on registration with the code RADAR20.

Fence photo: Fence Friday by DayTripper (Tom), on Flickr

September 15 2011

The evolution of data products

In "What is Data Science?," I started to talk about the nature of data products. Since then, we've seen a lot of exciting new products, most of which involve data analysis to an extent that we couldn't have imagined a few years ago. But that begs some important questions: What happens when data becomes a product, specifically, a consumer product? Where are data products headed? As computer engineers and data scientists, we tend to revel in the cool new ways we can work with data. But to the consumer, as long as the products are about the data, our job isn't finished. Proud as we may be about what we've accomplished, the products aren't about the data; they're about enabling their users to do whatever they want, which most often has little to do with data.

It's an old problem: the geeky engineer wants something cool with lots of knobs, dials, and fancy displays. The consumer wants an iPod, with one tiny screen, one jack for headphones, and one jack for charging. The engineer wants to customize and script it. The consumer wants a cool matte aluminum finish on a device that just works. If the consumer has to script it, something is very wrong. We're currently caught between the two worlds. We're looking for the Steve Jobs of data — someone who can design something that does what we want without getting us involved in the details.

Disappearing data

We've become accustomed to virtual products, but it's only appropriate to start by appreciating the extent to which data products have replaced physical products. Not that long ago, music was shipped as chunks of plastic that weighed roughly a pound. When the music was digitized and stored on CDs, it became a data product that weighed under an ounce, but was still a physical object. We've moved even further since: many of the readers of this article have bought their last CD, and now buy music exclusively in online form, through iTunes or Amazon. Video has followed the same path, as analog VHS videotapes became DVDs and are now streamed through Netflix, a pure data product.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

But while we're accustomed to the displacement of physical products by
virtual products, the question of how we take the next step — where
data recedes into the background — is surprisingly tough.
Do we want products that deliver data?
Or do we want products that deliver results based on data? We're
evolving toward the latter, though we're not there yet. The iPod may
be the best example of a product that pushes the data into the
background to deliver what the user wants, but its partner
application, iTunes, may be the worst. The user interface to iTunes
is essentially a spreadsheet that exposes all of your music
collection's metadata. Similarly, the
"People You May Know" feature on social sites such as LinkedIn and
Facebook delivers recommendations: a list of people in the
database who are close to you in one way or another. While that's
much more friendly than iTunes' spreadsheet, it is still a
list, a classic data structure. Products like these have a "data smell." I call them "overt"
data products because the data is clearly visible as part
of the deliverable.

A list may be an appropriate way to deliver potential contacts, and a
spreadsheet may be an appropriate way to edit music metadata. But
there are many other kinds of deliverables that help us to understand
where data products are headed. At a recent event at IBM Research, IBM
demonstrated an application that accurately predicts bus arrival times,
based on real-time analysis of traffic data.
(London is about to roll out something similar.) Another IBM project implemented a

congestion management system
for Stockholm that brought about significant
decreases in traffic and air pollution. A href="">newer
initiative allows drivers to text their destinations to a service, and receive an optimized route, given current traffic and weather conditions. Is a bus arrival time data?
Probably so. Is a route another list structure, like a list of potential Facebook friends? Yes, though the real deliverable here is reduced transit time and an improved
environment. The data is still in the foreground, but we're starting
to look beyond the data to the bigger picture: better quality of life.

These projects suggest the next step in the evolution toward data products that deliver results rather than data. Recently, Ford discussed some experimental work in which they used Google's prediction and mapping capabilities to optimize mileage in hybrid cars based on predictions about where the driver was going. It's clearly a data product: it's doing data analysis on historical driving data and knowledge about road conditions. But the deliverable isn't a route or anything the driver actually sees — it's optimized engine usage and lower fuel consumption. We might call such a product, in which the data is hidden, a "covert" data product.

We can push even further. The user really just wants to get from point A to point B. Google has demonstrated a self-driving car that solves this problem. A self-driving car is clearly not delivering data as the result, but there are massive amounts of data behind the scenes, including maps, Street View images of the roads (which, among other things, help it to compute the locations of curbs, traffic lights, and stop signs), and data from sensors on the car. If we ever find out everything that goes into the data processing for a self-driving car, I believe we'll see a masterpiece of extracting every bit of value from many data sources. A self-driving car clearly takes the next step to solving a user's real problem while making the data hide behind the scenes.

Once you start looking for data products that deliver real-world results rather than data, you start seeing them everywhere. One IBM project involved finding leaks in Dubuque, Iowa's, public water supply. Water is being used all the time, but sudden changes in usage could represent a leak. Leaks have a unique signature: they can appear at any time, particularly at times when you would expect usage to be low. Unlike someone watering his lawn, flushing a toilet, or filling a pool, leaks don't stop. What's the deliverable? Lower water bills and a more robust water system during droughts — not data, but the result of data.

In medical care, doctors and nurses frequently have more data at their disposal than they know what to do with. The problem isn't the data, but seeing beyond the data to the medical issue. In a collaboration between IBM and the University of Ontario, researchers knew that most of the data streaming from the systems monitoring premature babies was discarded. While readings of a baby's vital signs might be taken every few milliseconds, they were being digested into a single reading that was checked once or twice an hour. By taking advantage of the entire data stream, it was possible to detect the onset of life-threatening infections as much as 24 hours before the symptoms were apparent to a human. Again, a covert data product; and the fact that it's covert is precisely what makes it valuable. A human can't deal with the raw data, and digesting the data into hourly summaries so that humans can use it makes it less useful, not more. What doctors and nurses need isn't data, they need to know that the sick baby is about to get sicker.

Eben Hewitt, author of "Cassandra: The Definitive Guide," works for a large hotel chain. He told me that the hotel chain considers itself a software company that delivers a data product. The company's real expertise lies in the reservation systems, the supply management systems, and the rest of the software that glues the whole enterprise together. It's not a small task. They're tracking huge numbers of customers making reservations for hundreds of thousands of rooms at tens of thousands of properties, along with various awards programs, special offers, rates that fluctuate with holidays and seasons, and so forth. The complexity of the system is certainly on par with LinkedIn, and the amount of data they manage isn't that much smaller. A hotel looks awfully concrete, but in fact, your reservation at Westin or Marriott or Day's Inn is data. You don't experience it as data, however — you experience it as a comfortable bed at the end of a long day. The data is hidden, as it should be.

I see another theme developing. Overt products tend to depend on overt data collection: LinkedIn and Facebook don't have any data that wasn't given to them explicitly, though they may be able to combine it in unexpected ways. With covert data products, not only is data invisible in the result, but it tends to be collected invisibly. It has to be collected invisibly: we would not find a self-driving car satisfactory if we had to feed it with our driving history. These products are frequently built from data that's discarded because nobody knows how to use it; sometimes it's the "data exhaust" that we leave behind as our cell phones, cars, and other devices collect information on our activities. Many cities have all the data they need to do real-time traffic analysis; many municipal water supplies have extensive data about water usage, but can't yet use the data to detect leaks; many hospitals connect patients to sensors, but can't digest the data that flows from those sensors. We live in an ocean of ambient data, much of which we're unaware. The evolution of data products will center around discovering uses for these hidden sources of data.

The power of combining data

The first generation of data products, such as CDDB, were essentially a single database. More recent products, such as LinkedIn's Skills database, are composites: Skills incorporates databases of users, employers, job listings, skill descriptions, employment histories, and more. Indeed, the most important operation in data science may be a "join" between different databases to answer questions that couldn't be answered by either database alone.

Facebook's facial recognition provides an excellent example of the power in linked databases. In the most general case, identifying faces (matching a face to a picture, given millions of possible matches) is an extremely difficult problem. But that's not the problem Facebook has solved. In a reply to Tim O'Reilly, Jeff Jonas said that while one-to-many picture identification remains an extremely difficult problem, one-to-few identification is relatively easy. Facebook knows about social networks, and when it sees a picture, Facebook knows who took it and who that person's friends are. It's a reasonable guess that any faces in the picture belong to the taker's Facebook friends. So Facebook doesn't need to solve the difficult problem of matching against millions of pictures; it only needs to match against pictures of friends. The power doesn't come from a database of millions of photos; it comes from joining the photos to the social graph.

The goal of discovery

Many current data products are recommendation engines, using collaborative filtering or other techniques to suggest what to buy, who to friend, etc. One of the holy grails of the "new media" is to build customized, personalized news services that automatically find what the user thinks is relevant and interesting. Tools like Apple's Genius look through your apps or your record collection to make recommendations about what else to buy. "People you may know," a feature common to many social sites, is effectively a recommendation engine.

But mere recommendation is a shallow goal. Recommendation engines aren't, and can't, be the end of the road. I recently spent some time talking to Bradford Cross (@bradfordcross), founder of Woven, and eventually realized that his language was slightly different from the language I was used to. Bradford consistently talked about "discovery," not recommendation. That's a huge difference. Discovery is the key to building great data products, as opposed to products that are merely good.

The problem with recommendation is that it's all about recommending something that the user will like, whether that's a news article, a song, or an app. But simply "liking" something is the wrong criterion. A couple months ago, I turned on Genius on my iPad, and it said things like "You have Flipboard, maybe you should try Zite." D'oh. It looked through all my apps, and recommended more apps that were like the apps I had. That's frustrating because I don't need more apps like the ones I have. I'd probably like the apps it recommended (in fact, I do like Zite), but the apps I have are fine. I need apps that do something different. I need software to tell me about things that are entirely new, ideally something I didn't know I'd like or might have thought I wouldn't like. That's where discovery takes over. What kind of insight are we talking about here? I might be delighted if Genius said, "I see you have ForScore, you must be a musician, why don't you try Smule's Magic Fiddle" (well worth trying, even if you're not a musician). That's where recommendation starts making the transition to discovery.

Eli Pariser's "The Filter Bubble" is an excellent meditation on the danger of excessive personalization and a media diet consisting only of stuff selected because you will "like" it. If I only read news that has been preselected to be news I will "like," news that fits my personal convictions and biases, not only am I impoverished, but I can't take part in the kind of intelligent debate that is essential to a healthy democracy. If I only listen to music that has been chosen because I will "like" it, my music experience will be dull and boring. This is the world of E.M. Forster's story "The Machine Stops," where the machine provides a pleasing, innocuous cocoon in which to live. The machine offers music, art, and food — even water, air, and bedding; these provide a context for all "ideas" in an intellectual space where direct observation is devalued, even discouraged (and eventually forbidden). And it's no surprise that when the machine breaks down, the consequences are devastating.

I do not believe it is possible to navigate the enormous digital library that's available to us without filtering, nor does Pariser. Some kind of programmatic selection is an inevitable part of the future. Try doing Google searches in Chrome's Incognito mode, which suppresses any information that could be used to personalize search results. I did that experiment, and it's really tough to get useful search results when Google is not filtering based on its prior knowledge of your interests.

But if we're going to break out of the cocoon in which our experience of the world is filtered according to our likes and dislikes, we need to get beyond naïve recommendations to break through to discovery. I installed the iPad Zite app shortly after it launched, and I find that it occasionally breaks through to discovery. It can find articles for me that I wouldn't have found for myself, that I wouldn't have known to look for. I don't use the "thumbs up" and "thumbs down" buttons because I don't want Zite to turn into a parody of my tastes. Unfortunately, that seems to be happening anyway. I find that Zite is becoming less interesting over time: even without the buttons, I suspect that my Twitter stream is telling Zite altogether too much about what I like and degrading the results. Making the transition from recommendation to true discovery may be the toughest problem we face as we design the next generation of data products.


In the dark ages of data products, we accessed data through computers: laptops and desktops, and even minicomputers and mainframes if you go back far enough. When music and video first made the transition from physical products to data products, we listened and watched on our computers. But that's no longer the case: we listen to music on iPods; read books on Kindles, Nooks, and iPads; and watch online videos on our Internet-enabled televisions (whether the Internet interface is part of the TV itself or in an external box, like the Apple TV). This transition is inevitable. Computers make us aware of data as data: one disk failure will make you painfully aware that your favorite songs, movies, and photos are nothing more than bits on a disk drive.

It's important that Apple was at the core of this shift. Apple is a master of product design and user interface development. And it understood something about data that those of use who preferred listening to music through WinAmp or FreeAmp (now Zinf) missed: data products would never become part of our lives until the computer was designed out of the system. The user experience was designed into the product from the start. DJ Patil (@dpatil), Data Scientist in Residence at Greylock Partners, says that when building a data product, it is critical to integrate designers into the engineering team from the beginning. Data products frequently have special challenges around inputting or displaying data. It's not sufficient for engineers to mock up something first and toss it over to design. Nor is it sufficient for designers to draw pretty wireframes without understanding what the product is or how it works. The earlier design is integrated into the product group and the deeper the understanding designers have of the product, the better the results will be. Patil suggested that FourSquare succeeded because it used GPS to make checking into a location trivially simple. That's a design decision as much as a technical decision. (Success isn't fair: as a Dodgeball review points out, position wasn't integrated into cell phones, so Dodgeball's user interface was fundamentally hobbled.) To listen to music, you don't want a laptop with a disk drive, a filesystem, and a user interface that looks like something from Microsoft Office; you want something as small and convenient as a 1960s transistor radio, but much more capable and flexible.

What else needs to go if we're going to get beyond a geeky obsession with the artifact of data to what the customer wants? Amazon has done an excellent job of packaging ebooks in a way that is unobtrusive: the Kindle reader is excellent, it supports note taking and sharing, and Amazon keeps your location in sync across all your devices. There's very little file management; it all happens in Amazon's cloud. And the quality is excellent. Nothing gives a product a data smell quite as much as typos and other errors. Remember Project Gutenberg?

Back to music: we've done away with ripping CDs and managing the music ourselves. We're also done with the low-quality metadata from CDDB (although I've praised CDDB's algorithm, the quality of its data is atrocious, as anyone with songs by John "Lennnon" knows). Moving music to the cloud in itself is a simplification: you don't need to worry about backups or keeping different devices in sync. It's almost as good as an old phonograph, where you could easily move a record from one room to another, or take it to a friend's house. But can the task of uploading and downloading music be eliminated completely? We're partway there, but not completely. Can the burden of file management be eliminated? I don't really care about the so-called "death of the filesystem," but I do care about shielding users from the underlying storage mechanism, whether local or in the cloud.

New interfaces for data products are all about hiding the data itself, and getting to what the user wants. The iPod revolutionized audio not by adding bells and whistles, but by eliminating knobs and controls. Music had become data. The iPod turned it back into music.

The drive toward human time

It's almost shocking that in the past, Google searches were based on indexes that were built as batch jobs, with possibly a few weeks before a given page made it into the index. But as human needs and requirements have driven the evolution of data products, batch processing has been replaced by "human time," a term coined by Justin Sheehy (@justinsheehy), CTO of Basho Technologies. We probably wouldn't complain about search results that are a few minutes late, or maybe even an hour, but having to wait until tomorrow to search today's Twitter stream would be out of the question. Many of my examples only make sense in human time. Bus arrival times don't make sense after the bus has left, and while making predictions based on the previous day's traffic might have some value, to do the job right you need live data. We'd laugh at a self-driving car that used yesterday's road conditions. Predicting the onset of infection in a premature infant is only helpful if you can make the prediction before the infection becomes apparent to human observers, and for that you need all the data streaming from the monitors.

To meet the demands of human time, we're entering a new era in data tooling. Last September, Google blogged about Caffeine and Percolator, its new framework for doing real-time analysis. Few details about Percolate are available, but we're starting to see new tools in the open source world: Apache Flume adds real-time data collection to Hadoop-based systems. A recently announced project, Storm, claims to be the Hadoop of real-time processing. It's a framework for assembling complex topologies of message processing pipelines and represents a major rethinking of how to build data products in a real-time, stream-processing context.


Data products are increasingly part of our lives. It's easy to look at the time spent in Facebook or Twitter, but the real changes in our lives will be driven by data that doesn't look like data: when it looks like a sign saying the next bus will arrive in 10 minutes, or that the price of a hotel reservation for next week is $97. That's certainly the tack that Apple is taking. If we're moving to a post-PC world, we're moving to a world where we interact with appliances that deliver the results of data, rather than the data itself. Music and video may be represented as a data stream, but we're interested in the music, not the bits, and we are already moving beyond interfaces that force us to deal with its "bitly-ness": laptops, files, backups, and all that. We've witnessed the transformation from vinyl to CD to digital media, but the process is ongoing. We rarely rip CDs anymore, and almost never have to haul out an MP3 encoder. The music just lives in the cloud (whether it's Amazon's, Apple's, Google's, or Spotify's). Music has made the transition from overt to covert. So have books. Will you have to back up your self-driving route-optimized car? I doubt it. Though that car is clearly a data product, the data that drives it will have disappeared from view.

Earlier this year Eric Schmidt said:

Google needs to move beyond the current search format of you entering a query and getting 10 results. The ideal would be us knowing what you want before you search for it...

This controversial and somewhat creepy statement actually captures the next stage in data evolution. We don't want lists or spreadsheets; we don't want data as data; we want results that are in tune with our human goals and that cause the data to recede into the background. We need data products that derive their power by mashing up many sources. We need products that deliver their results in human time, rather than as batch processes run at the convenience of a computing system. And most crucially, we need data products that go beyond mere recommendation to discovery. When we have these products, we will forget that we are dealing with data. We'll just see the results, which will be aligned with our needs.

We are seeing a transformation in data products similar to what we have seen in computer networking. In the '80s and '90s, you couldn't have a network without being intimately aware of the plumbing. You had to manage addresses, hosts files, shared filesystems, even wiring. The high end of technical geekery was wiring a house with Ethernet. But all that network plumbing hasn't just moved into the walls: it's moved into the ether and disappeared entirely. Someone with no technical background can now build a wireless network for a home or office by doing little more than calling the cable company. Data products are striving for the same goal: consumers don't want to, or need to, be aware that they are using data. When we achieve that, when data products have the richness of data without calling attention to themselves as data, we'll be ready for the next revolution.


Sponsored post

August 31 2011

New tools and techniques for applying climate data

Upon entering the New York Academy of Sciences (NYAS) foyer, guests are greeted by a huge bust of Darwin, along with wonderful preservations and replicas of samples of his early works adorning the walls. While Darwin revolutionized science with curiosity and the powers of observation, who knows what he could have accomplished with the informatics and computational resources that are available to scientists today?

It was fitting last Friday that the NYAS held their First International Workshop on Climate Informatics at their downtown offices on a beautiful day when everyone seemed to be dodging the city in advance of Hurricane Irene. Aside from being a wonderful venue to hold a workshop — I enjoyed reading the pages of Darwin's "Descent of Man" writings on the wall — the discussions gave me much food for thought.

As with any small conference in a single-speaker setting, the majority of talks were good, covering the range of climate data and statistical methods and applications. And as is often the case, I was more impressed with the talks that addressed topics outside of my disciplines, particularly the machine learning discussion provided by Arindam Banerjee of the University of Minnesota.

But the highlight came during the breakout sessions, which provided in-depth discussions surrounding the challenges and opportunities in applying new methods to climate data management and analysis. Topics ranged from multiple-petabyte data management issues faced by paleoclimatologists to management and manipulation of large datasets associated with global climate modeling and Earth Observation (EO) technologies.

Overall, the workshop showed that we're seeing the early confluence of two communities: climate scientists looking for new tools and techniques are on one side, data scientists and statisticians looking for new problems to tackle are on the other.

Data poor to data rich

One of the event's more interesting side notes came from a breakout session where we explored the transition from being a data-poor field to a data-rich field. As an applied scientist, I certainly would say that climate researchers have been blessed with more data, both spatially and temporally. While the days of stitching various datasets together to test an idea may be behind us, the main issues tend to come down to scale. Is global coverage at 4KM resolution good enough for satellite observations? Can we build a robust model with data at this scale? Do interpolation methods for precipitation and temperature work across various physiographic environments?

While more data helps alleviate some of the scientific challenges we have faced in the past, it also raises more questions. Further, each year of global observations builds the database of reanalysis data — as an example, look at the reanalysis data that's part of the MERRA maintained at NASA's Goddard Space Flight Center.

That said, I'll default to the position that too much data is a good problem to have.

Path forward for the data community

The timing of this event was also useful for another reason. The upcoming Strata Summit in New York will bring together data scientists and others in the data domain to address the challenges and strategies this growing community faces. I'll be giving a talk on new ways to collect, generate and apply atmospheric and oceanic data in a decision-making context under the rubric of atmospheric analytics. In addition to the talk, I'm eager to learn how I can better utilize the data I'm working with as well as bring back some new tools to share with my colleagues in other fields who may face similar big data challenges.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30


August 25 2011

Strata Week: Green pigs and data

Here are a few of the data stories that caught my attention this week:

Predicting Angry Birds

Angry BirdsAngry Birds maker Rovio will begin using predictive analytics technology from the Seattle-based company Medio to help improve game play for its popular pig-smashing game.

According to the press release announcing the partnership, Angry Birds has been downloaded more 300 million times and is on course to reach 1 billion downloads. But it isn't merely downloaded a lot; it's played a lot, too. The game, which sees up to 1.4 billion minutes of game play per week, generates an incredible amount of data: user demographics, location, and device information are just a few of the data points.

Users' data has always been important in gaming, as game developers must refine their games to maximize the amount of time players spend as well as track their willingness to spend money on extras or to click on related ads. As casual gaming becomes a bigger and more competitive industry, game makers like Rovio will rely on analytics to keep their customers engaged.

As GigaOm's Derrick Harris notes, quoting Zynga's recent S-1 filing, this is already a crucial part of that gaming giant's business:

The extensive engagement of our players provides over 15 terabytes of game data per day that we use to enhance our games by designing, testing and releasing new features on an ongoing basis. We believe that combining data analytics with creative game design enables us to create a superior player experience.

By enlisting the help of Medio for predictive analytics, it's clear that Rovio is taking that same tactic to improve the Angry Bird experience.

Unstructured data and HP's next chapter

HP made a number of big announcements last week as it revealed plans for an overhaul. These plans include ending production of its tablet and smartphones, putting the development of WebOS on hold, and spending some $10 billion to acquire the British enterprise software company Autonomy.

AutonomyThe New York Times described the shift in HP as a move to "refocus the company on business products and services," and the acquisition of Autonomy could help drive that via its big data analytics. HP's president and CEO Léo Apotheker said in a statement: "Autonomy presents an opportunity to accelerate our strategic vision to decisively and profitably lead a large and growing space ... Together with Autonomy, we plan to reinvent how both unstructured and structured data is processed, analyzed, optimized, automated and protected."

As MIT Technology Review's Tom Simonite puts it, HP wants Autonomy for its "math skills" and the acquisition will position HP to take advantage of the big data trend.

Founded in 1996, Autonomy has a lengthy history of analyzing data, with an emphasis on unstructured data. Citing an earlier Technology Review interview, Simonite quotes Autonomy founder Mike Lynch's estimate that about 85% of the information inside a business is unstructured. "[W]e are human beings, and unstructured information is at the core of everything we do," Lynch said. "Most business is done using this kind of human-friendly information."

Simonite argues that by acquiring Autonomy, HP could "take a much more dominant position in the growing market for what Autonomy's Lynch dubs 'meaning-based computing.'"

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Using data to uncover stories for the Daily Dot

After several months of invitation-only testing, the web got its own official daily newspaper this week with the launch of The Daily Dot. CEO Nick White and founding editor Owen Thomas said the publication will focus on the news from various online communities and social networks.

GigaOm's Mathew Ingram gave The Daily Dot a mixed review, calling its focus on web communities "an interesting idea," but he questioned if the "home town newspaper" metaphor really makes sense. The number of kitten stories on the Daily Dot's front page aside, ReadWriteWeb's Marshall Kirkpatrick sees The Daily Dot as part of the larger trend toward data journalism, and he highlighted some of the technology that the publication is using to uncover the Web world's news, including Hadoop and assistance from Ravel Data.

"It's one thing to crawl, it's another to understand the community," Daily Dot CEO White told Kirkpatrick. "What we really offer is thinking about how the community ticks. The gestures and modalities on Reddit are very different from Youtube; it's sociological, not just math."

Got data news?

Feel free to email me.


August 02 2011

Data and the human-machine connection

Arnab Gupta is the CEO of Opera Solutions, an international company offering big data analytics services. I had the chance to chat with him recently about the massive task of managing big data and how humans and machines intersect. Our interview follows.

Tell me a bit about your approach to big data analytics.

Arnab GuptaArnab Gupta: Our company is a science-oriented company, and the core belief is that behavior — human or otherwise — can be mathematically expressed. Yes, people make irrational value judgments, but they are driven by common motivation factors, and the math expresses that.

I look at the so-called "big data phenomenon" as the instantiation of human experience. Previously, we could not quantitatively measure human experience, because the data wasn't being captured. But Twitter recently announced that they now serve 350 billion tweets a day. What we say and what we do has a physical manifestation now. Once there is a physical manifestation of a phenomenon, then it can be mathematically expressed. And if you can express it, then you can shape business ideas around it, whether that's in government or health care or business.

How do you handle rapidly increasing amounts of data?

Arnab Gupta: It's an impossible battle when you think about it. The amount of data is going to grow exponentially every day, ever week, every year, so capturing it all can't be done. In the economic ecosystem there is extraordinary waste. Companies spend vast amounts of money, and the ratio of investment to insight is growing, with much more investment for similar levels of insight. This method just mathematically cannot work.

So, we don't look for data, we look for signal. What we've said is that the shortcut is a priori identifying the signals to know where the fish are swimming, instead of trying to dam the water to find out which fish are in it. We focus on the flow, not a static data capture.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

What role does visualization play in the search for signal?

Arnab Gupta: Visualization is essential. People dumb it down sometimes by calling it "UI" and "dashboards," and they don't apply science to the question of how people perceive. We need understanding that feeds into the left brain through the right brain via visual metaphor. At Opera Solutions, we are increasingly trying to figure out the ways in which the mind understands and transforms the visualization of algorithms and data into insights.

If understanding is a priority, then which do you prefer: a black-box model with better predictability, or a transparent model that may be less accurate?

Arnab Gupta: People bifurcate, and think in terms of black-box machines vs. the human mind. But the question is whether you can use machine learning to feed human insight. The power lies in expressing the black box and making it transparent. You do this by stress testing it. For example, if you were looking at a model for mortgage defaults, you would say, "What happens if home prices went down by X percent, or interest rates go up by X percent?" You make your own heuristics, so that when you make a bet you understand exactly how the machine is informing your bet.

Humans can do analysis very well, but the machine does it consistently well; it doesn't make mistakes. What the machine lacks is the ability to consider orthogonal factors, and the creativity to consider what could be. The human mind fills in those gaps and enhances the power of the machine's solution.

So you advocate a partnership between the model and the data scientist?

Arnab Gupta: We often create false dichotomies for ourselves, but the truth is it's never been man vs. machine; it has always been man plus machine. Increasingly, I think it's an article of faith that the machine beats the human in most large-scale problems, even chess. But though the predictive power of machines may be better on a large-scale basis, if the human mind is trained to use it powerfully, the possibilities are limitless. In the recent Jeopardy showdown with IBM's Watson, I would have had a three-way competition with Watson, a Jeopardy champion, and a combination of the two. Then you would have seen where the future lies.

Does this mean we need to change our approach to education, and train people to use machines differently?

Arnab Gupta: Absolutely. If you look back in time between now and the 1850s, everything in the world has changed except the classroom. But I think we are dealing with a phase-shift occurring. Like most things, the inertia of power is very hard to shift. Change can take a long time and there will be a lot of debris in the process.

One major hurdle is that the language of machine-plus-human interaction has not yet begun to be developed. It's partly a silent language, with data visualization as a significant key. The trouble is that language is so powerful that the left brain easily starts dominating, but really almost all of our critical inputs come from non-verbal signals. We have no way of creating a new form of language to describe these things yet. We are at the beginning of trying to develop this.

Another open question is: What's the skill set and the capabilities necessary for this? At Opera we have focused on the ability to teach machines how to learn. We have 150-160 people working in that area, which is probably the largest private concentration in that area outside IBM and Google. One of the reasons we are hiring all these scientists is to try to innovate at the level of core competencies and the science of comprehension.

The business outcome of that is simply practical. At the end of the day, much of what we do is prosaic; it makes money or it doesn't make money. It's a business. But the philosophical fountain from which we drink needs to be a deep one.


July 14 2011

Strata Week: There's money in data sifting

Here are a few of the data stories that caught my attention this week.

Big bucks for DataSift and for data from Twitter's firehose

DataSiftThe social media data mining platform DataSift — one of the two companies that has the rights to re-syndicate the data from Twitter's firehose — announced this week that it has raised $6 million in a Series A round. (The other company with those rights is Gnip, whose handling of the firehose we recently covered.) DataSift aggregates data from other social media streams as well as Twitter, including Facebook, WordPress, and Digg. While providing the tools to "sift" this content and layering it with other metadata makes DataSift compelling, it's the company's connection to Twitter that may have piqued the most interest.

DataSift grew out of the company MediaSift, the same business that created Tweetmeme, a tool that fell into disfavor when Twitter launched its own sharing button. That move on the part of Twitter to take over functions that third-party developers once provided has had some negative implications on those in the Twitter ecosystem. At this stage, it seems like Twitter is willing to leave some of the big data processing to other companies.

Investor Mark Suster of GRP Partners, whose firm was one of the leaders in this round of DataSift's investment, made the announcement that he was "doubling down on the Twitter ecosystem." For its part, DataSift "has a product that will turn the stream into a lake," says Suster. In other words, "The Twitter stream like most others is ephemeral. If you don't bottle it as it passes by you it's gone. DataSift has a product that builds a permanent database for you of just the information you want to capture."

But Suster's announcement also reiterates the importance of Twitter, something that seems particularly relevant in light of the new Google Plus. Suster describes Twitter as real-time, open, asymmetric, social, viral, location-aware, a referral network, explicit, and implicit. But as the buzz over Google Plus continues, it's not clear that Twitter really holds the corner on all of these characteristics any longer.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

What's under the Google Plus hood?

Google PlusSpeaking of Google Plus, there's been lots of commentary and speculation about how successful the launch of the new social network has been, gauged in part on how quickly that network is growing. According to founder Paul Allen, Google Plus was set to break the 10 million user mark on July 12, just two weeks after its launch. Ubermedia's Bill Goss went so far as to predict that Google Plus would become the fastest growing social network in history.

So how does Google do it (in terms of the technology)? According to the project's technical lead — and OSCON speakerJoseph Smarr:

Our stack is pretty standard fare for Google apps these days: we use Java servlets for our server code and JavaScript for the browser-side of the UI, largely built with the (open-source) Closure framework, including Closure's JavaScript compiler and template system. A couple nifty tricks we do: we use the HTML5 History API to maintain pretty-looking URLs even though it's an AJAX app (falling back on hash-fragments for older browsers); and we often render our Closure templates server-side so the page renders before any JavaScript is loaded, then the JavaScript finds the right DOM nodes and hooks up event handlers, etc. to make it responsive (as a result, if you're on a slow connection and you click on stuff really fast, you may notice a lag before it does anything, but luckily most people don't run into this in practice). Our backends are built mostly on top of BigTable and Colossus/GFS, and we use a lot of other common Google technologies such as MapReduce (again, like many other Google apps do).

(Google's Joseph Smarr, a member of the Google+ team, will discuss the future of the social web at OSCON. Save 20% on registration with the code OS11RAD.)

Data products for education

DonorsChooseThe charitable giving site DonorsChoose has been running a contest called Hacking Education, and the contest's finalists have just been announced. DonorsChoose lets people make charitable contributions to public schools, supporting teachers' projects with a Kickstarter-like site for education. DonorsChoose opened up its data to developers — this data encompassed more than 300,000 classroom projects that have inspired some $80 million in charitable giving.

The finalists were chosen from over 50 apps and analyses and included a visualization of the kinds of projects teachers proposed and the kinds donors supported, a .NET Factbook, and an automatic press release system so that local journalists could be notified about projects. The grand prize winner has yet to be chosen, but that project will receive a trophy — and a big thumbs up — from Stephen Colbert.

Got data news?

Feel free to email me.


June 07 2011

How can we visualize the big players in the Web 2.0 data layer?

This post was originally published on John Battelle's Searchblog ("Web 2 Map: The Data Layer - Visualizing the Big Players in the Internet Economy").

As I wrote last month, I'm working with a team of folks to redesign the Web 2 Points of Control map along the lines of this year's theme: "The Data Frame." In the past few weeks I've been talking to scores of interesting people, including CEOs of data-driven start ups (TrialPay and Corda, for example), academics in the public data space, policy folks, and VCs. Along the way I've solidified my thinking about how best to visualize the "data layer" we'll be adding to the map, and I wanted to bounce it off all of you. So here, in my best narrative voice, is what I'm thinking.

First, of course, some data.

Data layer chart

On the left hand side are eight major players in the Internet Economy, along with two categories of players that are critical, but which I've lumped together — payment players such as Visa, Amex, and Mastercard, and carriers or ISP players such as Comcast, AT&T, and Verizon.

I've given each company my own "finger in the air" score for seven major data categories, which are shown across the top (I don't claim these are correct, rather, clay on the wheel for an ongoing dialog). The first six scores are in essence percentages, answering the question "What percentage of this company's holdings are in this type of data?" The seventh, which I've called Wildcard data, is a 1-10 ranking of the potency of that company's "wildcard" data that it's not currently leveraging, but might in the future. I'll get to more detail on each data category later.

Toward the far right, I've noted each company's overall global uniques (from Doubleclick, for now, save the carriers and payment guys — I've proxied their size with the reach of Google). There is also an "engagement" score (again, more on that soon). The final score is a very rough tabulation computing engagement over uniques against the sum of the data scores. There are pivots to be built from this data around each of the scores for various types of data, but I'll leave that for later. This is meant to be a relatively simple introduction to my rough thinking about the data layer. Hopefully, it'll spark some input from you.

Now, before you rip it apart, which I fully invite (especially those of you who are data quants, because I am clearly not, and I am likely mixing some apples and watermelons here), allow me to continue to narrate what I'm trying to visualize.

As you know, the map is a metaphor, showing key territories as "points of control." The companies I've highlighted in the chart all have "home territories" where they dominate a sector — Google in search, Facebook in social, Amazon and eBay in commerce, etc. What I plan to do is create a layer based on the data in the chart that, when activated, shows those companies' relative size and strength.

But how?

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Well, the best idea we've come up with so far is to show each as a small city of sorts, where the relative height of the buildings is determined by a corresponding data point. So Twitter, for example, will have a tall building in the middle of its city, representing "Interest data." Google's tallest building will be search. Facebook's will be social, and so on. And of course the cities can't be all on the same scale, hence our use of total global uniques, and total engagement. Yahoo may be nearly as big as Facebook, but it doesn't have nearly the engagement per user. So its city will be smaller, relatively, than Facebook's.

Building previewWhat is interesting about this approach is that each company's "cityscape" emerges as distinct. Microsoft's is wide but not tall — they have a lot of data in a number of areas. It will probably end up looking like a suburban office park — funnily enough, that's what Microsoft really looks like, for the most part. Amazon and eBay will have high towers of payment data, with a smattering of shorter buildings. And so on. I don't have a good visualization of this yet, but the designers I'm working with at Blend have sketched out a very rough early version just so you can get the idea (see image to the right). The structures will be more whimsical, and of course be keyed with color. But I think you get the idea.

I'm even thinking of adding other features, like "openness" — i.e., can you access, gain copies of, share, and mash up the data controlled by each company? If so, the city won't be walled. Apple, on the other hand, may well end up a walled city, with a moat, on top of a hill.

Now, a bit more detail on the data categories. You all gave me a lot of really good input on my earlier post, where I posited these original categories. But I've kept them the same, save the addition of the wildcard data. Why? Because I think each can be interpreted as larger buckets containing a lot of other data. I'll go through each briefly in turn:

Purchase Data: This is information about who buys what, in essence. But it's also who almost buys what (abandoned carts), when they buy, in what context, and so on.

Search Data: The original database of intentions — query data, path from query data, "intent" data, and tons more search signals.

Social Data: Social graph, but also identity data. Not to mention how people interact inside their graphs, etc.

Interest Data: This is data that describes what is generally called "the interest graph" — declarations of what people are interested in. It's related to content, but it's not just content consumption. It includes active production of interest datapoints, like tweets, status updates, checkins, etc.

Location Data: This is data about where people are, to be sure, but also data about how often we are there, and other correlated data — i.e., what apps we use in location context, who else is there and when, etc.

Content Data: Content is still a king in our world, and knowing patterns of content consumption is a powerful signal. This is data about who reads/watches/consumes what, when, and in what patterns.

Wildcard Data: This is data that is uncategorized, but could have huge implications. For example, Microsoft knows how people interact with their applications and OS. Microsoft and Google have a ton of language data (phonemes, etc.). Carriers see just about everything that passes across their servers, though their ability to use it might be regulated. Google, Yahoo and Microsoft have tons of email interaction data. And so on ...

Now, of course all these data categories get more powerful as they are leveraged one against the other, and of course, I've left tons of really big data players off the map entirely (small startups like Tynt, Quora, or Sharethis have massive amounts of data, as do very large companies like Nielsen, Quantcast, etc.). But you have to make choices to make something like this work.

So, that's where we are with the Web 2 Summit map data layer. Naturally, once the data layer is live, it will be driven by a database, so we can tweak the size and scope of the cities and buildings based on the collective intelligence of the map users' feedback.

What do you think? What's your input? We'll be building this over the next two months, and I'd love your feedback before we get too far down the line. Thanks!


May 11 2011

What are the key data categories companies want to control?

This post was originally published on John Battelle's Searchblog ("Building A New Map And I Need Your Help: What Are The Key Categories of Data In Today's Network Economy?").

Web 2 Summit MapMany of you probably remember the "Points of Control" Web 2 Summit Map from last year. It was very well received. Hundreds of thousands of folks came to check it out, and the average engagement time was north of six minutes per visitor. It was a really fun way to make the conference theme come to life, and given the work that went into its creation, we thought it'd be a shame to retire it simply because Web 2 has moved on to a new theme:

For 2011, our theme is "The Data Frame" — focusing on the impact of data in today's networked economy. We live in a world clothed in data, and as we interact with it, we create more — data is not only the web’s core resource, it is at once both renewable and boundless.

Consumers now create and consume extraordinary amounts of data. Hundreds of millions of mobile phones weave infinite tapestries of data, in real time. Each purchase, search, status update, and check-in layers our world with more of it. How our industries respond to this opportunity will define not only success and failure in the networked economy, but also the future texture of our culture. And as we're already seeing, these interactions raise complicated questions of consumer privacy, corporate trust, and our governments’ approach to balancing the two.

How, I wondered, might we update the Points of Control map such that it can express this theme? Well, first of all, it's clear the game is still afoot between the major players. Some boundaries may have moved, and progress has been made (Bing has gained search share, Facebook and Google have moved into social commerce, etc.), but the map in essence is intact as a thought piece.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Then it struck me — each of the major players, and most of the upstarts, have as a core asset in their arsenals data, often many types of it. In addition, most of them covet data that they've either not got access to, or are in the process of building out (think Google in social, for example, or in deals, which to my mind is a major play for local as well as purchase data.) Why not apply the "Data Frame" to the map itself, a lens of sorts that when overlaid upon the topography, shows the data assets and aspirations of each player?

So here's where you come in. If we're going to add a layer of data to each player on the map, the question becomes — what kind of data? And how should we visualize it? My initial thoughts on types of data hew somewhat to my post on the Database of Intentions, so that would include:

  • Purchase Data (including credit card info)
  • Search Data (query, path taken, history)
  • Social Graph Data (identity, friend data)
  • Interest Data (likes, tweets, recommendations, links)
  • Location Data (ambient as well as declared/checked in)
  • Content Data (journey through content, likes, engagement, "behavioral")

Those are some of the big buckets. Clearly, we can debate if, for example, identity should be its own category, separate from social, etc, and that's exactly the kind of argument I hope to spark. I'm sure I've missed huge swaths of landscape, but I'm writing this in a rush (have a meeting in five minutes!) and wanted to get the engine started, so to speak.

I'm gathering a small group of industry folks at my home in the next week to further this debate, but I most certainly want to invite my closest collaborators — readers here at Searchblog, to help us out as we build the next version of the map. Which, by the way, will be open sourced and ready for hacking ...

So please dive into the comments and tell me, what are the key categories of data that companies are looking to control?


Reposted bycheg00 cheg00

February 03 2011

A new challenge looks for a smarter algorithm to improve healthcare

Starting on April 4, the Heritage Health Prize (@HPNHealthPrize) competition, funded by the Heritage Provider Network (HPN), will ask the world's scientists to submit an algorithm that will help them to identify patients at risk of hospitalization before they need to go to the emergency room.

"This competition is to literally predict the probability that someone will go to the hospital in the next year," said Anthony Goldbloom at the Strata Conference. Goldbloom is the founder and CEO of Kaggle, the Australian data mining company that has partnered with HPN on the competition. "The idea is to rank how at risk people are, go through the list and figure out which of the people on the list can be helped," he said.

If successful, HPN estimates that the algorithm produced by this competition could save them billions in healthcare costs. In the process, the development and deployment of the algorithm could provide the rest of the healthcare industry with a successful model for reducing costs.

"Finally, we've got a data competition that has real world benefits," said Pete Warden, author of the "Data Source Handbook" and founder of OpenHeatMap. "This is like the Netflix Prize, but for something far more important."

The importance of reducing healthcare costs can't be underestimated. Nationally, some $2.8 trillion dollars are spent annually on healthcare in the United States, with that number expected to grow in the years ahead. "There are two problems with the healthcare reform law," said Jonathan Gluck, a senior executive at HPN. "We pay for quantity — the more services you consume, the more we're going to bill you — and it never addressed personal responsibility."

If patients who would benefit from receiving lower cost preventative care can receive relevant treatments and therapies earlier, the cost issue might be addressed.

Why a prize?

HPN is just the latest organization to turn to a prize to generate a solution to a big problem. The White House has been actively pursuing prizes and competitions as a means of catalyzing collaborative innovation in open government around solving grand national prizes. From the X-Prize to the Netflix Prize to a growing number of challenges at, 2011 might just be the year where this method for generating better answers hits the adoption tipping point.

Goldbloom noted that in the eight months that Kaggle has hosted competitions, they've never had one where the benchmark hasn't been outperformed. From tourism forecasting to chess ratings, each time the best method was quickly improved within a few weeks, said Goldbloom.

As David Zax highlighted in his Fast Company article on the competition, adding an algorithm to find patients at risk might suggest that doctors' diagnoses or clinical skills are being subtracted from the equation. The idea here is not necessarily to take away a doctor's skills. Rather, it's to provide them with predictive analytics that augment those capabilities. As Zax writes, that has to be taken in context with the current state of healthcare:

A shortage of primary care physicians in the U.S. means that doctors don't always have time to pick up on the subtle connections that might lead to a Gregory House-style epiphany of what's ailing a patient. More importantly, though, the algorithms may point to connections that a human mind simply would never make in the first place.

Balancing privacy with potential

One significant challenge with this competition, so to speak, is that the data set isn't just about what movies people are watching. It's about healthcare, and that introduces a host of complexities around privacy and compliance with regulations. The data has to be de-identified, which naturally impairs what can be done. Gluck emphasized that the competition is HIPAA-compliant. Avoiding a data breach has been prioritized ahead of a successful outcome in the competition. Not doing so, given the sanctions that exist for such a breach, might well have made the competition a non-starter.

Gluck said that Khaled El Eman, a professor at the University of Ontario and a noted healthcare privacy expert, has been making attempts to de-anonymize the test data sets. Gluck said El Eman has been using public databases and other techniques to try and triangulate identity with records. To date he has not been successful.

Hotspotting the big picture

The potential of the Heritage Health Challenge will be familiar to readers of the New Yorker, where Dr. Atul Gawande published a feature on "healthcare hotspotting." In the article, Gawande examines the efforts of physicians like Dr. Jeffrey Brenner, of Camden, New Jersey, to use data to discover the neediest patients and deliver them better care.

The Camden Coalition has been able to measure its long-term effect on its first thirty-six super-utilizers. They averaged sixty-two hospital and E.R. visits per month before joining the program and thirty-seven visits after—a forty-per-cent reduction. Their hospital bills averaged $1.2 million per month before and just over half a million after—a fifty-six-per-cent reduction.

These results don’t take into account Brenner’s personnel costs, or the costs of the medications the patients are now taking as prescribed, or the fact that some of the patients might have improved on their own (or died, reducing their costs permanently). The net savings are undoubtedly lower, but they remain, almost certainly, revolutionary. Brenner and his team are out there on the boulevards of Camden demonstrating the possibilities of a strange new approach to health care: to look for the most expensive patients in the system and then direct resources and brainpower toward helping them.

The results of the approach taken in Camden is controversial, as Gawande's response to criticism of his article acknowledges. The promise of applying data science to identifying patients at higher risk, however, comes at a time when the ability of that discipline to deliver meaningful results has never been greater. If a smarter predictive algorithm emerges from this contest, $3 million dollars of prize money may turn out to have been a bargain.


December 02 2010

Six months after "What is data science?"

What is Data Science?Mike Loukides examined the question "What is data science?" here on Radar six months ago.

That post kicked off considerable conversation. It also marked the beginning of our ongoing effort to track the companies, ideas, people and products shaping the data space.

The six-month point is a good time to check in: it's short enough to still feel that initial enthusiasm and long enough to sense deeper trends. With that in mind, below you'll find a handful of interviews and analysis posts that expand on the topics Mike surfaced in his report.

The stories fall into three broad categories: data science skills and technologies, broader applications of data science, and data products.

We'll continue to explore the data science space in the lead-up to February's Strata Conference -- see below for more information on that -- and through additional coverage on Radar and O'Reilly Answers. (Be sure to also check out the excellent "Strata Week" roundups from Edd Dumbill and Julie Steele.)

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR111RAD

Data science skills and technologies

What is data science? -- The future belongs to the companies who figure out how to collect and use data successfully. In this in-depth piece, O'Reilly editor Mike Loukides examines the unique skills and opportunities that flow from data science. (Related: A data science cheat sheet)

The SMAQ stack for big data -- We're at the beginning of a revolution in data-driven products and services, driven by a software stack that enables big data processing on commodity hardware. Learn about the SMAQ stack, and where today's big data tools fit in.

The data analysis path is built on curiosity, followed by action -- Precision and preparation define traditional data analysis, but author Philipp K. Janert believes there's more to it than just that. In this interview, he explains how simplicity, experimentation and action can shape data work.

Roger Magoulas, O'Reilly's director of research, offers his take on data science in the following short video:

Broader application of data science

Data as a service -- "With "data as a service" APIs like InfoChimps, and embeddable data components like Google Public Data Explorer and WolframAlpha Widgets, we're seeing the democratization of data and data visualization: new ways to access data, new ways to play with data, and new ways to communicate the results to others.

Data science democratized -- Data science has utility -- and repercussions -- well beyond data scientists. New tools are making it easier for non-programmers to tap huge stores of information. Data science's democratizing moment will come when its associated tools can be picked up by tech-savvy non-programmers.

Data products

A new twist on "data-driven site" -- TripAdvisor is using data from its Facebook application to expand its website. In this Q&A, Sanjay Vakil discusses the inner-workings of this app-website relationship and he passes on advice for companies pursuing their own data-driven products.

Open health data: Spurring better decisions and new businesses -- The iTriage app marries open government data with private information. Peter Hudson, one of the co-founders of the company behind the app, discusses the business and patient opportunities government health data creates.

November 11 2010

Growing new data streams

A pleasantly surprising revelation came to me during the Agriculture Outlook Americas conference, which was held in rainy Boston this week.

The global agricultural community is comprised of big agribiz (think Cargill, Deere & ADM), family farms in Mato Grosso, bankers/hedge funds (many of which have no clue how things grow), and everything in between.  As I do much of my work at the first link in the global supply chain, I attend numerous events like this each year.  When working with such a diverse group, it is often difficult to achieve consensus on anything, and particular irreverence is often displayed towards new technologies.  This is notable among the grower community, whose farms and farming techniques are often passed down through generations, just like an old watch or a wedding ring.  But after going into this latest conference expecting the usual pessimism about cooperation, consensus formed around the need and application of new sources of data.

The innovations in agriculture that grab most headlines are usually related to technologies such as new seed varieties, super-combines, or physical infrastructure that increases efficiencies in drip irrigation.  So, after one panel session comprised of investors looking for opportunities in both hemispheres of the Americas, I asked about the "non-tangible" innovations that often fly under the radar: those that require access to large databases, data manipulation creativity, and computational resources. The panel agreed that these are major focal points for the next generation of agricultural investments. Nearly every discussion that followed seemed to touch upon this theme.

The nice thing about quantifiable data for this community is that it can come from subjective sources as well as those repeatedly tested in a laboratory. A grower's logbook for instance -- containing such information as how a particular crop might respond to a specific weather pattern, the amount and type of pest-fighting application used in a given season, and local market offers -- can all be assembled into an index, which is another quantifiable data stream that users may have at their disposal.  And while upon first glance one might suppose that data streams are closely-guarded secrets, growers are probably among the most supportive advocates of open access and data sharing. What wiped out your neighbor's crop a decade ago may be the very thing that hits you this year.

In several offline conversations during coffee breaks, I offered some insight based on projects that Weather Trends is pursuing. We're working with clients to quantify relationships between weather patterns, crop disease, and agricultural yields. The potential for collaboration, and a new growth sector for this industry, was evident to everyone.

Looking ahead, I expect numerous high-quality and high-margin products to come to market that have their "roots" in both the acquisition of new types of agricultural data (ranging from genomic to planetary weather), as well as in repackaging existing data. As global food supplies are routinely subject to a number of shocks via weather, foreign exchange or geopolitics, this will be a very important platform for the global agricultural community in the years to come.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...