Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 07 2012

Data markets survey

The sale of data is a venerable business, and has existed since the
middle of the 19th century, when Paul Reuter began providing
telegraphed stock exchange prices between Paris and London, and New
York newspapers founded the Associated Press.

The web has facilitated a blossoming of information providers. As the ability to discover and exchange data improves, the need to rely on aggregators such as Bloomberg or Thomson Reuters is declining. This is a good thing: the business models of large aggregators do not readily scale to web startups, or casual use of data in analytics.

Instead, data is increasingly offered through online marketplaces: platforms that host data from publishers and offer it to consumers. This article provides an overview of the most mature data markets, and contrasts their different approaches and facilities.

What do marketplaces do?

Most of the consumers of data from today's marketplaces are developers. By adding another dataset to your own business data, you can create insight. To take an example from web analytics: by mixing an IP address database with the logs from your website, you can understand where your customers are coming from, then if you add demographic data to the mix, you have some idea of their socio-economic bracket and spending ability.

Such insight isn't limited to analytic use only, you can use it to provide value back to a customer. For instance, by recommending restaurants local to the vicinity of a lunchtime appointment in their calendar. While many datasets are useful, few are as potent as that of location in the way they provide context to activity.

Marketplaces are useful in three major ways. First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume.

In general, one of the important barriers to the development of the data marketplace economy is the ability of enterprises to store and make use of the data. A principle of big data is that it's often easier to move your computation to the data, rather than the reverse. Because of this, we're seeing the increasing integration between cloud computing facilities and data markets: Microsoft's data market is tied to its Azure cloud, and Infochimps offers hosted compute facilities. In short-term cases, it's probably easier to export data from your business systems to a cloud platform than to try and expand internal systems to integrate external sources.

While cloud solutions offer a route forward, some marketplaces also make the effort to target end-users. Microsoft's data marketplace can be accessed directly through Excel, and DataMarket provides online visualization and exploration tools.

The four most established data marketplaces are Infochimps, Factual, Microsoft Windows Azure Data Marketplace, and DataMarket. A table comparing these providers is presented at the end of this article, and a brief discussion of each marketplace follows.


According to founder Flip Kromer, Infochimps was created to give data life in the same way that code hosting projects such as SourceForge or GitHub give life to code. You can improve code and share it: Kromer wanted the same for data. The driving goal behind Infochimps is to connect every public and commercially available database in the world to a common platform.

Infochimps realized that there's an important network effect of "data with the data," that the best way to build a data commons and a data marketplace is to put them together in the same place. The proximity of other data makes all the data more valuable, because of the ease with which it can be found and combined.

The biggest challenge in the two years Infochimps has been operating is that of bootstrapping: a data market needs both supply and demand. Infochimps' approach is to go for a broad horizontal range of data, rather than specialize. According to Kromer, this is because they view data's value as being in the context it provides: in giving users more insight about their own data. To join up data points into a context, common identities are required (for example, a web page view can be given a geographical location by joining up the IP address of the page request with that from the IP address in an IP intelligence database). The benefit of common identities and data integration is where hosting data together really shines, as Infochimps only needs to integrate the data once for customers to reap continued benefit: Infochimps sells datasets which are pre-cleaned and integrated mash-ups of those from their providers.

By launching a big data cloud hosting platform alongside its marketplace, Infochimps is seeking to build on the importance of data locality.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at


Factual was envisioned by founder and CEO Gil Elbaz as an open data platform, with tools that could be leveraged by community contributors to improve data quality. The vision is very similar to that of Infochimps, but in late 2010 Factual elected to concentrate on one area of the market: geographical and place data. Rather than pursue a broad strategy, the idea is to become a proven and trusted supplier in one vertical, then expand. With customers such as Facebook, Factual's strategy is paying off.

According to Elbaz, Factual will look to expand into verticals other than local information in 2012. It is moving one vertical at a time due to the marketing effort required in building quality community and relationships around the data.

Unlike the other main data markets, Factual does not offer reselling facilities for data publishers. Elbaz hasn't found that the cash on offer is attractive enough for many organizations to want to share their data. Instead, he believes that the best way to get data you want is to trade other data, which could provide business value far beyond the returns of publishing data in exchange for cash. Factual offer incentives to their customers to share data back, improving the quality of the data for everybody.

Windows Azure Data Marketplace

Launched in 2010, Microsoft's Windows Azure Data Marketplace sits alongside the company's Applications marketplace as part of the Azure cloud platform. Microsoft's data market is positioned with a very strong integration story, both at the cloud level and with end-user tooling.

Through use of a standard data protocol, OData, Microsoft offers a well-defined web interface for data access, including queries. As a result, programs such as Excel and PowerPivot can directly access marketplace data: giving Microsoft a strong capability to integrate external data into the existing tooling of the enterprise. In addition, OData support is available for a broad array of programming languages.

Azure Data Marketplace has a strong emphasis on connecting data consumers to publishers, and most closely approximates the popular concept of an "iTunes for Data." Big name data suppliers such as Dun & Bradstreet and ESRI can be found among the publishers. The marketplace contains a good range of data across many commercial use cases, and tends to be limited to one provider per dataset — Microsoft has maintained a strong filter on the reliability and reputation of its suppliers.


Where the other three main data marketplaces put a strong focus on the developer and IT customers, DataMarket caters to the end-user as well. Realizing that interacting with bland tables wasn't engaging users, founder Hjalmar Gislason worked to add interactive visualization to his platform.

The result is a data marketplace that is immediately useful for researchers and analysts. The range of DataMarket's data follows this audience too, with a strong emphasis on country data and economic indicators. Much of the data is available for free, with premium data paid at the point of use.

DataMarket has recently made a significant play for data publishers, with the emphasis on publishing, not just selling data. Through a variety of plans, customers can use DataMarket's platform to publish and sell their data, and embed charts in their own pages. At the enterprise end of their packages, DataMarket offers an interactive branded data portal integrated with the publisher's own web site and user authentication system. Initial customers of this plan include Yankee Group and Lux Research.

Data markets compared


Data sources
Broad range
Range, with a focus on country and industry stats
Geo-specialized, some other datasets
Range, with a focus on geo, social and web sources

Free data

Free trials of paid data
Yes, limited free use of APIs

API, downloads
API, downloads for heavy users
API, downloads

Application hosting
Windows Azure
Infochimps Platform

Service Explorer
Interactive visualization
Interactive search

Tool integration
Excel, PowerPivot, Tableau and other OData consumers-
Developer tool integrations

Data publishing
Via database connection or web service
Upload or web/database connection.
Via upload or web service.

Data reselling
Yes, 20% commission on non-free datasets
Yes. Fees and commissions vary. Ability to create branded data market
Yes. 30% commission on non-free datasets.


Other data suppliers

While this article has focused on the more general purpose marketplaces, several other data suppliers are worthy of note.

Social dataGnip and Datasift specialize in offering social media data streams, in particular Twitter.

Linked dataKasabi, currently in beta, is a marketplace that is distinctive for hosting all its data as Linked Data, accessible via web standards such as SPARQL and RDF.

Wolfram Alpha — Perhaps the most prolific integrator of diverse databases, Wolfram Alpha recently added a Pro subscription level that permits the end user to download the data resulting from a computation.


December 13 2011

Tapping into a world of ambient data

More data was transmitted over the Internet in 2010 than in all other years combined. That's one reason why this year's Web 2.0 Summit used the "data frame" to explore the new landscape of digital business — from mobile to social to location to government.

Microsoft is part of this conversation about big data, with respect to the immense resources and technical talent the Redmond-based software giant continues to hold. During Web 2.0 Summit, I interviewed Microsoft Fellow David Campbell about his big data work and thinking. A video of our interview is below, with key excerpts added afterward.

What's Microsoft's role in the present and future of big data?

David Campbell: I've been a data geek for 25-plus years. You go back five to seven years ago, it was kind of hard to get some of the younger kids to think that the data space was interesting to solve problems. Databases are kind of boring stuff, but the data space is amazingly exciting right now.

It's a neat thing to have one part of the company that's processing petabytes of data on tens and hundreds of thousands of servers and then another part that's a commercial business. In the last couple of years, what's been interesting is to see them come together, with things that scale even on the commercial side. That's the cool part about it, and the cool part of being at Microsoft now.

What's happening now seems like it wasn't technically possible a few years ago. Is that the case?

David Campbell: Yes, for a variety of reasons. If you think about the costs just to acquire the data, you can still pay people to type stuff in. It's roughly $1 per kilobyte. But you go back 25 or 30 years and virtually all of the data that we were working with had come off human fingertips. Now it's just out there. Even inherently analog things like phone calls and pictures — they're just born digital. To store it, we've gone from $1,000-per-megabyte 25 years ago to $40-per-terabyte for raw storage. That's an incredible shift.

How is Microsoft working with data startups?

David Campbell: The interesting thing about the data space is that we're talking about a lot of people with machine learning experience. They know a particular domain, but it's really hard for them to go find a set of customers. So, let's say that they've got an algorithm or a model that might be relevant to 5,000 people. It's really hard for them to go find those people.

We built this thing a couple of years ago called the DataMarket. The idea is to change the route to market. So, people can take their model and place it on the DataMarket and then others can go find it.

Here's the example I use inside the company, for those old enough to remember: When people were building Visual Basic controls, it was way harder to write one than it was to consume one. The guys writing the controls didn't have to go find the guy who was building the dentist app. They just published it in this thing from way back when it was actually, on paper, called "Programmer's Paradise," and then the guy who was writing the dentist's app would go there to find what he needed.

It's the same sort of thing here. How do you connect those people, those data scientists, who are going to remain a rare commodity with the set of people who can make use of the models they have?

How are the tools of data science changing?

David Campbell: Tooling is going to be a big challenge and a big opportunity here. We announced a tool recently that we call the Data Explorer, which lets people discover other forms of data — some in the DataMarket, some that they have. They can mash it up, turn it around and then republish it.

One of the things we looked at when we started building the tools is that people tend to do mashups today in what I was calling a "last-mile tool." They might use Access or Excel or some other tool. When they were done, they could share it with anyone else who had the same tool. The idea of the Data Explorer is to back up one step and produce something that is itself a data source that's then consumable by a large number of last-mile tools. You can program against the service itself to produce applications and whatnot.

How should companies collect and use data? What strategic advice would you offer?

David Campbell: From the data side, we've lived in what we'd call a world of scarcity. We thought that data was expensive to store, so we had to get rid of it as soon as possible. You don't want it unless you have a good use for it. Now we think about data from a perspective of abundance.

Part of the challenge, 10 or 15 years years ago, was where do I go get the data? Where do I tap in? But in today's world, everything is so interconnected. It's just a matter of teeing into it. The phrase I've used instead of big data is "ambient data." It's just out there and available.

The recommendation would be to stop and think about the latent value in all that data that's there to be collected and that's fairly easy to store now. That's the challenge and the opportunity for all of us.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

This interview was edited and condensed.


April 20 2011

An iTunes model for data

iTunes and a spreadsheetAs we move toward a data economy, can we take the digital content model and apply it to data acquisition and sales? That's a suggestion that Gil Elbaz (@gilelbaz), CEO and co-founder of the data platform Factual made in passing at his recent talk at Web 2.0 Expo.

Elbaz spoke about some of the hurdles that startups face with big data — not just the question of storage, but the question of access. But as he addressed the emerging data economy, Elbaz said we will likely see novel access methods and new marketplaces for data. Startups will be able to build value-added services on top of big data, rather than having to worry about gathering and storing the data themselves. "An iTunes for data," is how he described it.

So what would it mean to apply the iTunes model to data sales and distribution? I asked Elbaz to expand on his thoughts.

What problems does an iTunes model for data solve?

Gil Elbaz: One key framework that will catalyze data sharing, licensing and consumption will be an open data marketplace. It is a place where data can be programmatically searched, licensed, accessed, and integrated directly into a consumer application. One might call it the "eBay of data" or the "iTunes of data." iTunes might be the better metaphor because it's not just the content that is valuable, but also the convenience of the distribution channel and the ability to pay for only what you will consume.

How would an iTunes model for data address licensing and ownership?

Gil Elbaz: In the case of iTunes, in a single click I purchase a track, download it, establish licensing rights on my iPhone and up to four other authorized devices, and it's immediately integrated into my daily life. Similarly, the deepest value will come for a marketplace that, with a single click, allows a developer to license data and have it automatically integrated into their particular application development stack. That might mean having the data instantly accessible via API, automatically replicated to a MySQL server on EC2, synchronized at, or copied to Google App Engine.

An iTunes for data could be priced from a single record/entity to a complete dataset. And it could be licensed for single use, caching allowed for 24 hours, or perpetual rights for a specific application.

What needs to happen for us to move away from "buying the whole album" to buying the data equivalent of a single?

Gil Elbaz: The marketplace will eventually facilitate competitive bidding, which will bring the price down for developers. iTunes is based on a fairly simple set-pricing model. But, in a world of multiple data vendors with commodity data, only truly unique data will command a premium price. And, of course, we'll need great search technology to find the right data or data API based on the developer's codified requirements: specified data schema, data quality bar, licensing needs, and the bid price.

Another dimension that is relevant to Factual's current model: data as a currency. Some of our most interesting partnerships are based on an open exchange of information. Partners access our data and also contribute back streams of edits and other bulk data into our ecosystem. We highly value the contributions our partners make. "Currency" is a medium of exchange and a basis for accessing other scarce resources. In a world where not everyone is yet actively looking to license data, unique data is increasingly an important medium of exchange.

This interview was edited and condensed.

Photos: iTunes interface courtesy Apple, Inc; Software Development LifeCycle Templates By Phase Spreadsheet by Ivan Walsh, on Flickr


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...