Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 08 2012

Strata Week: Profiling data journalists

Here are a few of the data stories that caught my attention this week.

Profiling data journalists

Over the past week, O'Reilly's Alex Howard has profiled a number of practicing data journalists, following up on the National Institute for Computer-Assisted Reporting's (NICAR) 2012 conference. Howard argues that data journalism has enormous importance, but "given the reality that those practicing data journalism remain a tiny percentage of the world's media, there's clearly still a need for its foremost practitioners to show why it matters, in terms of impact."

Howard's profiles include:

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Surveying data marketplaces

Edd Dumbill takes a look at data marketplaces, the online platforms that host data from various publishers and offer it for sale to consumers. Dumbill compares four of the most mature data marketplaces — Infochimps, Factual, Windows Azure Data Marketplace, and DataMarket — and examines their different approaches and offerings.

Dumbill says marketplaces like these are useful in three ways:

"First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume."

Analyzing sports stats

The Atlantic's Dashiell Bennett examines the MIT Sloan Sports Analytics Conference, a "festival of sports statistics" that has grown over the past six years from 175 attendees to more than 2,200.

Bennett writes:

"For a sports conference, the event is noticeably athlete-free. While a couple of token pros do occasionally appear as panel guests, this is about the people behind the scenes — those who are trying to figure out how to pick those athletes for their team, how to use them on the field, and how much to pay them without looking like a fool. General managers and team owners are the stars of this show ... The difference between them and the CEOs of most companies is that the sports guys have better data about their employees ... and a lot of their customers have it memorized."

Got data news?

Feel free to email me.


January 26 2011

Data markets aren't coming. They're already here

Jud Valeski (@jvaleski) is cofounder and CEO of Gnip, a social media data provider that aggregates feeds from sites like Twitter, Facebook, Flickr, delicious, and others into one API.

Jud will be speaking at Strata next week on a panel titled "What's Mine is Yours: the Ethics of Big Data Ownership."

If you're attending Strata, you can also find out more about growing business of data marketplaces at a "Data Marketplaces" panel with Ian White of Urban Mapping, Peter Marney of Thomson Reuters, Moe Khosravy of Microsoft, and Dennis Yang of Infochimps.

My interview with Jud follows.

Why is social media data important? What can we do with it or learn from it?

Jud ValeskiJud Valeski: Social media today is the first time a reasonably large population has communicated digitally in relative public. The ability to programmatically analyze collective conversation has never really existed. Being able to analyze the collective human consciousness has been the dream of researchers and analysts since day one.

The data itself is important because it can be analyzed to assist in disaster detection and relief. It can be analyzed for profit in an industry that has always struggled to pinpoint how and where to spend money. It can be analyzed to determine financial market viability (stock trading, for example). It can be analyzed to understand community sentiment, which has political ramifications; we all want our voices heard in order to shape public policy.

What are some of the most common or surprising queries run through Gnip?

Jud Valeski: We don't look at the queries our customers use. One pattern we have seen, however, is that there are some people who try to use the software to siphon as much data as possible out of a given publisher. "More data, more data, more data." We hear that all the time. But how our customers configure the Gnip software is up to them.

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD

With Gnip, customers can choose the data sources they want not just by site but also by category within the site. Can you tell me more about the options for Twitter, which include Decahose, Halfhose, and Spritzer?

Jud Valeski: We tend to categorize social media sources into three buckets: Volume, Coverage, or Both. Volume streams provide a consumer with a sampled rate of volume (Decahose is 10%, for example, while a full firehose is 100% of some service's activities). Statisticians and analysts like the Volume stuff.

Coverage streams exist to provide full coverage of a certain set of things (e.g., keywords, or the User Mention Stream for Twitter). Advertisers like Coverage streams because their interests are very targeted. There are some products that fall into both categories, but Volume and Coverage tend to describe the overall view.

For Twitter in particular, we use their algorithm as described on their dev pages, adjusted for each particular volume rate desired.

Gnip is currently the only licensed reseller of the full Twitter firehose. Are there other partnerships coming up?

Jud Valeski: "Currently" is the operative word here. While we're enjoying the implied exclusivity of the current conditions, we fully expect Twitter to grow its VAR tier to ensure a more competitive marketplace.

From my perspective, Twitter enabling VARs allows them to focus on what is near and dear to their hearts — developer use cases, promoted Tweets, end users, and the display ecosystem — while enabling firms focused on the data-delivery business to distribute underlying data for non-display use. Gnip provides stream enrichments for all of the data that flows through our software. Those enrichments include format and protocol normalization, as well as stream augmentation features such as global URL unwinding. Those value-adds make social media API integration and data leverage much easier than doing a bunch of one-off integrations yourself.

We're certainly working on other partnerships of this level of significance, but we have nothing to announce at this time.

What do you wish more people understood about data markets and/or the way large datasets can be used?

Jud Valeski: First, data is not free, and there's always someone out there that wants to buy it. As an end-user, educate yourself with how the content you create using someone else's service could ultimately be used by the service-provider.

Second, black markets are a real problem, and just because "everyone else is doing it" doesn't mean it's okay. As an example, botnet-like distributed IP address polling infrastructure is commonly used to extract more data from a publisher's service than their API usage terms allow. While perhaps fun to build and run (sometimes), these approaches clearly result in aggregated pools of publisher data that the publisher never intended to promote. Once collected, the aggregated pools of data are sold to data-hungry analytics firms. This results in end-user frustration, in that the content they produced was used in a manner that flagrantly violated the terms under which they signed up. These databases are frequently called out as infringing on privacy.

Everyone loves a good Robin Hood story, and that's how I'd characterize the overall state of data collection today.

How has real-time data changed the field of customer relationship management (CRM)?

Jud Valeski: CRM firms have a new level of awareness. They no longer rely exclusively on dated user studies. A customer service rep may know about your social life through their dashboard the moment you are connected to them over the phone.

I ultimately see the power of understanding collective consciousness in responding to customer service issues. We haven't even scratched the surface here. Imagine if Company X reached out to you directly every time you had a problem with their product or service. Proactivity can pay huge dividends. Companies haven't tapped even 10% of the potential here, and part of that is because they're not spending enough money in the area yet.

Today, "social" is a checkbox that CRM tools attempt to check off just to keep the boss happy. Tomorrow, social data and metaphors will define the tools outright.

Have you learned anything as a social media user yourself from working on Gnip? Is there anything social media users should be more aware of?

Jud Valeski: Read the terms of service for social media services you're using before you complain about privacy policies or how and where your data is being used. Unless you are on a private network, your data is treated as public for all to use, see, sell, or buy. Don't kid yourself. Of course, this brings us all the way back around to black markets. Black markets — and publishers' generally lackadaisical response to them — cloud these waters.

If you can't make it to Strata, you can learn more about the architectural challenges of distributing social and location data across the web in real time, and how Gnip has evolved to address those challenges, in Jud's contribution to "Beautiful Data."

Sponsored post

December 25 2010

December 16 2010

Strata Week: Shop 'til you drop

Need a break from the holiday madness? You're not alone. Check out these items of interest from the land of data and see why even the big consumers face tough choices.

Does this place accept returns?

On Monday, Stack Overflow announced that they have moved the Stack Exchange Data Explorer (SEDE) off of the Windows Azure platform and onto in-house hardware.


SEDE is an open source, web-based tool for querying the monthly data dump of Creative Commons data from its four main Q&A sites (Stack Overflow, Server Fault, Super User, and Meta) as well as other sites in the Stack Exchange family. The primary reason given (within a polite write-up by Jeff Atwood and SEDE lead Sam Saffron), was the desire to have fine-tuned control over the platform.

When you are using a [Platform-as-a-Service] you are giving up a lot of control to the service provider. The service provider chooses which applications you can run and imposes a series of restrictions. ... It was disorienting moving to a platform where we had no idea what kind of hardware was running our app. Giving up control of basic tools and processes we use to tune our environment was extremely painful.

While the support that comes with Platform-as-a-Service was acknowledged, it seems that the ability to better automate, adjust, and perpetuate processes and systems with more fine-grained control won out as a bigger convenience.

Where did you get that lovely platform?

Strata 2011Of course, one company's headache is another's dream. Netflix, a company known for playing with big data and crowdsourcing solutions "before it was cool," posted on Tuesday the four reasons they've chosen to use Amazon Web Services (AWS) as their platform and have moved onto it over the last year.

Laudably, the company states that it viewed its tremendous recent growth (in terms of both members and streaming devices) as a license to question everything in the necessary process of re-architecting. Instead of building out their own data centers, etc., they decided to answer that set of questions by paying someone else to worry about it.

Also to their credit, Netflix has enough self-awareness to know what they are and aren't good at. Building top-notch recommendation systems and providing entertainment? You betcha. Predicting customer growth and device engagement? Not so much.

How many subscribers would you guess used our Wii application the week it launched? How many would you guess will use it next month? We have to ask ourselves these questions for each device we launch because our software systems need to scale to the size of the business, every time.

Self-awareness is in fact the primary lesson in both Netflix's and Stack Exchange's platform decisions. If you feel your attention is better spent elsewhere, write a check. If you've got the time and expertise to hone your hardware, roll your own.

[Of course, Netflix doesn't go for the pre-packaged solutions every time. They also posted recently about why they love open source software, and listed among the projects they make use of and contribute back to: Hadoop, Hive, HBase, Honu, Ant, Tomcat, Hudson, Ivy, Cassandra, etc.]

With what shall we shop?

The New York Times this week released a cool group of interactive maps based on data collected in the Census Bureau's American Community Survey (ACS) from 2005 to 2009. Data is compared against the 2000 census to uncover rates of change.

[While similar to the census, the ACS is conducted every year instead of every 10 years. The ACS includes only a sampling of addresses instead of a comprehensive inventory. It covers much of the same ground on population (age, race, disability status, family relationships), but it also asks for information that is used to help make funding distribution decisions about community services and institutions.]

The Times maps explore education levels; rent, mortgage rates, and home values; household income; and racial distribution. Viewers can select among 22 maps in these four categories, and then pan and zoom to view national, state, or local trends down to the level of individual census tracts.

Above is the national view of the map that looks at change in median household income. The ACS website itself provides some maps displaying the survey numbers from the 2000 census and the 2005-2009 survey, as well as a listing of data tables.

The Times map shows the uneven way in which these numbers have gone up or down in various parts of the country, with some surprising results that are worth exploring. Note that the blue regions are places where income has dropped, and the yellow regions are places where it has increased. (No wonder a lot of us are getting creative with holiday shopping.)

If this kind of research floats your boat, check out Social Explorer, the mapping tool used to create the New York Times maps.

Even markets like to buy things

The emerging landscape of custom data markets is already shifting as Infochimps recently announced the acquisition of Data Marketplace, a start-up incubated at Y Combinator.

While Stewart Brand may be right in thinking information wants to be free, there's also enormous value to be added by aggregating, structuring, and packaging data, as well as in matching up buyers with sellers. That's the main service Data Marketplace aims to provide, particularly in the field of financial data.

At Infochimps, information is offered a la carte, and many of the site's datasets are offered for free. These include sets as diverse as "Word List - 100,000+ official crossword words (Excel readable)", "Measuring Worth: Interest Rates - US & UK 1790-2000", and "Retrosheet: Game Logs (play-by-play) for Major League Baseball Games." Data Marketplace is a bit different, in that it allows users to enter requests for data (with a deadline and budget, if desired) and then matches up would-be buyers with data providers.

Infochimps has said that Data Marketplace, which is less than a year old, will continue to operate as a standalone site, although its founders Steve DeWald and Matt Hodan will depart for new projects.

If you're interested in the burgeoning business of aggregated datasets, be sure to check out the Data Marketplaces panel I'll be moderating at Strata in February.

Not yet signed up for Strata? Register now and save 30% with the code STR11RAD.

December 05 2010

Strata Gems: Where to find data

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Quick starts for charts.

Strata 2011With the growth of both the open data movement and data marketplaces, there's now a wealth of public data - some free, some for sale - that you can use in your analyses and applications. It's not just about data dumps: increasingly you can get data through APIs, or even execute it on servers provided by the data host.


An icon of the open data movement, Freebase is a graph database full of "people, places and things". The data is contributed and edited by community volunteers. Freebase was recently acquired by Google.

Freebase both names real world entities, and stores structured data about the attributes of and relations between those things. For example, see the page for the movie Harry Potter and the Deathly Hallows: Part I. It looks a bit like a Wikipedia page, but you can edit and retrieve the structured data for every page.

Developers have access to a variety of Freebase services, including dumps from the entire database, and API access to the data. Of particular interest is "Acre", a hosted platform that lets you implemented an application on Freebase servers, close to the data you need.

Freebase screenshot
Screenshot from Freebase, showing activity in the most popular data sets

Amazon Public Data Sets

As a public service, Amazon Web Services host a variety of Public Data Sets available to users writing applications on their cloud services. By putting the data on servers next to their cloud computing platform EC2, Amazon helps avoid the difficulty of locating, cleaning and downloading data. The data never needs to travel: only your code. This is obviously valuable when data sets get particuarly large, or are updated frequently.

Amazon's public data sets include annotated human genome data, a variety of data sets from the US Census Bureau, and dumps from services such as Freebase and Wikipedia.

Windows Azure Data Market

Launched publicly by Microsoft this year, the Azure Data Market offers a variety of data sets and sources, accessible by the OData protocol. OData offers uniform access to data, along with a standardized query interface. By using data from the market, a user can reduce the friction of parsing and importing data. Unsurprisingly, Microsoft's own tools such as Excel allow importing of data directly from the marketplace's OData endpoints.

Azure Data Market contains both free and for-pay data sets, offering a route to monetization for data publishers. Free data sets include government and international agency data. An example of for-pay data, Sports data provider MLB game by game statistics through the marketplace.

The emergence of data marketplaces offers developers a legitimate route to data previously only obtainable at high cost, or through illicit web scraping.

Yahoo! Query Language (YQL)

YQL is a technology that presents web services in way in which familiar SQL-like queries can be executed against them. SELECT, INSERT and DELETE operations can be performed against services such as Flickr.

In essence, YQL offers a technology similar to OData, providing an adapter layer that gives data consumers a uniform interface to data. Data providers must provide their data as an Open Data Table: or third parties can contribute adapter definitions, such as those for Foursquare, Github, and Google. The most limiting aspect of YQL is that queries must run through Yahoo's own servers.


Infochimps is another data market place and commons, founded by Strata speaker Flip Kromer.

Infochimps makes its data available either as downloadable data sets, or accessible via an API. For an example of commercial data available on Infochimps, check out the Twitter Census Conversation Metrics, which counts the occurrence of URLs, hashtags and Smileys used over a year in Twitter.


A previous Strata Gem covered the use of Wikipedia as training data, but there's more than just free text content inside Wikipedia: many articles contain structured information. DBPedia is a community led project to extract this structured information and make it available on the web.

DBpedia offers a variety of data sets, covering entities such as cities, countries, politicians, films and books. The data is available as dumps, queryable online or available as crawlable linked data in RDF format.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...