Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 09 2011

Top Stories: December 5-9, 2011

Here's a look at the top stories published across O'Reilly sites this week.

The end of social
Mike Loukides: "If you want to tell me what you listen to, I care. But if sharing is nothing more than a social application feed that's constantly updated without your volition, then it's just another form of spam."

Why cloud services are a tempting target for attackers
Jeffrey Carr says before organizations embrace the efficiencies and cost savings of cloud services, they should also closely consider the security repercussions and liabilities attached to the cloud.

White House to open source as open government data platform
The new " in a box" could empower countries to build their own platforms. With this step forward, the prospects are brighter for stimulating economic activity, civic utility and accountability under a global open-government partnership.

Stickers as sensors
Put a GreenGoose sticker on an object, and just like that, you'll have an Internet-connected sensor. In this interview, GreenGoose founder Brian Krejcarek discusses stickers as sensors and the data that can be gathered from everyday activities.

What publishers can learn from Netflix's problems writer Tim Carmody examines the recent missteps of Netflix and takes a broad look at how technology shapes the reading experience.

Tools of Change for Publishing, being held February 13-15 in New York, is where the publishing and tech industries converge. Register to attend TOC 2012.

December 21 2010

The growing importance of data journalism

One of the themes from News Foo that continues to resonate with me is the importance of data journalism. That skillset has received renewed attention this winter after Tim Berners-Lee called analyzing data the future of journalism.

When you look at data journalism and the big picture, as USA Today's Anthony DeBarros did at his blog in November, it's clear the recent suite of technologies is part of a continuum of technologically enhanced storytelling that traces back to computer-assisted reporting (CAR).

As Barros pointed out, the message of CAR "was about finding stories and using simple tools to do it: spreadsheets, databases, maps, stats," like Microsoft Access, Excel, SPSS, and SQL Server. That's just as true today, even if data journalists now have powerful new tools for scraping data from the web with tools like ScraperWiki and Needlebase, scripting with Perl, or Ruby, Python, MySQL and Django.

Understanding the history of computer-assisted reporting is key to putting new tools in the proper context. "We use these tools to find and tell stories," Barros wrote. "We use them like we use a telephone. The story is still the thing."

The data journalism session at News Foo took place on the same day civic developers were participating in a global open data hackathon and the New York Times hosted its Times Open Hack Day. Many developers at contests like these are interested in working with open data, but the conversation at News Foo showed how much further government entities need to go to deliver on the promise open data holds for the future of journalism.

The issues that came up are significant. Government data is often "dirty," with missing metadata or incorrect fields. Journalists have to validate and clean up datasets with tools like Google Refine. ProPublica's Recovery Tracker for stimulus data and projects is one of the best examples of the practice in action.

A recent gold standard for data journalism is the Pulitzer-Prize winning Toxic Waters project from the New York Times. The scale of that project makes it a difficult act to follow, though Times developers are working hard with nifty projects like Inside Congress.

You can see a visualization of the Toxic Waters project and other examples of data journalism in this Ignite presentation from News Foo.

At ProPublica, the data journalism team is conscious of deep linking into news applications, with the perspective that the visualizations produced from such apps are themselves a form of narrative journalism. With great data visualizations, readers can find their own way and interrogate the data themselves. Moreover, distinctions between a news "story" and a news "app" are dissolving as readers increasingly consume media on mobile devices and tablets.

One approach to providing useful context is the "Ion" format at, where a project like "Eye on the Stimulus" is a hybrid between a blog and an application. On one side of the web page, there's a news river. On the other, there's entry points into the data itself. The challenge to this approach is that a media outlet needs alignment between staff and story. A reporter has to be filing every day on a running story that's data sensitive.


The data journalism News Foo session featured a virtual component, bringing City Camp founder Kevin Curry, evangelist Jeanne Holm, and Reynolds fellow David Herzog together with News Foo participants to talk about the value propositions for open government data and data journalism.

As the recent open data report showed, developers are not finding the government data they need or want. If other entrepreneurs are to follow the lead of BrightScope, open government datasets will need to be more relevant to business. The feedback for and other government data repositories was clear: more data, better data, and cleaner data, please.

Improving media access to data at the county- or state-level of government has structural barriers because of growing budget crises in statehouses around the United States. As Jeanne Holm observed during the News Foo session, open government initiatives will likely be done in a zero-sum budget environment in 2011. Officials have to make them sustainable and affordable.

There are some areas where the federal government can help. Holm said has created cloud hosting that can be shared with state, local or tribal is also rolling out a set of tools that will help with data conversion, optical character recognition, and, down the road, better tools for structured data.

Those resources could make government data more readily available and accessible to the media. Kevin Curry said that data catalogs are popping up everywhere. He pointed to CivicApps in Portland, Ore., where Max Ogden's work on coding the middleware for open government led to translating government data into more useful forms for developers.

Data journalists also run into government's cultural challenges. It can be hard to find public information officers willing or able to address substantive questions about data. Holm said may post more contact information online and create discussions around each dataset. That kind of information is a good start for addressing data concerns at the federal level, but fostering useful connections between journalists and data will still require improvement and effort.


August 26 2010

Earthquakes are HUGE on

After launching just over a year ago with only 47 data sets, the catalog now has 2,326 entries that have been collectively downloaded almost three-quarters of a million times. Of course, even these sizable download counts understate the actual impact of this data, which is being embedded in a variety of sites and apps, like those being developed for the Health 2.0 Developer Challenge.

The big winner so far? The Department of the Interior's "Worldwide M1+ Earthquakes, Past 7 Days" data set. My guess is that there is some great app or visualization out there making daily use of this file -- if you know what it it is, report it in the comments.

The top 10 downloads are:

  1. Worldwide M1+ Earthquakes, Past 7 Days. 122,888 downloads. Real-time, worldwide earthquake list for the past 7 days. Department of the Interior.

  2. Latest Volumes of Foreign Relations of the United States. 10,090 downloads. The feed for the latest ten volumes of the official historical documentary record of U.S. foreign policy in the Foreign Relations of the United States series. Department of State.

  3. U.S. Overseas Loans and Grants (Greenbook). 6,670 downloads. These data are U.S Economic and Military Assistance by country from 1946 to the present. US Agency for International Development.

  4. Child-Related Product Recalls. 2,784 downloads. Lists recalls from CPSC, the agency charged with protecting the public from unreasonable risks of serious injury or death from thousands of types of consumer products. US Consumer Product Safety Commission.

  5. Airline On-Time Performance and Causes of Flight Delays. 2,716 downloads. On-time arrival data for non-stop domestic flights by major air carriers, as well as additional items, such as departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance. Department of Transportation.

  6. 2005 Toxics Release Inventory data for American Samoa. 2,628 downloads. The Toxics Release Inventory (TRI) is a publicly available EPA database that contains information on toxic chemical releases and waste management activities reported annually by certain industries as well as federal facilities. Environmental Protection Agency.

  7. OSHA Data Initiative - Establishment Specific Injury and Illness Rates. 2,588 downloads. The data used by OSHA to calculate establishment-specific injury and illness incidence rates. Department of Labor.

  8. 2001 Federal Register in XML. 2,506 downloads. The official daily publication for rules, proposed rules, and notices of Federal agencies and organizations, as well as executive orders and other presidential documents. National Archives and Records Administration.

  9. 2007 National RCRA Hazardous Waste Biennial Report Data Files. 2,266 downloads. Data on the generation of hazardous waste from large-quantity generators and on waste management practices from treatment, storage, and disposal facilities. Environmental Protection Agency.

  10. Residential Energy Consumption Survey (RECS) Files, All Data, 2005 2,000 Downloads. Data on the use of energy in residential housing units including physical housing unit types, appliances utilized, demographics, fuels, and other energy-use information from the Residential Energy Consumption Survey (RECS), which is conducted every four years. Department of Energy.

Here's a breakdown of the contributions by agency:

AgencyData sets contributedDownloads Environmental Protection Agency474160,716 Department of Defense21444,837 Department of the Interior197157,273 Department of Commerce17637,430 Department of Health and Human Services14443,697 Executive Office of the President1327,569 Department of the Treasury9349,859 Department of Justice9016,392 Department of Energy8612,965 All remaining agencies740209,872

Finally, here's a link to the catalog that includes the number of times the set has been downloaded. (If you're interested in how this was done, check out Use BeautifulSoup to parse over on O'Reilly Answers).

Congrats to everyone at for creating this incredible resource for developers-at-large.

Tags: data datagov gov2
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
Get rid of the ads (sfw)

Don't be the product, buy the product!