Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 14 2012

The bond between data and journalism grows stronger

While reporters and editors have been the traditional vectors for information gathering and dissemination, the flattened information environment of 2012 now has news breaking first online, not on the newsdesk.

That doesn't mean that the integrated media organizations of today don't play a crucial role. Far from it. In the information age, journalists are needed more than ever to curate, verify, analyze and synthesize the wash of data.

To learn more about the shifting world of data journalism, I interviewed Liliana Bounegru (@bb_liliana), project coordinator of SYNC3 and Data Driven Journalism at the European Journalism Centre.

What's the difference between the data journalism of today and the computer-assisted reporting (CAR) of the past?

Liliana Bounegru: There is a "continuity and change" debate going on around the label "data journalism" and its relationship with previous journalistic practices that employ computational techniques to analyze datasets.

Some argue [PDF] that there is a difference between CAR and data journalism. They say that CAR is a technique for gathering and analyzing data as a way of enhancing (usually investigative) reportage, whereas data journalism pays attention to the way that data sits within the whole journalistic workflow. In this sense, data journalism pays equal attention to finding stories and to the data itself. Hence, we find the Guardian Datablog or the Texas Tribune publishing datasets alongside stories, or even just datasets by themselves for people to analyze and explore.

Another difference is that in the past, investigative reporters would suffer from a poverty of information relating to a question they were trying to answer or an issue that they were trying to address. While this is, of course, still the case, there is also an overwhelming abundance of information that journalists don't necessarily know what to do with. They don't know how to get value out of data. As Philip Meyer recently wrote to me: "When information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important."

On the other hand, some argue that there is no difference between data journalism and computer-assisted reporting. It is by now common sense that even the most recent media practices have histories as well as something new in them. Rather than debating whether or not data journalism is completely novel, a more fruitful position would be to consider it as part of a longer tradition but responding to new circumstances and conditions. Even if there might not be a difference in goals and techniques, the emergence of the label "data journalism" at the beginning of the century indicates a new phase wherein the sheer volume of data that is freely available online combined with sophisticated user-centric tools enables more people to work with more data more easily than ever before. Data journalism is about mass data literacy.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

What does data journalism mean for the future of journalism? Are there new business models here?

Liliana Bounegru: There are all kinds of interesting new business models emerging with data journalism. Media companies are becoming increasingly innovative with the way they produce revenues, moving away from subscription-based models and advertising to offering consultancy services, as in the case of the German award-winning OpenDataCity.

Digital technologies and the web are fundamentally changing the way we do journalism. Data journalism is one part in the ecosystem of tools and practices that have sprung up around data sites and services. Quoting and sharing source materials (structured data) is in the nature of the hyperlink structure of the web and in the way we are accustomed to navigating information today. By enabling anyone to drill down into data sources and find information that is relevant to them as individuals or to their community, as well as to do fact checking, data journalism provides a much needed service coming from a trustworthy source. Quoting and linking to data sources is specific to data journalism at the moment, but seamless integration of data in the fabric of media is increasingly the direction journalism is going in the future. As Tim Berners-Lee says, "data-driven journalism is the future".

What data-driven journalism initiatives have caught your attention?

Liliana Bounegru: The data journalism project FarmSubsidy.org is one of my favorites. It addresses a real problem: The European Union (EU) is spending 48% of its budget on agriculture subsidies, yet the money doesn't reach those who need it.

Tracking payments and recipients of agriculture subsidies from the European Union to all member states is a difficult task. The data is scattered in different places in different formats, with some missing and some scanned in from paper records. It is hard to piece it together to form a comprehensive picture of how funds are distributed. The project not only made the data available to anyone in an easy to understand way, but it also advocated for policy changes and better transparency laws.

LRA Crisis Tracker

Another of my favorite examples is the LRA Crisis Tracker, a real-time crisis mapping platform and data collection system. The tracker makes information about the attacks and movements of the Lord's Resistance Army (LRA) in Africa publicly available. It helps to inform local communities, as well as the organizations that support the affected communities, about the activities of the LRA through an early-warning radio network in order to reduce their response time to incidents.

I am also a big fan of much of the work done by the Guardian Datablog. You can find lots of other examples featured on datadrivenjournalism.net, along with interviews, case studies and tutorials.

I've talked to people like Chicago Tribune news app developer Brian Boyer about the emerging "newsroom stack." What do you feel are the key tools of the data journalist?

Liliana Bounegru: Experienced data journalists list spreadsheets as a top data journalism tool. Open source tools and web-based applications for data cleaning, analysis and visualization play very important roles in finding and presenting data stories. I have been involved in organizing several workshops on ScraperWiki and Google Refine for data collection and analysis. We found that participants were quite able to quickly ask and answer new kinds of questions with these tools.

How does data journalism relate to open data and open government?

Liliana Bounegru: Open government data means that more people can access and reuse official information published by government bodies. This in itself is not enough. It is increasingly important that journalists can keep up and are equipped with skills and resources to understand open government data. Journalists need to know what official data means, what it says and what it leaves out. They need to know what kind of picture is being presented of an issue.

Public bodies are very experienced in presenting data to the public in support of official policies and practices. Journalists, however, will often not have this level of literacy. Only by equipping journalists with the skills to use data more effectively can we break the current asymmetry, where our understanding of the information that matters is mediated by governments, companies and other experts. In a nutshell, open data advocates push for more data, and data journalists help the public to use, explore and evaluate it.

This interview has been edited and condensed for clarity.

Photo on associated home and category pages: NYTimes: 365/360 - 1984 (in color) by blprnt_van, on Flickr.

Related:

October 21 2011

Top Stories: October 17-21, 2011

Here's a look at the top stories published across O'Reilly sites this week.

Visualization deconstructed: Why animated geospatial data works
When you plot geographic data onto the scenery of a map and then create a shifting window into that scene through the sequence of time, you create a deep, data-driven story.

Jason Huggins' Angry Birds-playing Selenium robot
Jason Huggins explains how his Angry Birds-playing robot relates to the larger problems of mobile application testing and cloud-based infrastructure.

Data journalism and "Don Draper moments"
The Guardian's Alastair Dant discusses the organization's interactive stories, including its World Cup Twitter replay, along with the steps his team takes when starting a new data project.

Building books for platforms, from the ground up
Open Air Publishing's Jon Feldman says publishers aren't truly embracing digital. They're simply pushing out flat electronic versions of print books.


Open Question: What needs to happen for tablets to replace laptops?
What will it take for tablets to equal — or surpass — their laptop cousins? See specific wish lists and weigh in with your own thoughts.


Velocity Europe, being held November 8-9 in Berlin, brings together performance and site reliability experts who share the unique experiences that can only be gained by operating at scale. Save 20% on registration with the code RADAR20.

October 18 2011

Data journalism and "Don Draper moments"

Screenshot from The Guardian's World Cup Twitter replayIn a recent interview, Alastair Dant (@ajdant), lead interactive technologist for The Guardian News and Media, discussed some of the creative processes that go into The Guardian's data visualizations. Highlights from the interview include:

  • For last year's World Cup, The Guardian created a match replay that visualized each game based on its real-time Twitter data. Dant laughed when he described the Don Draper-like moment that started the project: "Seeing these pictures in my head of ... initially it was words, a bit like Wordle, but animated in real time, so that the actual 'roar of the crowd' could pass through Twitter." The project evolved from there, with the words morphing into the "balls" that you see in the published version. [Discussed at the beginning of the interview.]
  • How does The Guardian begin a new project? Dant said it's like a "Venn diagram with three circles: you have technology, you have data, and you have narrative. We generally start with one of those three things." In the case of the World Cup Twitter replay, the editorial people knew the narrative — the enthusiasm of the games. They also had a way to record the Tweets. "And it was then just a case of figuring out the tech to put all that together." [Discussed at 3:11.]
  • In the future, interactive news will settle on some of the "archetypes or standard types of interactive content we make again and again," Dant said. "In doing that, we're starting to refine our software so that it becomes easier to publish in a far shorter timeframe." In addition to having a better speed of response, Dant said there will be more access to big data as well as more interesting interfaces. [Discussed at 4:58.]

The full interview is available in the following video:

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

September 26 2011

For local news, TV is dominant but the Internet is our digital future

03.Newspapers.SW.WDC.22dec05 by ElvertBarnes, on FlickrThe days of relying on a print newspaper and a television anchor telling us "the way it is" are long gone. In 2011, Americans and citizens the world over consume news on multiple screens and platforms. Increasingly, we all contribute reports ourselves, using Internet-connected smartphones.

A new report on local news by the Pew Research Center's Project for Excellence in Journalism and the Pew Internet & American Life Project provides reason to be hopeful about new information platforms. But the report also reveals deep concern about the decay of local newspapers, and what that will mean for local government accountability.

"Research in the past about how people get information about their communities tended to focus on a single question: 'Where do you go most often to get local news?'," noted Tom Rosenstiel, Director of the Pew Excellence in Journalism project and co-author of the new report, in a prepared statement. "This research asked about 16 different local topics and found a much more complex ecosystem in which people rely on different platforms for different topics. It turns out that each piece of the local information system has special roles to play. Our research sorted that out and we found that for some things TV matters most, for others newspapers and their websites are primary sources, and the Internet is used for still other topics."

Specifically, the report found that Americans rely on local TV for information about popular local topics, including weather (89% use TV for this information), breaking news (80%), local politics (67%) and crime (66%). Americans use newspapers for breadth and depth of many more topics, particularly with respect to local government information. Newspapers supply "broccoli journalism" about the least popular topics, including zoning and development information (30%), local social services (35%), job openings (39%) and local government activities (42%). These are topics that other local news institutions don't often deliver.

The role of the Internet grows

In the latest confirmation of the growth of the Internet in modern life, we're increasingly going online when we're interested in gathering information about specific local services, searching for information about education, restaurants, and business news, logging onto social media and accessing mobile devices to find and share what we learn ourselves.

"The rise of search engines and specialty websites for different topics like weather, job postings, businesses, and even e-government have fractured and enriched the local news and information environment," said Lee Rainie, director of the Pew Internet Project and another report co-author, in a prepared statement.

Nearly half of adults (47%) now use mobile devices to get local news and information. The proliferation of smartphones, iPad apps and new platforms offers insight into a rapidly expanding mobile future.

"We don't yet know exactly how important mobile apps will be, but it's pretty easy to sketch out a scenario where they rise in importance, especially when it comes to breaking news, weather, traffic, local politics and some of the more popular local topics," said Rainie in an interview.

The Internet has become a key source for peer-generated information. In fact, the survey showed that among adults under age 40, the Internet rivals or exceeds other platforms in every topic area save one: breaking local news. According to the study, the Internet has now become American adults' key source for five broad areas of information:

  • Restaurants, clubs and bars.
  • Local businesses.
  • Local schools.
  • Local jobs.
  • Local housing and real estate.

The websites of local newspapers and TV stations aren't faring well, in terms of how the respondents rated their importance as a local news source. "Local TV news websites barely registered," reads the report, with less than 6% of those surveyed indicating that they depended on a legacy media organization's website for local news.  

One clear finding from this report is that social media currently plays a small role in providing local information that citizens say they rely upon, with 18% using Facebook and 2% turning to Twitter. "Social media look more like a supplemental source of information on these local topics than a primary, deeply-relied-upon source," said Rainie, in an interview. "That's not too surprising to me. Local information is just one of the many things that people discuss and share on SMS and Twitter." 

While the report showed that citizens don't rely on social media for local news, they are definitely discussing it there. "Participatory news" is a full-blown phenomenon: 41% of respondents can be considered "participators" who publish information online. That said, such information is frequently about restaurants and community events, versus harder news.

Web 2.0 Expo New York 2011, being held Oct. 10-13, showcases the latest Web 2.0 business models, development tools and design strategies for the builders of the next-generation web.

Save 20% on registration with code WEBNY11RAD

A digital generation gap

The question of what these trends mean for all levels of society is also critical to ask. "People under age of 40 are a lot more likely than those over 40 to use the Internet on a host of the topics we probed," said Rainie in our interview. "The gap is quite striking across a number of topics. As this younger cohort ages, it will probably expect legacy news organizations like newspapers, TV, and radio, to have an even more robust online presence. And they are likely to want to be able to contribute to news and easily share news with others via social media."

A generation gap could have profound implications for how informed citizens can be about their communities in the future, based upon their consumption habits and the availability of information.

"There is a disconnect in the public mind about newspapers, and that raises an important question about community information needs," observed Rainie. "As we said in the report, 'If your local newspaper no longer existed, would that have a major impact, a minor impact, or no impact on your ability to keep up with information and news about your local community?'. A large majority of Americans, 69%, believe the death of their local newspaper would have no impact (39%) or only a minor impact (30%) on their ability to get local information. Yet, newspapers are the leading source that people rely on to get information about most of the civic topics on our list. So, if a local newspaper did vanish, it is not entirely clear which parts of the ecosystem would address those needs. Newspapers are deeply enmeshed in the local information system in ways that are pretty important to democracy. That's why the economic struggles of newspapers matter."

Veterans of local news operations know that reality well. "This is something that I faced way back when I was at The Molokai Times in Hawaii," commented Kate Gardiner, responding to a question on Facebook. Gardiner is a new media strategist that works with Al Jazeera, Lauch and the Poynter Institute.

"We built a very robust online community to complement the hard copy and were experimenting with ways to make things even better for everyone — until the bottom dropped out and our major advertiser went bankrupt. The whole newspaper died. The community (about 5,000 people) was left with no alternative means of consuming news. Our competition sort of stepped up, but they weren't doing straight news. It's a problem on any number of levels — and there's no really obvious way to do a community-funded replacement, online or off.

Given that context, do these findings add additional urgency to funding and creating new models for information aggregation and distribution online?

"There aren't clear indications in our survey that speak to this question," replied Rainie. "People say now it's easier than in the past to get the local information they need, so we are not getting a signal in the questions about people thinking that data is hard to find."

An uncertain future for local government information

As print fades and a digital future for news becomes more equally distributed, establishing sustainable local online information hubs to meet the information needs of our democracy will grow in importance, along with the means to connect those news sources to communities on the other side of the digital or data divide. Simply put, there's an increasing need for local government news to be generated from civic media, libraries, schools, institutions and private industry. New platforms for social networking and sharing still need to be supplied with accurate information.

It's not clear if local governments, already stretched to provide essential services, will be able to become robust information providers. That said, new lightweight tools and platforms are enabling ambitious towns to go through "Gov 2.0 city makeovers."

For now, citizens are not relying on local government to be their primary information providers. According to the report:

... 3% of adults said that they rely on their local government (including both local government websites or visiting offices directly) as the main source of information for both taxes and for local social services, and even fewer cite their local government as a key source for other topics such as community events, zoning and development, and even local government activity.

The results of the survey leave us with significant questions and, unfortunately, few answers about the future of news in rural areas and towns. While local TV stations can focus on their profit centers — weather, breaking news, crime, traffic — it's going to be tough for local papers to monetize the less popular but important coverage of civic affairs.

"Newspapers are not struggling in the information-dissemination part of their business," said Rainie in our interview. "Indeed, other research shows many newspapers have a bigger audience than ever if you combine the print and web operations." (Research data on the state of the media in 2011 may not fully support that contention.)

"But if newspapers cut back on coverage of local government because it is expensive and doesn't pay for itself with lots of advertising, then local government information will be harder to come by," said Rainie.

What these news consumption trends mean for local governments, in terms of getting information to citizens when and where they need it, is more difficult to judge.

"The bigger issue that others have raised — notably by Steve Waldman at the FCC in his report [on the Information Needs of Communities"] — is who covers city hall and the school board and the zoning board to help make local institutions accountable? Our report raises that question without answering it," said Rainie. "If newspapers vanished, would TV stations or bloggers cover the bread-and-butter workings of local government, or do the kind of investigative pieces that newspapers have specialized in? We don't know and can't predict from these data. But it's an important question."

Steve Coll published an article in the Columbia Journalism Review prior to Waldman's that provided a thoughtful series of recommendations to reboot the news.

Of the suggestions in the FCC report Rainie mentioned, perhaps the most important to the technology community was the recommendation to put more proceedings, documents and data online: "Governments at all levels should put far more data and information online, and do it in ways that are designed to be most useful," suggested the FCC in its report. "Entrepreneurs can create new businesses and jobs based on distributing, shaping or analyzing this data. It will enable reporters to unearth stories in a day or two that might have previously taken two months."

Notably, the Federal Trade Commission also has recommended publishing public data online to support the future of journalism.

There is no shortage of creative ideas for the digital future of journalism, as evidenced by the conversations and new projects generated by the dynamic community that came together last weekend in Boston at the Online News Association's annual conference. The challenge is that many of them supply information to digitally literate news consumers with smartphones and broadband connections, not the poor, undereducated or disconnected. If local newspapers go away and local government information all goes digital, with primary access through mobile devices, what will it mean for the 21% of Americans still offline? In addition, will being poor mean being uninformed and disconnected from local civic life?

"In our data, people who are less well off are less connected," said Rainie. "That makes it harder for them to use new tools for civic activism and to gather information easily and on-the-fly. "

Closing the civic gap

As citizens turns to the Internet for government information, government entities have to respond on some level. At the local level, however, resources are scarce. Local TV news is unlikely to fill the gap left by local newspapers. The economics and the medium don't support using limited time to cover topics that aren't popular, as the report discusses:

Past PEJ studies have found that local newspapers typically have 70 to 100 stories a day. The typical half-hour local TV newscast is closer to 15. So it is logical that newspapers would offer coverage of more topics in a community, while television might concentrate on a more limited number that attract the widest audience.

"Local government is one coverage area that will suffer immensely if daily newspapers go under," commented Owen Covington, a reporter for The Triad Business Journal in North Carolina, in response to ">my question on Google+. "It can be mundane, but is necessary, and time-consuming to produce. Daily newspapers cover local government as a matter of course, while much of the online coverage from other sources is sporadic, and often opinionated and lacking depth. I'm not saying there aren't alternatives that do as good or better a job than the daily print editions, but they are still rather rare and absent in most communities now served by dailies."

The Pew report found that citizen-produced information (e.g. newsletters or listservs), commercial websites and newspapers all outweighed local government as news sources that readers relied upon. In that context, the work of e-democracy.org and other civic media platforms will be critical.

There are a growing number of free or inexpensive web-based tools available to city managers, including a growing repository of open source civic software at Civic Commons. Another direction lies in the use of local wikis to connect communities. Libraries will be important hubs for rural communities and will be a core element of bridging the digital divide in under-connected communities. Listservs will play a role in connecting citizens using the Internet's original killer app, email. Platforms for participatory budgeting may be integrated into hubs in municipalities that have a tolerance for ceding more power of the purse directly to citizens.

"I would suggest that many of the citizen-powered information systems will not look like a newspaper website," commented Jeff Sonderman, a digital media fellow at The Poynter Institute for Media Studies, when asked for his opinion, fittingly, in a Facebook group on social journalism. It's "more likely to be message boards, Facebook groups or email listservs."

Many forward-thinking local governments will provide the means for citizens to obtain information by using the most common electronic device: a cellphone. Arkansas, for instance, has added question and answer functionality to mobile apps for citizens using text messaging.

Should small cities or towns invest in citizen engagement? The government-as-a-platform approach looks to nonprofits, civic coders, educators, media, concerned citizens and commercial interests to fill that gap, building upon the core web services and data governments can provide. An essay on newspapers and government 2.0 published earlier this year by Pete Peterson, a professor at Pepperdine University, explored the potential for media and local government to collaborate on citizen engagement:

The increasing use of these tools by local and state governments has created a niche within the burgeoning "Gov 2.0 field," which now covers enterprises from participatory policy making to 311-systems. Although newspapers have been slower to employ these online engagement platforms, several interesting initiatives launched by newspapers from the San Francisco Chronicle and its water shortage game to the Washington Post's city budget balancing tool indicate that news organizations are beginning to take the lead in online public participation. This can be seen as both good and bad.

On the positive side, these tools are interactive, allowing a new and participatory form of learning for participants. Matched with the popularity of online games in general, these online civic engagement platforms can create a real "win-win" for both news organizations and users alike — informing readers and driving precious online traffic to newspaper websites.

To date, however, that kind of cooperation doesn't appear to be gathering much momentum as a complement to the press looking for fraud, corruption or scandal. And, as Peterson noted, there are other challenges for the media:

The way to build the most effective online engagement platforms is for news organizations and local governments to collaborate from their strengths: newspapers bringing their informed readership and marketing skills, working with a municipality's budget and policy experts. Of course, these relationships demand both transparency and a lack of bias — qualities neither party is known for. But — and this may be hardest of all — these tools also need citizens who are both engaged on local issues and humble about the challenges of forming public policy.

The growth of a new digital news ecosystem populated by civic media, an evolving civic stack, and data journalism will generate some answers to these questions, but it won't address all of the outstanding issues.

Local news readers write in

When I asked for feedback from readers on Twitter, Google+ and Facebook, I received a wide range of responses, some expressing serious worries about the future of local journalism and a few that were hopeful about the potential for technology to help citizens inform one another. A number of people discussed new public and private ventures, including Patch.com, AOL's initiative to fill the gap in local news, and NPR's Project Argo, which is experimenting with regional news coverage through public radio.

"I expect people will come together in groups or neighborhoods, and things will be more fluid," commented David Johnson, a journalism professor at American University, in response to my question on Facebook. "I don't foresee commercially supported news and information on the local level until there is a valuable platform for advertising and exchange of various levels of services. Perhaps associations will fill the void."

The "loss of local newspapers, dailies or weeklies, is not a new concern, and a concern in metropolitan as well as more rural cities and towns," commented Robert Petersen, a software developer, in response to my question on Google+. Petersen continued:

Many years ago the early trend in smaller markets was loss of local ownership of both print and broadcast news sources, an event that leads to a focus on financial performance first, rather than the financial success that follows from producing a quality product. More recently the advertising dollars necessary to sustain local journalism have tended to flow away from local journalism outlets to the additional delivery mechanisms, including "shoppers" (those go way back), electronic (direct email, blogs, coupon or deal sites, shopping help sites with reviews and price comparisons, etc.), movement of advertising to regional radio and TV, not to mention the loss of local sales to online merchants.

This shift away from local journalism can also be seen in the journalism schools, where students are much more interested in journalism with perceived better financial prospects. That there can be substantial non-financial benefits to living in small cities and towns, i.e., quality of life, seems of less importance.

I fear losing the judgment, ethics, and dedication of small-town journalists will lead to a slow deterioration of the quality of local government, a reduction in the quality of life due to a lack of balanced reporting (as well as editorials) of local issues, and in too many places a return to the civic leaders in the "smoke-filled room" making decisions for the uninformed.

Jeanne Holm, Data.gov's open data evangelist, shared her community's hybrid news reality in a reply to my question on Google+:

In my small town in Southern California, we still support a local paper, but the frequency has changed. We supplement with online news from City Hall, and most importantly we use social media — a lot. We have fires and floods in our area, and everyone connects on Facebook and via our emergency website to get people organized, supplies where needed, and our firefighters the support they need. It works really well. The local reporters often lead those social media conversations. They are reporting, but just in multiple modalities and in ways that make sense to the situation."

If states are the laboratories for democracy, towns and cities may be the Petri dishes that stress test the vitality of different species of online hubs. The ones that will stick around will have met the information needs of citizens better than the alternatives — or they'll have found sustainable business models. In an ideal world, they'll have both.

Appropriately, the conversation around the Pew report continues on a variety of online forums. If you have any thoughts on what's next, please feel free to share them in the comments here or on Google+ and Facebook, and via the #localnews hashtag on Twitter.

Photo: 03.Newspapers.SW.WDC.22dec05 by ElvertBarnes, on Flickr

Related:

September 20 2011

BuzzData: Come for the data, stay for the community

BuzzDataAs the data deluge created by the activities of global industries accelerates, the need for decision makers to find a signal in the noise will only grow more important. Therein lies the promise of data science, from data visualization to dashboard to predictive algorithms that filter the exaflood and produce meaning for those who need it most. Data consumers and data producers, however, are both challenged by "dirty data" and limited access to the expertise and insight they need. To put it another way, if you can't derive value, as Alistair Croll has observed here at Radar, there's no such thing as big data.

BuzzData, based in Toronto, Canada, is one of several startups looking to help bridge that gap. BuzzData launched this spring with a combination of online community and social networking that is reminiscent of what GitHub provides for code. The thinking here is that every dataset will have a community of interest around the topic it describes, no matter how niche it might be. Once uploaded, each dataset has tabs for tracking versions, visualizations, related articles, attachments and comments. BuzzData users can "follow" datasets, just as they would a user on Twitter or a page on Facebook.

"User experience is key to building a community around data, and that's what BuzzData seems to be set on doing," said Marshall Kirkpatrick, lead writer at ReadWriteWeb, in an interview. "Right now it's a little rough around the edges to use, but it's very pretty, and that's going to open a lot of doors. Hopefully a lot of creative minds will walk through those doors and do things with the data they find there that no single person would have thought of or been capable of doing on their own."

The value proposition that BuzzData offers will depend upon many more users showing up and engaging with one another and, most importantly, the data itself. For now, the site remains in limited beta with hundreds of users, including at least one government entity, the City of Vancouver.

"Right now, people email an Excel spreadsheet around or spend time clobbering a shared file on a network," said Mark Opauszky, the startup's CEO, in an interview late this summer. "Our behind-the-scenes energy is focused on interfaces so that you can talk through BuzzData instead. We're working to bring the same powerful tools that programmers have for source code into the world of data. Ultimately, you're not adding and removing lines of code — you're adding and removing columns of data."

Opauszky said that BuzzData is actively talking with data publishers about the potential of the platform: "What BuzzData will ultimately offer when we move beyond a minimum viable product is for organizations to have their own territory in that data. There is a 'brandability' to that option. We've found it very easy to make this case to corporations, as they're already spending dollars, usually on social networks, to try to understand this."

That corporate constituency may well be where BuzzData finds its business model, though the executive team was careful to caution that they're remaining flexible. It's "absolutely a freemium model," said Opauszky. "It's a fundamentally free system, but people can pay a nominal fee on an individual basis for some enhanced features — primarily the ability to privatize data projects, which by default are open. Once in a while, people will find that they're on to something and want a smaller context. They may want to share files, commercialize a data product, or want to designate where data is stored geographically."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30


Open data communities

"We're starting to see analysis happen, where people tell 'data stories' that are evolving in ways they didn't necessarily expect when they posted data on BuzzData," said Opauszky. "Once data is uploaded, we see people use it, fork it, and evolve data stories in all sorts of directions that the original data publishers didn't perceive."

For instance, a dataset of open data hubs worldwide has attracted a community that improved the original upload considerably. BuzzData featured the work of James McKinney, a civic hacker from Montreal, Canada, in making it so. A Google Map mashing up locations is embedded below:


The hope is that communities of developers, policy wonks, media, and designers will self-aggregate around datasets on the site and collectively improve them. Hints of that future are already present, as open government advocate David Eaves highlighted in his post on open source data journalism at BuzzData. As Eaves pointed out, it isn't just media companies that should be paying attention to the trends around open data journalism:

For years I argued that governments — and especially politicians — interested in open data have an unhealthy appetite for applications. They like the idea of sexy apps on smart phones enabling citizens to do cool things. To be clear, I think apps are cool, too. I hope in cities and jurisdictions with open data we see more of them. But open data isn't just about apps. It's about the analysis.

Imagine a city's budget up on BuzzData. Imagine the flow rates of the water or sewage system. Or the inventory of trees. Think of how a community of interested and engaged "followers" could supplement that data, analyze it, and visualize it. Maybe they would be able to explain it to others better, to find savings or potential problems, or develop new forms of risk assessment.

Open data journalism

"It's an interesting service that's cutting down barriers to open data crunching," said Craig Saila, director of digital products at the Globe and Mail, Canada's national newspaper, in an interview. He said that the Globe and Mail has started to open up the data that it's collecting, like forest fire data, at the Globe and Mail BuzzData account.

"We're a traditional paper with a strong digital component that will be a huge driver in the future," said Saila. "We're putting data out there and letting our audiences play with it. The licensing provides us with a neutral source that we can use to share data. We're working with data suppliers to release the data that we have or are collecting, exposing the Globe's journalism to more people. In a lot of ways, it's beneficial to the Globe to share census information, press releases and statistics."

The Globe and Mail is not, however, hosting any information there that's sensitive. "In terms of confidential information, I'm not sure if we're ready as a news organization to put that in the cloud," said Saila. "Were just starting to explore open data as a thing to share, following the Guardian model."

Saila said that he's found the private collaboration model useful. "We're working on a big data project where we need to combine all of the sources, and we're trying to munge them all together in a safe place," he said. "It's a great space for journalists to connect and normalize public data."

The BuzzData team emphasized that they're not trying to be another data marketplace, like Infochimps, or replace Excel. "We made an early decision not to reinvent the wheel," said Opauszky, "but instead to try to be a water cooler, in the same way that people go to Vimeo to share their work. People don't go to Flickr to edit photos or YouTube to edit videos. The value is to be the connective tissue of what's happening."

If that question about "what's happening?" sounds familiar to Twitter users, it's because that kind of stream is part of BuzzData's vision for the future of open data communities.

"One of the things that will become more apparent is that everything in the interface is real time," said Opauszky. "We think that topics will ultimately become one of the most popular features on the site. People will come from the Guardian or the Economist for the data and stay for the conversation. Those topics are hives for peers and collaborators. We think that BuzzData can provide an even 'closer to the feed' source of information for people's interests, similar to the way that journalists monitor feeds in Tweetdeck."

Related:

September 16 2011

Top Stories: September 12-16, 2011

Here's a look at the top stories published across O'Reilly sites this week.

Building data science teams
A data science team needs people with the right skills and perspectives, and it also requires strong tools, processes, and interaction between the team and the rest of the company.

The evolution of data products
The real changes in our lives will come from products that have the richness of data without calling attention to the data.

The work of data journalism: Find, clean, analyze, create ... repeat
Simon Rogers discusses the grunt work and tools behind The Guardian's data stories.

Social data: A better way to track TV
PeopleBrowsr CEO Jodee Rich says social data offers a better way to see what TV audiences watch and what they care about.


When media rebooted, it brought marketing with it
In this TOC podcast, Twist Image president Mitch Joel talks about some of the common challenges facing the music, magazine and book publishing sectors.




Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively. Save 30% on registration with the code ORM30.

September 15 2011

The work of data journalism: Find, clean, analyze, create ... repeat

Data journalism has rounded an important corner: The discussion is no longer if it should be done, but rather how journalists can find and extract stories from datasets.

Of course, a dedicated focus on the "how" doesn't guarantee execution. Stories don't magically float out of spreadsheets, and data rarely arrives in a pristine form. Data journalism — like all journalism — requires a lot of grunt work.

With that in mind, I got in touch with Simon Rogers, editor of The Guardian's Datablog and a speaker at next week's Strata Summit, to discuss the nuts and bolts of data journalism. The Guardian has been at the forefront of data-driven storytelling, so its process warrants attention — and perhaps even full-fledged duplication.

Our interview follows.

What's involved in creating a data-centric story?

Simon RogersSimon Rogers: It's really 90% perspiration. There's a whole process to making the data work and getting to a position where you can get stories out of it. It goes like this:

  • We locate the data or receive it from a variety of sources — from breaking news stories, government data, journalists' research and so on.
  • We then start looking at what we can do with the data. Do we need to mash it up with another dataset? How can we show changes over time?
  • Spreadsheets often have to be seriously tidied up — all those extraneous columns and weirdly merged cells really don't help. And that's assuming it's not a PDF, the worst format for data known to humankind.
  • Now we're getting there. Next up we can actually start to perform the calculations that will tell us if there's a story or not.
  • At the end of that process is the output. Will it be a story or a graphic or a visualisation? What tools will we use?

We've actually produced a graphic (of how we make graphics) that shows the process we go through:

Guardian data journalism process
Partial screenshot of "Data journalism broken down." Click to see the full graphic.

What is the most common mistake data journalists make?

Simon Rogers: There's a tendency to spend months fiddling around with things that are only mildly diverting. It's so easy to get sidetracked into statistical curiosities rather than telling stories that really matter. It's much more important to strive to create amazing work that will be remembered. You won't always succeed, but you will get closer.

Does data journalism require a team, or is it possible for one person to do all the work?

Simon Rogers: You can go solo. I set up the Datastore and ran it for more than a year on my own. But having a team you can call on is very useful. We have access to people who can scrape sites, people who can work with databases, and graphic designers who can make the results look beautiful. We also work with people out there in the world, bringing their expertise into what we do. With the web, you never have to operate on your own.

Strata Summit New York 2011, being held Sept. 20-21, is for executives, entrepreneurs, and decision-makers looking to harness data. Hear from the pioneers who are succeeding with data-driven strategies, and discover the data opportunities that lie ahead.

Save 30% on registration with the code ORM30

Are the data-driven stories you create updatable?

Simon Rogers: It's a constant issue. The clever thing is to try to make stuff either incredibly easy to update or something that happens without having to think too much about it. We aren't quite there yet, but we're working on it.

What data tools do you use?

Simon Rogers: It's a very personal thing, that. For us it includes: Excel, TextEdit (it's amazing how many times you just need to work on code or formulas without formatting), Google Fusion Tables, Google Spreadsheets, Timetric, Many Eyes, Adobe Illustrator, and Tableau.

This interview was edited and condensed.

Related:

August 26 2011

Social, mapping and mobile data tell the story of Hurricane Irene

As Hurricane Irene bears down the East Coast, millions of people are bracing for the impact of what could be a multi-billion dollar disaster.

We've been through hurricanes before. What's different about this one is the unprecedented levels of connectivity that now exist up and down the East Coast. According to the most recent numbers from the Pew Internet and Life Project, for the first time, more than 50% of American adults use social networks. 35% of American adults have smartphones. 78% of American adults are connected to the Internet. When combined, those factors mean that we now see earthquake tweets spread faster than the seismic waves themselves. The growth of an Internet of things is an important evolution. What we're seeing this weekend is the importance of an Internet of people.

As citizens look for hurricane information online, government websites are under high demand. In this information ecosystem, media, government and citizens alike will play a critical role in sharing information about what's happening and providing help to one another. The federal government is providing information on Hurricane Irene at Hurricanes.gov and sharing news and advisories in real-time on the radio, television, mobile devices and online using social media channels like @fema. As the storm comes in, FEMA recommends ready.gov for desktops.

Over the next 72 hours a networked public can share its effects in real-time, providing city, state and federal officials unprecedented insight into what's happening. Citizens will be acting as sensors in the midst of the storm, creating an ad hoc system of networked accountability through data. There are already efforts underway to organize and collect the crisis data that citizens are generating, along with putting the open data that city and state government have released.

Following are just a few examples of how data is playing a role in hurricane response and reporting.

Open data in the Big Apple

The city of New York is squarely in the path of Hurricane Irene and has initiated mandatory evacuations from low-lying areas. The NYC Mayor's Office has been providing frequent updates to New Yorkers as the hurricane approaches, including links to an evacuation map, embedded below:

NYC Hurricane Evacuation Map

The city provides public hurricane evacuation data on the NYC DataMine. Geographic data regarding NYC Hurricane Evacuation Zones and Hurricane Evacuation Centers is publicly available on the NYC DataMine. To find and use this open data, search for “Data by Agency” and select “Office of Emergency Management (OEM). Developers can also download Google Earth KMZ files for the Hurricane Evacuation Zones. If you have any trouble accessing these files, civic technologist Philip Ashlock is mirroring NYC Irene data and links on Amazon Web Services (AWS).

"This data is already being used to power a range of hurricane evacuation zone maps completely independent of the City of New York, including at WNYC.org and the New York Times," said Rachel Sterne, chief digital officer of New York City. "As always, we support and encourage developers to develop civic applications using public data."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save
30% on registration with the code STN11RAD

Partnering with citizens in Maryland

"We're partnering with SeeClickFix to collect reports from citizens about the effects from Irene to help first responders," said Bryan Sivak, Maryland's chief innovation officer, in a phone interview. The state has invited its citizens to share and view hurricane data throughout the state.

"This is interesting from a state perspective because there are very few things that we are responsible for or have the ability to fix. Any tree branches or wires that go down will be fixed by a local town or a utility. The whole purpose is to give our first responders another channel. We're operating under the perspective that more information is better information. By having more eyes and ears out there reporting data, we can make better informed decisions from an emergency management perspective. We just want to stress that this is a channel for communication, as opposed to a way to get something fixed. If this channel is useful in terms of managing the situation, we'll work with local governments in the future to see if it can help them. "

Your browser does not support iframes. Try it from www.SeeClickFix.com

SeeClickFix has been working on enabling government to use citizens as public sensors since its founding. We'll see if they can help Maryland with Hurricane Irene this weekend.

[Disclosure: O'Reilly AlphaTech Ventures is an investor in SeeClickFix.]

The best hurricane tracker ever?

In the face of the storm, the New York Times has given New Yorkers one of the best examples of data journalism I've seen to date, a hurricane tracker that puts open data from the National Weather Service to beautiful use.

If you want a virtuoso human curation of the storm, New York Times reporter Brian Stelter is down in the Carolinas and reporting live via Twitter.

Crisismapping the hurricane

A Hurricane Irene crisis map is already online, where volunteers have stood up an instance of Ushahidi:

Mashing up social and geospatial data

ESRI has also posted a mashup that combines video and tweets onto an interactive map, embedded below:

The Florida Division of Emergency Management is maintaining FloridaDisaster.org, with support from DHS Science and Technology, mashing up curated Twitter accounts. You can download live shape files of tweeters and KML files to use if you wish.

Google adds data layers

There are also a wealth of GIS and weather data feeds powering Google.org's Hurricane Season mashup:

google-hurricane-mashup.jpg

If you have more data stories or sources from Hurricane Irene, please let me know at alex@oreilly.com or on Twitter at @digiphile. If you're safe, dry and connected, you can also help Crisis Commons by contributing to the Hurricane Irene wiki.

Top Stories: August 22-26, 2011

Here's a look at the top stories published across O'Reilly sites this week.


Ruminations on the legacy of Steve Jobs
Apple, under Steve Jobs, has always had an unrelenting zeal to bring humanity to the center of the ring. Mark Sigal argues that it's this pursuit of humanity that may be Jobs' greatest innovation.
The nexus of data, art and science is where the interesting stuff happens
Jer Thorp, data artist in residence at the New York Times, discusses his work at the Times and how aesthetics shape our understanding of data.
Inside Google+: The virtuous circle of data and doing right by users
Data liberation and user experience emerged as core themes during a recent discussion between Tim O'Reilly and Google+ VP of Product Bradley Horowitz.
Five things Android needs to address on the enterprise side
Android has the foundation to support enterprise use, but there's a handful of missing pieces that need to be addressed if it's going to fully catch on in the corporate world.
The Daily Dot wants to tell the web's story with social data journalism
The newly launched Daily Dot is trying an experiment in community journalism, where the community is the Internet. To support their goal, they're applying the lens of data journalism to the social web.





Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively. Save 30% on registration with the code STN11RAD.

August 25 2011

Strata Week: Green pigs and data

Here are a few of the data stories that caught my attention this week:

Predicting Angry Birds

Angry BirdsAngry Birds maker Rovio will begin using predictive analytics technology from the Seattle-based company Medio to help improve game play for its popular pig-smashing game.

According to the press release announcing the partnership, Angry Birds has been downloaded more 300 million times and is on course to reach 1 billion downloads. But it isn't merely downloaded a lot; it's played a lot, too. The game, which sees up to 1.4 billion minutes of game play per week, generates an incredible amount of data: user demographics, location, and device information are just a few of the data points.

Users' data has always been important in gaming, as game developers must refine their games to maximize the amount of time players spend as well as track their willingness to spend money on extras or to click on related ads. As casual gaming becomes a bigger and more competitive industry, game makers like Rovio will rely on analytics to keep their customers engaged.

As GigaOm's Derrick Harris notes, quoting Zynga's recent S-1 filing, this is already a crucial part of that gaming giant's business:

The extensive engagement of our players provides over 15 terabytes of game data per day that we use to enhance our games by designing, testing and releasing new features on an ongoing basis. We believe that combining data analytics with creative game design enables us to create a superior player experience.

By enlisting the help of Medio for predictive analytics, it's clear that Rovio is taking that same tactic to improve the Angry Bird experience.

Unstructured data and HP's next chapter

HP made a number of big announcements last week as it revealed plans for an overhaul. These plans include ending production of its tablet and smartphones, putting the development of WebOS on hold, and spending some $10 billion to acquire the British enterprise software company Autonomy.

AutonomyThe New York Times described the shift in HP as a move to "refocus the company on business products and services," and the acquisition of Autonomy could help drive that via its big data analytics. HP's president and CEO Léo Apotheker said in a statement: "Autonomy presents an opportunity to accelerate our strategic vision to decisively and profitably lead a large and growing space ... Together with Autonomy, we plan to reinvent how both unstructured and structured data is processed, analyzed, optimized, automated and protected."

As MIT Technology Review's Tom Simonite puts it, HP wants Autonomy for its "math skills" and the acquisition will position HP to take advantage of the big data trend.

Founded in 1996, Autonomy has a lengthy history of analyzing data, with an emphasis on unstructured data. Citing an earlier Technology Review interview, Simonite quotes Autonomy founder Mike Lynch's estimate that about 85% of the information inside a business is unstructured. "[W]e are human beings, and unstructured information is at the core of everything we do," Lynch said. "Most business is done using this kind of human-friendly information."

Simonite argues that by acquiring Autonomy, HP could "take a much more dominant position in the growing market for what Autonomy's Lynch dubs 'meaning-based computing.'"

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD


Using data to uncover stories for the Daily Dot

After several months of invitation-only testing, the web got its own official daily newspaper this week with the launch of The Daily Dot. CEO Nick White and founding editor Owen Thomas said the publication will focus on the news from various online communities and social networks.

GigaOm's Mathew Ingram gave The Daily Dot a mixed review, calling its focus on web communities "an interesting idea," but he questioned if the "home town newspaper" metaphor really makes sense. The number of kitten stories on the Daily Dot's front page aside, ReadWriteWeb's Marshall Kirkpatrick sees The Daily Dot as part of the larger trend toward data journalism, and he highlighted some of the technology that the publication is using to uncover the Web world's news, including Hadoop and assistance from Ravel Data.

"It's one thing to crawl, it's another to understand the community," Daily Dot CEO White told Kirkpatrick. "What we really offer is thinking about how the community ticks. The gestures and modalities on Reddit are very different from Youtube; it's sociological, not just math."

Got data news?

Feel free to email me.



Related:


The Daily Dot wants to tell the web's story with social data journalism

If the Internet is the public square of the 21st century, the Daily Dot wants to be its town crier. The newly launched online media startup is trying an experiment in community journalism, where the community is the web. It's an interesting vision, and one that looks to capitalize on the amount of time people are spending online.

The Daily Dot wants to tell stories through a mix of data journalism and old-fashioned reporting, where its journalists pick up the phone and chase down the who, what, when, where, how and why of a video, image or story that's burning up the social web. The site's beat writers, who are members of the communities they cover, watch what's happening on Twitter, Facebook, Reddit, YouTube, Tumblr and Etsy, and then cover the issues and people that matter to them.

Daily Dot screenshot

Even if the newspaper metaphor has some flaws, this focus on original reporting could help distinguish the Daily Dot in a media landscape where attention and quality are both fleeting. In the hurly burly of the tech and new media blogosphere, picking up the phone to chase down a story is too often neglected.

There's something significant about that approach. Former VentureBeat editor Owen Thomas (@OwenThomas), the founding editor of the Daily Dot, has emphasized this angle in interviews with AdWeek and Forbes. Instead of mocking what people do online, as many mainstream media outlets have been doing for decades, the Daily Dot will tell their stories in the same way that a local newspaper might cover a country fair or concert. While Thomas was a well-known master of snark and satire during his tenure at Valleywag, in this context he's changed his style.

Where's the social data?

Whether or not this approach gains traction within the communities the Daily Dot covers remains to be seen. The Daily Dot was co-founded by Nova Spivack, former newspaper executive Nicholas White, and PR consultant Josh Jones-Dilworth, with a reported investment of some $600,000 from friends and family. White has written that he gave up the newspaper to save newspapering. Simply put, the Daily Dot is experimenting with covering the Internet in a way that most newspapers have failed to do.

"I trust that if we keep following people into the places where they gather to trade gossip, argue the issues, seek inspiration, and share lives, then we will also find communities in need of quality journalism," wrote White. "We will be carrying the tradition of local community-based journalism into the digital world, a professional coverage, practice and ethics coupled with the kind of local interaction and engagement required of a relevant and meaningful news source. Yet local to us means the digital communities that are today every bit as vibrant as those geographically defined localities."

To do that, they'll be tapping into an area that Spivack, a long-time technology entrepreneur, has been investing and writing about for years: data. Specifically, applying data journalism to mining and analyzing the social data from two of the web's most vibrant platforms: Tumblr and Reddit.

White himself is unequivocal about the necessity of data journalism in the new digital landscape, whether at the Daily Dot or beyond:

The Daily Dot may be going in this direction now because of our unique coverage area, but if this industry is to flourish in the 21st century, programming journalists should not remain unique. Data, just like the views of experts, men on the street, polls and participants, is a perspective on the world. And in the age of ATMs, automatic doors and customer loyalty cards, it's become just as ubiquitous. But the media isn't so good with data, with actual mathematics. Our stock-in-trade is the anecdote. Despite a complete lack of solid evidence, we've been telling people their cell phones will give them cancer. Our society ping-pongs between eating and not eating carbs, drinking too much coffee and not enough water, getting more Omega-3s — all on the basis of epidemiological research that is far, far, far from definitive. Most reporters do not know how to evaluate research studies, and so they report the authors' conclusions without any critical evaluation — and studies need critical evaluation.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Marshall Kirkpatrick, a proponent and practitioner of data journalism, dug deep into how data journalism happens at the Daily Dot. While he's similarly unsure of whether the publication will be interesting to a large enough audience to sustain an advertising venture, the way that the Daily Dot is going about hunting down digital stories is notable. Kirkpatrick shared the details over at ReadWriteWeb:

In order to capture and analyze that data from sites like Twitter, YouTube, Reddit, Etsy and more (the team says it's indexing a new community about every six weeks), the Dot has partnered with the mathematicians at Ravel Data. Ravel uses 80Legs for unblockable crawling, then Hadoop, its own open source framework called GoldenOrb and then an Eigenvector centrality algorithm (similar to Pagerank) to index, analyze, rank and discover connections between millions of users across these social networks.

There are a couple of aspects of data journalism to consider here. One is supplementing the traditional "nose for news" that Daily Dot writers apply to finding stories. "The data really begins to serve as our editorial prosthetics of sorts, telling us where to look, with whom to speak, and giving us the basic groundwork of the communities that we can continue to prod in interesting ways and ask questions of," explained Doug Freeman, an associate at Daily Dot investor Josh Jones-Dilworth's PR firm, in an interview. In other words, the editors of the Daily Dot analyze social data to identify the community's best sources for stories and share them on a "Leaderboard" that — in beta — shows a ranked list of members of Tumblr and Reddit.

Another open question is how social data could help with the startup's revenue down the road. "Our data business is a way of creating and funding new value in this regard; we instigated structured crawls of all of the communities we will cover and will continue to do so as we expand into new places," said Freeman. "We started with Reddit (for data and editorial both) because it is small and has a lot of complex properties — a good test balloon. We've now completed data work with Tumblr and YouTube and are continuing." For each community, data provides a view of members, behaviors, and influence dynamics.

That data also relates to how the Daily Dot approaches marketing, branding and advertising. "It's essentially a to-do list of people we need to get reading the Dot, and a list of their behaviors," said Freeman. "From a brand [point of view], it's market and audience intelligence that we can leverage, with services alongside it. From an advertiser [point of view], this data gives resolution and insight that few other outlets can provide. It will get even more exciting over time as we start to tie Leaderboard data to user accounts and instigate CPA-based campaigns with bonuses and bounties for highly influential clicks."

Taken as a whole, what the Daily Dot is doing with social data and digital journalism feels new, or at least like a new evolution. We've seen Facebook and Twitter integration into major media sites, but not Reddit and Tumblr. It could be that the communities of these sites acting as "curation layers" for the web will produce excellent results in terms of popular content, though relevance could still be at issue. Whether this venture in data journalism is successful or not will depend upon it retaining the interest and loyalty of the communities it covers. What is clear, for now, is that the experiment will be fun to watch — cute LOL cats and all.



Related:


August 11 2011

Strata Week: Twitter's coming Storm, data and maps from the London riots

Here are a few of the data stories that caught my attention this week:

Twitter's coming Storm

twitter_150_logo.png

In a blog post late last week, Twitter announced that it plans to open source Storm, its Hadoop-like data processing tool. Storm was developed by BackType, the social media analytics company that Twitter acquired last month. Several of BackType's other technologies, including ElephantDB, have already been open sourced, and Storm will join them this fall, according to Nathan Marz, formerly of BackType now of Twitter.

Marz's post digs into how Storm works as well as how it can be applied. He notes that a Storm cluster is only "superficially similar" to a Hadoop cluster. Instead of running MapReduce "jobs," Storm runs "topologies." One of the key differences is that a MapReduce job eventually finishes, whereas a topology processes messages "forever (or until you kill it)." This makes Storm useful, among other things, for processing real-time streams of data, continuous computation, and distributed RPC.

Touting the technology's ease-of-use, Marz lists the following complexities "under the hood: guaranteed message processing, robust process management, fault detection and automatic reassignment, efficient message passing, and local mode and distributed mode. More details -- and more documentation -- will follow in September 19 when Storm is officially open sourced.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save
20% on registration with the code STN11RAD


Mapping the London riots

Using real-time social streams and mapping tools in a crisis situation is hardly new. We've seen citizens, developers, journalists, governments alike undertake these efforts following a multitude of natural disasters. But the violence that erupted in London over weekend has proven yet again that these data tools are important for both safety and for analysis and understanding. Indeed, as journalist Kevin Anderson argued, "data journalists and social scientists should join forces" to understand the causes and motivations for the riots, rather than the more traditional "hours of speculation on television and acres of newsprint positing theories."

NPR's Matt Stiles was just one of the data journalists who picked up the mantle. Using data from The Guardian, he created a map that highlighted riot locations, overlaid on top of a colored representation of indices of deprivation." This makes a pretty compelling visualization, demonstrating that the areas with the most incidents of violence are also the least well-off areas of London.

london.jpg

In a reflective piece in PaidContent, James Cridland examined his experiences trying to use social media to map the riots. He created a Google Map where he was marking "verified incident areas." As he describes it, however, that verifiability became quite challenging. His "lessons learned" included realizations about what constitutes a reliable source.

"Twitter is not a reliable source: I lost count of the amount of times I was told that riots were occurring in Derby or Manchester. They weren’t, yet on Twitter they were being reported as fact, despite the Derbyshire Constabulary and Greater Manchester Police issuing denials on Twitter. I realised that, in order for this map to be useful, every entry needed to be verified, and verifiable for others too. For every report, I searched Google News, Twitter, and major news sites to try and establish some sort of verification. My criteria was that something had to be reported by an established news organisation (BBC, Sky, local newspapers) or by multiple people on Twitter in different ways.

Cridland points out that the traditional news media wasn't reliable either, as the BBC for example reported disturbances that never occurred or misreported their location.

"Many people don't know what a reliable source is," he concludes. "I discovered it was surprisingly easy to check the veracity of claims being made on Twitter by using the Internet to check and cross-reference, rather than blindly retweet."

When data disappears

Following the riots in the U.K., there is now a trove of data -- from Blackberry Messenger, from Twitter, from CCTV -- that the authorities can utilize to investigate "what happened." There are also probably plenty of people who wish that data would just disappear.

What happens when that actually happens? How can we ensure that important digital information is preserved? Those were the questions asked in an Op-Ed in Sunday's The New York Times. Kari Kraus, an assistant professor in the College of Information Studies and the English department at the University of Maryland, makes a strong case for why "digitization" isn't really the end-of-the-road when it comes to preservation.

"For all its many promises, digital storage is perishable, perhaps even more so than paper. Disks corrode, bits "rot" and hardware becomes obsolete.

But that doesn't mean digital preservation is pointless: if we're going to save even a fraction of the trillions of bits of data churned out every year, we can't think of digital preservation in the same way we do paper preservation. We have to stop thinking about how to save data only after it's no longer needed, as when an author donates her papers to an archive. Instead, we must look for ways to continuously maintain and improve it. In other words, we must stop preserving digital material and start curating it.

nes.jpg

She points to the efforts made to curate and preserve video games, something that highlights the struggles of not just saving the content -- the games -- but the technology -- NES cartridges, for example, as well as the gaming systems themselves. "It might seem silly to look to video-game fans for lessons on how to save our informational heritage, but in fact complex interactive games represent the outer limit of what we can do with digital preservation." By figuring out the complexities around preserving this sort of material -- a game, a console, for example -- we can get a better sense of how to develop systems to preserve other things, whether it's our Twitter archives, digital maps of London, or genetic data.

Got data news?

Send me an email.

Reposted bycheg00 cheg00

July 08 2011

Top stories: July 4-8, 2011

Here's a look at the top stories published across O'Reilly sites this week.


Seven reasons you should use Java again
To mark the launch of Java 7, here's seven reasons why Java is worth your time and worth another look.
What is Node.js?
Learning Node might take a little effort, but it's going to pay off. Why? Because you're afforded solutions to your web application problems that require only JavaScript to solve.
3 Android predictions: In your home, in your clothes, in your car
"Learning Android" author Marko Gargenta believes Android will soon be a fixture in our homes, in our clothes and in our vehicles. Here he explains why and how this will happen.
Into the wild and back again
Burnt out from years of school and tech work, Ryo Chijiiwa quit his job and moved off the grid. In this interview, Chijiiwa talks about how solitude and time in the wilderness has changed his perspective on work and life.
Data journalism, data tools, and the newsroom stack
The MIT Civic Media conference and 2011 Knight News Challenge winners made it clear that data journalism and data tools will play key roles in the future of media and open government.




OSCON Java 2011, being held July 25-27 in Portland, Ore., is focused on open source technologies that make up the Java ecosystem. Save 20% on registration with the code OS11RAD


July 05 2011

Data journalism, data tools, and the newsroom stack

NYTimes: 365/360 - 1984 (in color) by blprnt_van, on FlickrMIT's recent Civic Media Conference and the latest batch of Knight News Challenge winners made one reality crystal clear: as a new era of technology-fueled transparency, innovation and open government dawns, it won't depend on any single CIO or federal program. It will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, whatever form it is delivered in.

The themes that unite this class of Knight News Challenge winners were data journalism and platforms for civic connections. Each theme draws from central realities of the information ecosystems of today. Newsrooms and citizens are confronted by unprecedented amounts of data and an expanded number of news sources, including a social web populated by our friends, family and colleagues. Newsrooms, the traditional hosts for information gathering and dissemination, are now part of a flattened environment for news, where news breaks first on social networks, is curated by a combination of professionals and amateurs, and then analyzed and synthesized into contextualized journalism.

Data journalism and data tools

In an age of information abundance, journalists and citizens alike all need better tools, whether we're curating the samizdat of the 21st century in the Middle East, like Andy Carvin, processing a late night data dump, or looking for the best way to visualize water quality to a nation of consumers. As we grapple with the consumption challenges presented by this deluge of data, new publishing platforms are also empowering us to gather, refine, analyze and share data ourselves, turning it into information.

In this future of media, as Mathew Ingram wrote at GigaOm, big data meets journalism, in the same way that startups see data as an innovation engine, or civic developers see data as the fuel for applications. "The media industry is (hopefully) starting to understand that data can be useful for its purposes as well," Ingram wrote. He continued:

... data and the tools to manipulate it are the modern equivalent of the microfiche libraries and envelopes full of newspaper clippings that used to make up the research arm of most media outlets. They are just tools, but as some of the winners of the Knight News Challenge have already shown, these new tools can produce information that might never have been found before through traditional means.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

The Poynter Institute took note of the attention paid to data by the Knight Foundation as well. As Steve Myers reported, the Knight News Challenge gave $1.5 million to projects that filter and examine data. The winners that relate to data journalism include:

I talked more with the AP's Jonathan Stray about data journalism and Overview at the MIT Civic Media in the video below. For an even deeper dive into his thinking on what journalists need in the age of big data, read his thoughts on "the editorial search engine."

The newsroom stack

With these investments in the future of journalism, more seeds have been planted to add to a "newsroom stack," to borrow a technical term familiar to Radar readers, combining a series of technologies for use in a given enterprise.

"I like the thought of it," said Brian Boyer, the project manager for PANDA, in an interview at the MIT Media Lab. "The newsroom stack could add up to the kit of tools that you ought to be using in your day to day reporting."

Boyer described how the flow of data might move from a spreadsheet (as a .CSV file) to Google Refine (for tidying, clustering, adding columns) to PANDA and then on to Overview or Fusion Tables or Many Eyes, for visualization. This is about "small pieces, loosely joined," he said. "I would rather build one really good small piece than one big project that does everything."

PANDA and Overview are squarely oriented at bread-and-butter issues for newsrooms in the age of big data. "It's a pain to search across datasets, but we also have this general newsroom content management issue, said Boyer. "The data stuck on your hard drive is sad data. Knowledge management isn't a sexy problem to solve, but it's a real business problem. People could be doing better reporting if they knew what was available. Data should be visible internally."

Boyer thinks the trends toward big data in media are pretty clear, and that he and other hacker journalists can help their colleagues to not only understand it but to thrive. "There's a lot more of it, with government releasing its stuff more rapidly," he said. "The city of Chicago is dropping two datasets a week right now. We're going for increased efficiency, to help people work faster and write better stories. Every major news org in the country is hiring a news app developer right now. Or two. For smaller news organizations, it really works for them. Their data apps account for the majority of their traffic."

Bridging the data divide

There's some caution merited here. Big data is not a panacea to all things, in media or otherwise. Greg Borenstein explored some of these issues in his post on big data and cybernetics earlier this month. Short version: humans still matter in building human relationships and making sense of what matters, however good our personalized relevance engines for news become. Proponents of open data have to consider a complementary concern: digital literacy.

As Jesse Lichtenstein asserted "open data along isn't enough," following the thread of danah boyd's "transparency is not enough talk at the 2010 Gov 2.0 Expo. Open data can empower the empowered.

To make open government data sing, infomediaries need to have time and resources. If we're going to hope that citizens will draw their own conclusions from showing public data in real-time, we'll need to educate them to be able to be critical thinkers. As Andy Carvin tweeted during the MIT Civic Media conference, "you need to be sure those people have high levels of digital literacy and media literacy." There's a data divide that has to be considered here, as Nick Clark Judd pointed out over at techPresident.

It looks like those concerns were at least partially factored into the judges' decision on other Knight News Challenge winners. Spending Stories, from the Open Knowledge Foundation, is designed to add context to news stories based upon government data by connecting stories to the data used. Poderapedia will try to bring more transparency to Chile using data visualizations that draw upon a database of of editorial and crowdsourced data. The State Decoded will try to make the law more user-friendly. The project has notable open government DNA: Waldo Jaquith's work on OpenVirginia was aimed at providing an API for the Commonwealth.

There were citizen science and transparency projects alongside all of those data plays too, including:

Given the recent story here at Radar on citizen science and crowdsourced radiation data, there's good reason to watch both of these projects evolve. And given research from the Pew Internet and Life Project on the role of the Internet as a platform for collective action, the effect of connecting like-minded citizens to one another through efforts like the Tiziano Project may prove far reaching.

Photo: NYTimes: 365/360 - 1984 (in color) by blprnt_van, on Flickr



Related:


June 29 2011

Citizen science, civic media and radiation data hint at what's to come

SafecastNatural disasters and wars bring people together in unanticipated ways, as they use the tools and technologies easily at hand to help. From crisis response to situational awareness, free or low cost online tools are empowering citizens to do more than donate money or blood: now they can donate, time, expertise or, increasingly, act as sensors. In the United States, we saw a leading edge of this phenomenon in the Gulf of Mexico, where open source oil spill reporting provided a prototype for data collection via smartphone. In Japan, an analogous effort has grown and matured in the wake of the nuclear disaster that resulted from a massive earthquake and subsequent tsunami this spring.

The story of the RDTN project, which has grown into Safecast, a crowdsourced radiation detection network, isn't new, exactly, but it's important.

Radiation monitoring and grassroots mapping in Japan has been going on since April, as Emily Gertz reported at OneEarth.org. I recently heard more about the Safecast project from Joi Ito at this year's Civic Media conference at the MIT Media Lab, where Ito described his involvement. Ethan Zuckerman blogged Ito's presentation, capturing his thoughts on how the Internet helped cover the Japanese earthquake (Twitter "beat the pants" off the mainstream media on the first day) and the Safecast project's evolution from a Skype chat.

According to Gertz' reporting, Safecast now includes data from a variety of sources, including feeds from the U.S. Environmental Protection Agency, Greenpeace, a volunteer crowdsourcing network in Russia, and the Japanese Ministry of Education, Culture, Sports, Science and Technology. Radiation data that's put into Safecast is made available for others to use via Pachube, an open-source platform for monitoring sensor data.

Ito said that a lot of radiation data that the Japanese government had indicated would be opened up has not been released, prompting the insight that crises, natural or otherwise, are an excellent opportunity to examine how effective an open government data implementation has been. Initially, the RDTN project entered an environment where there was nearly no radiation data available to the public.

"They were releasing data, it was just not very specific," said Sean Bonner, via Skype Interview. Bonner has served as the communications lead for Safecast since the project began. The Japanese government "would release data for some areas and not for others — or rather they didn't have it," he said. "I don't think they had data they weren't releasing. Our point is that the sensors to detect the data were not in place at all. So we decided to help with that."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

A KickStarter campaign in April raised nearly $37,000 to purchase Geiger counters to gather radiation data. Normally, that might be sufficient to obtain dozens of devices, given costs that range from $100 to nearly $1,000 for a professional-grade unit. The challenge is that if Geiger counters weren't easy to get before the Japanese nuclear meltdown, they became nearly impossible to obtain afterwards.

The Safecast project has also hacked together iGeigie, an iPhone-connected Geiger counter that can detect beta and gamma radiation. "The iGeigie is just a concept product, it's not a focus or a main solution," cautioned Bonner. "So a lot of what we've been doing it trying to help cover more ground with single sensors."

Even if they were in broader circulation, Geiger counters are unlikely to detect radiation in food or water. That's where open source hardware and hackerspaces become more relevant, specifically the Arduino boards that Radar and Make readers know well.

"We have Arduinos in the static devices that we are building and connecting to the web," said Bonner. "We're putting those around and they report data back to us." In other words, the Internet of Things is growing.

The sensors Safecast is deploying will capture alpha, beta and gamma radiation. "It's very important to track all three," said Bonner. "The very sensitive devices we are using are commercially produced. [They are] Inspector Alerts, made by International Medcom. Those use the industry standard 2-inch pancake sensor, which we are using in our other devices as well. We are using the same sensors everywhere. "

Citizen science and open data

Open source software and citizens acting as sensors have steadily been integrated into journalism over the past few years, most dramatically in the videos and pictures uploaded after the 2009 Iran election and during this year's Arab Spring. Citizen science looks like the new frontier. "I think the real value of citizen media will be collecting data," said Rich Jones, founder of OpenWatch, a counter-surveillance project that aims to "police the police." Apps like Open Watch can make "analyzing data a revolutionary act," said Justin Jacoby Smith. The development of Oil Reporter, grassroots mapping, Safecast, social networks, powerful connected smartphones and massive online processing power have put us into new territory. In the context of environmental or man-made disasters, collecting or sharing data can also be a civic act.

Crowdsourcing radiation data on Japan does raise legitimate questions about data quality and reporting, as Safecast's own project leads acknowledge.

"We make it very clear on the site that yes, there could most definitely be inaccuracies in crowd-sourced data," Safecast's Marcelino Alvarez told Public Radio International. "And yes, there could be contamination of a particular Geiger counter so the readings could be off," Alvarez said. "But our hope is that with more centers and more data being reported that those points that are outliers can be eliminated, and that trends can be discerned from the data."

The thinking here is that while some data may be inaccurate or some sensors misconfigured, over time the aggregate will skew toward accuracy. "More data is always better than less data," said Bonner. "Data from several sources is more reliable than from one source, by default. Without commenting on the reliability of any specific source, all the other sources help improve the overall data. Open data helps with that."

Safecast is combining open data collected by citizen science with academic, NGO and open government data, where available, and then making it widely available. It's similar to other projects, where public data and experimental data are percolating.

Citizen science can create information orders of magnitude better than Google Maps, said Brian Boyer, news application developer at the Chicago Tribune, referencing the grassroots mapping work of Jeffrey Warren and others. "It's also fun," Boyer said. "You can get lots of people involved who wouldn't otherwise be involved doing a mapping project."

As news of these experiments spreads, the code and policies used to build them will also move with them. The spread of open source software is now being accompanied by open source hardware and maker culture. That will likely have unexpected effects.

When you can't meet demand for a device like a Geiger counter, people will start building their own, said Ito at the MIT Civic Media conference. He's seeing open hardware design spread globally. While there's an embargo on the export of many technologies, "we argue — and win — that open source software is free speech," said Ito. "Open source hardware is the same." If open source software now plays a fundamental role in new media, as evidenced by the 2011 winners of the Knight News Challenge, open source hardware may be supporting democracy in journalism too, says Ito.

Given Ito's success in anticipating (and funding) other technological changes, that's one prediction to watch.



Related:


June 23 2011

Strata Week: Data Without Borders

Here are some of the data stories that caught my attention this week:

Data without borders

Data without bordersData is everywhere. That much we know. But the usage of and benefit from data is not evenly distributed, and this week, New York Times data scientist Jake Porway has issued a call to arms to address this. He's asking for developers and data scientists to help build a Data Without Borders-type effort to take data — particularly NGO and non-profits' data — and match it with people who know what to do with it.

As Porway observes:

There's a lot of effort in our discipline put toward what I feel are sort of "bourgeois" applications of data science, such as using complex machine learning algorithms and rich datasets not to enhance communication or improve the government, but instead to let people know that there's a 5% deal on an iPad within a 1 mile radius of where they are. In my opinion, these applications bring vanishingly small incremental improvements to lives that are arguably already pretty awesome.

Porway proposes building a program to help match data scientists with non-profits and the like who need data services. The idea is still under development, but drop Porway a line if you're interested.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Big data and the future of journalism

ScraperWikiThe Knight Foundation announced the winners of its Knight News Challenge this week, a competition to find and support the best new ideas in journalism. The Knight Foundation selected 16 projects to fund from among hundreds of applicants.

In announcing the winners, the Knight Foundation pointed out a couple of important trends, including "the rise of the hacker/data journalist." Indeed, several of the projects are data-related, including Swiftriver, a project that aims to make sense of crisis data; ScraperWiki, a tool for users to create their own custom scrapers; and Overview, a project that will create visualization tools to help journalists better understand large data sets.

IBM releases it first Netezza appliance

Last fall, IBM announced its acquisition of the big data analytics company Netezza. The acquisition was aimed at helping IBM build out its analytics offerings.

This week, IBM released its first new Netezza appliance since acquiring the company. The IBM Netezza High Capacity Appliance is designed to analyze up to 10 petabytes in just a few minutes. "With the new appliance, IBM is looking to make analysis of so-called big data sets more affordable," Steve Mills, senior vice president and group executive of software and systems at IBM, told ZDNet.

The new Netezza appliance is part of IBM's larger strategy of handling big data, of which its recent success with Watson on Jeopardy was just one small part.

The superhero social graph

MarvelPlenty of attention is paid to the social graph: the ways in which we are connected online through our various social networks. And while there's still lots of work to be done making sense of that data and of those relationships, a new dataset released this week by the data marketplace Infochimps points to other social (fictional) worlds that can be analyzed.

The world, in this case, is that of the Marvel Comics universe. The Marvel dataset was constructed by Cesc Rosselló, Ricardo Alberich, and Joe Miro from the University of the Balearic Islands. Much like a real social graph, the data shows the relationships between characters, and according to the researchers "is closer to a real social graph than one might expect."

Got data news?

Feel free to email me.



Related:


April 12 2011

Maps aren't easy

There is an art to data journalism, and in many cases that art requires an involved and arduous process. In a recent interview, Simon Rogers, editor of the Guardian's Datablog and Datastore, discussed many of the issues his team faced when they assembled databases and reports from the WikiLeaks releases. More recently, journalists have been building scads of interactive maps to illustrate news from the disaster in Japan and the political situation in Libya.

A recent story at Poynter looking at the importance of such maps also briefly noted their return on investment:

"The data-driven interactives take a lot of time and teamwork to produce, but they have the greatest value and generate good traffic and time-spent on the site,” said Juan Thomassie, senior interactive developer at USA Today.

So, hard work yields strong engagement. Sounds good. But that same Poynter article included this eye-opening aside: the New York Times has four cartographers. On first blush, my editor cringed at the (seemingly) exceptional number of hours and resources the Times is dedicating to map production. Does a news org really need four cartographers? I turned to Pete Warden, founder of OpenHeatMap, for some informed answers.

Warden walked me through the labor-intensive process — one that may very well justify a full cartography team [Ed. duly noted]. He also discussed a few tools that can streamline data journalism production.

Our interview follows.


What are the steps involved in making an interactive map?

PeteWarden Pete Warden: Usually one of the hardest parts is gathering the data. A good example might be the map Alasdair Allan, Gemma Hobson, and I did for the Guardian (see the screen shot below; find the dataset here).

Alasdair spotted that the Japanese government had released some data on the radiation levels around the country. Unfortunately, it was only available in PDF forms, so Gemma and I did a combination of cutting-and-pasting and manual typing to get all the readings and locations into a spreadsheet. Once they were in a spreadsheet, we then had to pick exactly what we wanted to display in the final map.

Alasdair took charge of that process and spent a lot of time trying out different scales and units — for example, showing the difference between the current values and the background levels at each location since some areas had naturally higher levels of radiation. That involved understanding what the story was we wanted to tell — similar to the way reporters put together quotes and other evidence to support the points of their articles. It also meant repeatedly uploading different versions and iterating until there was something that looked interesting and informative.

PeteMap.png
Click here to visit the Guardian's original story.

Are there tools that can make the data acquisition and mapping processes more efficient?

Pete Warden: I'm obviously a big fan of OpenHeatMap, but I've also been very impressed by both Google's Fusion Tables and the Tableau Public tool. This gives users a lot of choices. My design bias is toward simplicity, so OpenHeatMap's audience includes users unfamiliar with traditional GIS.

You recently released the Data Science Toolkit. How can the open source tools in that kit be applied to data journalism?

Pete Warden: The toolkit contains a lot of tools based on common requests from journalists. In particular, the command-line tools, like street2coordinates and coordinates2politics, can be very handy for taking large spreadsheets of addresses and calculating their positions, along with information like which congressional districts, neighborhoods, cities, states and countries they are in. You can then take that data and do further processing to break down your statistics by those categories.



Related:




March 02 2011

Before you interrogate data, you must tame it

The Guardian's coverage of the WikiLeaks cablesIBM, Wolfram|Alpha, Google, Bing, groups at universities, and others are trying to develop algorithms that parse useful information from unstructured data.

This limitation in search is a dull pain for many industries, but it was sharply felt by data journalists with the WikiLeaks releases. In a recent interview, Simon Rogers (@smfrogers), editor of the Guardian's Datablog and Datastore, talked about the considerable differences between the first batch of WikiLeaks releases — which arrived in a structured form — and the text-filled mass of unstructured cables that came later.

There were three WikiLeaks releases. One and two, Afghanistan and Iraq, were very structured. We got a CSV sheet, which was basically the "SIGACTS" — that stands for "significant actions" — database. It's an amazing data set, and in some ways it was really easy to work with. We could do incredibly interesting things, showing where things happened and events over time, and so on.

With the cables, it was a different kettle of fish. It was just a massive text file. We couldn't just look for one thing and think, "oh, that's the end of one entry and the beginning of the next." We had a few guys working on this for two or three months, just trying to get it into a state where we could have it in a database. Once it was in a database, internally we could give it to our reporters to start interrogating and getting stories out of it.

During the same interview, Rogers said that providing readers with the searchable data behind stories is a counter-balance to the public's cynicism toward the media.

When we launched the Datablog, we thought it was just going to be developers [using it]. What it turned out to be, actually, is real people out there in the world who want to know what's going on with a story. And I think part of that is the fact that people don't trust journalists any more, really. They don't trust us to be truthful and honest, so there's a hunger to see the stories behind the stories.

For more about how Rogers' group dealt with the WikiLeaks data and how data journalism works, check out the full interview in the following video:



Related:




Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl