Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 29 2013

❝De quelle manière la mobilisation actuelle des bases de données déplace-t-elle les manières…

De quelle manière la mobilisation actuelle des bases de données déplace-t-elle les manières établies de construire les scandales ?

http://fr.slideshare.net/bodyspacesociety/pres-15jan13-machines-a-scandales
#datajournalism

June 14 2012

Stories over spreadsheets

I didn't realize how much I dislike spreadsheets until I was presented with a vision of the future where their dominance isn't guaranteed.

That eye-opening was offered by Narrative Science CTO Kris Hammond (@whisperspace) during a recent interview. Hammond's company turns data into stories: They provide sentences and paragraphs instead of rows and columns. To date, much of the attention Narrative Science has received has focused on the media applications. That's a natural starting point. Heck, I asked him about those very same things when I first met Hammond at Strata in New York last fall. But during our most recent chat, Hammond explored the other applications of narrative-driven data analysis.

"Companies, God bless them, had a great insight: They wanted to make decisions based upon the data that's out there and the evidence in front of them," Hammond said. "So they started gathering that data up. It quickly exploded. And they ended up with huge data repositories they had to manage. A lot of their effort ended up being focused on gathering that data, managing that data, doing analytics across that data, and then the question was: What do we do with it?"

Hammond sees an opportunity to extract and communicate the insights locked within company data. "We'll be the bridge between the data you have, the insights that are in there, or insights we can gather, and communicating that information to your clients, to your management, and to your different product teams. We'll turn it into something that's intelligible instead of a list of numbers, a spreadsheet, or a graph or two. You get a real narrative; a real story in that data."

My takeaway: The journalism applications of this are intriguing, but these other use cases are empowering.

Why? Because most people don't speak fluent "spreadsheet." They see all those neat rows and columns and charts, and they know something important is tucked in there, but what that something is and how to extract it aren't immediately clear. Spreadsheets require effort. That's doubly true if you don't know what you're looking for. And if data analysis is an adjacent part of a person's job, more effort means those spreadsheets will always be pushed to the side. "I'll get to those next week when I've got more time ..."

We all know how that plays out.

But what if the spreadsheet wasn't our default output anymore? What if we could take things most of us are hard-wired to understand — stories, sentences, clear guidance — and layer it over all that vital data? Hammond touched on that:

"For some people, a spreadsheet is a great device. For most people, not so much so. The story. The paragraph. The report. The prediction. The advisory. Those are much more powerful objects in our world, and they're what we're used to."

He's right. Spreadsheets push us (well, most of us) into a cognitive corner. Open a spreadsheet and you're forced to recalibrate your focus to see the data. Then you have to work even harder to extract meaning. This is the best we can do?

With that in mind, I asked Hammond if the spreadsheet's days are numbered.

"There will always be someone who uses a spreadsheet," Hammond said. "But, I think what we're finding is that the story is really going to be the endpoint. If you think about it, the spreadsheet is for somebody who really embraces the data. And usually what that person does is they reduce that data down to something that they're going to use to communicate with someone else."

A thought on dashboards

I used to view dashboards as the logical step beyond raw data and spreadsheets. I'm not so sure about that anymore, at least in terms of broad adoption. Dashboards are good tools, and I anticipate we'll have them from now until the end of time, but they're still weighed down by a complexity that makes them inaccessible.

It's not that people can't master the buttons and custom reports in dashboards; they simply don't have time. These people — and I include myself among them — need something faster and knob-free. Simplicity is the thing that will ultimately democratize data reporting and data insights. That's why the expansion of data analysis requires a refinement beyond our current dashboards. There's a next step that hasn't been addressed.

Does the answer lie in narrative? Will visualizations lead the way? Will a hybrid format take root? I don't know what the final outputs will look like, but the importance of data reporting means someone will eventually crack the problem.

Full interview

You can see the entire discussion with Hammond in the following video.

Related:

June 01 2012

Publishing News: HMTL5 may be winning the war against apps

Here are a few stories that caught my attention in the publishing space this week:

The shortest link between content and revenue may be HTML5

HTML5 LogoA couple weeks ago, MIT Technology Review's editor in chief and publisher Jason Pontin wrote a piece about killing their app and optimizing their website for all devices with HTML5. That same week, Lonely Planet's Jani Patokallio predicted that HTML5 would nudge out the various ebook formats. This week, Wired publisher Howard Mittman shot back in an interview with Jeff John Roberts at PaidContent, insisting that apps are the future, not HTML5.

Roberts reports that "[Mittman] believes that HTML5 will just be part of a 'larger app experience' in which an app is a storefront or gateway for readers to have deeper interactions with publishing brands." I'm not sure, however, that readers need yet another gateway (read: obstacle) to their content, and recent movements in the publishing industry suggest HTML5 may be the more likely way forward.

This week, Inkling founder and CEO Matt MacInnis announced the launch of Inkling for Web, an HTML5-based web client that brings Inkling's iPad app features to any device with a browser. The app and HTML5 technology in this case are intertwined — all content previously owned in the app can now also be accessed via the web, and activity will sync between the app and the web, so notes made on the web will appear in the iPad app and vice versa. MacInnis says in the announcement that the launch is a big part of the company's overall vision to provide service to anyone on any device they choose, one of the major benefits of choosing HTML5 technology.

Also this week, OverDrive announced plans to launch OverDrive Read, an open standard HTML5/EPUB browser-based ebook platform that will allow users to read ebooks online or offline, without having to install software or download an app. Dianna Dilworth at GalleyCat reports on additional benefits for publishers: "Using the platform, publishers can create a URL for each title. This link can include book previews and review copies, as well as browsing capabilities and sample chapters."

In the end, it will all come down to what it always comes down to: money. Roger McNamee's latest piece, "HTML 5: The Next Big Thing for Content," takes a very thorough look at HTML5 in general and specifically in relation to content publishing (this week's must-read). As to money, this excerpt stood out:

"The beauty of these new [HTML5] 'app' models is that each can [be] monetized, in most cases at rates better than the current web standard. Imagine you are reading David Pogue's technology product review column in the New York Times. Today, the advertising on that page is pretty random. In HTML 5, it will be possible for ads to search the page they are on for relevant content. This would allow the Times to auction the ad space to companies that sell consumer electronics, whose ads could then look at the page, identify the products and then offer them in the ad."

As it becomes more and more likely that ads will be incorporated as a revenue stream in ebooks, publishers will embrace whatever technology draws the shortest line and the most avenues between content and revenue, which at this point is looking more and more like HTML5.

The future of publishing has a busy schedule.
Stay up to date with Tools of Change for Publishing events, publications, research and resources. Visit us at oreilly.com/toc.

MIT students present news reporting solutions

MIT Media Lab students were busy this week presenting final projects for their "News in the Age of Participatory Media" class. Andrew Phelps at Nieman Journalism Lab highlighted a few of the interesting projects, which were required to address a new tool, technique, or technology for reporting the news. One student proposed modernizing the hyperlink by attaching semantic meaning to it; another suggested a Wiki-like idea for correlations to put impossibly big numbers — the $15 trillion U.S. national debt, for instance — into context for readers.

The growing importance of data journalism makes another student's suite of tools called DBTruck particularly interesting. As Phelps explains, users can "[e]nter the URL of a CSV file, JSON data, or an HTML table and DBTruck will clean up the data and import it to a local database." The tools also let you compare arbitrary data to provoke deeper insights — in testing, the student discovered a correlation between low birth weights and New York state communities with high teen pregnancy rates, a connection that might not have been otherwise discovered.

Penguin and Macmillan deny participation in an illegal conspiracy

Publishers Penguin and Macmillan responded this week to the Department of Justice's (DOJ) antitrust lawsuit filed earlier this year against the two publishers and Apple (Apple responded to the lawsuit last week).

The New York Times reports that in Penguin's 74-page response (PDF), it "called Amazon 'predatory' and a 'monopolist' that treats books as 'widgets.' It asserted that Amazon, not Penguin, was the company engaging in anticompetitive behavior, to the detriment of the industry."

Laura Hazard Owen called Macmillan's 26-page response (PDF) "shorter and more fiery" than Penguin's. She reports:

"'Macmillan did not participate in any illegal conspiracy,' Macmillan's filing says, and 'the lack of direct evidence of conspiracy cited in the Government's Complaint is telling…[it is] necessarily based entirely on the little circumstantial evidence it was able to locate during its extensive investigation, on which it piles innuendo on top of innuendo, stretches facts and implies actions that did not occur and which Macmillan denies unequivocally.'"

Related:

May 24 2012

Knight Foundation grants $2 million for data journalism research

Every day, the public hears more about technology and media entrepreneurs, from when they started in the garages and the dorm rooms, all the way up until when they go public, get acquired or go spectacularly bust. The way that the world mourned the passing of Steve Jobs last year and that young people now look to Mark Zuckerberg as a model for what's possible offer some insight into that dynamic.

For those who want to follow in their footsteps, the most interesting elements of those stories will be the muddy details of who came up with the idea, who wrote the first lines of code, who funded them, how they were mentored and then how the startup executed upon their ideas.

Today, foundations and institutions alike are getting involved in the startup ecosystem, but with a different hook than the venture capitalists on Sand Hill Road in California or Y Combinator: They're looking for smart, ambitious social entrepreneurs who want to start civic startups and increase the social capital of the world. From the Code for America Civic Accelerator to the Omidyar Foundation to Google.org to the Knight Foundation's News Challenge, there's more access to seed capital than ever before.

There are many reasons to watch what the Knight Foundation is doing, in particular, as it shifts how it funds digital journalism projects. The foundation's grants are going toward supporting many elements of the broader open government movement, from civic media to government transparency projects to data journalism platforms.

Many of these projects — or elements and code from them — have a chance at becoming part of the plumbing of digital democracy in the 21st century, although we're still on the first steps of the long road of that development.

This model for catalyzing civic innovation in the public interest is, in the broader sweep of history, still relatively new. (Then again, so is the medium you're reading this post on.) One barrier that the Internet has helped lower is in the process of discovering and selecting good ideas to fund and letting bad ideas fall to the wayside. Another is changing how ideas are capitalized through microfunding approaches or how distributing opportunities for participation in helping products or services go to market now can happen though crowdfunding platforms like Kickstarter.

When the Pebble smartwatch received $10 million through Kickstarter this year, it offered a notable data point into how this model could work. We'll see how others follow.

These models could contribute to the development of small pieces of civic architecture around the world, loosely joining networks in civil society with mobile technology, lightweight programming languages and open data.

After years of watching how the winners of the Knight News Challenges have — or have not — contributed to this potential future, its architects are looking at big questions: How should resources be allocated in newsrooms? What should be measured? Are governments more transparent and accountable due to the use of public data by journalists? What data is available? What isn't? What's useful and relevant to the lives of citizens? How can data visualization, news applications and interactive maps inform and engage readers?

In the context of these questions, the fact that the next Knight News Challenge will focus on data will create important new opportunities to augment the practice of journalism and accelerate the pace of open government. John Bracken (@jsb), the Knight Foundation's program director for journalism and media innovation, offered an explanation for this focus on the foundation's blog:

"Knight News Challenge: Data is a call for making sense of this onslaught of information. 'As data sits teetering between opportunity and crisis, we need people who can shift the scales and transform data into real assets,' wrote Roger Ehrenberg earlier this year.

"Or, as danah boyd has put it, 'Data is cheap, but making sense of it is not.'

"The CIA, the NBA's Houston Rockets, startups like BrightTag and Personal ('every detail of your life is data') — they're all trying to make sense out of data. We hope that this News Challenge will uncover similar innovators discovering ways for applying data towards informing citizens and communities."

Regardless of what happens with this News Challenge, some of those big data questions stand a much better chance of being answered because of the Knight Foundation's $2 million grant to Columbia University to research and distribute best practices for digital reporting, data visualizations and measuring impact.

Earlier this spring, I spoke with Emily Bell, the director of the Tow Center for Digital Journalism, about how this data journalism research at Columbia will close the data science "skills gap" in newsrooms. Bell is now entrusted with creating the architecture for learning that will teach the next generation of data journalists at Columbia University.

In search of the reasoning behind the grant, I talked to Michael Maness (@MichaelManess), vice president of journalism and media innovations at the Knight Foundation. Our interview, lightly edited for content and clarity, follows.

The last time I checked, you're in charge of funding ideas that will make the world better through journalism and technology. Is that about right?

Michael Maness: That's the hope. What we're trying to do is make sure that we're accelerating innovation in the journalism and media space that continues to help inform and engage communities. We think that's vital for democracy. What I do is work on those issues and fund ideas around that to not only make it easier for journalists to do their work, but citizens to engage in that same practice.

The Knight News Challenge has changed a bit over the last couple of years. How has the new process been going?

Michael Maness: I've been in the job a little bit more than a year. I came in at the tail end of 2011 and the News Challenge of 2011. We had some great winners, but we noticed that in the amount of time from when you applied in the News Challenge to when you were funded could be up to 10 months, by the time everything was done, and certainly eight months in terms of the process. So we reduced that to about 10 weeks. It's intense for the judges to do that, but we wanted to move more quickly, recognizing the speed of disruption and the energy of innovation and how fast it's moving.

We've also switched to a thematic theme. We're going to do three [themes] this year. The point of it is to fund as fast as possible those ideas that we think are interesting and that we think will have a big impact.

This last round was around networks. The reason we focused on networks is the apparent rise of network power. The second reason is we get people, for example, that say, "This is the new Twitter for X" or "This is the new Facebook for journalists." Our point is actually, you should be using and leveraging existing things for that.

We found when we looked back at the last five years of the News Challenge that people who came in with networks or built networks in accordance with what they're doing had a higher and faster scaling rate. We want to start targeting areas to do that, too.

We hear a lot about entrepreneurs, young people and the technology itself, but schools and libraries seem really important to me. How will existing institutions be part of the future that you're funding and building?

Michael Maness: One of the things that we're doing is moving into more "prototyping" types of grants and then finding ways of scaling those out, helping get ideas into a proof-of-concept phase so users kick the tires and look for scaling afterward.

In terms of the institutions, one of the things that we've seen that's been a bit of a frustration point is making sure that when we have innovations, [we're] finding the best ways to parlay those into absorption in these kinds of institutions.

A really good standout for that, from a couple years ago as a News Challenge winner, is DocumentCloud, which has been adopted by a lot of the larger legacy media institutions. From a university standpoint, we know one of the things that is key is getting involvement with students as practitioners. They're trying these things out and they're doing the two kinds of modeling that we're talking about. They're using the newest tools in the curriculum.

That's one of the reasons we made the grant [to Columbia.] They have a good track record. The other reason is that you have a real practitioner there with Emily Bell, doing all of her digital work from The Guardian and really knowing how to implement understandings and new ways of reporting. She's been vital. We see her as someone who has lived in an actual newsroom, pulling in those digital projects and finding new ways for journalists to implement them.

The other aspect is that there are just a lot of unknowns in this space. As we move forward, using these new tools for data visualization, for database reporting, what are the things that work? What are the things that are hard to do? What are the ideas that make the most impact? What efficiencies can we find to help newsrooms do it? We didn't really have a great body of knowledge around that, and that's one of the things that's really exciting about the project at Columbia.

How will you make sure the results of the research go beyond Columbia's ivy-covered walls?

Michael Maness: That was a big thing that we talked about, too, because it's not in us to do a lot of white papers around something like this. It doesn't really disseminate. A lot of this grant is around making sure that there are convocations.

We talk a lot about the creation of content objects. If you're studying data visualization, we should be making sure that we're producing that as well. This will be something that's ongoing and emerging. Definitely, a part of it is that some of these resources will go to hold gatherings, to send people out from Columbia to disseminate [research] and also to produce findings in a way that can be moved very easily around a digital ecosystem.

We want to make sure that you're running into this work a lot. This is something that we've baked into the grant, and we're going to be experimenting with, I think, as it moves forward. But I hear you, that if we did all of this — and it got captured behind ivy walls — it's not beneficial to the industry.

Related:

May 22 2012

Data journalism research at Columbia aims to close data science skills gap

Successfully applying data science to the practice of journalism requires more than providing context and finding clarity in vasts amount of unstructured data: it will require media organizations to think differently about how they work and who they venerate. It will mean evolving towards a multidisciplinary approach to delivering stories, where reporters, videographers, news application developers, interactive designers, editors and community moderators collaborate on storytelling, instead of being segregated by departments or buildings.

The role models for this emerging practice of data journalism won't be found on broadcast television or on the lists of the top journalists over the past century. They're drawn from the increasing pool of people who are building new breeds of newsrooms and extending the practice of computational journalism. They see the reporting that provisions their journalism as data, a body of work that can itself can be collected, analyzed, shared and used to create longitudinal insights about the ways that society, industry or government are changing. (Or not, as the case may be.)

In a recent interview, Emily Bell (@EmilyBell), director of the Tow Center for Digital Journalism at the Columbia University School of Journalism, offered her perspective about what's needed to train the data journalists of the future and the changes that still need to occur in media organizations to maximize their potential. In this context, while the role of institutions and "journalism education are themselves evolving, they both will still fundamentally matter for "what's next," as practitioners adapt to changing newsonomics.

Our discussion took place in the context of a notable investment in the future of data journalism: a $2 million research grant to Columbia University from the Knight Foundation to research and distribute best practices for digital reportage, data visualizations and measuring impact. Bell explained more about what how the research effort will help newsrooms determine what's next on the Knight Foundation's blog:

The knowledge gap that exists between the cutting edge of data science, how information spreads, its effects on people who consume information and the average newsroom is wide. We want to encourage those with the skills in these fields and an interest and knowledge in journalism to produce research projects and ideas that will both help explain this world and also provide guidance for journalism in the tricky area of ‘what next’. It is an aim to produce work which is widely accessible and immediately relevant to both those producing journalism and also those learning the skills of journalism.

We are focusing on funding research projects which relate to the transparency of public information and its intersection with journalism, research into what might broadly be termed data journalism, and the third area of ‘impact’ or, more simply put, what works and what doesn’t.

Our interview, lightly edited for content and clarity, follows.

What did you do before you became director of the Tow Center for Digital Journalism?

I spent ten years where I was editor-in-chief of The Guardian website. During the last four of those, I was also overall director of digital content for all The Guardian properties. That included things like mobile applications, et cetera, but from the editorial side.

Over the course of that decade, you saw one or two things change online, in terms of what journalists could do, the tools available to them and the news consumption habits of people. You also saw the media industry change, in terms of the business models and institutions that support journalism as we think of it. What are the biggest challenges and opportunities for the future journalism?

For newspapers, there was an early warning system: that newspaper circulation has not really consistently risen since the early 1980s. We had a long trajectory of increased production and actually, an overall systemic decline which has been masked by a very, very healthy advertising market, which really went on an incredible bull run with a more static pictures, and just "widen the pipe," which I think fooled a lot of journalism outlets and publishers into thinking that that was the real disruption.

And, of course, it wasn’t.

The real disruption was the ability of anybody anywhere to upload multimedia content and share it with anybody else who was on a connected device. That was the thing that really hit hard, when you look at 2004 onwards.

What journalism has to do is reinvent its processes, its business models and its skillsets to function in a world where human capital does not scale well, in terms of sifting, presenting and explaining all of this information. That’s really the key to it.

The skills that journalists need to do that -- including identifying a story, knowing why something is important and putting it in context -- are incredibly important. But how you do that, which particular elements you now use to tell that story are changing.

Those now include the skills of understanding the platform that you’re operating on and the technologies which are shaping your audiences’ behaviors and the world of data.

By data, I don’t just mean large caches of numbers you might be given or might be released by institutions: I mean that the data thrown off by all of our activity, all the time, is simply transforming the speed and the scope of what can be explained and reported on and identified as stories at a really astonishing speed. If you don’t have the fundamental tools to understand why that change is important and you don’t have the tools to help you interpret and get those stories out to a wide public, then you’re going to struggle to be a sustainable journalist.

The challenge for sustainable journalism going forward is not so different from what exists in other industries: there's a skills gap. Data scientists and data journalists use almost the exact same tools. What are the tools and skills that are needed to make sense of all of this data that you talked about? What will you do to catalog and educate students about them?

It's interesting when you say that the skills of these clients are very similar, which is absolutely right. First of all, you have a basic level of numeracy needed - and maybe not just a basic level, but a more sophisticated understanding of statistical analysis. That’s not something which is routinely taught in journalism schools but that I think will increasingly have to be.

The second thing is having some coding skills or some computer science understanding to help with identifying the best, most efficient tools and the various ways that data is manipulated.

The third thing is that when you’re talking about 'data scientists,' it’s really a combination of those skills. Adding data doesn’t mean you don't have to have other journalism skills which do not change: understanding context, understanding what the story might be, and knowing how to derive that from the data that you’re given or the data that exists. If it’s straightforward, how do you collect it? How do you analyze it? How do you interpret them and present it?

It’s easy to say, but it’s difficult to do. It’s particularly difficult to reorient the skillsets of an industry which have very much resided around the idea of a written story and an ability with editing. Even in the places where I would say there’s sophisticated use of data in journalism, it’s still a minority sport.

I’ve talked to several heads of data in large news organizations and they’ve said, “We have this huge skills gap because we can find plenty of people who can do the math; we can find plenty of people who are data scientists; we can’t find enough people who have those skills but also have a passion or an interest in telling stories in a journalistic context and making those relatable.”

You need a mindset which is about putting this in the context of the story and spotting stories, as well having creative and interesting ideas about how you can actually collect this material for your own stories. It’s not a passive kind of processing function if you’re a data journalist: it’s an active speaking, inquiring and discovery process. I think that that’s something which is actually available to all journalists.

Think about just local information and how local reporters go out and speak to people every day on the beat, collect information, et cetera. At the moment, most get from those entities don’t structure the information in a way that will help them find patterns and build new stories in the future.

This is not just about an amazing graphic that the New York Times does with census data over the past 150 years. This is about almost every story. Almost every story has some component of reusability or a component where you can collect the data in a way that helps your reporting in the future.

To do that requires a level of knowledge about the tools that you’re using, like coding, Google Refine or Fusion Tables. There are lots of freely available tools out there that are making this easier. But, if you don’t have the mindset that approaches, understands and knows why this is going to help you and make you a better reporter, then it’s sometimes hard to motivate journalists to see why they might want to grab on.

The other thing to say, which is really important, is there is currently a lack of both jobs and role models for people to point to and say, “I want to be that person.”

I think the final thing I would say to the industry is we’re getting a lot of smart journalists now. We are one of the schools where all of our digital concentrations from students this year include a basic grounding in data journalism. Every single one of them. We have an advanced course taught by Susan McGregor in data visualization. But we’re producing people from the school now, who are being hired to do these jobs, and the people who are hiring them are saying, “Write your own job description because we know we want you to do something, we just don’t quite know what it is. Can you tell us?”

You can’t cookie-cutter these people out of schools and drop them into existing roles in news trends because those are still developing. What we’re seeing are some very smart reporters with data-centric mindsets and also the ability to do these stories -- but they want to be out reporting. They don’t want to be confined to a desk and a spreadsheet. Some editors usually find that very hard to understand, “Well, what does that job look like?”

I think that this is where working with the industry, we can start to figure some of these things out, produce some experimental work or stories, and do some of the thinking in the classroom that helps people figure out what this whole new world is going to look like.

What do journalism schools need to do to close this 'skills gap?' How do they need to respond to changing business models? What combination of education, training and hands-on experience must they provide?

One of the first things they need to do is identify the problem clearly and be honest about it. I like to think that we’ve done that at Columbia, although I’m not a data journalist. I don’t have a background in it. I’m a writer. I am, if you like, completely the old school.

But one of the things I did do at The Guardian was helped people who early on said to me, “Some of this transformation means that we have to think about data as being a core part of what we do.” Because of the political context and the position I was in, I was able to recognize that that was an important thing that they were saying and we could push through changes and adoption in those areas of the newsroom.

That’s how The Guardian became interested in data. It’s the same in journalism school. One of the early things that we talked about [at Columbia] was how we needed to shift some of what the school did on its axis and acknowledge that this was going to be key part of what we do in the future. Once we acknowledged that that is something we had to work towards, [we hired] Susan McGregor from the Wall Street Journal’s Interactive Team. She’s an expert in data journalism and has an MA in technology in education.

If you say to me, “Well, what’s the ground vision here?” I would say the same thing I would say to anybody: over time, and hopefully not too long a course of time, we want to attract a type of student that is interested and capable in this approach. That means getting out and motivating and talking to people. It means producing attractive examples which high school children and undergraduate programs think about [in their studies]. It means talking to the CS [computer science] programs -- and, in fact, more about talking to those programs and math majors than you would be talking to the liberal arts professors or the historians or the lawyers or the people who have traditionally been involved.

I think that has an effect: it starts to show people who are oriented towards storytelling but have capabilities which are align more with data science skill sets that there’s a real task for them. We can’t message that early enough as an industry. We can’t message it early enough as an educator to get people into those tracks. We have to really make sure that the teaching is high quality and that we’re not just carried away with the idea of the new thing, we need to think pretty deeply about how we get those skills.

What sort of basic sort of statistical teaching do you need? What are the skills you need for data visualization? How do you need to introduce design as well as computer science skills into the classroom, in a way which makes sense for stories? How do you tier that understanding?

You're always going to produce superstars. Hopefully, we’ll be producing superstars in this arena soon as well.

We need to take the mission seriously. Then we need to build resources around it. And that’s difficult for educational organizations because it takes time to introduce new courses. It takes time to signal that this is something you think is important.

I think we’ve done a reasonable job of that so far at Columbia, but we’ve got a lot further to go. It's important that institutions like Columbia do take the lead and demonstrate that we think this is something that has to be a core curriculum component.

That’s hard, because journalism schools are known for producing writers. They’re known for different types of narratives. They are not necessarily lauded for producing math or computer science majors. That has to change.

Related:

May 15 2012

Profile of the Data Journalist: The Data News Editor

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society. (You can learn more about this world and the emerging leaders of this discipline in the newly released "Data Journalism Handbook.")

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

John Keefe (@jkeefe) is a senior editor for data news and journalism technology at WNYC public radio, based in New York City, NY. He attracted widespread attention when an online map he built using available data beat the Associated Press with Iowa caucus results earlier this year. He's posted numerous tutorials and resources for budding data journalists, including how to map data onto county districts, use APIs, create news apps without a backend content management system and make election results maps. As you'll read below, Keefe is a great example of a journalist who picked up these skills from the data journalism community and the Hacks/Hackers group.

Our interview follows, lightly edited for content and clarity. (I've also added a Twitter list of data journalist from the New York Times' Jacob Harris.)

Where do you work now? What is a day in your life like?

I work in the middle of the WNYC newsroom -- quite literally. So throughout the day, I have dozens of impromptu conversations with reporters and editors about their ideas for maps and data projects, or answering questions about how to find or download data.

Our team works almost entirely on "news time," which means our creations hit the Web in hours and days more often than weeks and months. So I'm often at my laptop creating or tweaking maps and charts to go with online stories. That said, Wednesday mornings it's breakfast at a Chelsea cafe with collaborators at Balance Media to update each other on longer-range projects and tools we make for the newsroom and then open source, like Tabletop.js and our new vertical timeline.

Then there are key meetings, such as the newsroom's daily and weekly editorial discussions, where I look for ways to contribute and help. And because there's a lot of interest and support for data news at the station, I'm also invited to larger strategy and planning meetings.

How did you get started in data journalism? Did you get any special degrees or certificates?

I've been fascinated with the intersection of information, design and technology since I was a kid. In the last couple of years, I've marveled at what journalists at the New York Times, ProPublica and the Chicago Tribune were doing online. I thought the public radio audience, which includes a lot of educated, curious people, would appreciate such data projects at WNYC, where I was news director.

Then I saw that Aron Pilhofer of the New York Times would be teaching a programming workshop at the 2009 Online News Association annual meeting. I signed up. In preparation, I installed Django on my laptop and started following the beginner's tutorial on my subway commute. I made my first "Hello World!" web app on the A Train.

I also started hanging out at Hacks/Hackers meetups and hackathons, where I'd watch people code and ask questions along the way.

Some of my experimentation made it onto the WNYC's website -- including our 2010 Census maps and the NYC Hurricane Evacuation map ahead of Hurricane Irene. Shortly thereafter, WNYC management asked me to focus on it full-time.

Did you have any mentors? Who? What were the most important resources they shared with you?

I could not have done so much so fast without kindness, encouragement and inspiration from Pilhofer at the Times; Scott Klein, Al Shaw, Jennifer LaFleur and Jeff Larson at ProPublica; , Chris Groskopf, Joe Germuska and Brian Boyer at the Chicago Tribune; and Jenny 8. Lee of, well, everywhere.

Each has unstuck me at various key moments and all have demonstrated in their own work what amazing things were possible. And they have put a premium on sharing what they know -- something I try to carry forward.

The moment I may remember most was at an afternoon geek talk aimed mainly at programmers programmers. After seeing a demo of a phone app called Twilio, I turned to Al Shaw, sitting next to me, and lamented that I had no idea how to play with such things.

"You absolutely can do this," he said.

He encouraged me to pick up Sinatra, a surprisingly easy way to use the Ruby programming language. And I was off.

What does your personal data journalism "stack" look like? What tools could you not live without?

Google Maps - Much of what I can turn around quickly is possible because of Google Maps. I'm also experimenting with MapBox and Geocommons for more data-intensive mapping projects, like our NYC diversity map.

Google Fusion Tables - Essential for my wrangling, merging and mapping of data sets on the fly.

Google Spreadsheets - These have become the "backend" to many of our data projects, giving reporters and editors direct access to the data driving an application, chart or map. We wire them to our apps using Tabletop.js, an open-source program we helped to develop.

TextMate - A programmer's text editor for Mac. There are several out there, and some are free. TextMate is my fave.

The JavaScript Tools Bundle for Textmate - It checks my JavaScript code ever time I save, flagging me to near-invisible, infuriating errors such as a stray comma or a missing parenthesis. I'm certain this one piece of software has given me more days with my kids.

Firebug for Firefox - Lets you see what your code is doing in the browser. Essential for troubleshooting CSS and JavaScript, and great for learning how the heck other people make cool stuff.

Amazon S3 - Most of what we build are static pages of html and JavaScript, which we host in the Amazon cloud and embed into article pages on our CMS.

census.ire.org - A fabulous, easy-to-navigate presentation of US Census data made by a bunch of journo-programmers for Investigative Reporters and Editors. I send someone there probably once a week.

What data journalism project are you the most proud of working on or creating?

I'd have to say our GOP Iowa Caucuses feature. It has several qualities I like:

  • Mashed-up data -- It mixes live, county vote results with Patchwork Nation community types.
  • A new take -- We know other news sites would shade Iowa's counties by the winner; we shaded them by community type and showed who won which categories.
  • Complete sharability -- We made it super-easy for anyone to embed the map into their own site, which was possible because the results came license-free from the state GOP via Google.
  • Key code from another journalist -- The map-rollover coolness comes from code built by Albert Sun, then of the Wall Street Journal and now at the New York Times.
  • Rapid learning -- I taught myself a LOT of JavaScript quickly.
  • Reusability -- We used it for which we did for each state until Santorum bowed out.


Bonus: I love that I made most of it sitting at my mom's kitchen table over winter break.

Where do you turn to keep your skills updated or learn new things?

WNYC's editors and reporters. They have the bug, and they keep coming up with new and interesting projects. And I find project-driven learning is the most effective way to discover new things. New York Public Radio -- which runs WNYC along with classical radio station WQXR, New Jersey Public Radio and a street-level performance space -- also has a growing stable of programmers and designers, who help me build things, teach me amazing tricks and spot my frequent mistakes.

The IRE/NICAR annual conference. It's a meetup of the best journo-programmers in the country, and it truly seems each person is committed to helping others learn. They're also excellent at celebrating the successes of others.

Twitter. I follow a bunch of folks who seem to tweet the best stuff, and try to keep a close eye on 'em.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Candidates, companies, municipalities, agencies and non-profit organizations all are using data. And a lot of that data is about you, me and the people we cover.

So first off, journalism needs an understanding of the data available and what it can do. It's just part of covering the story now. To skip that part of the world would shortchange our audience, and our democracy. Really.

And the better we can both present data to the general public and tell data-driven (or -supported) stories with impact, the better we can do great journalism.

May 07 2012

A brief history of data journalism

The following is an excerpt from "The Data Journalism Handbook," a collection of essays and resources covering the growing field of data journalism.


Data journalism imagesIn August 2010 some colleagues and I organised what we believe was one of the first international "data journalism" conferences, which took place in Amsterdam. At this time there wasn't a great deal of discussion around this topic and there were only a couple of organizations that were widely known for their work in this area.

The way that media organizations like Guardian and the New York Times handled the large amounts of data released by Wikileaks is one of the major steps that brought the term into prominence. Around that time the term started to enter into more widespread usage, alongside "computer-assisted reporting," to describe how journalists were using data to improve their coverage and to augment in-depth investigations into a given topic.

Speaking to experienced data journalists and journalism scholars on Twitter it seems that one of the earliest formulations of what we now recognise as data journalism was in 2006 by Adrian Holovaty, founder of EveryBlock — an information service which enables users to find out what has been happening in their area, on their block. In his short essay "A fundamental way newspaper sites need to change," he argues that journalists should publish structured, machine-readable data, alongside the traditional "big blob of text":

"For example, say a newspaper has written a story about a local fire. Being able to read that story on a cell phone is fine and dandy. Hooray, technology! But what I really want to be able to do is explore the raw facts of that story, one by one, with layers of attribution, and an infrastructure for comparing the details of the fire — date, time, place, victims, fire station number, distance from fire department, names and years experience of firemen on the scene, time it took for firemen to arrive — with the details of previous fires. And subsequent fires, whenever they happen."

But what makes this distinctive from other forms of journalism which use databases or computers? How — and to what extent — is data journalism different from other forms of journalism from the past?

"Computer-Assisted Reporting" and "Precision Journalism"

Using data to improve reportage and delivering structured (if not machine readable) information to the public has a long history. Perhaps most immediately relevant to what we now call data journalism is "computer-assisted reporting" or "CAR," which was the first organised, systematic approach to using computers to collect and analyze data to improve the news.

CAR was first used in 1952 by CBS to predict the result of the presidential election. Since the 1960s, (mainly investigative, mainly U.S.-based) journalists, have sought to independently monitor power by analyzing databases of public records with scientific methods. Also known as "public service journalism," advocates of these computer-assisted techniques have sought to reveal trends, debunk popular knowledge and reveal injustices perpetrated by public authorities and private corporations. For example, Philip Meyer tried to debunk received readings of the 1967 riots in Detroit — to show that it was not just less-educated Southerners who were participating. Bill Dedman's "The Color of Money" stories in the 1980s revealed systemic racial bias in lending policies of major financial institutions. In his "What Went Wrong," Steve Doig sought to analyze the damage patterns from Hurricane Andrew in the early 1990s, to understand the effect of flawed urban development policies and practices. Data-driven reporting has brought valuable public service, and has won journalists famous prizes.

In the early 1970s the term "precision journalism" was coined to describe this type of news-gathering: "the application of social and behavioral science research methods to the practice of journalism." Precision journalism was envisioned to be practiced in mainstream media institutions by professionals trained in journalism and social sciences. It was born in response to "new journalism," a form of journalism in which fiction techniques were applied to reporting. Meyer suggests that scientific techniques of data collection and analysis rather than literary techniques are what is needed for journalism to accomplish its search for objectivity and truth.

Precision journalism can be understood as a reaction to some of journalism's commonly cited inadequacies and weaknesses: dependence on press releases (later described as "churnalism"), bias towards authoritative sources, and so on. These are seen by Meyer as stemming from a lack of application of information science techniques and scientific methods such as polls and public records. As practiced in the 1960s, precision journalism was used to represent marginal groups and their stories. According to Meyer:

"Precision journalism was a way to expand the tool kit of the reporter to make topics that were previously inaccessible, or only crudely accessible, subject to journalistic scrutiny. It was especially useful in giving a hearing to minority and dissident groups that were struggling for representation."

An influential article published in the 1980s about the relationship between journalism and social science echoes current discourse around data journalism. The authors, two U.S. journalism professors, suggest that in the 1970s and 1980s the public's understanding of what news is broadens from a narrower conception of "news events" to "situational reporting," or reporting on social trends. By using databases of — for example — census data or survey data, journalists are able to "move beyond the reporting of specific, isolated events to providing a context which gives them meaning."

As we might expect, the practise of using data to improve reportage goes back as far as "data" has been around. As Simon Rogers points out, the first example of data journalism at the Guardian dates from 1821. It is a leaked table of schools in Manchester listing the number of students who attended it and the costs per school. According to Rogers this helped to show for the first time the real number of students receiving free education, which was much higher than what official numbers showed.

Data Journalism in the Guardian in 1821
Data Journalism in the Guardian in 1821 (The Guardian)

Another early example in Europe is Florence Nightingale and her key report, "Mortality of the British Army," published in 1858. In her report to the parliament she used graphics to advocate improvements in health services for the British army. The most famous is her "coxcomb," a spiral of sections, each representing deaths per month, which highlighted that the vast majority of deaths were from preventable diseases rather than bullets.

Mortality of the British Army by Florence Nightingale
Mortality of the British Army by Florence Nightingale (Image from Wikipedia)

Data journalism and Computer-Assisted Reporting

At the moment there is a "continuity and change" debate going on around the label "data journalism" and its relationship with these previous journalistic practices which employ computational techniques to analyze datasets.

Some argue that there is a difference between CAR and data journalism. They say that CAR is a technique for gathering and analyzing data as a way of enhancing (usually investigative) reportage, whereas data journalism pays attention to the way that data sits within the whole journalistic workflow. In this sense data journalism pays as much — and sometimes more — attention to the data itself, rather than using data simply as a means to find or enhance stories. Hence we find the Guardian Datablog or the Texas Tribune publishing datasets alongside stories, or even just datasets by themselves for people to analyze and explore.

Another difference is that in the past investigative reporters would suffer from a poverty of information relating to a question they were trying to answer or an issue that they were trying to address. While this is of course still the case, there is also an overwhelming abundance of information that journalists don't necessarily know what to do with. They don't know how to get value out of data. A recent example is the Combined Online Information System, the U.K.'s biggest database of spending information — which was long sought after by transparency advocates, but which baffled and stumped many journalists upon its release. As Philip Meyer recently wrote to me: "When information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important."

On the other hand, some argue that there is no meaningful difference between data journalism and computer-assisted reporting. It is by now common sense that even the most recent media practices have histories, as well as something new in them. Rather than debating whether or not data journalism is completely novel, a more fruitful position would be to consider it as part of a longer tradition, but responding to new circumstances and conditions. Even if there might not be a difference in goals and techniques, the emergence of the label "data journalism" at the beginning of the century indicates a new phase wherein the sheer volume of data that is freely available online combined with sophisticated user-centric tools, self-publishing and crowdsourcing tools enables more people to work with more data more easily than ever before.

Data journalism is about mass data literacy

Digital technologies and the web are fundamentally changing the way information is published. Data journalism is one part in the ecosystem of tools and practices that have sprung up around data sites and services. Quoting and sharing source materials is in the nature of the hyperlink structure of the web and the way we are accustomed to navigate information today. Going further back, the principle that sits at the foundation of the hyperlinked structure of the web is the citation principle used in academic works. Quoting and sharing the source materials and the data behind the story is one of the basic ways in which data journalism can improve journalism, what Wikileaks founder Julian Assange calls "scientific journalism."

By enabling anyone to drill down into data sources and find information that is relevant to them, as well as to verify assertions and challenge commonly received assumptions, data journalism effectively represents the mass democratisation of resources, tools, techniques and methodologies that were previously used by specialists — whether investigative reporters, social scientists, statisticians, analysts or other experts. While currently quoting and linking to data sources is particular to data journalism, we are moving towards a world in which data is seamlessly integrated into the fabric of media. Data journalists have an important role in helping to lower the barriers to understanding and interrogating data, and increasing the data literacy of their readers on a mass scale.

At the moment the nascent community of people who called themselves data journalists is largely distinct from the more mature CAR community. Hopefully in the future we will see stronger ties between these two communities, in much the same way that we see new NGOs and citizen media organizations like ProPublica and the Bureau of Investigative Journalism work hand in hand with traditional news media on investigations. While the data journalism community might have more innovative ways of delivering data and presenting stories, the deeply analytical and critical approach of the CAR community is something that data journalism could certainly learn from.

This excerpt was lightly edited. Links were added for EveryBlock, the Guardian Datablog, Texas Tribune datasets, the Combined Online Information System, and Julian Assange's reference to "scientific journalism."

The Data Journalism Handbook (Early Release) — This collaborative book aims to answer questions like: Where can I find data? What tools can I use? How can I find stories in data? (The digital Early Release edition includes raw and unedited content. You'll receive updates when significant changes are made, as well as the final ebook version.)

Related:

Reposted bynunatak nunatak

May 03 2012

Strata Week: Google offers big data analytics

Here are the data stories that caught my attention this week.

BigQuery for everyone

Google BigQueryGoogle has released its big data analytics service BigQuery to the public. Initially made available to a small number of developers late last year, now anyone can sign up for the service. A free account lets you query up to 100 GB of data per month, with the option to pay for additional queries and/or storage.

"Google's aim may be to sell data storage in the cloud, as much as it is to sell analytic software," says The New York Times' Quentin Hardy. "A company using BigQuery has to have data stored in the cloud data system, which costs 12 cents a gigabyte a month, for up to two terabytes, or 2,000 gigabytes. Above that, prices are negotiated with Google. BigQuery analysis costs 3.5 cents a gigabyte of data processed."

The interface for BigQuery is meant to lower the bar for these sorts of analytics — there's a UI and a REST interface. In the Times article, Google project manager Ju-kay Kwek says Google is hoping developers build tools that encourage widespread use of the product by executives and other non-developers.

If folks are looking for something to cut their teeth on with BigQuery, GitHub's public timeline is now a publicly available dataset. The data is being synced regularly, so you can query things like popular languages and popular repos. To that end, GitHub is running a data visualization contest.

The Data Journalism Handbook

The Data Journalism Handbook had its release this week at the 2012 International Journalism Festival in Italy. The book, which is freely available and openly licensed, was a joint effort of the European Journalism Centre and the Open Knowledge Foundation. It's meant to serve as a reference for those interested in the field of data journalism.

In the introduction, "Deutsche Welle's" Mirko Lorenz writes:

"Today, news stories are flowing in as they happen, from multiple sources, eye-witnesses, blogs, and what has happened is filtered through a vast network of social connections, being ranked, commented and more often than not, ignored. This is why data journalism is so important. Gathering, filtering and visualizing what is happening beyond what the eye can see has a growing value."


Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.



Save 20% on registration with the code RADAR20

Open data is a joke?

Tim Slee fired a shot across the bow of the open data movement with a post this week arguing that "the open data movement is a joke." Moreover, it's not a movement at all, he contends. Slee turns a critical eye to the Canadian government's open data efforts in particular, noting that: "The Harper government's actions around 'open government,' and the lack of any significant consequences for those actions, show just how empty the word 'open' has become."

Slee is also critical of open data efforts outside the government, calling the open data movement "a phrase dragged out by media-oriented personalities to cloak a private-sector initiative in the mantle of progressive politics."

Open data activist David Eaves responded strongly to Slee's post with one of his own, recognizing his own frustrations with "one of the most — if not the most — closed and controlling [governments] in Canada's history." But Eaves takes exception with the ways in which Slee characterizes the open data movement. He contends that many of the corporations involved with the open data movement — something Slee charges has corrupted open data — are U.S. corporations (and points out that in Canada, "most companies don't even know what open data is"). Eaves adds, too, that many of these corporations are led by geeks.

Eaves writes:

"Just as an authoritarian regime can run on open-source software, so too might it engage in open data. Open data is not the solution for Open Government (I don't believe there is a single solution, or that Open Government is an achievable state of being — just a goal to pursue consistently), and I don't believe anyone has made the case that it is. I know I haven't. But I do believe open data can help. Like many others, I believe access to government information can lead to better informed public policy debates and hopefully some improved services for citizens (such as access to transit information). I'm not deluded into thinking that open data is going to provide a steady stream of obvious 'gotcha moments' where government malfeasance is discovered, but I am hopeful that government data can arm citizens with information that the government is using to inform its decisions so that they can better challenge, and ultimately help hold accountable, said government."

Got data news?

Feel free to email me.

Related:

March 17 2012

Profile of the Data Journalist: The Homicide Watch

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Chris Amico (@eyeseast) is a journalist and web developer based in Washington, DC, where he works on NPR's State Impact project, building a platform for local reporters covering issues in their states. Laura Norton Amico (@LauraNorton) is the editor of Homicide Watch (@HomicideWatch), an online community news platform in Washington, D.C. that aspires to cover every homicide in the District of Columbia. And yes, the similar names aren't a coincidence: the Amicos were married in 2010.

Since Homicide Watch launched in 2009, it's been earning praise and interest from around the digital world, including a profile by the Nieman Lab at Harvard University that asked whether a local blog "could fill the gaps of DC's homicide coverage. Notably, Homicide Watch has turned up a number of unreported murders.

In the process, the site has also highlighted an important emerging set of data that other digital editors should consider: using inbound search engine analytics for reporting. As Steve Myers reported for the Poynter Institute, Homicide Watch used clues in site search queries to ID a homicide victim. We'll see if the Knight Foundation think this idea has legs: the husband and wife team have applied for a Knight News Challenge grant to build a tooklit for real-time investigative reporting from site analytics.

The Amico's success with the site - which saw big growth in 2011 -- offers an important case study into why organizing beats may well hold similar importance as investigative projects. It also will be a case study with respect to sustainability and business models for the "new news,"as Homicide Watch looks to license its platform to news outlets across the country.

Below, I've embedded a presentation on Homicide Watch from the January 2012 meeting of the Online News Association. Our interview follows.

Watch live streaming video from onlinenewsassociation at livestream.com

Where do you work now? What is a day in your life like?

Laura: I work full time right now for Homicide Watch, a database driven beat publishing platform for covering homicides. Our flagship site is in DC, and I’m the editor and primary reporter on that site as well as running business operations for the brand.

My typical days start with reporting. First, news checks, and maybe posting some quick posts on anything that’s happened overnight. After that, it’s usually off to court to attend hearings and trials, get documents, reporting stuff. I usually have to to-do list for the day that includes business meetings, scheduling freelancers, mapping out long-term projects, doing interviews about the site, managing our accounting, dealing with awards applications, blogging about the start-up data journalism life on my personal blog and for ONA at journalists.org, guest teaching the occasional journalism class, and meeting deadlines for freelance stories. The work day never really ends; I’m online keeping an eye on things until I go to bed.

Chris: I work for NPR, on the State Impact project, where I build news apps and tools for journalists. With Homicide Watch, I work in short bursts, usually an hour before dinner and a few hours after. I’m a night owl, so if I let myself, I’ll work until 1 or 2 a.m., just hacking at small bugs on the site. I keep a long list of little things I can fix, so I can dip into the codebase, fix something and deploy it, then do something else. Big features, like tracking case outcomes, tend to come from weekend code sprints.

How did you get started in data journalism? Did you get any special degrees or certificates?

Laura: Homicide Watch DC was my first data project. I’ve learned everything I know now from conceiving of the site, managing it as Chris built it, and from working on it. Homicide Watch DC started as a spreadsheet. Our start-up kit for newsrooms starting Homicide Watch sites still includes filling out a spreadsheet. The best lesson I learned when I was starting out was to find out what all the pieces are and learn how to manage them in the simplest way possible.

Chris: My first job was covering local schools in southern California, and data kept creeping into my beat. I liked having firm answers to tough questions, so I made sure I knew, for example, how many graduates at a given high school met the minimum requirements for college. California just has this wealth of education data available, and when I started asking questions of the data, I got stories that were way more interesting.

I lived in Dalian, China for a while. I helped start a local news site with two other expats (Alex Bowman and Rick Martin). We put everything we knew about the city -- restaurant reviews, blog posts, photos from Flickr -- into one big database and mapped it all. It was this awakening moment when suddenly we had this resource where all the information we had was interlinked. When I came back to California, I sat down with a book on Python and Django and started teaching myself to code. I spent a year freelancing in the Bay Area, writing for newspapers by day, learning Python by night. Then the NewsHour hired me.

Did you have any mentors? Who? What were the most important resources they shared with you?

Laura: Chris really coached me through the complexities of data journalism when we were creating the site. He taught me that data questions are editorial questions. When I realized that data could be discussed as an editorial approach, it opened the crime beat up. I learned to ask questions of the information I was gathering in a new way.

Chris: My education has been really informal. I worked with a great reporter at my first job, Bob Wilson, who is a great interviewer of both people and spreadsheets. At NewsHour, I worked with Dante Chinni on Patchwork Nation, who taught me about reporting around a central organizing principle. Since I’ve started coding, I’ve ended up in this great little community of programmer-journalists where people bound ideas around and help each other out.

What does your personal data journalism "stack" look like? What tools could you not live without?

Laura: The site itself and its database which I report to and from, WordPress, Wordpress analytics, Google Analytics, Google Calendar, Twitter, Facebook, Storify, Document Cloud, VINElink, and DC Superior Court’s online case lookup.

Chris: Since I write more Python than prose these days, I spend most of my time in a text editor (usually TextMate) on a MacBook Pro. I try not to do anything with git.

What data journalism project are you the most proud of working on or creating?

Laura: Homicide Watch is the best thing I’ve ever done. It’s not just about the data, and it’s not just about the journalism, but it’s about meeting a community need in an innovative way. I stared thinking about a Homicide Watchtype site when I was trying to follow a few local cases shortly after moving to DC. It was nearly impossible to find news sources for the information. I did find that family and friends of victims and suspects were posting newsy updates in unusual places -- online obituaries and Facebook memorial pages, for example. I thought a lot about how a news product could fit the expressed need for news, information, and a way for the community to stay in touch about cases.

The data part developed very naturally out of that. The earliest description of the site was “everything a reporter would have in their notebook or on their desk while covering a murder case from start to finish.” That’s still one of the guiding principals of the site, but it’s also meant that organizing that information is super important. What good is making court dates public if you’re not doing it on a calendar, for example.

We started, like I said, with a spreadsheet that listed everything we knew: victim name, age, race, gender, method of death, place of death, link to obituary, photo, suspect name, age, race, gender, case status, incarceration status, detective name, age, race, gender, phone number, judge assigned to case, attorneys connected to the case, co-defendants, connections to other murder cases.

And those are just the basics. Any reporter covering a murder case, crime to conviction, should have that information. What Homicide Watch does is organize it, make as much of it public as we can, and then report from it. It’s led to some pretty cool work, from developing a method to discover news tips in analytics, to simply building news packages that accomplish more than anyone else can.

Chris: Homicide Watch is really the project I wanted to build for years. It’s data-driven beat reporting, where the platform and the editorial direction are tightly coupled. In a lot of ways, it’s what I had in mind when I was writing about frameworks for reporting.

The site is built to be a crime reporter’s toolkit. It’s built around the way Laura works, based on our conversations over the dinner table for the first six months of the site’s existence. Building it meant understanding the legal system, doing reporting and modeling reality in ways I hadn’t done before, and that was a challenge on both the technical and editorial side.

Where do you turn to keep your skills updated or learn new things?

Laura: Assigning myself new projects and tasks is the best way for me to learn; it forces me to find solutions for what I want to do. I’m not great at seeking out resources on my own, but I keep a close eye on Twitter for what others are doing, saying about it, and reading.

Chris: Part of my usual morning news reading is a run through a bunch of programming blogs. I try to get exposed to technologies that have no immediate use to me, just so it keeps me thinking about other ways to approach a problem and to see what other problems people are trying to solve.

I spend a lot of time trying to reverse-engineer other people’s projects, too. Whenever someone launches a new news app, I’ll try to find the data behind it, take a dive through the source code if it’s available and generally see if I can reconstruct how it came together.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Laura: Working on Homicide Watch has taught me that news is about so much more than “stories.” If you think about a typical crime brief, for example, there’s a lot of information in there, starting with the "who-what-where-when." Once that brief is filed and published, though, all of that information disappears.

Working with news apps gives us the ability to harness that information and reuse/repackage it. It’s about slicing our reporting in as many ways as possible in order to make the most of it. On Homicide Watch, that means maintaining a database and creating features like victims’ and suspects’ pages. Those features help regroup, refocus, and curate the reporting into evergreen resources that benefit both reporters and the community.

Chris: Spend some time with your site analytics. You’ll find that there’s no one thing your audience wants. There isn’t even really one audience. Lots of people want lots of different things at different times, or at least different views of the information you have.

One of our design goals with Homicide Watch is “never hit a dead end.” A user may come in looking for information about a certain case, then decide she’s curious about a related issue, then wonder which cases are closed. We want users to be able to explore what we’ve gathered and to be able to answer their own questions. Stories are part of that, but stories are data, too.

March 08 2012

Profile of the Data Journalist: The Storyteller and The Teacher

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Sarah Cohen (@sarahduke), the Knight professor of the practice of journalism and public policy at Duke University, and Anthony DeBarros (@AnthonyDB), the senior database editor at USA Today, were both important sources of historical perspective for my feature on how data journalism is evolving from "computer-assisted reporting" (CAR) to a powerful Web-enabled practice that uses cloud computing, machine learning and algorithms to make sense of unstructured data.

The latter halves of our interviews, which focused upon their personal and professional experience, follow.

What data journalism project are you the most proud of working on or creating?

DeBarros: "In 2006, my USA TODAY colleague Robert Davis and I built a database of 620 students killed on or near college campuses and mined it to show how freshmen were uniquely vulnerable. It was a heart-breaking but vitally important story to tell. We won the 2007 Missouri Lifestyle Journalism Awards for the piece, and followed it with an equally wrenching look at student deaths from fires."

Cohen: "I'd have to say the Pulitzer-winning series on child deaths in DC, in which we documented that children were dying in predictable circumstances after key mistakes by people who knew that their agencies had specific flaws that could let them fall through the cracks.

I liked working on the Post's POTUS Tracker and Head Count. Those were Web projects that were geared at accumulating lots of little bits about Obama's schedule and his appointees, respectively, that we could share with our readers while simultaneously building an important dataset for use down the road. Some of the Post's Solyndra and related stories, I have heard, came partly from studying the president's trips in POTUS Tracker.

There was one story, called "Misplaced Trust," on DC's guardianship system, that created immediate change in Superior Court, which was gratifying. "Harvesting Cash," our 18-month project on farm subsidies, also helped point out important problems in that system.

The last one, I'll note, is a piece of a project I worked on, in which the DC water authority refused to release the results of a massive lead testing effort, which in turn had shown widespread contamination. We got the survey from a source, but it was on paper.

After scanning, parsing, and geocoding, we sent out a team of reporters to neighborhoods to spot check the data, and also do some reporting on the neighborhoods. We ended up with a story about people who didn't know what was near them.

We also had an interesting experience: the water authority called our editor to complain that we were going to put all of the addresses online -- they felt that it was violating peoples' privacy, even though we weren't identifyng the owners or the residents. It was more important to them that we keep people in the dark about their blocks. Our editor at the time, Len Downie, said, "you're right. We shouldn't just put it on the Web." He also ordered up a special section to put them all in print.

Where do you turn to keep your skills updated or learn new things?

Cohen: "It's actually a little harder now that I'm out of the newsroom, surprisingly. Before, I would just dive into learning something when I'd heard it was possible and I wanted to use it to get to a story. Now I'm less driven, and I have to force myself a little more. I'm hoping to start doing more reporting again soon, and that the Reporters' Lab will help there too.

Lately, I've been spending more time with people from other disciplines to understand better what's possible, like machine learning and speech recognition at Carnegie Mellon and MIT, or natural language processing at Stanford. I can't DO them, but getting a chance to understand what's out there is useful. NewsFoo, SparkCamp and NICAR are the three places that had the best bang this year. I wish I could have gone to Strata, even if I didn't understand it all."

DeBarros: For surveillance, I follow really smart people on Twitter and have several key Google Reader subscriptions.

To learn, I spend a lot of time training after work hours. I've really been pushing myself in the last couple of years to up my game and stay relevant, particularly by learning Python, Linux and web development. Then I bring it back to the office and use it for web scraping and app building.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Cohen: "I think anything that gets more leverage out of fewer people is important in this age, because fewer people are working full time holding government accountable. The news apps help get more eyes on what the government is doing by getting more of what we work with and let them see it. I also think it helps with credibility -- the 'show your work' ethos -- because it forces newsrooms to be more transparent with readers / viewers.

For instance, now, when I'm judging an investigative prize, I am quite suspicious of any project that doesn't let you see each item, I.e., when they say, "there were 300 cases that followed this pattern," I want to see all 300 cases, or all cases with the 300 marked, so I can see whether I agree.

DeBarros: "They're important because we're living in a data-driven culture. A data-savvy journalist can use the Twitter API or a spreadsheet to find news as readily as he or she can use the telephone to call a source. Not only that, we serve many readers who are accustomed to dealing with data every day -- accountants, educators, researchers, marketers. If we're going to capture their attention, we need to speak the language of data with authority. And they are smart enough to know whether we've done our research correctly or not.

As for news apps, they're important because -- when done right -- they can make large amounts of data easily understood and relevant to each person using them."

These interviews were edited and condensed for clarity.

Strata Week: Profiling data journalists

Here are a few of the data stories that caught my attention this week.

Profiling data journalists

Over the past week, O'Reilly's Alex Howard has profiled a number of practicing data journalists, following up on the National Institute for Computer-Assisted Reporting's (NICAR) 2012 conference. Howard argues that data journalism has enormous importance, but "given the reality that those practicing data journalism remain a tiny percentage of the world's media, there's clearly still a need for its foremost practitioners to show why it matters, in terms of impact."

Howard's profiles include:

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Surveying data marketplaces

Edd Dumbill takes a look at data marketplaces, the online platforms that host data from various publishers and offer it for sale to consumers. Dumbill compares four of the most mature data marketplaces — Infochimps, Factual, Windows Azure Data Marketplace, and DataMarket — and examines their different approaches and offerings.

Dumbill says marketplaces like these are useful in three ways:

"First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume."

Analyzing sports stats

The Atlantic's Dashiell Bennett examines the MIT Sloan Sports Analytics Conference, a "festival of sports statistics" that has grown over the past six years from 175 attendees to more than 2,200.

Bennett writes:

"For a sports conference, the event is noticeably athlete-free. While a couple of token pros do occasionally appear as panel guests, this is about the people behind the scenes — those who are trying to figure out how to pick those athletes for their team, how to use them on the field, and how much to pay them without looking like a fool. General managers and team owners are the stars of this show ... The difference between them and the CEOs of most companies is that the sports guys have better data about their employees ... and a lot of their customers have it memorized."

Got data news?

Feel free to email me.

Related:

March 06 2012

Profile of the Data Journalist: The Daily Visualizer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Matt Stiles (@stiles) , a data journalist based in Washington, D.C., maintains a popular Daily Visualization blog. Our interview follows.

Where do you work now? What is a day in your life like?

I work at NPR, where I oversee data journalism on the State Impact project, a local-national partnership between us and member stations. My typical day always begins with a morning "scrum" meeting among the D.C. team as part of our agile development process. I spend time acquiring and analyzing data throughout each data, and I typically work directly with reporters, training them on software and data visualization techniques. I also spend time planning news apps and interactives, a process that requires close consultation with reporters, designers and developers.

How did you get started in data journalism? Did you get any special degrees or certificates?

No special training or certificates, though I did attend three NICAR boot camps (databases, mapping, statistics) over the years.

Did you have any mentors? Who? What were the most important resources they shared with you?

I have several mentors, both on the reporting side and the data side. For data, I wouldn't be where I am today without the help of two people: Chase Davis and Jennifer LaFleur. Jen got me interested early, and has helped me with formal and informal training over the years. Chase helped me with day-to-day questions when we worked together at the Houston Chronicle.

What does your personal data journalism "stack" look like? What tools could you not live without?

I have a MacBook that runs Windows 7. I have the basic CAR suite (Excel/Access, ArcGIS, SPSS, etc.) but also plenty of open-source tools, such as R for visualization or MySQL/Postgres for databases. I use Coda and Text Mate for coding. I use BBEdit and Python for text manipulation. I also couldn't live without Photoshop and Illustrator for cleaning up graphics.

What data journalism project are you the most proud of working on or creating?

I'm most proud of the online data library I created (and others have since expanded) at The Texas Tribune, but we're building some sweet apps at NPR. That's only going to expand now that we've created a national news apps team, which I'm joining soon.

Where do you turn to keep your skills updated or learn new things?

I read blogs, subscribe to email lists and attend lots of conferences for inspiration. There's no silver bullet. If you love this stuff, you'll keep up.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

More and more information is coming at us every day. The deluge is so vast. Data journalism at its core is important because it's about facts, not anecdotes.

Apps are important because Americans are already savvy data consumers, even if they don't know it. We must get them thinking -- or, even better, not thinking -- about news consumption in the same way they think about syncing their iPads or booking flights on Priceline or purchasing items on eBay. These are all "apps" that are familiar to many people. Interactive news should be, too.

This interview has been edited and condensed for clarity.

March 05 2012

Profile of the Data Journalist: The API Architect

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Jacob Harris (@harrisj) is an interactive news developer based in New York City. Our interview follows.

Where do you work now? What is a day in your life like?

I work in the Interactive Newsroom team at the New York Times. A day in my life is usually devoted to coding rather than meetings. Currently, I am almost exclusively devoted to the NYT elections coverage, where I switch between the operations of loading election results from the AP or building internal APIs that provide data to our various parts of elections.nytimes.com. I also sometimes help fix problems in our server stack when they arise or sometimes get involved in other projects if they need me.

How did you get started in data journalism? Did you get any special degrees or certificates?

I have a classical CS education, with a combined B.A./M.Eng from MIT. I have no journalism background or experience. I never even worked for my newspaper in college or anywhere. I do have a profound skepticism and contrarian nature that does help me fit in well with the journalists.

Did you have any mentors? Who? What were the most important resources they shared with you?

I don't have any specific mentors. But that doesn't mean I haven't been learning from anybody. We're in a very open team and we all usually learn things from each other. Currently, several of the frontend guys are tolerating my new forays into Javascript. Soon, the map guys will learn to bear my questions with patience.

What does your personal data journalism "stack" look like? What tools could you not live without?

Our actual web stack is built on top of EC2, with Phusion Passenger and Ruby on Rails serving our apps. We also use haproxy as a load balancer. Varnish is an amazing cache that everybody should use. On my own machine, I do my coding currently in Sublime Text 2. I use Pivotal Tracker to track my coding tasks. I could probably live with a different editor, but don't take my server stack away from me.

What data journalism project are you the most proud of working on or creating?

I have two projects I'm pretty proud of working on. Last year, I helped out with the Wikileaks War Logs reporting. We built an internal news app for the reporters to search the reports, see them on a map, and tag the most interesting ones. That was an interesting learning experience.

One of the unique things I figured out was how to extract MGRS coordinates from within the reports to geocode the locations inside of them. From this, I was able to distinguish the locations of various homicides within Baghdad more finely than the geocoding for the reports. I built a demo, pitched it to graphics, and we built an effective and sobering look at the devastation on Baghdad from the violence.

This year, I am working on my third election as part of Interactive News. Although we are proud of our team's work in 2008 and 2010, we've been trying some new ways of presenting our election coverage online and new ways of architecting all of our data sources so that it's easier to build new stuff. It's been gratifying to see how internal APIs combine with novel new storytelling formats and modern browser technologies this year.

Where do you turn to keep your skills updated or learn new things?

Usually, I just find out about things by following all the other news app developers on Twitter. We're a small ecosystem with lots of sharing. It's great how everybody learns from each other. I have created a Twitter list @harrisj/news-hackers to help keep tabs on all the cool stuff being done out there. (If you know someone who should be on it, let me know.)


Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

We live in a world of data. Our reporting should do a better job of presenting and investigating that data. I think it's been an incredible time for the world of news applications lately. A few years back, it was just an achievement to put data online in a browsable way.

These days, news applications are at a whole other level. Scott Klein of ProPublica put it best when he described all good data stories as including both the "near" (individual cases, examples) and the "far" (national trends, etc.).

In an article, the reporter would be pick a few compelling "nears" for the story. As a reader, I also would want to know how my school is performing or how polluted my water supply is.

This is what news applications can do: tell the stories that are found in the data, but also allow the readers to investigate the stories in the data that are highly important to them.

This interview has been edited and condensed for clarity.

OpenCorporates opens up new database of corporate directors and officers

In an age of technology-fueled transparency, corporations are subject to the same powerful disruption as governments. In that context, data journalism has profound importance for society. If a researcher needs data for business journalism, OpenCorporates is a bonafide resource.

Today, OpenCorporates is making a new open database of corporate officers and directors available to the world.

"It's pretty cool, and useful for journalists, to be able to search not just all the companies with directors for a given name in a given state, but across multiple states," said Chris Taggart, founder of Open Corporates, in an email interview. "Not surprisingly, loads of people, from journalists to corruption investigators, are very interested in this."

OpenCorporates is the largest open database of companies and corporate data in the world. The service now contains public data from around the world, from health and safety violations in the United Kingdom to official public notices in Spain to a register of federal contractors. The database has been built by the open data community, under a bounty scheme in conjunction with ScraperWiki. The site also has a useful Google Refine reconciliation function that matches legal entities to company names. Taggart's presentation on OpenCorporates from the 2012 NICAR conference, which provides an overview, is embedded below:

The OpenCorporates open application programming interface can be used with or without a key, although an API key does increase usage limits. The open data site's business model comes with an interesting hook: while OpenCorporates makes its data both free and open under a Share-Alike Attribution Open Database License, users who wish import the data into a proprietary database or use it without attribution must pay to do so.

"The critical thing about our Directors import, and *all* the other data in OpenCorporates, is that we give the provenance, both where and when we got the information," said Taggart. "This is in contrast to the proprietary databases who never give this, because they don't want you to go straight to the source, which also means it's problematic in tracing the source of errors. We've had several instances of the data being wrong at the source, like U.K. health and safety violations."

Taggart offered more perspective on the source of OpenCorporates director data, corporate data availability and the landscape around a universal business ID in the rest of our interview:

Where does the officer and director data come from? How is it validated and cleaned?

It's all from the official company registers. Most are scraped (we've scraped millions of pages), a couple (e.g. Vermont) are from downloads that the registries provide. We just need to make sure we're scraping and importing properly. We do some cleaning up (e.g. removing some of the '**NO DIRECTOR**' entries, but to a degree this has to be done post import, as you often don't know these till they're imported (which is why there are still a few in there).

By the way, in case you were wondering, the reason there are so many more directors than in the filters to the right is that there are about 3 million and counting Florida directors.

Was this data available anywhere before? If no, why not?

As far as I'm aware, only in proprietary databases. Proprietary databases have dominated company data. The result is massive duplication of effort, databases that have opaque errors in them, because they don't have many eyes on them, and lack of access to the public, small businesses, and as you will have heard from NICAR, journalists. I'm tempted to offer a bottle of champagne to the first journalist who finds a story in the directors data.

Who else is working on the universal business ID issue? I heard Beth Noveck propose something along these lines, for instance.

Several organizations have been working on this, mostly from a semi-proprietary point of view, or at least trying to generate a monopoly ID. In other words, it might be open, but in order to get anything on the company, you have to use their site as a lookup table.

OpenCorporates is different in that if you know the URI you know the jurisdiction and identity issued by the company register and vice versa. This means you don't need to ask OpenCorporates what the company ID is, as it's there in the ID. It also works with the EU/W3C's Business Vocabulary, which has just been published.

ISO has been working on one, but it's got exactly this problem. Also, their database won't contain the company number, meaning it doesn't link to the legal entity. Bloomberg have been working on one, as have Thomson Reuters, as they need an alternative to the DUNS number, but from the conversations I had in D.C., nobody's terribly interested in this.

I don't really know the status of Beth's project. They were intending to create a new ID too. From speaking to Jim Hendler, it didn't seem to be connected to the legal entity but instead to represent a search of the name (actually a hash of a SPARQL query). You can see a demo site at http://tw.rpi.edu/orgpedia/companies. I have severe doubts regarding this.

Finally, there's the Financial Stability Board's (part of the G20) work on a global legal entity identifier -- we're on the advisory board for this. This also would be a new number, and be voluntary, but on the other hand will be openly licensed.

I don't think it's a solution to the problem, as it won't be complete and for other reasons, but it may surface more information. We'd definitely provide an entity resolution service to it.

March 02 2012

Profile of the Data Journalist: The Visualizer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Michelle Minkoff (@MichelleMinkoff ) is an investigative developer/journalist based in Washington, D.C. Our interview follows.

Where do you work now? What is a day in your life like?

I am an Interactive Producer at the Associated Press' Washington DC bureau, where I focus on news applications related to politics and the election, as well as general mapping for our interactives on the Web. While my days pretty much always involve sitting in front of a computer, the actual tasks themselves can vary wildly. I may be chatting with reporters and editors in politics, environment, educational, national security or myriad beats about upcoming stories and how to use data to support reporting or create interactive stories. I might be gathering data, reformatting it or crafting Web applications. I spend a great deal of time creating interactive mapping systems, working a lot with geographic data, and collaborating with cartographers, editors and designers to decide how to best display it.

I split my time between working closely with my colleagues in the Washington bureau on the reporting/editing side, and my fellow interactive team members, only one of whom is also in DC. Our team is global, headquartered in New York, but with members spanning the globe from Phoenix to Bangkok.

It's a question of walking a balance between what needs to be done on daily deadlines for breaking news, longer-term stories which are often investigative, and creating frameworks that help The Associated Press to make the most of the Web's interactive nature in the long run.

How did you get started in data journalism? Did you get any special degrees or certificates?

I caught the bug when I took a computer-assisted reporting class from Derek Willis, a member of the New York Times' Interactive News Team, at Northwestern's journalism school where I was a grad student. I was fascinated by the role that technology could play in journalism for reporting and presentation, and very quickly got hooked. I also quickly discovered that I could lose track of hours playing with these tools, and that what came naturally to me was not as natural to others. I would spend days reporting for class, on and off Capitol Hill, and nights exchanging gchats with Derek and other data journalists he introduced me to. I started to understand SQL, advanced Excel, and fairly quickly thereafter, Python and Django.

I followed this up with an independent study in data visualization back at Medill's Chicago campus, under Rich Gordon. I practiced making Django apps, played with the Processing visualization language. I voraciously read through all the Tufte books. As a final project, I created a package about the persistence of Chicago art galleries that encompasses text, Flash visualization and a searchable database.

I have a concentration in Interactive Journalism, with my Medill masters' degree, but the courses mentioned above are but a partial component of that concentration.

Did you have any mentors? Who? What were the most important resources they shared with you?

The question here is in the wrong tense. I currently "do" have many mentors, and I don't know how I would do my job without what they've shared in the past, and in the present. Derek, mentioned above, was the first. He introduced me to his friend Matt [Waite], and then he told me there was a whole group of people doing this work at NICAR. Literally hundreds of people from that organization have helped me at various places on my journey, and I believe strongly in the mantra of "paying it forward" as they have -- no one can know it all, so we pass on what we've learned, so more people can do even better work.

Other key folks I've had the privilege to work with include all of the Los Angeles Times' Data Desk's members, which includes reporters, editors and Web developers. I worked most closely with Ben Welsh and Ken Schwencke, who answered many questions, and were extremely encouraging when I was at the very beginning of my journey.

At my current job at The Associated Press, I'm lucky to have teammates who mentor me in design, mapping and various Washington-based beats. Each is helpful in his or her own way.

Special attention deserves to be called to Jonathan Stray, who's my official boss, but also a fantastic mentor who enables me to do what I do. He's helping me to learn the appropriate technical skills to execute what I see in my head, as well as learn how to learn. He's not just teaching me the answers to the problems we encounter in our daily work, but also helping me learn how to better solve them, and work this whole "thing I do" into a sustainable career path. And all with more patience than I have for myself.

What does your personal data journalism "stack" look like? What tools could you not live without?

No matter how advanced our tools get, I always find myself coming back to Excel first to do simple work. It helps us an overall handle on a data set. I also will often quickly bring data into SQLite, a Firefox extension that allows a user to run SQL queries, with no database setup. I'm more comfortable asking complicated questions of data that way. I also like to use Google's Chart Tools to create quick visualizations for myself to better understand a story.

When it comes to presentation, since I've been doing a lot with mapping recently, I don't know what I'd do without my favorite open source tools, Tilemill and Leaflet. Building a map stack is hard work, but the work that others have done before it have made it a lot easier.

If we consider programming languages tools (which I do), JavaScript is my new Swiss army knife. Prior to coming to the AP, I did a lot with Python and Django, but I've learned a lot about what I like to call "Really Hard JavaScript." It's not just about manipulating the colors of a background on a Web page, but parsing, analyzing and presenting data. When I need to do more complex work to manipulate data, I use a combination of Ruby and Python -- depending on which has better tools for the job. For XML parsing, I like Ruby more. For simplifying geo data, I prefer Python.

What data journalism project are you the most proud of working on or creating?

That would be " Road to 270", a project we did at the AP that allows users to test out hypothetical "what-if" scenarios for the national election, painting states to define to which candidate a state's delegates could go. It combines demographic and past election data with the ability for users to make a choice and deeply engage with the interactive. It's not just telling the user a story, but informing the user by allowing him or her to be part of the story. That, I believe, is when data journalism becomes its most compelling and informative.

It also uses some advanced technical mapping skills that were new to me. I greatly enjoyed the thrill of learning how to structure a complex application, and add new tools to my toolkit. Now, I don't just have those new tools, but a better understanding of how to add other new tools.

Where do you turn to keep your skills updated or learn new things?

I look at other projects, both within the journalism industry and in general visualization communities. The Web inspector is my best friend. I'm always looking to see how people did things. I read blogs voraciously, and have a fairly robust Google Reader set of people whose work I follow closely. I also use lynda.com frequently (I tend to learn best by video tutorials.) Hanging out on listservs for free tools I use (such as Leaflet), programming languages I care about (Python), or projects whose mission our work is related to (Sunlight Foundation) help me engage with a community that cares about similar issues.

Help sites like Stack Overflow, and pretty much anything I can find on Google, are my other best friends. The not-so-secret secret of data journalism: we're learning as we go. That's part of what makes it so fun.

Really, the learning is not about paper or electronic resources. Like so much of journalism, this is best conquered, I argue, with persistence and stick-to-it-ness. I approach the process of data journalism and Web development as a beat. We attend key meetings. Instead of city council, it's NICAR. We develop vast rolodexes. I know people who have myriad specialties and feel comfortable calling on them. In return, I help people all over the world with this sort of work whenever I can, because it's that important. While we may work for competing places, we're really working toward the same goal: improving the way we inform the public about what's going on in our world. That knowledge matters a great deal.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

More and more information is coming at us every day. The deluge is so vast that we need to not just say things are true, but prove those truths with verifiable facts. Data journalism allows for great specificity, and truths based in the scientific method. Using computers to commit data journalism allows us to process great amounts of information much more efficiently, and make the world more comprehensible to a user.

Also, while we are working with big data, often only a subset of that data is valuable to a specific user. Data journalism and Web development skills allow us to customize those subsets for our various users, such as by localizing a map. That helps us give a more relevant and useful experience to each individual we serve.

Perhaps most importantly, more and more information is digital, and is coming at us through the Internet. It simply makes sense to display that information with a similar environment in which it's provided. Information is dispensed in a different way now than it was five years ago. It will be totally different in another five years. So, our explanations of that environment should match. We must make the most of the Internet to tell our stories differently now than we did before, and differently than we will in the future.

Knowing things are constantly changing, being at the forefront of that change, and enabling the public to understand and participate in that change, is a large part of what makes data journalism so exciting and fundamentally essential.

This interview has been edited and condensed for clarity.

Profile of the Data Journalist: The Human Algorithm

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Ben Welsh (@palewire) is an Web developer and journalist based in Los Angeles. Our interview follows.

Where do you work now? What is a day in your life like?

I work for the Los Angeles Times, a daily newspaper and 24-hour Web site based in Southern California. I'm a member of the Data Desk, a team of reporters and Web developers that specializes in maps, databases, analysis and visualization. We both build Web applications and conduct analysis for reporting projects.

I like to compare The Times to a factory, a factory that makes information. Metaphorically speaking, it has all sorts of different assembly lines. Just to list a few, one makes beautifully rendered narratives, another makes battleship-like investigative projects.

A typical day involves juggling work on difference projects, mentally moving from one assembly line to the other. Today I patched an embryonic open-source release, discussed our next move on a pending public records request, guided the real-time publication of results from the GOP primaries in Michigan and Arizona, and did some preparation for how we'll present a larger dump of results on Super Tuesday.

How did you get started in data journalism? Did you get any special degrees or certificates?

I'm thrilled to see new-found interest in "data journalism" online. It's drawing young, bright people into the field and involving people from different domains. But it should be said that the idea isn't new.

I was initiated into the field as a graduate student at the Missouri School of Journalism. There I worked at the National Institute for Computer-Assisted Reporting , also known as NICAR. Decades before anyone called it "data journalism," a disparate group of misfit reporters discovered that the data analysis made possible by computers enabled them to do more powerful investigative reporting. In 1989, they founded NICAR, which has, for decades, been training data skills to journalists and nurtured a tribe of journalism geeks. In the time since, computerized data analysis has become a dominant force in investigative reporting, responsible for a large share of the field's best work.

To underscore my point, here's a 1986 Time magazine article about how "newsmen are enlisting the machine."

Did you have any mentors? Who? What were the most important resources they shared with you?

My first journalism job was in Chicago. I got a gig working for two great people there, Carol Marin and Don Moseley, who have spent most of their careers as television journalists. I worked as their assistant. Carol and Don are warm people who are good teachers, but they are also excellent at what they do. There was a moment when I realized, "Hey, I can do this!" It wasn't just something I heard about in class, but I could actually see myself doing.

At Missouri, I had a great classmate named Brian Hamman, who is now at the New York Times. I remember seeing how invested Brian was in the Web, totally committed to Web development as a career path. When an opportunity opened up to be a graduate assistant at NICAR, Brian encouraged me to pursue it. I learned enough SQL to help do farmed-out investigative work for TV stations. And, more importantly, I learned that if you had technical skills you could get the job to work on a cool story.

After that I got a job doing data analysis at the Center for Public Integrity in Washington DC. I had the opportunity to work on investigative projects, but also the chance to learn a lot of computer programming along the way. I had the guidance of my talented coworkers, Daniel Lathrop, Agustin Armendariz, John Perry, Richard Mullins and Helena Bengtsson. I learned that computer programming wasn't impossible. They taught me that if you have a manageable task, a few friends to help you out and a door you can close, you can figure out a lot.

What does your personal data journalism "stack" look like? What tools could you not live without?

I do my daily development in gedit text editor, Byobu's slick implementation of the screen terminal and the Chromium browser. And, this part may be hard to believe, but I love Ubuntu Unity. I don't understand what everybody is complaining about.

I do almost all of my data management in the Python Web development framework Django and PostgreSQL's database, even if the work is an exploratory reporting project that will never be published. I find that the structure of the framework can be useful for organizing just about any data-driven project.

I use GitHub for both version-control and project management. Without it, I'd be lost.

What data journalism project are you the most proud of working on or creating?

As we all know, there's a lot of data out there. And, as anyone who works with it knows, most of it is crap. The projects I'm most proud of have taken large, ugly data sets and refined them into something worth knowing: a nut graf in an investigative story, or a data-driven app that gives the reader some new insight into the world around them. It's impossible to pick one. I like to think the best is still, as they say in the newspaper business, TK.

Where do you turn to keep your skills updated or learn new things?

Twitter is a great way to keep up with what is getting other programmers excited. I know a lot of people find social media overwhelming or distracting, but I feel plugged in and inspired by what I find there. I wouldn't want to live without it.

GitHub is another great source. I've learned so much just exploring other people's code. It's invaluable.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Computers offer us an opportunity to better master information, better understand each other and better watchdog those who would govern us. I tried to talk about some of the ways simply thinking about the process of journalism as an algorithm can point the way at last week's NICAR conference in a talk called "Human-Assisted Reporting." In my opinion, we should aspire to write code that embodies the idealistic principles and investigative methods of the previous generation. There's all this data out there now, and journalistic algorithms, "robot reporters," can help us ask it tougher questions.

Reposted byscyphi scyphi

March 01 2012

In the age of big data, data journalism has profound importance for society

The promise of data journalism was a strong theme throughout the National Institute for Computer-Assisted Reporting's (NICAR) 2012 conference. In 2012, making sense of big data through narrative and context, particularly unstructured data, will be a central goal for data scientists around the world, whether they work in newsrooms, Wall Street or Silicon Valley. Notably, that goal will be substantially enabled by a growing set of common tools, whether they're employed by government technologists opening Chicago, healthcare technologists or newsroom developers.

At NICAR 2012, you could literally see the code underpinning the future of journalism written - or at least projected - on the walls.

"The energy level was incredible," said David Herzog, associate professor for print and digital news at the Missouri School of Journalism, in an email interview after NICAR. "I didn't see participants wringing their hands and worrying about the future of journalism. They're too busy building it."

Just as open civic software is increasingly baked into government, open source is playing a pivotal role in the new data journalism.

"Free and open-source tools dominated," said Herzog. "It's clear from the panels and hands-on classes that free and open source tools have eliminated the barrier to entry in terms of many software costs."

While many developers are agnostic with respect to which tools they use to get a job done, the people who are building and sharing tools for data journalism are often doing it with open source code. As Dan Sinker, the head of the Knight-Mozilla News Technology Partnership for Mozilla, wrote afterwards, journo-coders took NICAR 12 "to a whole new level."

While some of that open source development was definitely driven by the requirements of the Knight News Challenge, which funded the PANDA and Overview projects, there's also a collaborative spirit in evidence throughout this community.

This is a group of people who are fiercely committed to "showing your work" -- and for newsroom developers, that means sharing your code. To put it another way, code, don't tell. Sessions on Python, Django, mapping, Google Refine and Google Fusion tables were packed at NICAR 12.

No, this is not your father's computer-assisted reporting.

"I thought this stacked up as the best NICAR conference since the first in 1993," said Herzog. "It's always been tough to choose from the menu of panels, demos and hands-on classes at NICAR conferences. But I thought there was an abundance of great, informative, sessions put on by the participants. Also, I think NICAR offered a good range of options for newbies and experts alike. For instance, attendees could learn how to map using Google Fusion tables on the beginner's end, or PostGIS and qGIS at the advanced level. Harvesting data through web scraping has become an ever bigger deal for data journalists. At the same time, it's getting easier for folks with no or little programming chops to scrape using tools like spreadsheets, Google Refine and ScraperWiki. "

On the history of NICAR

According to IRE, NICAR was founded in 1989. Since its founding, the Institute has trained thousands of journalists how to find, collect and public electronic information.

Today, "the NICAR conference helps journalists, hackers, and developers figure out best practices, best methods,and best digital tools for doing journalism that involves data analysis and classic reporting in the field," said Brant Houston, former executive director of Investigative Reporters and Editors, in an email interview. "The NICAR conference also obviously includes investigative journalism and the standards for data integrity and credibility."

"I believe the first IRE-sponsored [conference] was in 1993 in Raleigh, when a few reporters were trying to acquire and learn to use spreadsheets, database managers, etc. on newly open electronic records," said Sarah Cohen, the Knight professor of the practice of journalism and public policy at Duke University, in an email interview. "Elliott Jaspin was going around the country teaching reporters how to get data off of 9-track tapes. There really was no public Internet. At the time, it was really, really hard to use the new PC's, and a few reporters were trying to find new stories. The famous ones had been Elliott's school bus drivers who had drunk driving records and the Atlanta Color of Money series on redlining."

"St. Louis was my 10th NICAR conference," said Anthony DeBarros, the senior database editor at USA Today, in an email interview. "My first was in 1999 in Boston. The conference is a place where news nerds can gather and remind themselves that they're not alone in their love of numbers, data analysis, writing code and finding great stories by poring over columns in a spreadsheet. It serves as an important training vehicle for journalists getting started with data in the newsroom, and it's always kept journalists apprised of technological developments that offer new ways of finding and telling stories. At the same time, its connection to IRE keeps it firmly rooted in the best aspects of investigative reporting -- digging up stories that serve the public good.

Baby, you can drive my CAR

Long before we started talking about "data journalism," the practice of computer-assisted reporting (CAR) was growing around the world.

"The practice of CAR has changed over time as the tools and environment in the digital world has changed," said Houston. "So it began in the time of mainframes in the late 60s and then moved onto PCs (which increased speed and flexibility of analysis and presentation) and then moved onto the Web, which accelerated the ability to gather, analyze and present data. The basic goals have remained the same. To sift through data and make sense of it, often with social science methods. CAR tends to be an "umbrella" term - one that includes precision journalism and data driven journalism and any methodology that makes sense of date such as visualization and effective presentations of data."

On one level, CAR is still around because the journalism world hasn't coined a good term to use instead.

"Computer-assisted reporting" is an antiquated term, but most people who practice it have recognized that for years," said DeBarros. "It sticks around because no one has yet to come up with a dynamite replacement. Phil Meyer, the godfather of the movement, wrote a seminal book called "Precision Journalism, and that term is a good one to describe that segment of CAR that deals with statistics and the use of social science methods in newsgathering. As an umbrella term, data journalism seems to be the best description at the moment, probably because it adequately covers most of the areas that CAR has become -- from traditional data-driven reporting to the newer category of news applications."

The most significant shift in CAR may well be when all of those computers being used for reporting were connected through the network of networks in the 1990s.

"It may seem obvious, but of course the Internet changed it all, and for a while it got smushed in with trying to learn how to navigate the Internet for stories, and how to download data," said Cohen. "Then there was a stage when everyone was building internal intranets to deliver public records inside newsrooms to help find people on deadline, etc. So for much of the time, it was focused on reporting, not publishing or presentation. Now the data journalism folks have emerged from the other direction: People who are using data obtained through APIs who often skip the reporting side, and use the same techniques to deliver unfiltered information to their readers in an easier format the the government is giving us. But I think it's starting to come back together -- the so-called data journalists are getting more interested in reporting, and the more traditional CAR reporters are interested in getting their stories on the web in more interesting ways.

Whatever you call it, the goals are still the same.

"CAR has always been about using data to find and tell stories," said DeBarros. "And it still is. What has changed in recent years is more emphasis toward online presentations (interactive maps and applications) and the coding skills required to produce them (JavaScript, HTML/CSS, Django, Ruby on Rails). Earlier NICAR conferences revolved much more around the best stories of the year and how to use data techniques to cover particular topics and beats. That's still in place. But more recently, the conference and the practice has widened to include much more coding and presentation topics. That reflects the state of media -- every newsroom is working overtime to make its content work well on the web, on mobile, and on apps, and data journalists tend to be forward thinkers so it's not surprising that the conference would expand to include those topics."

What stood out at NICAR 2012?

The tools and tactics on display at NICAR were enough to convince Tyler Dukes at Duke to write that NICAR taught me I know nothing. Browse through the tools, slides and links from NICAR 2012 curated by Chrys Wu to get a sense of just how much is out there. The big theme, however, without a doubt, was data.

"Data really is the meat of the conference, and a quick scan of the schedule shows there were tons of sessions on all kinds of data topics, from the Census to healthcare to crime to education," said DeBarros.

What I saw everywhere at NICAR was interest not simply in what data was out there, however, but how to get it and put it to use, from finding stories and source to providing empirical evidence to back up other reporting to telling stories with maps and visualizations.

"A major theme was the analysis of data (using spreadsheets, data managers, GIS) that gives journalism more credibility by seeing patterns, trends and outliers," said Houston. "Other themes included collection and analysis of social media, visualization of data, planning and organizing stories based on data analysis, programming for web scraping (data collection from the Web) and mashing up various Web programs."

"Harvesting data through web scraping has become an ever bigger deal for data journalists," said Herzog. "At the same time, it's getting easier for folks with no or little programming chops to scrape using tools like spreadsheets, Google Refine and ScraperWiki. That said, another message for me was how important programming has become. No, not all journalists or even data journalists need to learn programming. But as Rich Gordon at Medill has said, all journalists should have an appreciation and understanding of what it can do."

Cohen similarly pointed to data, specifically its form. "The theme that I saw this year was a focus on unstructured rather than structured data," she said. "For a long time, we've been hammering governments to give us 'data' in columns and rows. I think we're increasingly seeing that stories just as likely (if not more likely) come from the unstructured information that comes from documents, audio and video, tweets, other social media -- from government and non-government sources. The other theme is that there is a lot more collaboration, openness and sharing among competing news organizations. (Witness PANDA and census.ire.org and the New York Times campaign finance API). But it only goes so far -- you don't see ProPublica sharing the 40+ states' medical licensure data that Dan scraped with everyone else. (I have to admit, though, I haven't asked him to share.) IRE has always been about sharing techniques and tools --- now we're actually sharing source material."

While data dominated NICAR 12, other trends mattered as well, from open mapping tools to macroeconomic trends in the media industry. "A lot of newsrooms are grappling with rapid change in mapping technology," said DeBarros. "Many of us for years did quite well with Flash, but the lack of support for Flash on iPad has fueled exploration into maps built on open source technologies that work across a range of online environments. Many newsrooms are grappling with this, and the number of mapping sessions at the conference reflected this."

There's also serious context to the interest in developing data journalism skills. More than 166 U.S. newspapers have stopped putting out a print edition or closed down altogether since 2008, resulting in more than 35,000 job losses or buyouts in the newspaper industry since 2007.

"The economic slump and the fundamental change in the print publishing business means that journalists are more aware of the business side than ever," said DeBarros, "and I think the conference reflected that more than in the past. There was a great session on turning your good work into money by Chase Davis and Matt Wynn, for example. I was on a panel talking about the business reasons for starting APIs. The general unease most journalists feel knowing that our industry still faces difficult economic times. Watching a new generation of journalists come into the fold has been exciting."

One notable aspect of that next generation of data journalists is that it does not appear likely to look or sound the same as the newsrooms of the 20th century.

"This was the most diverse conference that I can remember," said Herzog. "I saw more women and people of color than ever before. We had data journalists from many countries: Korea, the U.K., Serbia, Germany, Canada, Latin America, Denmark, Sweden and more. Also, the conference is much more diverse in terms of professional skills and interests. Web 2.0 entrepreneurs, programmers, open data advocates, data visualization specialists, educators, and app builders mixed with traditional CAR jockeys. I also saw a younger crowd, a new generation of data journalists who are moving into the fold. For many of the participants, this was their first conference."

What problems does data journalism face?

While the tools are improving, there are still immense challenges ahead, from the technology itself to education to resources in newsroom. "A major unsolved challenge is making the analysis of unstructured data easier and faster to do. Those working on this include myself, Sarah Cohen, the DocumentCloud team, teams at AP and Chicago Tribune and many others," said Houston.

There's also the matter of improving the level of fundamental numeracy in the media. "This is going to sound basic, but there are still far too many journalists around the world who cannot open an Excel spreadsheet, sort the values or write an equation to determine percentage change," said DeBarros, "and that includes a large number of the college interns I see year after year, which really scares me. Journalism programs need to step up and understand that we live in a data-rich society, and math skills and basic data analysis skills are highly relevant to journalism. The 400+ journalists at NICAR still represent something of an outlier in the industry, and that has to change if journalism is going to remain relevant in an information-based culture."

In that context, Cohen has high hopes for a new project, the Reporters Lab. "The big unsolved problem to me is that it's still just too hard to use "data" writ large," she said. " You might have seen 4 or 5 panels on how to scrape data [at NICAR]. People have to write one-off computer programs using Python or Ruby or something to scrape a site, rather than use a tool like Kapow, because newsrooms can't (and never have) invest that kind of money into something that really isn't mission-critical. I think Kapow and its cousins cost $20,000-$40,000 a year. Our project to find those kinds of holes and create, commission or adapt free, open source tools for regular reporters to use, not the data journalist skilled in programming. We're building communities of people who want to work on these problems."

What role does data journalism play in open government?

On the third day of NICAR 2012, I presented upon "open data journalism, which, to paraphrase Jonathan Stray, I'd define as obtaining, reporting upon, curating and publishing open data in the public interest. As someone who's been following the open government movement closely for a few years now, the parallels to what civic hackers are doing and what this community of data journalists are working on is unescapable. They're focused on putting data to work for the public good, whether it's in the public interest, for profit, in the service of civic utility or, in the biggest crossover, government accountability.

To do so will require that data journalists and civic coders alike apply the powerful emerging tools in the newsroom stack to the explosion of digital bits and bytes from government, business and our fellow citizens.

The need for data journalism, in the context of massive amounts of government data being released, could not any more timely, particularly given persistent quality issues.

"I can't find any downsides of more data rather than less," said Cohen, "but I worry about a few things."

First, emphasized Cohen, there's an issue of whether data is created open from the beginning -- and the consequences of 'sanitizing' it before release. "The demand for structured, nicely scrubbed data for the purpose of building apps can result in fake records rather than real records being released. USASpending.gov is a good example of that -- we don't get access to the actual spending records like invoices and purchase orders that agencies use, or the systems they use to actually do their business. Instead we have a side system whose only purpose is to make it public, so it's not a high priority inside agencies and there's no natural audit trail on it. It's not used to spend money, so mistakes aren't likely to be caught."

Second, there's the question of whether information relevant to an investigation has been scrubbed for release. "We get the lowest common denominator of information," she said. "There are a lot of records used for accountability that depend on our ability to see personally identifiable information (as opposed to private or personal information, which isn't the same thing). For instance, if you want to do stories on how farm subsidies are paid, you kind of have to know who gets them. If you want to do something on fraud in FEMA claims, you have to be able to find the people and businesses who get the aid. But when it gets pushed out as open government data, it often gets scrubbed of important details and then we have a harder time getting them under FOIA because the agencies say the records are already public."

To address those two issues, Cohen recommends getting more source documents, as a historian would. "I think what we can do is to push harder for actual records, and to not settle for what the White House wants to give us," she said. "We also have to get better at using records that aren't held in nice, neat forms -- they're not born that way, and we should get better at using records in whatever form they exist."

Why do data journalism and news apps matter?

Given the economic and technological context, it might seem like the case for data journalism should make itself. "CAR, data journalism, precision journalism, and news apps all are crucial to journalism -- and the future of journalism -- because they make sense of the tremendous amounts of data," said Houston, "so that people can understand the world and make sensible decisions and policies."

Given the reality that those practicing data journalism remain a tiny percentage of the world's media, however, there's clearly still a need for its foremost practitioners to show why it matters, in terms of impact.

"We're living in a data-driven culture," said DeBarros. "A data-savvy journalist can use the Twitter API or a spreadsheet to find news as readily as he or she can use the telephone to call a source. Not only that, we serve many readers who are accustomed to dealing with data every day -- accountants, educators, researchers, marketers. If we're going to capture their attention, we need to speak the language of data with authority. And they are smart enough to know whether we've done our research correctly or not. As for news apps, they're important because -- when done right -- they can make large amounts of data easily understood and relevant to each person using them."

New tools, same rules

While the platforms and toolkits for journalism are evolving and the sources of data are exploding, many things haven't changed. For one, the ethics that guide the choices of the profession remain central to the journalism of the 21st century, as the new NPR's new ethics guide makes clear.

Whether news developers are rendering data in real-time, validating data in the real world, or improving news coverage with data, good data journalism still must tell a story. And as Erika Owens reflected in her own blog after NICAR, looking back upon a group field trip to the marvelous City Museum in St. Louis, journalism is also joyous, whether one is "crafting the perfect lede or slaying an infuriating bug."

Whether the tool is a smartphone, notebook or dataset, these tools must also extend investigative reporting, as the Los Angeles Times Doug Smith emphasized to me at the NICAR conference.

If text is the next frontier in data journalism, harnessing the power of big data, it will be in the service of telling stories more effectively. Digital journalism and digital humanities are merging in the service of more informed society.

Profiles of the data journalist

To learn more about the people who are redefining the practice computer-assisted reporting, in some cases, building the newsroom stack for the 21st century, Radar conducted a series of email interviews with data journalists during the 2012 NICAR Conference. The first two of the series are linked below:

Profile of the Data Journalist: The Elections Developer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Derek Willis (@derekwillis) is a news developer based in New York City. Our interview follows.

Where do you work now? What is a day in your life like?

I work for The New York Times as a developer in the Interactive News Technologies group. A day in my work life usually includes building or improving web applications relating to politics, elections and Congress, although I also get the chance to branch out to do other things. Since elections are such an important subject, I try to think of ways to collect information we might want to display and of ways to get that data in front of readers in an intelligent and creative manner.

How did you get started in data journalism? Did you get any special degrees or certificates?

No, I started working with databases in graduate school at the University of Florida (I left for a job before finishing my master's degree). I had an assistantship at an environmental occupations training center and part of my responsibilities was to maintain the mailing list database. And I just took to it - I really enjoyed working with data, and once I found Investigative Reporters & Editors, things just took off for me.

Did you have any mentors? Who? What were the most important resources they shared with you?

A ton of mentors, mostly met through IRE but also people at my first newspaper job at The Palm Beach Post. A researcher there, Michelle Quigley, taught me how to find information online and how sometimes you might need to take an indirect route to locating the stuff you want. Kinsey Wilson, now the chief content officer at NPR, hired me at Congressional Quarterly and constantly challenged me to think bigger about data and the news. And my current and former colleagues at The Times and The Washington Post are an incredible source of advice, counsel and inspiration.

What does your personal data journalism "stack" look like? What tools could you not live without?

It's pretty basic: spreadsheets, databases (MySQL, PostgreSQL, SQLite) and a programming language like Python or, these days, Ruby. I've been lucky to find excellent tools in the Ruby world, such as the Remote Table gem by Brighter Planet, and a host of others. I like PostGIS for mapping stuff.

What data journalism project are you the most proud of working on or creating?

I'm really proud of the elections work at The Times, but can't take credit for how good it looks. A project called Toxic Waters also was incredibly challenging and rewarding to work on, too. But my favorite might be the first one: the Congressional Votes Database that Adrian Holovaty, Alyson Hurt and I created at The Post in late 2005. It was a milestone for me and for The Post, and helped set the bar for what news organizations could do with data on the web.

Where do you turn to keep your skills updated or learn new things?

My colleagues are my first source. When you work with Jeremy Ashkenas, the author of the Backbone and Underscore Javascript libraries, you see and learn new things all the time. Our team is constantly bouncing new concepts around. I wish I had more time to learn new things; maybe after the elections!

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

A couple of reasons: one is that we live in an age where information is plentiful. Tools that can help distill and make sense of it are valuable. They save time and convey important insights. News organizations can't afford to cede that role. The second is that they really force you to think about how the reader/user is getting this information and why. I think news apps demand that you don't just build something because you like it; you build it so that others might find it useful.

This email interview has been edited and condensed for clarity.

Profile of the Data Journalist: The Long Form Developer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Dan Nguyen (@dancow) is an investigative developer/journalist based in Manhattan. Our interview follows.

Where do you work now? What is a day in your life like?

I'm a news app developer at ProPublica, where I've worked for about 3.5 years. It's hard to say what our typical day is like. Ideally, I either have a project or am writing code to collect the data to determine whether a project is worth doing (or just doing old-fashioned reading of articles/papers that may spark ideas for things to look at). We're a small operation so we have our hands on any of the daily news production as well, including helping the reporters put together online features for their more print-focused work.

How did you get started in data journalism? Did you get any special degrees or certificates?

I stumbled into data journalism because I had always been interested in being a journalist but double majored in journalism and computer engineering just in case the job market didn't work out. Out of college, I got a good job as a traditional print reporter at a regional newspaper but was eventually asked to help with the newsroom's online side. I got back into programming and started to realize there was a role for programming and important journalism.

Did you have any mentors? Who? What were the most important resources they shared with you?

The mix of programming and journalism is still relatively new, so I didn't have any formal mentors in it. I was of course lucky that my boss at ProPublica, Scott Klein, had a great vision about the role of news applications and our investigative journalism. We were also fortunate to have Brian Boyer (now the news applications editor at the Tribune company) to work with us as we started doing news apps with Ruby on Rails, as he had come into journalism from being a professional developer.

What does your personal data journalism "stack" look like? What tools could you not live without?

In terms of day-to-day tools, I use RVM (Ruby Version Manager) to run multiple versions of Ruby, which is my all purpose tool for doing any kind of batch task work, text processing/parsing, number crunching, and of course Ruby on Rails development. Git, of course, is essential, and I combine that with Dropbox to keep versioned copies of personal projects and data work. On top of that, my most frequently used tool is Google Refine, which takes the tedium out of exploring new data sets, especially if I have to clean them.

What data journalism project are you the most proud of working on or creating?

The project I'm most proud of is something I did before SOPA Opera, which was our Dollars for Docs project in 2010. It started off with just a blog post I wrote to teach other journalists how web scraping was useful. In this case, I scraped a website Pfizer used to disclose what it paid doctors to do promotional and consulting work. My colleagues noticed and said that we could do that for every company that had been disclosing payments. Because each company disclosed these payments in a variety of formats, including Flash containers and PDFs, few people had tried to analyze these disclosures in bulk, to see nationwide trends in these financial relationships.

A lot of the data work happened behind the scenes, including writing dozens of scrapers to cross-reference our database of payments with state medical board and med school listings. For the initial story, we teamed up with five other newsrooms, including NPR and the Boston Globe, which required programmatically creating a system in which we could coordinate data and research. With all the data we had, and the number of reporters and editors working on this outside of our walls, this wasn't a project that would've succeeded by just sending Excel files back and forth.

The website we built from that data is our most visited project yet, as millions of people used it to look up their doctors. Afterwards, we shared our data with any news outlet that asked, so hundreds of independently reported stories came from our data. Among the results were that the drug companies and the med schools revisited their screening and conflict of interest policies.

So, in terms of impact, Dollars for Docs is the project I'm proudest of. But it shares something in common with SOPA Opera (which was mostly a solo project that took a couple weeks), in that both projects were based of already well-known and long-ago-publicized data. But with data journalism techniques, there are countless new angles to important issues, and countless new and interesting ways to tell their stories.

Where do you turn to keep your skills updated or learn new things?

I check Hacker News and the programming subreddit constantly to see what new hacks, projects, and plugins that the community is putting out. I also have a huge backlog of programming books, some of them free that were posted on HN, on my Kindle.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

I went into journalism because I wanted to be a longform writer in the tradition of the New Yorker. But I'm fortunate that I stumbled onto the path of using programming to do journalism; more and more, I'm seeing how important stories aren't being done even though the data and information are out in broad daylight (as they were in D4D and SOPA Opera) because we have relatively few journalists with the skills or mindset to process and understand that data. Of course, doing this work doesn't preclude me from presenting in a longform article; it just happens that programming also provides even more ways to present a story when narrative isn't the only (or the ideal) way to do so.

February 17 2012

Top stories: February 13-17, 2012

Here's a look at the top stories published across O'Reilly sites this week.

The stories behind a few O'Reilly "classics"
Tim O'Reilly: "It's amazing to me how books I first published more than 20 years ago are still creating value for readers."

How to create a visualization
Creating a visualization requires more than just data and imagery. Pete Warden outlines the process and actions that drove his new Facebook visualization project.

Let's remember why we got into the publishing business
At the 2012 Tools of Change for Publishing Conference this week in New York City, keynoter LeVar Burton reminded the audience why storytelling will always matter.

There's Plan A, and then there's the plan that will become your business
Drawing from the Lean Startup and other methods, "Running Lean" helps entrepreneurs transform flawed Plan A ideas into viable companies. "Running Lean" author Ash Maurya explains the basics in this interview.

The bond between data and journalism grows stronger
This interview with Liliana Bounegru, project coordinator of Data Driven Journalism at the European Journalism Centre, offers more insight into why the importance of data journalism continues to grow in the age of big data.


Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl