Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 15 2012

Profile of the Data Journalist: The Data News Editor

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society. (You can learn more about this world and the emerging leaders of this discipline in the newly released "Data Journalism Handbook.")

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

John Keefe (@jkeefe) is a senior editor for data news and journalism technology at WNYC public radio, based in New York City, NY. He attracted widespread attention when an online map he built using available data beat the Associated Press with Iowa caucus results earlier this year. He's posted numerous tutorials and resources for budding data journalists, including how to map data onto county districts, use APIs, create news apps without a backend content management system and make election results maps. As you'll read below, Keefe is a great example of a journalist who picked up these skills from the data journalism community and the Hacks/Hackers group.

Our interview follows, lightly edited for content and clarity. (I've also added a Twitter list of data journalist from the New York Times' Jacob Harris.)

Where do you work now? What is a day in your life like?

I work in the middle of the WNYC newsroom -- quite literally. So throughout the day, I have dozens of impromptu conversations with reporters and editors about their ideas for maps and data projects, or answering questions about how to find or download data.

Our team works almost entirely on "news time," which means our creations hit the Web in hours and days more often than weeks and months. So I'm often at my laptop creating or tweaking maps and charts to go with online stories. That said, Wednesday mornings it's breakfast at a Chelsea cafe with collaborators at Balance Media to update each other on longer-range projects and tools we make for the newsroom and then open source, like Tabletop.js and our new vertical timeline.

Then there are key meetings, such as the newsroom's daily and weekly editorial discussions, where I look for ways to contribute and help. And because there's a lot of interest and support for data news at the station, I'm also invited to larger strategy and planning meetings.

How did you get started in data journalism? Did you get any special degrees or certificates?

I've been fascinated with the intersection of information, design and technology since I was a kid. In the last couple of years, I've marveled at what journalists at the New York Times, ProPublica and the Chicago Tribune were doing online. I thought the public radio audience, which includes a lot of educated, curious people, would appreciate such data projects at WNYC, where I was news director.

Then I saw that Aron Pilhofer of the New York Times would be teaching a programming workshop at the 2009 Online News Association annual meeting. I signed up. In preparation, I installed Django on my laptop and started following the beginner's tutorial on my subway commute. I made my first "Hello World!" web app on the A Train.

I also started hanging out at Hacks/Hackers meetups and hackathons, where I'd watch people code and ask questions along the way.

Some of my experimentation made it onto the WNYC's website -- including our 2010 Census maps and the NYC Hurricane Evacuation map ahead of Hurricane Irene. Shortly thereafter, WNYC management asked me to focus on it full-time.

Did you have any mentors? Who? What were the most important resources they shared with you?

I could not have done so much so fast without kindness, encouragement and inspiration from Pilhofer at the Times; Scott Klein, Al Shaw, Jennifer LaFleur and Jeff Larson at ProPublica; , Chris Groskopf, Joe Germuska and Brian Boyer at the Chicago Tribune; and Jenny 8. Lee of, well, everywhere.

Each has unstuck me at various key moments and all have demonstrated in their own work what amazing things were possible. And they have put a premium on sharing what they know -- something I try to carry forward.

The moment I may remember most was at an afternoon geek talk aimed mainly at programmers programmers. After seeing a demo of a phone app called Twilio, I turned to Al Shaw, sitting next to me, and lamented that I had no idea how to play with such things.

"You absolutely can do this," he said.

He encouraged me to pick up Sinatra, a surprisingly easy way to use the Ruby programming language. And I was off.

What does your personal data journalism "stack" look like? What tools could you not live without?

Google Maps - Much of what I can turn around quickly is possible because of Google Maps. I'm also experimenting with MapBox and Geocommons for more data-intensive mapping projects, like our NYC diversity map.

Google Fusion Tables - Essential for my wrangling, merging and mapping of data sets on the fly.

Google Spreadsheets - These have become the "backend" to many of our data projects, giving reporters and editors direct access to the data driving an application, chart or map. We wire them to our apps using Tabletop.js, an open-source program we helped to develop.

TextMate - A programmer's text editor for Mac. There are several out there, and some are free. TextMate is my fave.

The JavaScript Tools Bundle for Textmate - It checks my JavaScript code ever time I save, flagging me to near-invisible, infuriating errors such as a stray comma or a missing parenthesis. I'm certain this one piece of software has given me more days with my kids.

Firebug for Firefox - Lets you see what your code is doing in the browser. Essential for troubleshooting CSS and JavaScript, and great for learning how the heck other people make cool stuff.

Amazon S3 - Most of what we build are static pages of html and JavaScript, which we host in the Amazon cloud and embed into article pages on our CMS.

census.ire.org - A fabulous, easy-to-navigate presentation of US Census data made by a bunch of journo-programmers for Investigative Reporters and Editors. I send someone there probably once a week.

What data journalism project are you the most proud of working on or creating?

I'd have to say our GOP Iowa Caucuses feature. It has several qualities I like:

  • Mashed-up data -- It mixes live, county vote results with Patchwork Nation community types.
  • A new take -- We know other news sites would shade Iowa's counties by the winner; we shaded them by community type and showed who won which categories.
  • Complete sharability -- We made it super-easy for anyone to embed the map into their own site, which was possible because the results came license-free from the state GOP via Google.
  • Key code from another journalist -- The map-rollover coolness comes from code built by Albert Sun, then of the Wall Street Journal and now at the New York Times.
  • Rapid learning -- I taught myself a LOT of JavaScript quickly.
  • Reusability -- We used it for which we did for each state until Santorum bowed out.


Bonus: I love that I made most of it sitting at my mom's kitchen table over winter break.

Where do you turn to keep your skills updated or learn new things?

WNYC's editors and reporters. They have the bug, and they keep coming up with new and interesting projects. And I find project-driven learning is the most effective way to discover new things. New York Public Radio -- which runs WNYC along with classical radio station WQXR, New Jersey Public Radio and a street-level performance space -- also has a growing stable of programmers and designers, who help me build things, teach me amazing tricks and spot my frequent mistakes.

The IRE/NICAR annual conference. It's a meetup of the best journo-programmers in the country, and it truly seems each person is committed to helping others learn. They're also excellent at celebrating the successes of others.

Twitter. I follow a bunch of folks who seem to tweet the best stuff, and try to keep a close eye on 'em.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Candidates, companies, municipalities, agencies and non-profit organizations all are using data. And a lot of that data is about you, me and the people we cover.

So first off, journalism needs an understanding of the data available and what it can do. It's just part of covering the story now. To skip that part of the world would shortchange our audience, and our democracy. Really.

And the better we can both present data to the general public and tell data-driven (or -supported) stories with impact, the better we can do great journalism.

March 17 2012

Profile of the Data Journalist: The Homicide Watch

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Chris Amico (@eyeseast) is a journalist and web developer based in Washington, DC, where he works on NPR's State Impact project, building a platform for local reporters covering issues in their states. Laura Norton Amico (@LauraNorton) is the editor of Homicide Watch (@HomicideWatch), an online community news platform in Washington, D.C. that aspires to cover every homicide in the District of Columbia. And yes, the similar names aren't a coincidence: the Amicos were married in 2010.

Since Homicide Watch launched in 2009, it's been earning praise and interest from around the digital world, including a profile by the Nieman Lab at Harvard University that asked whether a local blog "could fill the gaps of DC's homicide coverage. Notably, Homicide Watch has turned up a number of unreported murders.

In the process, the site has also highlighted an important emerging set of data that other digital editors should consider: using inbound search engine analytics for reporting. As Steve Myers reported for the Poynter Institute, Homicide Watch used clues in site search queries to ID a homicide victim. We'll see if the Knight Foundation think this idea has legs: the husband and wife team have applied for a Knight News Challenge grant to build a tooklit for real-time investigative reporting from site analytics.

The Amico's success with the site - which saw big growth in 2011 -- offers an important case study into why organizing beats may well hold similar importance as investigative projects. It also will be a case study with respect to sustainability and business models for the "new news,"as Homicide Watch looks to license its platform to news outlets across the country.

Below, I've embedded a presentation on Homicide Watch from the January 2012 meeting of the Online News Association. Our interview follows.

Watch live streaming video from onlinenewsassociation at livestream.com

Where do you work now? What is a day in your life like?

Laura: I work full time right now for Homicide Watch, a database driven beat publishing platform for covering homicides. Our flagship site is in DC, and I’m the editor and primary reporter on that site as well as running business operations for the brand.

My typical days start with reporting. First, news checks, and maybe posting some quick posts on anything that’s happened overnight. After that, it’s usually off to court to attend hearings and trials, get documents, reporting stuff. I usually have to to-do list for the day that includes business meetings, scheduling freelancers, mapping out long-term projects, doing interviews about the site, managing our accounting, dealing with awards applications, blogging about the start-up data journalism life on my personal blog and for ONA at journalists.org, guest teaching the occasional journalism class, and meeting deadlines for freelance stories. The work day never really ends; I’m online keeping an eye on things until I go to bed.

Chris: I work for NPR, on the State Impact project, where I build news apps and tools for journalists. With Homicide Watch, I work in short bursts, usually an hour before dinner and a few hours after. I’m a night owl, so if I let myself, I’ll work until 1 or 2 a.m., just hacking at small bugs on the site. I keep a long list of little things I can fix, so I can dip into the codebase, fix something and deploy it, then do something else. Big features, like tracking case outcomes, tend to come from weekend code sprints.

How did you get started in data journalism? Did you get any special degrees or certificates?

Laura: Homicide Watch DC was my first data project. I’ve learned everything I know now from conceiving of the site, managing it as Chris built it, and from working on it. Homicide Watch DC started as a spreadsheet. Our start-up kit for newsrooms starting Homicide Watch sites still includes filling out a spreadsheet. The best lesson I learned when I was starting out was to find out what all the pieces are and learn how to manage them in the simplest way possible.

Chris: My first job was covering local schools in southern California, and data kept creeping into my beat. I liked having firm answers to tough questions, so I made sure I knew, for example, how many graduates at a given high school met the minimum requirements for college. California just has this wealth of education data available, and when I started asking questions of the data, I got stories that were way more interesting.

I lived in Dalian, China for a while. I helped start a local news site with two other expats (Alex Bowman and Rick Martin). We put everything we knew about the city -- restaurant reviews, blog posts, photos from Flickr -- into one big database and mapped it all. It was this awakening moment when suddenly we had this resource where all the information we had was interlinked. When I came back to California, I sat down with a book on Python and Django and started teaching myself to code. I spent a year freelancing in the Bay Area, writing for newspapers by day, learning Python by night. Then the NewsHour hired me.

Did you have any mentors? Who? What were the most important resources they shared with you?

Laura: Chris really coached me through the complexities of data journalism when we were creating the site. He taught me that data questions are editorial questions. When I realized that data could be discussed as an editorial approach, it opened the crime beat up. I learned to ask questions of the information I was gathering in a new way.

Chris: My education has been really informal. I worked with a great reporter at my first job, Bob Wilson, who is a great interviewer of both people and spreadsheets. At NewsHour, I worked with Dante Chinni on Patchwork Nation, who taught me about reporting around a central organizing principle. Since I’ve started coding, I’ve ended up in this great little community of programmer-journalists where people bound ideas around and help each other out.

What does your personal data journalism "stack" look like? What tools could you not live without?

Laura: The site itself and its database which I report to and from, WordPress, Wordpress analytics, Google Analytics, Google Calendar, Twitter, Facebook, Storify, Document Cloud, VINElink, and DC Superior Court’s online case lookup.

Chris: Since I write more Python than prose these days, I spend most of my time in a text editor (usually TextMate) on a MacBook Pro. I try not to do anything with git.

What data journalism project are you the most proud of working on or creating?

Laura: Homicide Watch is the best thing I’ve ever done. It’s not just about the data, and it’s not just about the journalism, but it’s about meeting a community need in an innovative way. I stared thinking about a Homicide Watchtype site when I was trying to follow a few local cases shortly after moving to DC. It was nearly impossible to find news sources for the information. I did find that family and friends of victims and suspects were posting newsy updates in unusual places -- online obituaries and Facebook memorial pages, for example. I thought a lot about how a news product could fit the expressed need for news, information, and a way for the community to stay in touch about cases.

The data part developed very naturally out of that. The earliest description of the site was “everything a reporter would have in their notebook or on their desk while covering a murder case from start to finish.” That’s still one of the guiding principals of the site, but it’s also meant that organizing that information is super important. What good is making court dates public if you’re not doing it on a calendar, for example.

We started, like I said, with a spreadsheet that listed everything we knew: victim name, age, race, gender, method of death, place of death, link to obituary, photo, suspect name, age, race, gender, case status, incarceration status, detective name, age, race, gender, phone number, judge assigned to case, attorneys connected to the case, co-defendants, connections to other murder cases.

And those are just the basics. Any reporter covering a murder case, crime to conviction, should have that information. What Homicide Watch does is organize it, make as much of it public as we can, and then report from it. It’s led to some pretty cool work, from developing a method to discover news tips in analytics, to simply building news packages that accomplish more than anyone else can.

Chris: Homicide Watch is really the project I wanted to build for years. It’s data-driven beat reporting, where the platform and the editorial direction are tightly coupled. In a lot of ways, it’s what I had in mind when I was writing about frameworks for reporting.

The site is built to be a crime reporter’s toolkit. It’s built around the way Laura works, based on our conversations over the dinner table for the first six months of the site’s existence. Building it meant understanding the legal system, doing reporting and modeling reality in ways I hadn’t done before, and that was a challenge on both the technical and editorial side.

Where do you turn to keep your skills updated or learn new things?

Laura: Assigning myself new projects and tasks is the best way for me to learn; it forces me to find solutions for what I want to do. I’m not great at seeking out resources on my own, but I keep a close eye on Twitter for what others are doing, saying about it, and reading.

Chris: Part of my usual morning news reading is a run through a bunch of programming blogs. I try to get exposed to technologies that have no immediate use to me, just so it keeps me thinking about other ways to approach a problem and to see what other problems people are trying to solve.

I spend a lot of time trying to reverse-engineer other people’s projects, too. Whenever someone launches a new news app, I’ll try to find the data behind it, take a dive through the source code if it’s available and generally see if I can reconstruct how it came together.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Laura: Working on Homicide Watch has taught me that news is about so much more than “stories.” If you think about a typical crime brief, for example, there’s a lot of information in there, starting with the "who-what-where-when." Once that brief is filed and published, though, all of that information disappears.

Working with news apps gives us the ability to harness that information and reuse/repackage it. It’s about slicing our reporting in as many ways as possible in order to make the most of it. On Homicide Watch, that means maintaining a database and creating features like victims’ and suspects’ pages. Those features help regroup, refocus, and curate the reporting into evergreen resources that benefit both reporters and the community.

Chris: Spend some time with your site analytics. You’ll find that there’s no one thing your audience wants. There isn’t even really one audience. Lots of people want lots of different things at different times, or at least different views of the information you have.

One of our design goals with Homicide Watch is “never hit a dead end.” A user may come in looking for information about a certain case, then decide she’s curious about a related issue, then wonder which cases are closed. We want users to be able to explore what we’ve gathered and to be able to answer their own questions. Stories are part of that, but stories are data, too.

March 08 2012

Profile of the Data Journalist: The Storyteller and The Teacher

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Sarah Cohen (@sarahduke), the Knight professor of the practice of journalism and public policy at Duke University, and Anthony DeBarros (@AnthonyDB), the senior database editor at USA Today, were both important sources of historical perspective for my feature on how data journalism is evolving from "computer-assisted reporting" (CAR) to a powerful Web-enabled practice that uses cloud computing, machine learning and algorithms to make sense of unstructured data.

The latter halves of our interviews, which focused upon their personal and professional experience, follow.

What data journalism project are you the most proud of working on or creating?

DeBarros: "In 2006, my USA TODAY colleague Robert Davis and I built a database of 620 students killed on or near college campuses and mined it to show how freshmen were uniquely vulnerable. It was a heart-breaking but vitally important story to tell. We won the 2007 Missouri Lifestyle Journalism Awards for the piece, and followed it with an equally wrenching look at student deaths from fires."

Cohen: "I'd have to say the Pulitzer-winning series on child deaths in DC, in which we documented that children were dying in predictable circumstances after key mistakes by people who knew that their agencies had specific flaws that could let them fall through the cracks.

I liked working on the Post's POTUS Tracker and Head Count. Those were Web projects that were geared at accumulating lots of little bits about Obama's schedule and his appointees, respectively, that we could share with our readers while simultaneously building an important dataset for use down the road. Some of the Post's Solyndra and related stories, I have heard, came partly from studying the president's trips in POTUS Tracker.

There was one story, called "Misplaced Trust," on DC's guardianship system, that created immediate change in Superior Court, which was gratifying. "Harvesting Cash," our 18-month project on farm subsidies, also helped point out important problems in that system.

The last one, I'll note, is a piece of a project I worked on, in which the DC water authority refused to release the results of a massive lead testing effort, which in turn had shown widespread contamination. We got the survey from a source, but it was on paper.

After scanning, parsing, and geocoding, we sent out a team of reporters to neighborhoods to spot check the data, and also do some reporting on the neighborhoods. We ended up with a story about people who didn't know what was near them.

We also had an interesting experience: the water authority called our editor to complain that we were going to put all of the addresses online -- they felt that it was violating peoples' privacy, even though we weren't identifyng the owners or the residents. It was more important to them that we keep people in the dark about their blocks. Our editor at the time, Len Downie, said, "you're right. We shouldn't just put it on the Web." He also ordered up a special section to put them all in print.

Where do you turn to keep your skills updated or learn new things?

Cohen: "It's actually a little harder now that I'm out of the newsroom, surprisingly. Before, I would just dive into learning something when I'd heard it was possible and I wanted to use it to get to a story. Now I'm less driven, and I have to force myself a little more. I'm hoping to start doing more reporting again soon, and that the Reporters' Lab will help there too.

Lately, I've been spending more time with people from other disciplines to understand better what's possible, like machine learning and speech recognition at Carnegie Mellon and MIT, or natural language processing at Stanford. I can't DO them, but getting a chance to understand what's out there is useful. NewsFoo, SparkCamp and NICAR are the three places that had the best bang this year. I wish I could have gone to Strata, even if I didn't understand it all."

DeBarros: For surveillance, I follow really smart people on Twitter and have several key Google Reader subscriptions.

To learn, I spend a lot of time training after work hours. I've really been pushing myself in the last couple of years to up my game and stay relevant, particularly by learning Python, Linux and web development. Then I bring it back to the office and use it for web scraping and app building.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Cohen: "I think anything that gets more leverage out of fewer people is important in this age, because fewer people are working full time holding government accountable. The news apps help get more eyes on what the government is doing by getting more of what we work with and let them see it. I also think it helps with credibility -- the 'show your work' ethos -- because it forces newsrooms to be more transparent with readers / viewers.

For instance, now, when I'm judging an investigative prize, I am quite suspicious of any project that doesn't let you see each item, I.e., when they say, "there were 300 cases that followed this pattern," I want to see all 300 cases, or all cases with the 300 marked, so I can see whether I agree.

DeBarros: "They're important because we're living in a data-driven culture. A data-savvy journalist can use the Twitter API or a spreadsheet to find news as readily as he or she can use the telephone to call a source. Not only that, we serve many readers who are accustomed to dealing with data every day -- accountants, educators, researchers, marketers. If we're going to capture their attention, we need to speak the language of data with authority. And they are smart enough to know whether we've done our research correctly or not.

As for news apps, they're important because -- when done right -- they can make large amounts of data easily understood and relevant to each person using them."

These interviews were edited and condensed for clarity.

Profile of the Data Journalist: The Hacks Hacker

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference. This interview followed the conference and featured a remote participant who diligently used social media and the World Wide Web to document and share the best of NICAR:

Chrys Wu (@MacDiva) is a data journalist and user engagement strategist based in New York City. Our interview follows.

Where do you work now? What is a day in your life like?

I work with clients through my company, Matchstrike, which specializes in user engagement strategy. It's a combination of user experience research, design and program planning. Businesses turn to me to figure out how to keep people's attention, create community and tie that back to return on investment.

I also launch Hacks/Hackers chapters around the world and co-organize the group in New York with Al Shaw of ProPublica and Jacqui Cox of The New York Times.

Both things involve seeking out people and ideas, asking questions, reading, wireframing and understanding what motivates people as individuals and as groups.

How did you get started in data journalism? Did you get any special degrees or certificates?

I had a stats class in high school with a really terrific instructor who also happened to be the varsity basketball coach. He was kind of like our John Wooden. Realizing the importance of statistics, being able to organize and interpret data — and learning how to be skeptical of claims (e.g., where "4 out of 5 dentists agree" comes from)— has always stayed with me.

Other than that class and studying journalism at university, what I know has come from exploring (finding what's out there), doing (making something) and working (making something for money). I think that's pretty similar to most journalists and journalist-developers currently in the field.

Though I've spent several years in newsrooms (most notably with the Los Angeles Times and CBS Digital Media Group), most of my journalism and communications career has been as a freelancer. One of my earliest clients specialized in fundraising for Skid Row shelters. I quantified the need cases for her proposals. That involved working closely with the city health and child welfare departments and digging through a lot of data.

Once I figured that out, it was important to balance the data with narrative. Numbers and charts have a much more profound impact on people if they're framed by an idea to latch onto and compelling story to share.

Did you have any mentors? Who? What were the most important resources they shared with you?

I don't have individual mentors, but there's an active community with a huge body of work out there to learn from. It's one of the reasons why I've been collecting things on Delicious and Pinboard, and it's why I try my best to put everything that's taught at NICAR on my blog.

I always try look beyond journalism to see what people are thinking about and doing in other fields. Great ideas can come from everywhere. There are lots of very smart people willing to share what they know.

What does your personal data journalism "stack" look like? What tools could you not live without?

I use Coda and TextMate most often. For wireframing, I'm a big fan of OmniGraffle. I code in Ruby, and a little bit in Python. I'm starting to learn how to use R for dataset manipulation and for its maps library.

For keeping tabs on new but not urgent-to-read material, I use my friend Samuel Clay's RSS reader, Newsblur.

What data journalism project are you the most proud of working on or creating?

I'm most proud of working with the Hacks/Hackers community. Since 2009, we've grown to more than 40 groups worldwide, with each locality bringing journalists, designers and developers together to push what's possible for news.

As I say, talking is good; making is better — and the individual Hacks/Hackers chapters have all done some version of that: presentations, demos, classes and hack days. They're all opportunities to share knowledge, make friends and create new things that help people better understand what's happening around them.

Where do you turn to keep your skills updated or learn new things?

MIT's open courses have been great. There's also blogs, mailing lists, meetups, lectures and conferences. And then there's talking with friends and people they know.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

I like Amanda Cox's view of the importance of reporting through data. She's a New York Times graphics editor who comes from a statistics background. To paraphrase: Presenting a pile of facts and numbers without directing people toward any avenue of understanding is not useful.

Journalism is fundamentally about fact-finding and opening eyes. One of the best ways to do that, especially when lots of people are affected by something, is to interweave narrative with quantifiable information.

Data journalism and news apps create the lens that shows people the big picture they couldn't see but maybe had a hunch about otherwise. That's important for a greater understanding of the things that matter to us as individuals and as a society.

This interview has been edited and condensed for clarity.

March 06 2012

Profile of the Data Journalist: The Data Editor

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Meghan Hoyer (@MeghanHoyer) is a data editor based in Virginia. Our interview follows.

Where do you work now? What is a day in your life like?

I work in an office within The Virginian Pilot’s newsroom. I’m a one-person team, so there’s no such thing as typical.

What I might do: Help a reporter pull Census data, work with IT on improving our online crime report app, create a DataTable of city property assessment changes, and plan training for a group of co-workers who’d like to grow their online skills. At least, that’s what I’m doing today.

Tomorrow, it’ll be helping with our online election report, planning a strategy to clean a dirty database, and working with a reporter to crunch data for a crime trend story.

How did you get started in data journalism? Did you get any special degrees or certificates?

I have a journalism degree from Northwestern, but I got started the same way most reporters probably got started - I had questions about my community and I wanted quantifiable answers. How had the voting population in a booming suburb changed? Who was the region’s worst landlord? Were our localities going after delinquent taxpayers? Anecdotes are nice, but it’s an amazingly powerful thing to be able to get the true measure of a situation. Numbers and analysis help provide a better focus - and sometimes, they upend entirely your initial theory.

Did you have any mentors? Who? What were the most important resources they shared with you?

I haven’t collected a singular mentor as much as a group of people whose work I keep tabs on, for inspiration and follow-up. The news community is pretty small. A lot of people have offered suggestions, guidance, cheat sheets and help over the years. Data journalism - from analysis to building apps -- is definitely not something you can or need to learn in a bubble all on your own.

What does your personal data journalism "stack" look like? What tools could you not live without?

In terms of daily tools, I keep it basic: Google docs, Fusion Tables and Refine, QGIS, SQLite and Excel are all in use pretty much every day.

I’ve learned some Python and JavaScript for specific projects and to automate some of the newsroom’s daily tasks, but I definitely don’t have the programming or technical background that a lot of people in this field have. That’s left me trying to learn as much as I can as quick as I can.

In terms of a data stack, we keep information such as public employee salaries, land assessment databases and court record databases (among others) updated in a shared drive in our newsroom. It’s amazing how often reporters use them, even if it’s just to find out which properties a candidate owns or how long a police officer caught at a DUI checkpoint has been on the force.

What data journalism project are you the most proud of working on or creating?

A few years ago, I combined property ownership records, code enforcement citations, real estate tax records and rental inspection information from all our local cities and found a company with hundreds of derelict properties.

Their properties seemed to change hands often, so a partner and I then hand-built a database from thousands of land deeds that proved the company was flipping houses among investors in a $26 million mortgage fraud scheme. None of the cities in our region had any idea this was going on because they were dealing with each parcel as a separate entity.

That’s what combining sets of data can get you - a better overall view of what’s really happening. While government agencies are great at collecting piles of data, it’s that kind of larger analysis that’s missing.

Where do you turn to keep your skills updated or learn new things?

To be honest - Twitter. I get a lot of ideas and updates on new tools there. And the NICAR conference and listserv. Usually when you hit up against a problem - whether it’s dealing with a dirty dataset or figuring out how to best visualize your data -- it’s something that someone else has already faced.

I also learn a lot from the people within our newsroom. We have a talented group of web producers who all are eager to try new things and learn.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Data is everywhere, but in most cases it’s just stockpiled and warehoused without a second thought to analysis or using it to solve larger problems.

Journalists are in a unique position to make sense of it, to find the stories in it, to make sure that governments and organizations are considering the larger picture.

I think, too, that people in our field need to truly push for open government in the sense not of government building interfaces for data, but for just releasing raw data streams. Government is still far too stuck in the “Here’s a PDF of a spreadsheet” mentality. That doesn’t create informed citizens, and it doesn’t lead to innovative ways of thinking about government.

I’ve been involved recently in a community effort to create an API and then apps out of the regional transit authority’s live bus GPS stream. It has been a really fun project - and something that I hope other local governments in our area take note of.

Profile of the Data Journalist: The Daily Visualizer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Matt Stiles (@stiles) , a data journalist based in Washington, D.C., maintains a popular Daily Visualization blog. Our interview follows.

Where do you work now? What is a day in your life like?

I work at NPR, where I oversee data journalism on the State Impact project, a local-national partnership between us and member stations. My typical day always begins with a morning "scrum" meeting among the D.C. team as part of our agile development process. I spend time acquiring and analyzing data throughout each data, and I typically work directly with reporters, training them on software and data visualization techniques. I also spend time planning news apps and interactives, a process that requires close consultation with reporters, designers and developers.

How did you get started in data journalism? Did you get any special degrees or certificates?

No special training or certificates, though I did attend three NICAR boot camps (databases, mapping, statistics) over the years.

Did you have any mentors? Who? What were the most important resources they shared with you?

I have several mentors, both on the reporting side and the data side. For data, I wouldn't be where I am today without the help of two people: Chase Davis and Jennifer LaFleur. Jen got me interested early, and has helped me with formal and informal training over the years. Chase helped me with day-to-day questions when we worked together at the Houston Chronicle.

What does your personal data journalism "stack" look like? What tools could you not live without?

I have a MacBook that runs Windows 7. I have the basic CAR suite (Excel/Access, ArcGIS, SPSS, etc.) but also plenty of open-source tools, such as R for visualization or MySQL/Postgres for databases. I use Coda and Text Mate for coding. I use BBEdit and Python for text manipulation. I also couldn't live without Photoshop and Illustrator for cleaning up graphics.

What data journalism project are you the most proud of working on or creating?

I'm most proud of the online data library I created (and others have since expanded) at The Texas Tribune, but we're building some sweet apps at NPR. That's only going to expand now that we've created a national news apps team, which I'm joining soon.

Where do you turn to keep your skills updated or learn new things?

I read blogs, subscribe to email lists and attend lots of conferences for inspiration. There's no silver bullet. If you love this stuff, you'll keep up.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

More and more information is coming at us every day. The deluge is so vast. Data journalism at its core is important because it's about facts, not anecdotes.

Apps are important because Americans are already savvy data consumers, even if they don't know it. We must get them thinking -- or, even better, not thinking -- about news consumption in the same way they think about syncing their iPads or booking flights on Priceline or purchasing items on eBay. These are all "apps" that are familiar to many people. Interactive news should be, too.

This interview has been edited and condensed for clarity.

March 05 2012

Profile of the Data Journalist: The API Architect

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Jacob Harris (@harrisj) is an interactive news developer based in New York City. Our interview follows.

Where do you work now? What is a day in your life like?

I work in the Interactive Newsroom team at the New York Times. A day in my life is usually devoted to coding rather than meetings. Currently, I am almost exclusively devoted to the NYT elections coverage, where I switch between the operations of loading election results from the AP or building internal APIs that provide data to our various parts of elections.nytimes.com. I also sometimes help fix problems in our server stack when they arise or sometimes get involved in other projects if they need me.

How did you get started in data journalism? Did you get any special degrees or certificates?

I have a classical CS education, with a combined B.A./M.Eng from MIT. I have no journalism background or experience. I never even worked for my newspaper in college or anywhere. I do have a profound skepticism and contrarian nature that does help me fit in well with the journalists.

Did you have any mentors? Who? What were the most important resources they shared with you?

I don't have any specific mentors. But that doesn't mean I haven't been learning from anybody. We're in a very open team and we all usually learn things from each other. Currently, several of the frontend guys are tolerating my new forays into Javascript. Soon, the map guys will learn to bear my questions with patience.

What does your personal data journalism "stack" look like? What tools could you not live without?

Our actual web stack is built on top of EC2, with Phusion Passenger and Ruby on Rails serving our apps. We also use haproxy as a load balancer. Varnish is an amazing cache that everybody should use. On my own machine, I do my coding currently in Sublime Text 2. I use Pivotal Tracker to track my coding tasks. I could probably live with a different editor, but don't take my server stack away from me.

What data journalism project are you the most proud of working on or creating?

I have two projects I'm pretty proud of working on. Last year, I helped out with the Wikileaks War Logs reporting. We built an internal news app for the reporters to search the reports, see them on a map, and tag the most interesting ones. That was an interesting learning experience.

One of the unique things I figured out was how to extract MGRS coordinates from within the reports to geocode the locations inside of them. From this, I was able to distinguish the locations of various homicides within Baghdad more finely than the geocoding for the reports. I built a demo, pitched it to graphics, and we built an effective and sobering look at the devastation on Baghdad from the violence.

This year, I am working on my third election as part of Interactive News. Although we are proud of our team's work in 2008 and 2010, we've been trying some new ways of presenting our election coverage online and new ways of architecting all of our data sources so that it's easier to build new stuff. It's been gratifying to see how internal APIs combine with novel new storytelling formats and modern browser technologies this year.

Where do you turn to keep your skills updated or learn new things?

Usually, I just find out about things by following all the other news app developers on Twitter. We're a small ecosystem with lots of sharing. It's great how everybody learns from each other. I have created a Twitter list @harrisj/news-hackers to help keep tabs on all the cool stuff being done out there. (If you know someone who should be on it, let me know.)


Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

We live in a world of data. Our reporting should do a better job of presenting and investigating that data. I think it's been an incredible time for the world of news applications lately. A few years back, it was just an achievement to put data online in a browsable way.

These days, news applications are at a whole other level. Scott Klein of ProPublica put it best when he described all good data stories as including both the "near" (individual cases, examples) and the "far" (national trends, etc.).

In an article, the reporter would be pick a few compelling "nears" for the story. As a reader, I also would want to know how my school is performing or how polluted my water supply is.

This is what news applications can do: tell the stories that are found in the data, but also allow the readers to investigate the stories in the data that are highly important to them.

This interview has been edited and condensed for clarity.

OpenCorporates opens up new database of corporate directors and officers

In an age of technology-fueled transparency, corporations are subject to the same powerful disruption as governments. In that context, data journalism has profound importance for society. If a researcher needs data for business journalism, OpenCorporates is a bonafide resource.

Today, OpenCorporates is making a new open database of corporate officers and directors available to the world.

"It's pretty cool, and useful for journalists, to be able to search not just all the companies with directors for a given name in a given state, but across multiple states," said Chris Taggart, founder of Open Corporates, in an email interview. "Not surprisingly, loads of people, from journalists to corruption investigators, are very interested in this."

OpenCorporates is the largest open database of companies and corporate data in the world. The service now contains public data from around the world, from health and safety violations in the United Kingdom to official public notices in Spain to a register of federal contractors. The database has been built by the open data community, under a bounty scheme in conjunction with ScraperWiki. The site also has a useful Google Refine reconciliation function that matches legal entities to company names. Taggart's presentation on OpenCorporates from the 2012 NICAR conference, which provides an overview, is embedded below:

The OpenCorporates open application programming interface can be used with or without a key, although an API key does increase usage limits. The open data site's business model comes with an interesting hook: while OpenCorporates makes its data both free and open under a Share-Alike Attribution Open Database License, users who wish import the data into a proprietary database or use it without attribution must pay to do so.

"The critical thing about our Directors import, and *all* the other data in OpenCorporates, is that we give the provenance, both where and when we got the information," said Taggart. "This is in contrast to the proprietary databases who never give this, because they don't want you to go straight to the source, which also means it's problematic in tracing the source of errors. We've had several instances of the data being wrong at the source, like U.K. health and safety violations."

Taggart offered more perspective on the source of OpenCorporates director data, corporate data availability and the landscape around a universal business ID in the rest of our interview:

Where does the officer and director data come from? How is it validated and cleaned?

It's all from the official company registers. Most are scraped (we've scraped millions of pages), a couple (e.g. Vermont) are from downloads that the registries provide. We just need to make sure we're scraping and importing properly. We do some cleaning up (e.g. removing some of the '**NO DIRECTOR**' entries, but to a degree this has to be done post import, as you often don't know these till they're imported (which is why there are still a few in there).

By the way, in case you were wondering, the reason there are so many more directors than in the filters to the right is that there are about 3 million and counting Florida directors.

Was this data available anywhere before? If no, why not?

As far as I'm aware, only in proprietary databases. Proprietary databases have dominated company data. The result is massive duplication of effort, databases that have opaque errors in them, because they don't have many eyes on them, and lack of access to the public, small businesses, and as you will have heard from NICAR, journalists. I'm tempted to offer a bottle of champagne to the first journalist who finds a story in the directors data.

Who else is working on the universal business ID issue? I heard Beth Noveck propose something along these lines, for instance.

Several organizations have been working on this, mostly from a semi-proprietary point of view, or at least trying to generate a monopoly ID. In other words, it might be open, but in order to get anything on the company, you have to use their site as a lookup table.

OpenCorporates is different in that if you know the URI you know the jurisdiction and identity issued by the company register and vice versa. This means you don't need to ask OpenCorporates what the company ID is, as it's there in the ID. It also works with the EU/W3C's Business Vocabulary, which has just been published.

ISO has been working on one, but it's got exactly this problem. Also, their database won't contain the company number, meaning it doesn't link to the legal entity. Bloomberg have been working on one, as have Thomson Reuters, as they need an alternative to the DUNS number, but from the conversations I had in D.C., nobody's terribly interested in this.

I don't really know the status of Beth's project. They were intending to create a new ID too. From speaking to Jim Hendler, it didn't seem to be connected to the legal entity but instead to represent a search of the name (actually a hash of a SPARQL query). You can see a demo site at http://tw.rpi.edu/orgpedia/companies. I have severe doubts regarding this.

Finally, there's the Financial Stability Board's (part of the G20) work on a global legal entity identifier -- we're on the advisory board for this. This also would be a new number, and be voluntary, but on the other hand will be openly licensed.

I don't think it's a solution to the problem, as it won't be complete and for other reasons, but it may surface more information. We'd definitely provide an entity resolution service to it.

March 02 2012

Profile of the Data Journalist: The Visualizer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Michelle Minkoff (@MichelleMinkoff ) is an investigative developer/journalist based in Washington, D.C. Our interview follows.

Where do you work now? What is a day in your life like?

I am an Interactive Producer at the Associated Press' Washington DC bureau, where I focus on news applications related to politics and the election, as well as general mapping for our interactives on the Web. While my days pretty much always involve sitting in front of a computer, the actual tasks themselves can vary wildly. I may be chatting with reporters and editors in politics, environment, educational, national security or myriad beats about upcoming stories and how to use data to support reporting or create interactive stories. I might be gathering data, reformatting it or crafting Web applications. I spend a great deal of time creating interactive mapping systems, working a lot with geographic data, and collaborating with cartographers, editors and designers to decide how to best display it.

I split my time between working closely with my colleagues in the Washington bureau on the reporting/editing side, and my fellow interactive team members, only one of whom is also in DC. Our team is global, headquartered in New York, but with members spanning the globe from Phoenix to Bangkok.

It's a question of walking a balance between what needs to be done on daily deadlines for breaking news, longer-term stories which are often investigative, and creating frameworks that help The Associated Press to make the most of the Web's interactive nature in the long run.

How did you get started in data journalism? Did you get any special degrees or certificates?

I caught the bug when I took a computer-assisted reporting class from Derek Willis, a member of the New York Times' Interactive News Team, at Northwestern's journalism school where I was a grad student. I was fascinated by the role that technology could play in journalism for reporting and presentation, and very quickly got hooked. I also quickly discovered that I could lose track of hours playing with these tools, and that what came naturally to me was not as natural to others. I would spend days reporting for class, on and off Capitol Hill, and nights exchanging gchats with Derek and other data journalists he introduced me to. I started to understand SQL, advanced Excel, and fairly quickly thereafter, Python and Django.

I followed this up with an independent study in data visualization back at Medill's Chicago campus, under Rich Gordon. I practiced making Django apps, played with the Processing visualization language. I voraciously read through all the Tufte books. As a final project, I created a package about the persistence of Chicago art galleries that encompasses text, Flash visualization and a searchable database.

I have a concentration in Interactive Journalism, with my Medill masters' degree, but the courses mentioned above are but a partial component of that concentration.

Did you have any mentors? Who? What were the most important resources they shared with you?

The question here is in the wrong tense. I currently "do" have many mentors, and I don't know how I would do my job without what they've shared in the past, and in the present. Derek, mentioned above, was the first. He introduced me to his friend Matt [Waite], and then he told me there was a whole group of people doing this work at NICAR. Literally hundreds of people from that organization have helped me at various places on my journey, and I believe strongly in the mantra of "paying it forward" as they have -- no one can know it all, so we pass on what we've learned, so more people can do even better work.

Other key folks I've had the privilege to work with include all of the Los Angeles Times' Data Desk's members, which includes reporters, editors and Web developers. I worked most closely with Ben Welsh and Ken Schwencke, who answered many questions, and were extremely encouraging when I was at the very beginning of my journey.

At my current job at The Associated Press, I'm lucky to have teammates who mentor me in design, mapping and various Washington-based beats. Each is helpful in his or her own way.

Special attention deserves to be called to Jonathan Stray, who's my official boss, but also a fantastic mentor who enables me to do what I do. He's helping me to learn the appropriate technical skills to execute what I see in my head, as well as learn how to learn. He's not just teaching me the answers to the problems we encounter in our daily work, but also helping me learn how to better solve them, and work this whole "thing I do" into a sustainable career path. And all with more patience than I have for myself.

What does your personal data journalism "stack" look like? What tools could you not live without?

No matter how advanced our tools get, I always find myself coming back to Excel first to do simple work. It helps us an overall handle on a data set. I also will often quickly bring data into SQLite, a Firefox extension that allows a user to run SQL queries, with no database setup. I'm more comfortable asking complicated questions of data that way. I also like to use Google's Chart Tools to create quick visualizations for myself to better understand a story.

When it comes to presentation, since I've been doing a lot with mapping recently, I don't know what I'd do without my favorite open source tools, Tilemill and Leaflet. Building a map stack is hard work, but the work that others have done before it have made it a lot easier.

If we consider programming languages tools (which I do), JavaScript is my new Swiss army knife. Prior to coming to the AP, I did a lot with Python and Django, but I've learned a lot about what I like to call "Really Hard JavaScript." It's not just about manipulating the colors of a background on a Web page, but parsing, analyzing and presenting data. When I need to do more complex work to manipulate data, I use a combination of Ruby and Python -- depending on which has better tools for the job. For XML parsing, I like Ruby more. For simplifying geo data, I prefer Python.

What data journalism project are you the most proud of working on or creating?

That would be " Road to 270", a project we did at the AP that allows users to test out hypothetical "what-if" scenarios for the national election, painting states to define to which candidate a state's delegates could go. It combines demographic and past election data with the ability for users to make a choice and deeply engage with the interactive. It's not just telling the user a story, but informing the user by allowing him or her to be part of the story. That, I believe, is when data journalism becomes its most compelling and informative.

It also uses some advanced technical mapping skills that were new to me. I greatly enjoyed the thrill of learning how to structure a complex application, and add new tools to my toolkit. Now, I don't just have those new tools, but a better understanding of how to add other new tools.

Where do you turn to keep your skills updated or learn new things?

I look at other projects, both within the journalism industry and in general visualization communities. The Web inspector is my best friend. I'm always looking to see how people did things. I read blogs voraciously, and have a fairly robust Google Reader set of people whose work I follow closely. I also use lynda.com frequently (I tend to learn best by video tutorials.) Hanging out on listservs for free tools I use (such as Leaflet), programming languages I care about (Python), or projects whose mission our work is related to (Sunlight Foundation) help me engage with a community that cares about similar issues.

Help sites like Stack Overflow, and pretty much anything I can find on Google, are my other best friends. The not-so-secret secret of data journalism: we're learning as we go. That's part of what makes it so fun.

Really, the learning is not about paper or electronic resources. Like so much of journalism, this is best conquered, I argue, with persistence and stick-to-it-ness. I approach the process of data journalism and Web development as a beat. We attend key meetings. Instead of city council, it's NICAR. We develop vast rolodexes. I know people who have myriad specialties and feel comfortable calling on them. In return, I help people all over the world with this sort of work whenever I can, because it's that important. While we may work for competing places, we're really working toward the same goal: improving the way we inform the public about what's going on in our world. That knowledge matters a great deal.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

More and more information is coming at us every day. The deluge is so vast that we need to not just say things are true, but prove those truths with verifiable facts. Data journalism allows for great specificity, and truths based in the scientific method. Using computers to commit data journalism allows us to process great amounts of information much more efficiently, and make the world more comprehensible to a user.

Also, while we are working with big data, often only a subset of that data is valuable to a specific user. Data journalism and Web development skills allow us to customize those subsets for our various users, such as by localizing a map. That helps us give a more relevant and useful experience to each individual we serve.

Perhaps most importantly, more and more information is digital, and is coming at us through the Internet. It simply makes sense to display that information with a similar environment in which it's provided. Information is dispensed in a different way now than it was five years ago. It will be totally different in another five years. So, our explanations of that environment should match. We must make the most of the Internet to tell our stories differently now than we did before, and differently than we will in the future.

Knowing things are constantly changing, being at the forefront of that change, and enabling the public to understand and participate in that change, is a large part of what makes data journalism so exciting and fundamentally essential.

This interview has been edited and condensed for clarity.

Profile of the Data Journalist: The Human Algorithm

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Ben Welsh (@palewire) is an Web developer and journalist based in Los Angeles. Our interview follows.

Where do you work now? What is a day in your life like?

I work for the Los Angeles Times, a daily newspaper and 24-hour Web site based in Southern California. I'm a member of the Data Desk, a team of reporters and Web developers that specializes in maps, databases, analysis and visualization. We both build Web applications and conduct analysis for reporting projects.

I like to compare The Times to a factory, a factory that makes information. Metaphorically speaking, it has all sorts of different assembly lines. Just to list a few, one makes beautifully rendered narratives, another makes battleship-like investigative projects.

A typical day involves juggling work on difference projects, mentally moving from one assembly line to the other. Today I patched an embryonic open-source release, discussed our next move on a pending public records request, guided the real-time publication of results from the GOP primaries in Michigan and Arizona, and did some preparation for how we'll present a larger dump of results on Super Tuesday.

How did you get started in data journalism? Did you get any special degrees or certificates?

I'm thrilled to see new-found interest in "data journalism" online. It's drawing young, bright people into the field and involving people from different domains. But it should be said that the idea isn't new.

I was initiated into the field as a graduate student at the Missouri School of Journalism. There I worked at the National Institute for Computer-Assisted Reporting , also known as NICAR. Decades before anyone called it "data journalism," a disparate group of misfit reporters discovered that the data analysis made possible by computers enabled them to do more powerful investigative reporting. In 1989, they founded NICAR, which has, for decades, been training data skills to journalists and nurtured a tribe of journalism geeks. In the time since, computerized data analysis has become a dominant force in investigative reporting, responsible for a large share of the field's best work.

To underscore my point, here's a 1986 Time magazine article about how "newsmen are enlisting the machine."

Did you have any mentors? Who? What were the most important resources they shared with you?

My first journalism job was in Chicago. I got a gig working for two great people there, Carol Marin and Don Moseley, who have spent most of their careers as television journalists. I worked as their assistant. Carol and Don are warm people who are good teachers, but they are also excellent at what they do. There was a moment when I realized, "Hey, I can do this!" It wasn't just something I heard about in class, but I could actually see myself doing.

At Missouri, I had a great classmate named Brian Hamman, who is now at the New York Times. I remember seeing how invested Brian was in the Web, totally committed to Web development as a career path. When an opportunity opened up to be a graduate assistant at NICAR, Brian encouraged me to pursue it. I learned enough SQL to help do farmed-out investigative work for TV stations. And, more importantly, I learned that if you had technical skills you could get the job to work on a cool story.

After that I got a job doing data analysis at the Center for Public Integrity in Washington DC. I had the opportunity to work on investigative projects, but also the chance to learn a lot of computer programming along the way. I had the guidance of my talented coworkers, Daniel Lathrop, Agustin Armendariz, John Perry, Richard Mullins and Helena Bengtsson. I learned that computer programming wasn't impossible. They taught me that if you have a manageable task, a few friends to help you out and a door you can close, you can figure out a lot.

What does your personal data journalism "stack" look like? What tools could you not live without?

I do my daily development in gedit text editor, Byobu's slick implementation of the screen terminal and the Chromium browser. And, this part may be hard to believe, but I love Ubuntu Unity. I don't understand what everybody is complaining about.

I do almost all of my data management in the Python Web development framework Django and PostgreSQL's database, even if the work is an exploratory reporting project that will never be published. I find that the structure of the framework can be useful for organizing just about any data-driven project.

I use GitHub for both version-control and project management. Without it, I'd be lost.

What data journalism project are you the most proud of working on or creating?

As we all know, there's a lot of data out there. And, as anyone who works with it knows, most of it is crap. The projects I'm most proud of have taken large, ugly data sets and refined them into something worth knowing: a nut graf in an investigative story, or a data-driven app that gives the reader some new insight into the world around them. It's impossible to pick one. I like to think the best is still, as they say in the newspaper business, TK.

Where do you turn to keep your skills updated or learn new things?

Twitter is a great way to keep up with what is getting other programmers excited. I know a lot of people find social media overwhelming or distracting, but I feel plugged in and inspired by what I find there. I wouldn't want to live without it.

GitHub is another great source. I've learned so much just exploring other people's code. It's invaluable.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Computers offer us an opportunity to better master information, better understand each other and better watchdog those who would govern us. I tried to talk about some of the ways simply thinking about the process of journalism as an algorithm can point the way at last week's NICAR conference in a talk called "Human-Assisted Reporting." In my opinion, we should aspire to write code that embodies the idealistic principles and investigative methods of the previous generation. There's all this data out there now, and journalistic algorithms, "robot reporters," can help us ask it tougher questions.

Reposted byscyphi scyphi

March 01 2012

In the age of big data, data journalism has profound importance for society

The promise of data journalism was a strong theme throughout the National Institute for Computer-Assisted Reporting's (NICAR) 2012 conference. In 2012, making sense of big data through narrative and context, particularly unstructured data, will be a central goal for data scientists around the world, whether they work in newsrooms, Wall Street or Silicon Valley. Notably, that goal will be substantially enabled by a growing set of common tools, whether they're employed by government technologists opening Chicago, healthcare technologists or newsroom developers.

At NICAR 2012, you could literally see the code underpinning the future of journalism written - or at least projected - on the walls.

"The energy level was incredible," said David Herzog, associate professor for print and digital news at the Missouri School of Journalism, in an email interview after NICAR. "I didn't see participants wringing their hands and worrying about the future of journalism. They're too busy building it."

Just as open civic software is increasingly baked into government, open source is playing a pivotal role in the new data journalism.

"Free and open-source tools dominated," said Herzog. "It's clear from the panels and hands-on classes that free and open source tools have eliminated the barrier to entry in terms of many software costs."

While many developers are agnostic with respect to which tools they use to get a job done, the people who are building and sharing tools for data journalism are often doing it with open source code. As Dan Sinker, the head of the Knight-Mozilla News Technology Partnership for Mozilla, wrote afterwards, journo-coders took NICAR 12 "to a whole new level."

While some of that open source development was definitely driven by the requirements of the Knight News Challenge, which funded the PANDA and Overview projects, there's also a collaborative spirit in evidence throughout this community.

This is a group of people who are fiercely committed to "showing your work" -- and for newsroom developers, that means sharing your code. To put it another way, code, don't tell. Sessions on Python, Django, mapping, Google Refine and Google Fusion tables were packed at NICAR 12.

No, this is not your father's computer-assisted reporting.

"I thought this stacked up as the best NICAR conference since the first in 1993," said Herzog. "It's always been tough to choose from the menu of panels, demos and hands-on classes at NICAR conferences. But I thought there was an abundance of great, informative, sessions put on by the participants. Also, I think NICAR offered a good range of options for newbies and experts alike. For instance, attendees could learn how to map using Google Fusion tables on the beginner's end, or PostGIS and qGIS at the advanced level. Harvesting data through web scraping has become an ever bigger deal for data journalists. At the same time, it's getting easier for folks with no or little programming chops to scrape using tools like spreadsheets, Google Refine and ScraperWiki. "

On the history of NICAR

According to IRE, NICAR was founded in 1989. Since its founding, the Institute has trained thousands of journalists how to find, collect and public electronic information.

Today, "the NICAR conference helps journalists, hackers, and developers figure out best practices, best methods,and best digital tools for doing journalism that involves data analysis and classic reporting in the field," said Brant Houston, former executive director of Investigative Reporters and Editors, in an email interview. "The NICAR conference also obviously includes investigative journalism and the standards for data integrity and credibility."

"I believe the first IRE-sponsored [conference] was in 1993 in Raleigh, when a few reporters were trying to acquire and learn to use spreadsheets, database managers, etc. on newly open electronic records," said Sarah Cohen, the Knight professor of the practice of journalism and public policy at Duke University, in an email interview. "Elliott Jaspin was going around the country teaching reporters how to get data off of 9-track tapes. There really was no public Internet. At the time, it was really, really hard to use the new PC's, and a few reporters were trying to find new stories. The famous ones had been Elliott's school bus drivers who had drunk driving records and the Atlanta Color of Money series on redlining."

"St. Louis was my 10th NICAR conference," said Anthony DeBarros, the senior database editor at USA Today, in an email interview. "My first was in 1999 in Boston. The conference is a place where news nerds can gather and remind themselves that they're not alone in their love of numbers, data analysis, writing code and finding great stories by poring over columns in a spreadsheet. It serves as an important training vehicle for journalists getting started with data in the newsroom, and it's always kept journalists apprised of technological developments that offer new ways of finding and telling stories. At the same time, its connection to IRE keeps it firmly rooted in the best aspects of investigative reporting -- digging up stories that serve the public good.

Baby, you can drive my CAR

Long before we started talking about "data journalism," the practice of computer-assisted reporting (CAR) was growing around the world.

"The practice of CAR has changed over time as the tools and environment in the digital world has changed," said Houston. "So it began in the time of mainframes in the late 60s and then moved onto PCs (which increased speed and flexibility of analysis and presentation) and then moved onto the Web, which accelerated the ability to gather, analyze and present data. The basic goals have remained the same. To sift through data and make sense of it, often with social science methods. CAR tends to be an "umbrella" term - one that includes precision journalism and data driven journalism and any methodology that makes sense of date such as visualization and effective presentations of data."

On one level, CAR is still around because the journalism world hasn't coined a good term to use instead.

"Computer-assisted reporting" is an antiquated term, but most people who practice it have recognized that for years," said DeBarros. "It sticks around because no one has yet to come up with a dynamite replacement. Phil Meyer, the godfather of the movement, wrote a seminal book called "Precision Journalism, and that term is a good one to describe that segment of CAR that deals with statistics and the use of social science methods in newsgathering. As an umbrella term, data journalism seems to be the best description at the moment, probably because it adequately covers most of the areas that CAR has become -- from traditional data-driven reporting to the newer category of news applications."

The most significant shift in CAR may well be when all of those computers being used for reporting were connected through the network of networks in the 1990s.

"It may seem obvious, but of course the Internet changed it all, and for a while it got smushed in with trying to learn how to navigate the Internet for stories, and how to download data," said Cohen. "Then there was a stage when everyone was building internal intranets to deliver public records inside newsrooms to help find people on deadline, etc. So for much of the time, it was focused on reporting, not publishing or presentation. Now the data journalism folks have emerged from the other direction: People who are using data obtained through APIs who often skip the reporting side, and use the same techniques to deliver unfiltered information to their readers in an easier format the the government is giving us. But I think it's starting to come back together -- the so-called data journalists are getting more interested in reporting, and the more traditional CAR reporters are interested in getting their stories on the web in more interesting ways.

Whatever you call it, the goals are still the same.

"CAR has always been about using data to find and tell stories," said DeBarros. "And it still is. What has changed in recent years is more emphasis toward online presentations (interactive maps and applications) and the coding skills required to produce them (JavaScript, HTML/CSS, Django, Ruby on Rails). Earlier NICAR conferences revolved much more around the best stories of the year and how to use data techniques to cover particular topics and beats. That's still in place. But more recently, the conference and the practice has widened to include much more coding and presentation topics. That reflects the state of media -- every newsroom is working overtime to make its content work well on the web, on mobile, and on apps, and data journalists tend to be forward thinkers so it's not surprising that the conference would expand to include those topics."

What stood out at NICAR 2012?

The tools and tactics on display at NICAR were enough to convince Tyler Dukes at Duke to write that NICAR taught me I know nothing. Browse through the tools, slides and links from NICAR 2012 curated by Chrys Wu to get a sense of just how much is out there. The big theme, however, without a doubt, was data.

"Data really is the meat of the conference, and a quick scan of the schedule shows there were tons of sessions on all kinds of data topics, from the Census to healthcare to crime to education," said DeBarros.

What I saw everywhere at NICAR was interest not simply in what data was out there, however, but how to get it and put it to use, from finding stories and source to providing empirical evidence to back up other reporting to telling stories with maps and visualizations.

"A major theme was the analysis of data (using spreadsheets, data managers, GIS) that gives journalism more credibility by seeing patterns, trends and outliers," said Houston. "Other themes included collection and analysis of social media, visualization of data, planning and organizing stories based on data analysis, programming for web scraping (data collection from the Web) and mashing up various Web programs."

"Harvesting data through web scraping has become an ever bigger deal for data journalists," said Herzog. "At the same time, it's getting easier for folks with no or little programming chops to scrape using tools like spreadsheets, Google Refine and ScraperWiki. That said, another message for me was how important programming has become. No, not all journalists or even data journalists need to learn programming. But as Rich Gordon at Medill has said, all journalists should have an appreciation and understanding of what it can do."

Cohen similarly pointed to data, specifically its form. "The theme that I saw this year was a focus on unstructured rather than structured data," she said. "For a long time, we've been hammering governments to give us 'data' in columns and rows. I think we're increasingly seeing that stories just as likely (if not more likely) come from the unstructured information that comes from documents, audio and video, tweets, other social media -- from government and non-government sources. The other theme is that there is a lot more collaboration, openness and sharing among competing news organizations. (Witness PANDA and census.ire.org and the New York Times campaign finance API). But it only goes so far -- you don't see ProPublica sharing the 40+ states' medical licensure data that Dan scraped with everyone else. (I have to admit, though, I haven't asked him to share.) IRE has always been about sharing techniques and tools --- now we're actually sharing source material."

While data dominated NICAR 12, other trends mattered as well, from open mapping tools to macroeconomic trends in the media industry. "A lot of newsrooms are grappling with rapid change in mapping technology," said DeBarros. "Many of us for years did quite well with Flash, but the lack of support for Flash on iPad has fueled exploration into maps built on open source technologies that work across a range of online environments. Many newsrooms are grappling with this, and the number of mapping sessions at the conference reflected this."

There's also serious context to the interest in developing data journalism skills. More than 166 U.S. newspapers have stopped putting out a print edition or closed down altogether since 2008, resulting in more than 35,000 job losses or buyouts in the newspaper industry since 2007.

"The economic slump and the fundamental change in the print publishing business means that journalists are more aware of the business side than ever," said DeBarros, "and I think the conference reflected that more than in the past. There was a great session on turning your good work into money by Chase Davis and Matt Wynn, for example. I was on a panel talking about the business reasons for starting APIs. The general unease most journalists feel knowing that our industry still faces difficult economic times. Watching a new generation of journalists come into the fold has been exciting."

One notable aspect of that next generation of data journalists is that it does not appear likely to look or sound the same as the newsrooms of the 20th century.

"This was the most diverse conference that I can remember," said Herzog. "I saw more women and people of color than ever before. We had data journalists from many countries: Korea, the U.K., Serbia, Germany, Canada, Latin America, Denmark, Sweden and more. Also, the conference is much more diverse in terms of professional skills and interests. Web 2.0 entrepreneurs, programmers, open data advocates, data visualization specialists, educators, and app builders mixed with traditional CAR jockeys. I also saw a younger crowd, a new generation of data journalists who are moving into the fold. For many of the participants, this was their first conference."

What problems does data journalism face?

While the tools are improving, there are still immense challenges ahead, from the technology itself to education to resources in newsroom. "A major unsolved challenge is making the analysis of unstructured data easier and faster to do. Those working on this include myself, Sarah Cohen, the DocumentCloud team, teams at AP and Chicago Tribune and many others," said Houston.

There's also the matter of improving the level of fundamental numeracy in the media. "This is going to sound basic, but there are still far too many journalists around the world who cannot open an Excel spreadsheet, sort the values or write an equation to determine percentage change," said DeBarros, "and that includes a large number of the college interns I see year after year, which really scares me. Journalism programs need to step up and understand that we live in a data-rich society, and math skills and basic data analysis skills are highly relevant to journalism. The 400+ journalists at NICAR still represent something of an outlier in the industry, and that has to change if journalism is going to remain relevant in an information-based culture."

In that context, Cohen has high hopes for a new project, the Reporters Lab. "The big unsolved problem to me is that it's still just too hard to use "data" writ large," she said. " You might have seen 4 or 5 panels on how to scrape data [at NICAR]. People have to write one-off computer programs using Python or Ruby or something to scrape a site, rather than use a tool like Kapow, because newsrooms can't (and never have) invest that kind of money into something that really isn't mission-critical. I think Kapow and its cousins cost $20,000-$40,000 a year. Our project to find those kinds of holes and create, commission or adapt free, open source tools for regular reporters to use, not the data journalist skilled in programming. We're building communities of people who want to work on these problems."

What role does data journalism play in open government?

On the third day of NICAR 2012, I presented upon "open data journalism, which, to paraphrase Jonathan Stray, I'd define as obtaining, reporting upon, curating and publishing open data in the public interest. As someone who's been following the open government movement closely for a few years now, the parallels to what civic hackers are doing and what this community of data journalists are working on is unescapable. They're focused on putting data to work for the public good, whether it's in the public interest, for profit, in the service of civic utility or, in the biggest crossover, government accountability.

To do so will require that data journalists and civic coders alike apply the powerful emerging tools in the newsroom stack to the explosion of digital bits and bytes from government, business and our fellow citizens.

The need for data journalism, in the context of massive amounts of government data being released, could not any more timely, particularly given persistent quality issues.

"I can't find any downsides of more data rather than less," said Cohen, "but I worry about a few things."

First, emphasized Cohen, there's an issue of whether data is created open from the beginning -- and the consequences of 'sanitizing' it before release. "The demand for structured, nicely scrubbed data for the purpose of building apps can result in fake records rather than real records being released. USASpending.gov is a good example of that -- we don't get access to the actual spending records like invoices and purchase orders that agencies use, or the systems they use to actually do their business. Instead we have a side system whose only purpose is to make it public, so it's not a high priority inside agencies and there's no natural audit trail on it. It's not used to spend money, so mistakes aren't likely to be caught."

Second, there's the question of whether information relevant to an investigation has been scrubbed for release. "We get the lowest common denominator of information," she said. "There are a lot of records used for accountability that depend on our ability to see personally identifiable information (as opposed to private or personal information, which isn't the same thing). For instance, if you want to do stories on how farm subsidies are paid, you kind of have to know who gets them. If you want to do something on fraud in FEMA claims, you have to be able to find the people and businesses who get the aid. But when it gets pushed out as open government data, it often gets scrubbed of important details and then we have a harder time getting them under FOIA because the agencies say the records are already public."

To address those two issues, Cohen recommends getting more source documents, as a historian would. "I think what we can do is to push harder for actual records, and to not settle for what the White House wants to give us," she said. "We also have to get better at using records that aren't held in nice, neat forms -- they're not born that way, and we should get better at using records in whatever form they exist."

Why do data journalism and news apps matter?

Given the economic and technological context, it might seem like the case for data journalism should make itself. "CAR, data journalism, precision journalism, and news apps all are crucial to journalism -- and the future of journalism -- because they make sense of the tremendous amounts of data," said Houston, "so that people can understand the world and make sensible decisions and policies."

Given the reality that those practicing data journalism remain a tiny percentage of the world's media, however, there's clearly still a need for its foremost practitioners to show why it matters, in terms of impact.

"We're living in a data-driven culture," said DeBarros. "A data-savvy journalist can use the Twitter API or a spreadsheet to find news as readily as he or she can use the telephone to call a source. Not only that, we serve many readers who are accustomed to dealing with data every day -- accountants, educators, researchers, marketers. If we're going to capture their attention, we need to speak the language of data with authority. And they are smart enough to know whether we've done our research correctly or not. As for news apps, they're important because -- when done right -- they can make large amounts of data easily understood and relevant to each person using them."

New tools, same rules

While the platforms and toolkits for journalism are evolving and the sources of data are exploding, many things haven't changed. For one, the ethics that guide the choices of the profession remain central to the journalism of the 21st century, as the new NPR's new ethics guide makes clear.

Whether news developers are rendering data in real-time, validating data in the real world, or improving news coverage with data, good data journalism still must tell a story. And as Erika Owens reflected in her own blog after NICAR, looking back upon a group field trip to the marvelous City Museum in St. Louis, journalism is also joyous, whether one is "crafting the perfect lede or slaying an infuriating bug."

Whether the tool is a smartphone, notebook or dataset, these tools must also extend investigative reporting, as the Los Angeles Times Doug Smith emphasized to me at the NICAR conference.

If text is the next frontier in data journalism, harnessing the power of big data, it will be in the service of telling stories more effectively. Digital journalism and digital humanities are merging in the service of more informed society.

Profiles of the data journalist

To learn more about the people who are redefining the practice computer-assisted reporting, in some cases, building the newsroom stack for the 21st century, Radar conducted a series of email interviews with data journalists during the 2012 NICAR Conference. The first two of the series are linked below:

Profile of the Data Journalist: The Elections Developer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Derek Willis (@derekwillis) is a news developer based in New York City. Our interview follows.

Where do you work now? What is a day in your life like?

I work for The New York Times as a developer in the Interactive News Technologies group. A day in my work life usually includes building or improving web applications relating to politics, elections and Congress, although I also get the chance to branch out to do other things. Since elections are such an important subject, I try to think of ways to collect information we might want to display and of ways to get that data in front of readers in an intelligent and creative manner.

How did you get started in data journalism? Did you get any special degrees or certificates?

No, I started working with databases in graduate school at the University of Florida (I left for a job before finishing my master's degree). I had an assistantship at an environmental occupations training center and part of my responsibilities was to maintain the mailing list database. And I just took to it - I really enjoyed working with data, and once I found Investigative Reporters & Editors, things just took off for me.

Did you have any mentors? Who? What were the most important resources they shared with you?

A ton of mentors, mostly met through IRE but also people at my first newspaper job at The Palm Beach Post. A researcher there, Michelle Quigley, taught me how to find information online and how sometimes you might need to take an indirect route to locating the stuff you want. Kinsey Wilson, now the chief content officer at NPR, hired me at Congressional Quarterly and constantly challenged me to think bigger about data and the news. And my current and former colleagues at The Times and The Washington Post are an incredible source of advice, counsel and inspiration.

What does your personal data journalism "stack" look like? What tools could you not live without?

It's pretty basic: spreadsheets, databases (MySQL, PostgreSQL, SQLite) and a programming language like Python or, these days, Ruby. I've been lucky to find excellent tools in the Ruby world, such as the Remote Table gem by Brighter Planet, and a host of others. I like PostGIS for mapping stuff.

What data journalism project are you the most proud of working on or creating?

I'm really proud of the elections work at The Times, but can't take credit for how good it looks. A project called Toxic Waters also was incredibly challenging and rewarding to work on, too. But my favorite might be the first one: the Congressional Votes Database that Adrian Holovaty, Alyson Hurt and I created at The Post in late 2005. It was a milestone for me and for The Post, and helped set the bar for what news organizations could do with data on the web.

Where do you turn to keep your skills updated or learn new things?

My colleagues are my first source. When you work with Jeremy Ashkenas, the author of the Backbone and Underscore Javascript libraries, you see and learn new things all the time. Our team is constantly bouncing new concepts around. I wish I had more time to learn new things; maybe after the elections!

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

A couple of reasons: one is that we live in an age where information is plentiful. Tools that can help distill and make sense of it are valuable. They save time and convey important insights. News organizations can't afford to cede that role. The second is that they really force you to think about how the reader/user is getting this information and why. I think news apps demand that you don't just build something because you like it; you build it so that others might find it useful.

This email interview has been edited and condensed for clarity.

Profile of the Data Journalist: The Long Form Developer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Dan Nguyen (@dancow) is an investigative developer/journalist based in Manhattan. Our interview follows.

Where do you work now? What is a day in your life like?

I'm a news app developer at ProPublica, where I've worked for about 3.5 years. It's hard to say what our typical day is like. Ideally, I either have a project or am writing code to collect the data to determine whether a project is worth doing (or just doing old-fashioned reading of articles/papers that may spark ideas for things to look at). We're a small operation so we have our hands on any of the daily news production as well, including helping the reporters put together online features for their more print-focused work.

How did you get started in data journalism? Did you get any special degrees or certificates?

I stumbled into data journalism because I had always been interested in being a journalist but double majored in journalism and computer engineering just in case the job market didn't work out. Out of college, I got a good job as a traditional print reporter at a regional newspaper but was eventually asked to help with the newsroom's online side. I got back into programming and started to realize there was a role for programming and important journalism.

Did you have any mentors? Who? What were the most important resources they shared with you?

The mix of programming and journalism is still relatively new, so I didn't have any formal mentors in it. I was of course lucky that my boss at ProPublica, Scott Klein, had a great vision about the role of news applications and our investigative journalism. We were also fortunate to have Brian Boyer (now the news applications editor at the Tribune company) to work with us as we started doing news apps with Ruby on Rails, as he had come into journalism from being a professional developer.

What does your personal data journalism "stack" look like? What tools could you not live without?

In terms of day-to-day tools, I use RVM (Ruby Version Manager) to run multiple versions of Ruby, which is my all purpose tool for doing any kind of batch task work, text processing/parsing, number crunching, and of course Ruby on Rails development. Git, of course, is essential, and I combine that with Dropbox to keep versioned copies of personal projects and data work. On top of that, my most frequently used tool is Google Refine, which takes the tedium out of exploring new data sets, especially if I have to clean them.

What data journalism project are you the most proud of working on or creating?

The project I'm most proud of is something I did before SOPA Opera, which was our Dollars for Docs project in 2010. It started off with just a blog post I wrote to teach other journalists how web scraping was useful. In this case, I scraped a website Pfizer used to disclose what it paid doctors to do promotional and consulting work. My colleagues noticed and said that we could do that for every company that had been disclosing payments. Because each company disclosed these payments in a variety of formats, including Flash containers and PDFs, few people had tried to analyze these disclosures in bulk, to see nationwide trends in these financial relationships.

A lot of the data work happened behind the scenes, including writing dozens of scrapers to cross-reference our database of payments with state medical board and med school listings. For the initial story, we teamed up with five other newsrooms, including NPR and the Boston Globe, which required programmatically creating a system in which we could coordinate data and research. With all the data we had, and the number of reporters and editors working on this outside of our walls, this wasn't a project that would've succeeded by just sending Excel files back and forth.

The website we built from that data is our most visited project yet, as millions of people used it to look up their doctors. Afterwards, we shared our data with any news outlet that asked, so hundreds of independently reported stories came from our data. Among the results were that the drug companies and the med schools revisited their screening and conflict of interest policies.

So, in terms of impact, Dollars for Docs is the project I'm proudest of. But it shares something in common with SOPA Opera (which was mostly a solo project that took a couple weeks), in that both projects were based of already well-known and long-ago-publicized data. But with data journalism techniques, there are countless new angles to important issues, and countless new and interesting ways to tell their stories.

Where do you turn to keep your skills updated or learn new things?

I check Hacker News and the programming subreddit constantly to see what new hacks, projects, and plugins that the community is putting out. I also have a huge backlog of programming books, some of them free that were posted on HN, on my Kindle.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

I went into journalism because I wanted to be a longform writer in the tradition of the New Yorker. But I'm fortunate that I stumbled onto the path of using programming to do journalism; more and more, I'm seeing how important stories aren't being done even though the data and information are out in broad daylight (as they were in D4D and SOPA Opera) because we have relatively few journalists with the skills or mindset to process and understand that data. Of course, doing this work doesn't preclude me from presenting in a longform article; it just happens that programming also provides even more ways to present a story when narrative isn't the only (or the ideal) way to do so.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl