Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 31 2012

Strata Week: MIT and Massachusetts bet on big data

Here are a few of the big data stories that caught my attention this week.

MIT makes a big data push

MIT unveiled its big data research plans this week with a new initiative: bigdata@csail. CSAIL is the university's Computer Science and Artificial Intelligence Laboratory. According to the initiative's website, the project will "identify and develop the technologies needed to solve the next generation data challenges which require the ability to scale well beyond what today's computing platforms, algorithms, and methods can provide."

The research will be funded in part by Intel, which will contribute $2.5 million per year for up to five years. As part of the announcement, Massachusetts Governor Deval Patrick added that his state was forming a Massachusetts Big Data initiative that would provide matching grants for big data research, something he hopes will make the state "well-known for big data research."

Cisco's predictions for the Internet

Cisco released its annual forecast for Internet networking. Not surprisingly, Cisco projects massive growth in networking, with annual global IP traffic reaching 1.3 zettabytes by 2016. "The projected increase of global IP traffic between 2015 and 2016 alone is more than 330 exabytes," according to the company's press release, "which is almost equal to the total amount of global IP traffic generated in 2011 (369 exabytes)."

Cisco points to a number of factors contributing to the explosion, including more Internet-connected devices, more users, faster Internet speeds, and more video.

Open data startup Junar raises funding

The Chilean data startup Junar announced this week that it had raised a seed round of funding. The startup is an open data platform with the goal of making it easy for anyone to collect, analyze, and publish. GigaOm's Barb Darrow writes:

"Junar's Open Data Platform promises to make it easier for users to find the right data (regardless of its underlying format); enhance it with analytics; publish it; enable interaction with comments and annotation; and generate reports. Throughout the process it also lets user manage the workflow and track who has accessed and downloaded what, determine which data sets are getting the most traction etc."

Junar joins a number of open data startups and marketplaces that offer similar or related services, including Socrata and DataMarket.

Have data news to share?

Feel free to email me.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR

May 24 2012

Knight Foundation grants $2 million for data journalism research

Every day, the public hears more about technology and media entrepreneurs, from when they started in the garages and the dorm rooms, all the way up until when they go public, get acquired or go spectacularly bust. The way that the world mourned the passing of Steve Jobs last year and that young people now look to Mark Zuckerberg as a model for what's possible offer some insight into that dynamic.

For those who want to follow in their footsteps, the most interesting elements of those stories will be the muddy details of who came up with the idea, who wrote the first lines of code, who funded them, how they were mentored and then how the startup executed upon their ideas.

Today, foundations and institutions alike are getting involved in the startup ecosystem, but with a different hook than the venture capitalists on Sand Hill Road in California or Y Combinator: They're looking for smart, ambitious social entrepreneurs who want to start civic startups and increase the social capital of the world. From the Code for America Civic Accelerator to the Omidyar Foundation to Google.org to the Knight Foundation's News Challenge, there's more access to seed capital than ever before.

There are many reasons to watch what the Knight Foundation is doing, in particular, as it shifts how it funds digital journalism projects. The foundation's grants are going toward supporting many elements of the broader open government movement, from civic media to government transparency projects to data journalism platforms.

Many of these projects — or elements and code from them — have a chance at becoming part of the plumbing of digital democracy in the 21st century, although we're still on the first steps of the long road of that development.

This model for catalyzing civic innovation in the public interest is, in the broader sweep of history, still relatively new. (Then again, so is the medium you're reading this post on.) One barrier that the Internet has helped lower is in the process of discovering and selecting good ideas to fund and letting bad ideas fall to the wayside. Another is changing how ideas are capitalized through microfunding approaches or how distributing opportunities for participation in helping products or services go to market now can happen though crowdfunding platforms like Kickstarter.

When the Pebble smartwatch received $10 million through Kickstarter this year, it offered a notable data point into how this model could work. We'll see how others follow.

These models could contribute to the development of small pieces of civic architecture around the world, loosely joining networks in civil society with mobile technology, lightweight programming languages and open data.

After years of watching how the winners of the Knight News Challenges have — or have not — contributed to this potential future, its architects are looking at big questions: How should resources be allocated in newsrooms? What should be measured? Are governments more transparent and accountable due to the use of public data by journalists? What data is available? What isn't? What's useful and relevant to the lives of citizens? How can data visualization, news applications and interactive maps inform and engage readers?

In the context of these questions, the fact that the next Knight News Challenge will focus on data will create important new opportunities to augment the practice of journalism and accelerate the pace of open government. John Bracken (@jsb), the Knight Foundation's program director for journalism and media innovation, offered an explanation for this focus on the foundation's blog:

"Knight News Challenge: Data is a call for making sense of this onslaught of information. 'As data sits teetering between opportunity and crisis, we need people who can shift the scales and transform data into real assets,' wrote Roger Ehrenberg earlier this year.

"Or, as danah boyd has put it, 'Data is cheap, but making sense of it is not.'

"The CIA, the NBA's Houston Rockets, startups like BrightTag and Personal ('every detail of your life is data') — they're all trying to make sense out of data. We hope that this News Challenge will uncover similar innovators discovering ways for applying data towards informing citizens and communities."

Regardless of what happens with this News Challenge, some of those big data questions stand a much better chance of being answered because of the Knight Foundation's $2 million grant to Columbia University to research and distribute best practices for digital reporting, data visualizations and measuring impact.

Earlier this spring, I spoke with Emily Bell, the director of the Tow Center for Digital Journalism, about how this data journalism research at Columbia will close the data science "skills gap" in newsrooms. Bell is now entrusted with creating the architecture for learning that will teach the next generation of data journalists at Columbia University.

In search of the reasoning behind the grant, I talked to Michael Maness (@MichaelManess), vice president of journalism and media innovations at the Knight Foundation. Our interview, lightly edited for content and clarity, follows.

The last time I checked, you're in charge of funding ideas that will make the world better through journalism and technology. Is that about right?

Michael Maness: That's the hope. What we're trying to do is make sure that we're accelerating innovation in the journalism and media space that continues to help inform and engage communities. We think that's vital for democracy. What I do is work on those issues and fund ideas around that to not only make it easier for journalists to do their work, but citizens to engage in that same practice.

The Knight News Challenge has changed a bit over the last couple of years. How has the new process been going?

Michael Maness: I've been in the job a little bit more than a year. I came in at the tail end of 2011 and the News Challenge of 2011. We had some great winners, but we noticed that in the amount of time from when you applied in the News Challenge to when you were funded could be up to 10 months, by the time everything was done, and certainly eight months in terms of the process. So we reduced that to about 10 weeks. It's intense for the judges to do that, but we wanted to move more quickly, recognizing the speed of disruption and the energy of innovation and how fast it's moving.

We've also switched to a thematic theme. We're going to do three [themes] this year. The point of it is to fund as fast as possible those ideas that we think are interesting and that we think will have a big impact.

This last round was around networks. The reason we focused on networks is the apparent rise of network power. The second reason is we get people, for example, that say, "This is the new Twitter for X" or "This is the new Facebook for journalists." Our point is actually, you should be using and leveraging existing things for that.

We found when we looked back at the last five years of the News Challenge that people who came in with networks or built networks in accordance with what they're doing had a higher and faster scaling rate. We want to start targeting areas to do that, too.

We hear a lot about entrepreneurs, young people and the technology itself, but schools and libraries seem really important to me. How will existing institutions be part of the future that you're funding and building?

Michael Maness: One of the things that we're doing is moving into more "prototyping" types of grants and then finding ways of scaling those out, helping get ideas into a proof-of-concept phase so users kick the tires and look for scaling afterward.

In terms of the institutions, one of the things that we've seen that's been a bit of a frustration point is making sure that when we have innovations, [we're] finding the best ways to parlay those into absorption in these kinds of institutions.

A really good standout for that, from a couple years ago as a News Challenge winner, is DocumentCloud, which has been adopted by a lot of the larger legacy media institutions. From a university standpoint, we know one of the things that is key is getting involvement with students as practitioners. They're trying these things out and they're doing the two kinds of modeling that we're talking about. They're using the newest tools in the curriculum.

That's one of the reasons we made the grant [to Columbia.] They have a good track record. The other reason is that you have a real practitioner there with Emily Bell, doing all of her digital work from The Guardian and really knowing how to implement understandings and new ways of reporting. She's been vital. We see her as someone who has lived in an actual newsroom, pulling in those digital projects and finding new ways for journalists to implement them.

The other aspect is that there are just a lot of unknowns in this space. As we move forward, using these new tools for data visualization, for database reporting, what are the things that work? What are the things that are hard to do? What are the ideas that make the most impact? What efficiencies can we find to help newsrooms do it? We didn't really have a great body of knowledge around that, and that's one of the things that's really exciting about the project at Columbia.

How will you make sure the results of the research go beyond Columbia's ivy-covered walls?

Michael Maness: That was a big thing that we talked about, too, because it's not in us to do a lot of white papers around something like this. It doesn't really disseminate. A lot of this grant is around making sure that there are convocations.

We talk a lot about the creation of content objects. If you're studying data visualization, we should be making sure that we're producing that as well. This will be something that's ongoing and emerging. Definitely, a part of it is that some of these resources will go to hold gatherings, to send people out from Columbia to disseminate [research] and also to produce findings in a way that can be moved very easily around a digital ecosystem.

We want to make sure that you're running into this work a lot. This is something that we've baked into the grant, and we're going to be experimenting with, I think, as it moves forward. But I hear you, that if we did all of this — and it got captured behind ivy walls — it's not beneficial to the industry.

Related:

May 22 2012

Data journalism research at Columbia aims to close data science skills gap

Successfully applying data science to the practice of journalism requires more than providing context and finding clarity in vasts amount of unstructured data: it will require media organizations to think differently about how they work and who they venerate. It will mean evolving towards a multidisciplinary approach to delivering stories, where reporters, videographers, news application developers, interactive designers, editors and community moderators collaborate on storytelling, instead of being segregated by departments or buildings.

The role models for this emerging practice of data journalism won't be found on broadcast television or on the lists of the top journalists over the past century. They're drawn from the increasing pool of people who are building new breeds of newsrooms and extending the practice of computational journalism. They see the reporting that provisions their journalism as data, a body of work that can itself can be collected, analyzed, shared and used to create longitudinal insights about the ways that society, industry or government are changing. (Or not, as the case may be.)

In a recent interview, Emily Bell (@EmilyBell), director of the Tow Center for Digital Journalism at the Columbia University School of Journalism, offered her perspective about what's needed to train the data journalists of the future and the changes that still need to occur in media organizations to maximize their potential. In this context, while the role of institutions and "journalism education are themselves evolving, they both will still fundamentally matter for "what's next," as practitioners adapt to changing newsonomics.

Our discussion took place in the context of a notable investment in the future of data journalism: a $2 million research grant to Columbia University from the Knight Foundation to research and distribute best practices for digital reportage, data visualizations and measuring impact. Bell explained more about what how the research effort will help newsrooms determine what's next on the Knight Foundation's blog:

The knowledge gap that exists between the cutting edge of data science, how information spreads, its effects on people who consume information and the average newsroom is wide. We want to encourage those with the skills in these fields and an interest and knowledge in journalism to produce research projects and ideas that will both help explain this world and also provide guidance for journalism in the tricky area of ‘what next’. It is an aim to produce work which is widely accessible and immediately relevant to both those producing journalism and also those learning the skills of journalism.

We are focusing on funding research projects which relate to the transparency of public information and its intersection with journalism, research into what might broadly be termed data journalism, and the third area of ‘impact’ or, more simply put, what works and what doesn’t.

Our interview, lightly edited for content and clarity, follows.

What did you do before you became director of the Tow Center for Digital Journalism?

I spent ten years where I was editor-in-chief of The Guardian website. During the last four of those, I was also overall director of digital content for all The Guardian properties. That included things like mobile applications, et cetera, but from the editorial side.

Over the course of that decade, you saw one or two things change online, in terms of what journalists could do, the tools available to them and the news consumption habits of people. You also saw the media industry change, in terms of the business models and institutions that support journalism as we think of it. What are the biggest challenges and opportunities for the future journalism?

For newspapers, there was an early warning system: that newspaper circulation has not really consistently risen since the early 1980s. We had a long trajectory of increased production and actually, an overall systemic decline which has been masked by a very, very healthy advertising market, which really went on an incredible bull run with a more static pictures, and just "widen the pipe," which I think fooled a lot of journalism outlets and publishers into thinking that that was the real disruption.

And, of course, it wasn’t.

The real disruption was the ability of anybody anywhere to upload multimedia content and share it with anybody else who was on a connected device. That was the thing that really hit hard, when you look at 2004 onwards.

What journalism has to do is reinvent its processes, its business models and its skillsets to function in a world where human capital does not scale well, in terms of sifting, presenting and explaining all of this information. That’s really the key to it.

The skills that journalists need to do that -- including identifying a story, knowing why something is important and putting it in context -- are incredibly important. But how you do that, which particular elements you now use to tell that story are changing.

Those now include the skills of understanding the platform that you’re operating on and the technologies which are shaping your audiences’ behaviors and the world of data.

By data, I don’t just mean large caches of numbers you might be given or might be released by institutions: I mean that the data thrown off by all of our activity, all the time, is simply transforming the speed and the scope of what can be explained and reported on and identified as stories at a really astonishing speed. If you don’t have the fundamental tools to understand why that change is important and you don’t have the tools to help you interpret and get those stories out to a wide public, then you’re going to struggle to be a sustainable journalist.

The challenge for sustainable journalism going forward is not so different from what exists in other industries: there's a skills gap. Data scientists and data journalists use almost the exact same tools. What are the tools and skills that are needed to make sense of all of this data that you talked about? What will you do to catalog and educate students about them?

It's interesting when you say that the skills of these clients are very similar, which is absolutely right. First of all, you have a basic level of numeracy needed - and maybe not just a basic level, but a more sophisticated understanding of statistical analysis. That’s not something which is routinely taught in journalism schools but that I think will increasingly have to be.

The second thing is having some coding skills or some computer science understanding to help with identifying the best, most efficient tools and the various ways that data is manipulated.

The third thing is that when you’re talking about 'data scientists,' it’s really a combination of those skills. Adding data doesn’t mean you don't have to have other journalism skills which do not change: understanding context, understanding what the story might be, and knowing how to derive that from the data that you’re given or the data that exists. If it’s straightforward, how do you collect it? How do you analyze it? How do you interpret them and present it?

It’s easy to say, but it’s difficult to do. It’s particularly difficult to reorient the skillsets of an industry which have very much resided around the idea of a written story and an ability with editing. Even in the places where I would say there’s sophisticated use of data in journalism, it’s still a minority sport.

I’ve talked to several heads of data in large news organizations and they’ve said, “We have this huge skills gap because we can find plenty of people who can do the math; we can find plenty of people who are data scientists; we can’t find enough people who have those skills but also have a passion or an interest in telling stories in a journalistic context and making those relatable.”

You need a mindset which is about putting this in the context of the story and spotting stories, as well having creative and interesting ideas about how you can actually collect this material for your own stories. It’s not a passive kind of processing function if you’re a data journalist: it’s an active speaking, inquiring and discovery process. I think that that’s something which is actually available to all journalists.

Think about just local information and how local reporters go out and speak to people every day on the beat, collect information, et cetera. At the moment, most get from those entities don’t structure the information in a way that will help them find patterns and build new stories in the future.

This is not just about an amazing graphic that the New York Times does with census data over the past 150 years. This is about almost every story. Almost every story has some component of reusability or a component where you can collect the data in a way that helps your reporting in the future.

To do that requires a level of knowledge about the tools that you’re using, like coding, Google Refine or Fusion Tables. There are lots of freely available tools out there that are making this easier. But, if you don’t have the mindset that approaches, understands and knows why this is going to help you and make you a better reporter, then it’s sometimes hard to motivate journalists to see why they might want to grab on.

The other thing to say, which is really important, is there is currently a lack of both jobs and role models for people to point to and say, “I want to be that person.”

I think the final thing I would say to the industry is we’re getting a lot of smart journalists now. We are one of the schools where all of our digital concentrations from students this year include a basic grounding in data journalism. Every single one of them. We have an advanced course taught by Susan McGregor in data visualization. But we’re producing people from the school now, who are being hired to do these jobs, and the people who are hiring them are saying, “Write your own job description because we know we want you to do something, we just don’t quite know what it is. Can you tell us?”

You can’t cookie-cutter these people out of schools and drop them into existing roles in news trends because those are still developing. What we’re seeing are some very smart reporters with data-centric mindsets and also the ability to do these stories -- but they want to be out reporting. They don’t want to be confined to a desk and a spreadsheet. Some editors usually find that very hard to understand, “Well, what does that job look like?”

I think that this is where working with the industry, we can start to figure some of these things out, produce some experimental work or stories, and do some of the thinking in the classroom that helps people figure out what this whole new world is going to look like.

What do journalism schools need to do to close this 'skills gap?' How do they need to respond to changing business models? What combination of education, training and hands-on experience must they provide?

One of the first things they need to do is identify the problem clearly and be honest about it. I like to think that we’ve done that at Columbia, although I’m not a data journalist. I don’t have a background in it. I’m a writer. I am, if you like, completely the old school.

But one of the things I did do at The Guardian was helped people who early on said to me, “Some of this transformation means that we have to think about data as being a core part of what we do.” Because of the political context and the position I was in, I was able to recognize that that was an important thing that they were saying and we could push through changes and adoption in those areas of the newsroom.

That’s how The Guardian became interested in data. It’s the same in journalism school. One of the early things that we talked about [at Columbia] was how we needed to shift some of what the school did on its axis and acknowledge that this was going to be key part of what we do in the future. Once we acknowledged that that is something we had to work towards, [we hired] Susan McGregor from the Wall Street Journal’s Interactive Team. She’s an expert in data journalism and has an MA in technology in education.

If you say to me, “Well, what’s the ground vision here?” I would say the same thing I would say to anybody: over time, and hopefully not too long a course of time, we want to attract a type of student that is interested and capable in this approach. That means getting out and motivating and talking to people. It means producing attractive examples which high school children and undergraduate programs think about [in their studies]. It means talking to the CS [computer science] programs -- and, in fact, more about talking to those programs and math majors than you would be talking to the liberal arts professors or the historians or the lawyers or the people who have traditionally been involved.

I think that has an effect: it starts to show people who are oriented towards storytelling but have capabilities which are align more with data science skill sets that there’s a real task for them. We can’t message that early enough as an industry. We can’t message it early enough as an educator to get people into those tracks. We have to really make sure that the teaching is high quality and that we’re not just carried away with the idea of the new thing, we need to think pretty deeply about how we get those skills.

What sort of basic sort of statistical teaching do you need? What are the skills you need for data visualization? How do you need to introduce design as well as computer science skills into the classroom, in a way which makes sense for stories? How do you tier that understanding?

You're always going to produce superstars. Hopefully, we’ll be producing superstars in this arena soon as well.

We need to take the mission seriously. Then we need to build resources around it. And that’s difficult for educational organizations because it takes time to introduce new courses. It takes time to signal that this is something you think is important.

I think we’ve done a reasonable job of that so far at Columbia, but we’ve got a lot further to go. It's important that institutions like Columbia do take the lead and demonstrate that we think this is something that has to be a core curriculum component.

That’s hard, because journalism schools are known for producing writers. They’re known for different types of narratives. They are not necessarily lauded for producing math or computer science majors. That has to change.

Related:

May 15 2012

Profile of the Data Journalist: The Data News Editor

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society. (You can learn more about this world and the emerging leaders of this discipline in the newly released "Data Journalism Handbook.")

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

John Keefe (@jkeefe) is a senior editor for data news and journalism technology at WNYC public radio, based in New York City, NY. He attracted widespread attention when an online map he built using available data beat the Associated Press with Iowa caucus results earlier this year. He's posted numerous tutorials and resources for budding data journalists, including how to map data onto county districts, use APIs, create news apps without a backend content management system and make election results maps. As you'll read below, Keefe is a great example of a journalist who picked up these skills from the data journalism community and the Hacks/Hackers group.

Our interview follows, lightly edited for content and clarity. (I've also added a Twitter list of data journalist from the New York Times' Jacob Harris.)

Where do you work now? What is a day in your life like?

I work in the middle of the WNYC newsroom -- quite literally. So throughout the day, I have dozens of impromptu conversations with reporters and editors about their ideas for maps and data projects, or answering questions about how to find or download data.

Our team works almost entirely on "news time," which means our creations hit the Web in hours and days more often than weeks and months. So I'm often at my laptop creating or tweaking maps and charts to go with online stories. That said, Wednesday mornings it's breakfast at a Chelsea cafe with collaborators at Balance Media to update each other on longer-range projects and tools we make for the newsroom and then open source, like Tabletop.js and our new vertical timeline.

Then there are key meetings, such as the newsroom's daily and weekly editorial discussions, where I look for ways to contribute and help. And because there's a lot of interest and support for data news at the station, I'm also invited to larger strategy and planning meetings.

How did you get started in data journalism? Did you get any special degrees or certificates?

I've been fascinated with the intersection of information, design and technology since I was a kid. In the last couple of years, I've marveled at what journalists at the New York Times, ProPublica and the Chicago Tribune were doing online. I thought the public radio audience, which includes a lot of educated, curious people, would appreciate such data projects at WNYC, where I was news director.

Then I saw that Aron Pilhofer of the New York Times would be teaching a programming workshop at the 2009 Online News Association annual meeting. I signed up. In preparation, I installed Django on my laptop and started following the beginner's tutorial on my subway commute. I made my first "Hello World!" web app on the A Train.

I also started hanging out at Hacks/Hackers meetups and hackathons, where I'd watch people code and ask questions along the way.

Some of my experimentation made it onto the WNYC's website -- including our 2010 Census maps and the NYC Hurricane Evacuation map ahead of Hurricane Irene. Shortly thereafter, WNYC management asked me to focus on it full-time.

Did you have any mentors? Who? What were the most important resources they shared with you?

I could not have done so much so fast without kindness, encouragement and inspiration from Pilhofer at the Times; Scott Klein, Al Shaw, Jennifer LaFleur and Jeff Larson at ProPublica; , Chris Groskopf, Joe Germuska and Brian Boyer at the Chicago Tribune; and Jenny 8. Lee of, well, everywhere.

Each has unstuck me at various key moments and all have demonstrated in their own work what amazing things were possible. And they have put a premium on sharing what they know -- something I try to carry forward.

The moment I may remember most was at an afternoon geek talk aimed mainly at programmers programmers. After seeing a demo of a phone app called Twilio, I turned to Al Shaw, sitting next to me, and lamented that I had no idea how to play with such things.

"You absolutely can do this," he said.

He encouraged me to pick up Sinatra, a surprisingly easy way to use the Ruby programming language. And I was off.

What does your personal data journalism "stack" look like? What tools could you not live without?

Google Maps - Much of what I can turn around quickly is possible because of Google Maps. I'm also experimenting with MapBox and Geocommons for more data-intensive mapping projects, like our NYC diversity map.

Google Fusion Tables - Essential for my wrangling, merging and mapping of data sets on the fly.

Google Spreadsheets - These have become the "backend" to many of our data projects, giving reporters and editors direct access to the data driving an application, chart or map. We wire them to our apps using Tabletop.js, an open-source program we helped to develop.

TextMate - A programmer's text editor for Mac. There are several out there, and some are free. TextMate is my fave.

The JavaScript Tools Bundle for Textmate - It checks my JavaScript code ever time I save, flagging me to near-invisible, infuriating errors such as a stray comma or a missing parenthesis. I'm certain this one piece of software has given me more days with my kids.

Firebug for Firefox - Lets you see what your code is doing in the browser. Essential for troubleshooting CSS and JavaScript, and great for learning how the heck other people make cool stuff.

Amazon S3 - Most of what we build are static pages of html and JavaScript, which we host in the Amazon cloud and embed into article pages on our CMS.

census.ire.org - A fabulous, easy-to-navigate presentation of US Census data made by a bunch of journo-programmers for Investigative Reporters and Editors. I send someone there probably once a week.

What data journalism project are you the most proud of working on or creating?

I'd have to say our GOP Iowa Caucuses feature. It has several qualities I like:

  • Mashed-up data -- It mixes live, county vote results with Patchwork Nation community types.
  • A new take -- We know other news sites would shade Iowa's counties by the winner; we shaded them by community type and showed who won which categories.
  • Complete sharability -- We made it super-easy for anyone to embed the map into their own site, which was possible because the results came license-free from the state GOP via Google.
  • Key code from another journalist -- The map-rollover coolness comes from code built by Albert Sun, then of the Wall Street Journal and now at the New York Times.
  • Rapid learning -- I taught myself a LOT of JavaScript quickly.
  • Reusability -- We used it for which we did for each state until Santorum bowed out.


Bonus: I love that I made most of it sitting at my mom's kitchen table over winter break.

Where do you turn to keep your skills updated or learn new things?

WNYC's editors and reporters. They have the bug, and they keep coming up with new and interesting projects. And I find project-driven learning is the most effective way to discover new things. New York Public Radio -- which runs WNYC along with classical radio station WQXR, New Jersey Public Radio and a street-level performance space -- also has a growing stable of programmers and designers, who help me build things, teach me amazing tricks and spot my frequent mistakes.

The IRE/NICAR annual conference. It's a meetup of the best journo-programmers in the country, and it truly seems each person is committed to helping others learn. They're also excellent at celebrating the successes of others.

Twitter. I follow a bunch of folks who seem to tweet the best stuff, and try to keep a close eye on 'em.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Candidates, companies, municipalities, agencies and non-profit organizations all are using data. And a lot of that data is about you, me and the people we cover.

So first off, journalism needs an understanding of the data available and what it can do. It's just part of covering the story now. To skip that part of the world would shortchange our audience, and our democracy. Really.

And the better we can both present data to the general public and tell data-driven (or -supported) stories with impact, the better we can do great journalism.

May 11 2012

Four short links: 11 May 2012

  1. Stanford Med School Contemplates Flipped Classroom -- the real challenge isn't sending kids home with videos to watch, it's using tools like OceanBrowser to keep on top of what they're doing. Few profs at universities have cared whether students learned or not.
  2. Inclusive Tech Companies Win The Talent War (Gina Trapani) -- she speaks the truth, and gently. The original CNN story flushed out an incredible number of vitriolic commenters apparently lacking the gene for irony.
  3. Buyers and Sellers Guide to Web Design and Development Firms (Lance Wiggs) -- great idea, particularly "how to be a good client". There are plenty of dodgy web shops, but more projects fail because of the clients than many would like to admit.
  4. What Does It Mean to Say That Something Causes 16% of Cancers? (Discover Magazine) -- hey, all you infographic jockeys with your aspirations to add Data Scientist to your business card: read this and realize how hard it is to make sense of a lot of numbers and then communicate that sense. Data Science isn't about Hadoop any more than Accounting is about columns. Both try to tell a story (the original meaning of your company's "accounts") and what counts is the informed, disciplined, honest effort of knowing that your story is honest.

April 23 2012

Big data in Europe

The worldwide Big Data Week kicks off today with gatherings in the U.K., U.S., Germany, Finland and Australia. As part of their global focus, Big Data Week founder/organizer Stewart Townsend (@stewarttownsend) of DataSift, and Carlos Somohano (@ds_ldn), founder of the Data Science London community and the Data Science Hackathon, have been tracking big data in Europe. This is an area that we're exploring here at Radar and through October's Strata Conference in London, so I asked Townsend and Somohano to share their perspectives on the European data scene. They combined their thoughts in the following Q&A.

Are the U.S. and Europe at similar stages in big data growth, adoption and application?

Townsend and Somohano: Based on our experience across industry verticals and markets in Europe, we believe the U.S. is leading the way in adopting so called "big data." The U.K. is perhaps the European leader, where the level of adoption is picking up quite quickly, although still lagging behind the U.S.

In Germany, surprisingly, many organizations are not adopting big data as quickly as in the U.S. and U.K. markets. In some southern European markets, big data is still quite a new concept. Practically speaking, it's not on the radar there yet.

Part of our organizational mission at Big Data Week is to promote big data across the whole European ecosystem and act as messengers of the benefits of adopting big data early.

What are the key differences between big data in the U.S. and Europe?

Townsend and Somohano: Many large organizations in Europe are still in the early stages of the adoption cycle. This is perhaps due to the level of confusion around aspects like terminology (i.e. what do we mean by "big data"?) and the acute lack of skills around "new" big data technologies.

Today, most of the new developments and technologies around big data come from the U.S., and this presents somewhat of a challenge for some European organizations as they try to keep up with the rapid level of change. Perhaps the speed of both the decision-making cycle and the organizational change in most European companies is a bit slower than in the U.S. counterparts, and this may be reflected in the adoption of big data. We also see that the concept of data science and the role of the data scientist is well adopted in many U.S. companies, and not so much in Europe.

Where is Europe excelling with big data?

Townsend and Somohano: The financial services sector — particularly investment banking and trading in London — is one of the early adopters of big data. In this sector, both the experience and expertise in big data is on par with big data leaders in the U.S. Equally, the level of investment in big data in this sector — despite the economic downturn — is healthy, with a positive outlook.

Technology startups are also becoming European leaders in big data, primarily around London, and to a lesser degree in Berlin. There is a relatively large number of startups that are not only adopting but developing new business models and value from big data in social media and business-to-consumer services.

Finally, some verticals like oil and gas, utilities, and manufacturing are increasingly adopting big data in areas like sensor data, telemetry, and operational data streams. Our research indicates that retail is perhaps a late adopter.

The U.K. government has a number of robust open data initiatives. How are those shaping the big data space?

Townsend and Somohano: Quite notably, the U.K. government is becoming one of the leaders in open data, although perhaps not so in big data. This is perhaps due to the fact that the key drivers of open data initiatives are mainly information transparency, information privacy, information accessibility for citizens, and European regulatory changes. It's also worth mentioning that the U.K. government has been involved in very large-scale IT projects in the past (e.g. NHS IT), which could qualify as big data program initiatives. For diverse reasons, these projects were not successful and experienced massive budget overruns and delays. We believe this could be a factor in the U.K. government not focusing its main effort on big data for now. However, the open data initiative is driving the open release of massive large public datasets, which eventually will require a strategic big data approach from the U.K. government.

What's the state of data science in Europe?

Townsend and Somohano: Similarly to the status of big data, Europe lags behind in the adoption levels of data science. Even in the U.K. — an early adopter in many ways — data science is still viewed in some sectors with a certain dose of skepticism. That's because data science is still understood as as new name for practices like business intelligence or analytics.

Additionally, in many large organizations in Europe the role of the data scientist is still not associated with a clear job description from the business and HR perspectives — and even in IT in some cases. Contrary to the corporate environment, where the data scientist role is still not fully recognized, in the European startup scene there is a healthy and vibrant data science community. This is perhaps best exemplified by our organization Data Science London. In its mission to promote data science and the data scientist role, our community is dedicated to the free, open dissemination of data science concepts and ideas.

Big Data Week is being held April 23-28. What are you hoping the event yields?

Townsend and Somohano: The main goal of Big Data Week is to bring together the big data communities around the world by hosting a series of events and activities to promote big data. We aim to spread the knowledge and understanding of big data challenges and opportunities from the technology, business, and commercial perspectives.


Big Data Week will feature the Data Science Hackathon, a 24-hour in-person/online event organized by Data Science London. Full Hackathon details are available here.


This interview was edited and condensed.

Related:

April 10 2012

Open source is interoperable with smarter government at the CFPB

CFPBWhen you look at the government IT landscape of 2012, federal CIOs are being asked to address a lot of needs. They have to accomplish your mission. They need to be able to scale initiatives to tens of thousands of agency workers. They're under pressure to address not just network security but web security and mobile device security. They also need to be innovative, because all of this is supported by the same of less funding. These are common requirements in every agency.

As the first federal "start-up agency" in a generation, some of those needs at the Consumer Financial Protection Bureau (CFPB) are even more pressing. On the other hand, the opportunity for the agency to be smarter, leaner and "open from the beginning" is also immense.

Progress establishing the agency's infrastructure and culture over the first 16 months has been promising, save for larger context of getting a director at the helm. Enabling open government by design isn't just a catchphrase at the CFPB. There has been a bold vision behind the CFPB from the outset, where a 21st century regulator would leverage new technologies to find problems in the economy before the next great financial crisis escalates.

In the private sector, there's great interest right now is finding actionable insight in large volumes of data. Making sense of big data is increasingly being viewed as a strategic imperative in the public sector as well. Recently, the White House put its stamp on that reality with a $200 million big data research and development initiative, including a focus on improving the available tools. There's now an entire ecosystem of software around Hadoop, which is itself open source code. The problem that now exists in many organizations, across the public and private sector, is not so much that the technology to manipulate big data isn't available: it's that the expertise to apply big data doesn't exist in-house. The data science talent shortage is real.

People who work and play in the open source community understand the importance of sharing code, especially when that action leads to improving the code base. That's not necessarily an ethic or a perspective that has been pervasive across the federal government. That does seem to be slowly changing, with leadership from the top: the White House used Drupal for its site and has since contributed modules back into the open source community, including one that helps with 508 compliance.

In an in-person interview last week, CFPB CIO Chris Willey (@ChrisWilleyDC) and acting deputy CIO Matthew Burton (@MatthewBurton) sat down to talk about the agency's new open source policy, government IT, security, programming in-house, the myths around code-sharing, and big data.

The fact that this government IT leadership team is strongly supportive of sharing code back to the open source community is probably the most interesting part of this policy, as Scott Merrill picked up in his post on the CFPB and Github.

Our interview follows.

In addition to being the leader of the CFPB's development team over the past year and half, Burton was just named acting deputy chief information officer. What will that mean?

Willey: He hasn't been leading the software development team the whole time. In fact, we only really had an org chart as of October. In the time that he's been here, Matt has led his team to some amazing things. We're going to talk about a one of them today, but we've also got a great intranet. We've got some great internal apps that are being built and that we've built. We've unleashed one version of the supervision system that helps bank examiners do their work in the field. We've got a lot of faith he's going to do great things.

What it actually means is that he's going to be backing me up as CIO. Even though we're a fairly small organization, we have an awful lot going on. We have 76 active IT projects, for example. We're just building a team. We're actually doubling in size this fiscal year, from about 35 staff to 70, as well as adding lots of contractors. We're just growing the whole pie. We've got 800 people on board now. We're going to have 1,100 on board in the whole bureau by the end of the fiscal year. There's a lot happening, and I recognize we need to have some additional hands and brain cells helping me out.

With respect to building an internal IT team, what's the thinking behind having technical talent inside of an agency like this one? What does that change, in terms of your relationship with technology and your capacity to work?

Burton: I think it's all about experimentation. Having technical people on staff allows an organization to do new things. I think the way most agencies work is that when they have a technical need, they don't have the technical people on staff to make it happen so instead, that need becomes larger and larger until it justifies the contract. And by then, the problem is very difficult to solve.

By having developers and designers in-house, we can constantly be addressing things as they come up. In some cases, before the businesses even know it's a problem. By doing that, we're constantly staying ahead of the curve instead of always reacting to problems that we're facing.

How do you use open source technology to accomplish your mission? What are the tools you're using now?

Willey: We're actually trying to use open source in every aspect of what we do. It's not just in software development, although that's been a big focus for us. We're trying to do it on the infrastructure side as well.

As we look at network and system monitoring, we look at the tools that help us manage the infrastructure. As I've mentioned in the past, we are 100% in the cloud today. Open source has been a big help for us in giving us the ability to manipulate those infrastructures that we have out there.

At the end of the day, we want to bring in the tools that make the most sense for the business needs. It's not about only selecting open source or having necessarily a preference for open source.

What we've seen is that over time, the open source marketplace has matured. A lot of tools that might not have been ready for prime time a year ago or two years ago are today. By bringing them into the fold, we potentially save money. We potentially have systems that we can extend. We could more easily integrate with the other things that we have inside the shop that maybe we built or maybe things that we've acquired through other means. Open source gives us a lot of flexibility because there's a lot of opportunities to do things that we might not be able to do with some proprietary software.

Can you share a couple of specific examples of open source tools that you're using and what you actually use them for within mission?

Willey: On network monitoring, for example, we're using ZFS, which is an open source monitoring tool. We've been working with Nagios as well. Nagios, we actually inherited from Treasury — and while Treasury's not necessarily known for its use of open source technologies, it uses that internally for network monitoring. Splunk is another one that we have been using for web analysis. [After the interview, Burton and Willey also shared that they built the CFPB's intranet on MediaWiki, the software that drives Wikipedia.]

Burton: On the development side, we've invested a lot in Django and WordPress. Our site is a hybrid of them. It's WordPress at its core, with Django on top of that.

In November of 2010, it was actually a few weeks before I started here, Merici [Vinton] called me and said, "Matt, what should we use for our website?"

And I said, "Well, what's it going to do?"

And she said, "At first, it's going to be a blog with a few pages."

And this website needed to be up and running by February. And there was no hosting; there was nothing. There were no developers.

So I said, "Use WordPress."

And by early February, we had our website up. I'm not sure that would have been possible if we had to go through a lengthy procurement process for something not open source.

We use a lot of jQuery. We use Linux servers. For development ops, we use Selenium and Jenkins and Git to manage our releases and source code. We actually have GitHub Enterprise, which although not open source, is very sharing-focused. It encourages sharing internally. And we're using GitHub on the public side to share our code. It's great to have the same interface internally as we're using externally.

Developers and citizens alike can go to github.com/cfpb and see code that you've released back to the public and for other federal agencies. What projects are there?

Burton: These are the ones that came up between basic building blocks. They range from code that may not strike an outside developer as that interesting but that's really useful for the government, all the way to things that we created from scratch that are very developer-focused and are going to be very useful for any developer.

On the first side of that spectrum, there's an app that we made for transit subsidy involvement. Treasury used to manage our transit subsidy balances. That involved going to a webpage that you would print out, write into with a pen and then fax to someone.

Willey: Or scan and email it.

Burton: Right. And then once you'd had your supervisor sign it, faxed it over to someone, eventually, several weeks later, you would get your benefits. We started to take over that process and the human resources office came to us and asked, "How can we do this better?"

Obviously, that should just be a web form that you type into, that will auto fill any detail it knows about you. You press submit and it goes into the database, which goes directly to the DOT [Department of Transportation]. So that's what we made. We demoed that for DOT and they really like it. USAID is also into it. It's encouraging to see that something really simple could prove really useful for other agencies.

On the other side of the spectrum, we use a lot of Django tools. As an example, we have a tool we just released through our website called "Ask CFPB." It's a Django-based question and answer tool, with a series of questions and answers.

Now, the content is managed in Django. All of the content is managed from our staging server behind the firewall. When we need to get that content, we need to get the update from staging over to production.

Before, what we had to do was pick up the entire database, copy it and them move it over to production, which was kind of a nightmare. And there was no Django tool for selectively moving data modifications.

So we sat there and we thought, "Oh, we really need something to do that because we're going to be doing a lot of that. We can't be copying the database over every time we need to correct a copy. So two of our developers developed a Django app called "Nudge." Basically, you go into a Django and if you've ever seen a Django admin, you just go into it and assess, "Hey, here's everything that's changed. What do you want to move over?"

You can pick and choose what you want to move over and, with the click of a button, it goes to production. I think that's something that every Django developer will have a use for if they have a staging server.

In a way, we were sort of surprised it didn't exist. So, we needed it. We built it. Now we're giving it back and anybody in the world can use it.

You mentioned the cloud. I know that CFPB is very associated with Treasury. Are you using Treasury's FISMA moderate cloud?

Willey: We have a mix of what I would say are private and public clouds. On the public side, we're using our own cloud environments that we have established. On the private side, we are using Treasury for some of our apps. We're slowly migrating off of treasury systems onto our own cloud infrastructure or our own cloud.

In the case of email, for example, we're looking at email as a service. So we'll be looking at Google, Microsoft and others just to see what's out there and what we might be able to use.

Why is it important for the CFPB to share code back to the public? And who else in the federal government has done something like this, aside from the folks at the White House?

Burton:: We see it the same way that we believe the rest of the open source community sees it: The only way this stuff is going to get better and become more viable is if people share. Without that, then it'll only be hobbyists. It'll only be people who build their own little personal thing. Maybe it's great. Maybe it's not. Open source gets better by the community actually contributing to it. So it's self-interest in a lot of ways. If the tools get better, then what we have available to us is, therefore, gets better. We can actually do our mission better.

Using the transit subsidy enrollment application example, it's also an opportunity for government to help itself, for one agency to help another. We've created this thing. Every federal agency has a transit subsidy program. They all need to allow people to enroll in it. Therefore, it's immediately useful to any other agency in the federal government. That's just a matter of government improving its own processes.

If one group does it, why should another group have to figure it out or have to pay lots of money to have it figured out? Why not just share it internally and then everybody benefits?

Why do you think it's taken until 2012 to have that insight actually be made into reality in terms of a policy?

Burton: I think to some degree, the tools have changed. The ability to actually do this easily is a lot better now than it was even a year or two ago. Government also traditionally lags behind the private sector in a lot of ways. I think that's changing, too. With this administration in particular, I think what we've seen is that government has started to become a little bit on parity with the private sector, including some of the thinking around how to use technology to improve business processes. That's really exciting. And I think as a result, there are a lot of great people coming in as developers and designers who want to work in the federal government because they see that change.

Willey: It's also because we're new. There are two things behind that. First, we're able to sort of craft a technology philosophy with a modern perspective. So we can, from our founding, ask "What is the right way to do this?" Other agencies, if they want to do this, have to turn around decades of culture. We don't have that burden. I think that's a big reason why we're able to do this.

The second thing is a lot of agencies don't have the intense need that we do. We have 76 projects to do. We have to use every means available to us.

We can't say, "We're not going to use a large share of the software that's available to us." That's just not an option. We have to say, "Yes, we will consider this as a commercial good, just like any other piece of proprietary software."

In terms of the broader context for technology and policy, how does open source relate to open government?

Willey: When I was working for the District, Apps for Democracy was a big contest that we did around opening data and then asking developers to write applications using that data that could then be used by anybody. We said that the next logical step was to sort of create more participatory government. And in my mind, open sourcing the projects that we do is a way of asking the citizenry to participate in the active government.

So by putting something in the public space, somebody could pick that up. Maybe not the transit subsidy enrollment project — but maybe some other project that we've put out there that's useful outside of government as well as inside of government. Somebody can pick that code up, contribute to it and then we benefit. In that way, the public is helping us make government better.

When you have conversations around open source in government, what do you say about what it means to put your code online and to have people look at it or work on it? Can you take changes that people make to the code base to improve it and then use it yourself?

Willey: Everything that we put out there will be reviewed by our security team. The goal is that, by the time it's out there, not to have any security vulnerabilities. If someone does discover a security vulnerability, however, we'll be sharing that code in a way that makes it much more likely that someone will point it out to us and maybe even provide a fix than they will exploit it because it's out there. They wouldn't be exploiting our instance of the code; they would be working with the code on Github.com.

I've seen people in government with a misperception of what open source means. They hear that it's code that anyone can contribute to. I think that they don't understand that you're controlling your own instance of it. They think that anyone can come along and just write anything into your code that they like. And, of course, it's not like that.

I think as we talk more and more about this to other agencies, we might run into that, but I think it'll be good to have strong advocates in government, especially on the security side, who can say, "No, that's not the case; it doesn't work that way."

Burton: We have a firewall between our public and private instances at Git as well. So even if somebody contributes code, that's also reviewed on the way in. We wouldn't implement it unless we made sure that, from a security perspective, the code was not malicious. We're taking those precautions as well.

I can't point to one specifically, but I know that there have been articles and studies done on the relative security of open source. I think the consensus in the industry is that the peer review process of open source actually helps from a security perspective. It's not that you have a chaos of people contributing code whenever they want to. It improves the process. It's like the thinking behind academic papers. You do peer review because it enhances the quality of the work. I think that's true for open source as well.

We actually want to create a community of peer reviewers of code within the federal government. As we talk to agencies, we want people to actually use the stuff we build. We want them to contribute to it. We actually want them to be a community. As each agency contributes things, the other agencies can actually review that code and help each other from that perspective as well.

It's actually fairly hard. As we build more projects, it's going to put a little bit of a strain on our IT security team, doing an extra level of scrutiny to make sure that the code going out is safe. But the only way to get there is to grow that pie. And I think by talking with other agencies, we'll be able to do that.

A classic open source koan is that "with many eyes, all bugs become shallow." In IT security, is it that with many eyes, all worms become shallow?

Burton: What the Department of Defense said was if someone has malicious intent and the code isn't available, they'll have some way of getting the code. But if it is available and everyone has access to it, then any vulnerabilities that are there are much more likely to be corrected than before they're exploited.

How do you see open source contributing to your ability to get insights from large amounts of data? If you're recruiting developers, can they actually make a difference in helping their fellow citizens?

Burton: It's all about recruiting. As we go out and we bring on data people and software developers, we're looking for that kind of expertise. We're looking for people that have worked with PostgreSQL. We're looking for people that have worked with Solar. We're looking for people that have worked with Hadoop, because then we can start to build that expertise in-house. Those tools are out there.

R is an interesting example. What we're finding is that as more people are coming out of academia into the professional world, they're actually used to using R in school. And then they have to come out and learn a different tool and they're actually working in the marketplace.

It's similar with the Mac versus the PC. You get people using the Mac in college — and suddenly they have to go to a Windows interface. Why impose that on them? If they're going to be extremely productive with a tool like R, why not allow that to be used?

We're starting to see, in some pockets of the bureau, push from the business side to actually use some of these tools, which is great. That's another change I think that's happened in the last couple of years.

Before, there would've been big resistance on that kind of thing. Now that we're getting pushed a little bit, we have to respond to that. We also think it's worth it that we do.

Related:

April 03 2012

Data's next steps

Steve O'Grady (@sogrady) , a developer-focused analyst from RedMonk, views large-scale data collection and aggregation as a problem that has largely been solved. The tools and techniques required for the Googles and Facebooks of the world to handle what he calls "datasets of extraordinary sizes" have matured. In O'Grady's analysis, what hasn't matured are methods for teasing meaning of this data that are accessible to "ordinary users."

Among the other highlights from our interview:

  • O'Grady on the challenge of big data: "Kevin Weil (@kevinweil) from Twitter put it pretty well, saying that it's hard to ask the right question. One of the implications of that statement is that even if we had perfect access to perfect data, it's very difficult to determine what you would want to ask, how you would want to ask it. More importantly, once you get that answer, what are the questions that derive from that?"
  • O'Grady on the scarcity of data scientists: "The difficulty for basically every business on the planet is that there just aren't many of these people. This is, at present anyhow, a relatively rare skill set and therefore one that the market tends to place a pretty hefty premium on."
  • O'Grady on the reasons for using NoSQL: "If you are going down the NoSQL route for the sake of going down the NoSQL route, that's the wrong way to do things. You're likely to end up with a solution that may not even improve things. It may actively harm your production process moving forward because you didn't implement it for the right reasons in the first place."

The full interview is embedded below and available Here. For the entire interview transcript, click here.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Related:

March 30 2012

Automated science, deep data and the paradox of information

A lot of great pieces have been written about the relatively recent surge in interest in big data and data science, but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, "What's different here? What's special about these outliers and what do they tell us about our models and assumptions?”

The reason that big data proponents are so excited about the burgeoning data revolution isn't just because of the math. Don't get me wrong, the math is fun, but we're excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.

That's big data.

Of course, data are just a collection of facts; bits of information that are only given context — assigned meaning and importance — by human minds. It's not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.

And therein lies the rub.

Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?


(Semi)Automated science

In 2010, Cornell researchers Michael Schmidt and Hod Lipson published a groundbreaking paper in "Science" titled, "Distilling Free-Form Natural Laws from Experimental Data". The premise was simple, and it essentially boiled down to the question, "can we algorithmically extract models to fit our data?"

So they hooked up a double pendulum — a seemingly chaotic system whose movements are governed by classical mechanics — and trained a machine learning algorithm on the motion data.

Their results were astounding.

In a matter of minutes the algorithm converged on Newton's second law of motion: f = ma. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.

In 2011, some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in "Nature Methods" titled "Large-scale automated synthesis of human functional neuroimaging data". In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.

To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.

In other words, you type in a word such as "learning" on their website search and visualization tool, NeuroSynth, and they give you back a pattern of brain activity that you should expect to see during a learning task.

But that's not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, "given the data that I'm observing, what is the most probable behavioral state that this brain is in?"

Similarly, in late 2010, my wife (Jessica Voytek) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.

How many undergrads would I need to hire to read through that many papers? Any volunteers?

Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual Society for Neuroscience conference. If we assume that only two-thirds of those people actually do research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that's around 40 million person-hours dedicated to but one branch of the sciences.

Annually.

This means that in the 10 years I've been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about eight.

So my wife and I said to ourselves, "there has to be a better way".

Which lead us to create brainSCANr, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.

For example, if 10,000 papers mention "Alzheimer's disease" that also mention "dementia," then Alzheimer's disease is probably related to dementia. In fact, there are 17,087 papers that mention Alzheimer's and dementia, whereas there are only 14 papers that mention Alzheimer's and, for example, creativity.

From this, we built what we're calling the "cognome", a mapping between brain structure, function, and disease.

Big data, data mining, and machine learning are becoming critical tools in the modern scientific arsenal. Examples abound: text mining recipes to find cultural food taste preferences, analyzing cultural trends via word use in books ("culturomics"), identifying seasonality of mood from tweets, and so on.

But so what?

Deep data

What those three studies show us is that it's possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.

My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we're calling "semi-automated hypothesis generation," which is predicated on a basic "the friend of a friend should be a friend" concept.

In the example below, the neurotransmitter "serotonin" has thousands of shared publications with "migraine," as well as with the brain region "striatum." However, migraine and striatum only share 16 publications.

That's very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?

Perhaps there's a missing connection?

Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren't the only stories that our data can tell us.

For example, in my geoanalytics work as the data evangelist for Uber, I put some of my (definitely rudimentary) neuroscience network analytic skills to work to figure out how people move from neighborhood to neighborhood in San Francisco.

At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.

No big deal.

But what's cool was seeing where the outliers were. When I looked at the models' residuals, that's where I found the far more interesting story. While it's good to have a model that fits your data, knowing where the model breaks down is not only important for internal metrics, but it also makes for a more interesting story:

What's happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?

The paradox of information

The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that's where AT&T Park is. But maybe there are just five guys who live in SoMa who happen to take Uber 100 times more often than average.

While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don't fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.

In 2008, psychologists David McCabe and Alan Castel published a paper in the journal "Cognition," titled, "Seeing is believing: The effect of brain images on judgments of scientific reasoning". In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.

This should cause any data scientist serious concern. In fact, I've formulated three laws of statistical analyses:

  1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
  2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
  3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.

The first law is closely related to the "bike shed effect" (also known as Parkinson's Law of Triviality) which states that, "the time spent on any item of the agenda will be in inverse proportion to the sum involved."

In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant — a project so vast and complicated that most people can't understand it — people will defer to expert opinion.

Such is the case with statistics.

If you make the mistake of going into the comments section of any news piece discussing a scientific finding, invariably someone will leave the comment, "correlation does not equal causation."

We'll go ahead and call that truism Voytek's fourth law.

But people rarely have the capacity to argue against the methods and models used by, say, neuroscientists or cosmologists.

But sometimes we get perfect models without any understanding of the underlying processes. What do we learn from that?

The always fantastic Radiolab did a follow-up story on the Schmidt and Lipson "automated science" research in an episode titled "Limits of Science". It turns out, a biologist contacted Schmidt and Lipson and gave them data to run their algorithm on. They wanted to figure out the principles governing the dynamics of a single-celled bacterium. Their result?

Well sometimes the stories we tell with data ... they just don't make sense to us.

They found, "two equations that describe the data."

But they didn't know what the equations meant. They had no context. Their variables had no meaning. Or, as Radiolab co-host Jad Abumrad put it, "the more we turn to computers with these big questions, the more they'll give us answers that we just don't understand."

So while big data projects are creating ridiculously exciting new vistas for scientific exploration and collaboration, we have to take care to avoid the Paradox of Information wherein we can know too many things without knowing what those "things" are.

Because at some point, we'll have so much data that we'll stop being able to discern the map from the territory. Our goal as (data) scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand. Or, to operationalize that sentence better, we should aim to find balance between minimizing the residuals of our models and maximizing our ability to make sense of those models.

Recently, Stephen Wolfram released the results of a 20-year long experiment in personal data collection, including every keystroke he's typed and every email he's sent. In response, Robert Krulwich, the other co-host of Radiolab, concludes by saying "I'm looking at your data [Dr. Wolfram], and you know what's amazing to me? How much of you is missing."

Personally, I disagree; I believe that there's a humanity in those numbers and that Mr. Krulwich is falling prey to the idea that science somehow ruins the magic of the universe. Quoth Dr. Sagan:

"It is sometimes said that scientists are unromantic, that their passion to figure out robs the world of beauty and mystery. But is it not stirring to understand how the world actually works — that white light is made of colors, that color is the way we perceive the wavelengths of light, that transparent air reflects light, that in so doing it discriminates among the waves, and that the sky is blue for the same reason that the sunset is red? It does no harm to the romance of the sunset to know a little bit about it."

So go forth and create beautiful stories, my statistical friends. See you after peer-review.

Where Conference 2012 — Bradley Voytek will examine the connection between geodata and user experience through a number of sessions at O'Reilly's Where Conference, being held April 2-4 in San Francisco.

Save 20% on registration with the code RADAR20

Related:

March 28 2012

Designing great data products

By Jeremy Howard, Margit Zwemer and Mike Loukides

In the past few years, we've seen many data products based on predictive modeling. These products range from weather forecasting to recommendation engines to services that predict airline flight times more accurately than the airline itself. But these products are still just making predictions, rather than asking what action they want someone to take as a result of a prediction. Prediction technology can be interesting and mathematically elegant, but we need to take the next step. The technology exists to build data products that can revolutionize entire industries. So, why aren't we building them?

To jump-start this process, we suggest a four-step approach that has already transformed the insurance industry. We call it the Drivetrain Approach, inspired by the emerging field of self-driving vehicles. Engineers start by defining a clear objective: They want a car to drive safely from point A to point B without human intervention. Great predictive modeling is an important part of the solution, but it no longer stands on its own; as products become more sophisticated, it disappears into the plumbing. Someone using Google's self-driving car is completely unaware of the hundreds (if not thousands) of models and the petabytes of data that make it work. But as data scientists build increasingly sophisticated products, they need a systematic design approach. We don't claim that the Drivetrain Approach is the best or only method; our goal is to start a dialog within the data science and business communities to advance our collective vision.

Objective-based data products

We are entering the era of data as drivetrain, where we use data not just to generate more data (in the form of predictions), but use data to produce actionable outcomes. That is the goal of the Drivetrain Approach. The best way to illustrate this process is with a familiar data product: search engines. Back in 1997, AltaVista was king of the algorithmic search world. While their models were good at finding relevant websites, the answer the user was most interested in was often buried on page 100 of the search results. Then, Google came along and transformed online search by beginning with a simple question: What is the user's main objective in typing in a search query?

Drivetrain approach
The four steps in the Drivetrain Approach. Click to enlarge.

Google realized that the objective was to show the most relevant search result; for other companies, it might be increasing profit, improving the customer experience, finding the best path for a robot, or balancing the load in a data center. Once we have specified the goal, the second step is to specify what inputs of the system we can control, the levers we can pull to influence the final outcome. In Google's case, they could control the ranking of the search results. The third step was to consider what new data they would need to produce such a ranking; they realized that the implicit information regarding which pages linked to which other pages could be used for this purpose. Only after these first three steps do we begin thinking about building the predictive models. Our objective and available levers, what data we already have and what additional data we will need to collect, determine the models we can build. The models will take both the levers and any uncontrollable variables as their inputs; the outputs from the models can be combined to predict the final state for our objective.

Step 4 of the Drivetrain Approach for Google is now part of tech history: Larry Page and Sergey Brin invented the graph traversal algorithm PageRank and built an engine on top of it that revolutionized search. But you don't have to invent the next PageRank to build a great data product. We will show a systematic approach to step 4 that doesn't require a PhD in computer science.

The Model Assembly Line: A case study of Optimal Decisions Group

Optimizing for an actionable outcome over the right predictive models can be a company's most important strategic decision. For an insurance company, policy price is the product, so an optimal pricing model is to them what the assembly line is to automobile manufacturing. Insurers have centuries of experience in prediction, but as recently as 10 years ago, the insurance companies often failed to make optimal business decisions about what price to charge each new customer. Their actuaries could build models to predict a customer's likelihood of being in an accident and the expected value of claims. But those models did not solve the pricing problem, so the insurance companies would set a price based on a combination of guesswork and market studies.

This situation changed in 1999 with a company called Optimal Decisions Group (ODG). ODG approached this problem with an early use of the Drivetrain Approach and a practical take on step 4 that can be applied to a wide range of problems. They began by defining the objective that the insurance company was trying to achieve: setting a price that maximizes the net-present value of the profit from a new customer over a multi-year time horizon, subject to certain constraints such as maintaining market share. From there, they developed an optimized pricing process that added hundreds of millions of dollars to the insurers' bottom lines. [Note: Co-author Jeremy Howard founded ODG.]

ODG identified which levers the insurance company could control: what price to charge each customer, what types of accidents to cover, how much to spend on marketing and customer service, and how to react to their competitors' pricing decisions. They also considered inputs outside of their control, like competitors' strategies, macroeconomic conditions, natural disasters, and customer "stickiness." They considered what additional data they would need to predict a customer's reaction to changes in price. It was necessary to build this dataset by randomly changing the prices of hundreds of thousands of policies over many months. While the insurers were reluctant to conduct these experiments on real customers, as they'd certainly lose some customers as a result, they were swayed by the huge gains that optimized policy pricing might deliver. Finally, ODG started to design the models that could be used to optimize the insurer's profit.

Drivetrain Step 4:  The Model Assembly Line
Drivetrain Step 4: The Model Assembly Line. Picture a Model Assembly Line for data products that transforms the raw data into an actionable outcome. The Modeler takes the raw data and converts it into slightly more refined predicted data. Click to enlarge.

The first component of ODG's Modeler was a model of price elasticity (the probability that a customer will accept a given price) for new policies and for renewals. The price elasticity model is a curve of price versus the probability of the customer accepting the policy conditional on that price. This curve moves from almost certain acceptance at very low prices to almost never at high prices.

The second component of ODG's Modeler related price to the insurance company's profit, conditional on the customer accepting this price. The profit for a very low price will be in the red by the value of expected claims in the first year, plus any overhead for acquiring and servicing the new customer. Multiplying these two curves creates a final curve that shows price versus expected profit (see Expected Profit figure, below). The final curve has a clearly identifiable local maximum that represents the best price to charge a customer for the first year.

Expected profit
Expected profit.

ODG also built models for customer retention. These models predicted whether customers would renew their policies in one year, allowing for changes in price and willingness to jump to a competitor. These additional models allow the annual models to be combined to predict profit from a new customer over the next five years.

This new suite of models is not a final answer because it only identifies the outcome for a given set of inputs. The next machine on the assembly line is a Simulator, which lets ODG ask the "what if" questions to see how the levers affect the distribution of the final outcome. The expected profit curve is just a slice of the surface of possible outcomes. To build that entire surface, the Simulator runs the models over a wide range of inputs. The operator can adjust the input levers to answer specific questions like, "What will happen if our company offers the customer a low teaser price in year one but then raises the premiums in year two?" They can also explore how the distribution of profit is shaped by the inputs outside of the insurer's control: "What if the economy crashes and the customer loses his job? What if a 100-year flood hits his home? If a new competitor enters the market and our company does not react, what will be the impact on our bottom line?" Because the simulation is at a per-policy level, the insurer can view the impact of a given set of price changes on revenue, market share, and other metrics over time.

The Simulator's result is fed to an Optimizer, which takes the surface of possible outcomes and identifies the highest point. The Optimizer not only finds the best outcomes, it can also identify catastrophic outcomes and show how to avoid them. There are many different optimization techniques to choose from (see see sidebar, below), but it is a well-understood field with robust and accessible solutions. ODG's competitors use different techniques to find an optimal price, but they are shipping the same over-all data product. What matters is that using a Drivetrain Approach combined with a Model Assembly Line bridges the gap between predictive models and actionable outcomes. Irfan Ahmed of CloudPhysics provides a good taxonomy of predictive modeling that describes this entire assembly line process:

"When dealing with hundreds or thousands of individual components models to understand the behavior of the full-system, a 'search' has to be done. I think of this as a complicated machine (full-system) where the curtain is withdrawn and you get to model each significant part of the machine under controlled experiments and then simulate the interactions. Note here the different levels: models of individual components, tied together in a simulation given a set of inputs, iterated through over different input sets in a search optimizer."



Sidebar: Optimization in the real world

Optimization is a classic problem that has been studied by Newton and Gauss all the way up to mathematicians and engineers in the present day. Many optimization procedures are iterative; they can be thought of as taking a small step, checking our elevation and then taking another small uphill step until we reach a point from which there is no direction in which we can climb any higher. The danger in this hill-climbing approach is that if the steps are too small, we may get stuck at one of the many local maxima in the foothills, which will not tell us the best set of controllable inputs. There are many techniques to avoid this problem, some based on statistics and spreading our bets widely, and others based on systems seen in nature, like biological evolution or the cooling of atoms in glass.

Optimization

Optimization is a process we are all familiar with in our daily lives, even if we have never used algorithms like gradient descent or simulated annealing. A great image for optimization in the real world comes up in a recent TechZing podcast with the co-founders of data-mining competition platform Kaggle. One of the authors of this paper was explaining an iterative optimization technique, and the host says, "So, in a sense Jeremy, your approach was like that of doing a startup, which is just get something out there and iterate and iterate and iterate." The takeaway, whether you are a tiny startup or a giant insurance company, is that we unconsciously use optimization whenever we decide how to get to where we want to go.

Drivetrain Approach to recommender systems

Let's look at how we could apply this process to another industry: marketing. We begin by applying the Drivetrain Approach to a familiar example, recommendation engines, and then building this up into an entire optimized marketing strategy.

Recommendation engines are a familiar example of a data product based on well-built predictive models that do not achieve an optimal objective. The current algorithms predict what products a customer will like, based on purchase history and the histories of similar customers. A company like Amazon represents every purchase that has ever been made as a giant sparse matrix, with customers as the rows and products as the columns. Once they have the data in this format, data scientists apply some form of collaborative filtering to "fill in the matrix." For example, if customer A buys products 1 and 10, and customer B buys products 1, 2, 4, and 10, the engine will recommend that A buy 2 and 4. These models are good at predicting whether a customer will like a given product, but they often suggest products that the customer already knows about or has already decided not to buy. Amazon's recommendation engine is probably the best one out there, but it's easy to get it to show its warts. Here is a screenshot of the "Customers Who Bought This Item Also Bought" feed on Amazon from a search for the latest book in Terry Pratchett's "Discworld series:"

Terry Pratchett books at Amazon

All of the recommendations are for other books in the same series, but it's a good assumption that a customer who searched for "Terry Pratchett" is already aware of these books. There may be some unexpected recommendations on pages 2 through 14 of the feed, but how many customers are going to bother clicking through?

Instead, let's design an improved recommendation engine using the Drivetrain Approach, starting by reconsidering our objective. The objective of a recommendation engine is to drive additional sales by surprising and delighting the customer with books he or she would not have purchased without the recommendation. What we would really like to do is emulate the experience of Mark Johnson, CEO of Zite, who gave a perfect example of what a customer's recommendation experience should be like in a recent TOC talk. He went into Strand bookstore in New York City and asked for a book similar to Toni Morrison's "Beloved." The girl behind the counter recommended William Faulkner's "Absolom Absolom." On Amazon, the top results for a similar query leads to another book by Toni Morrison and several books by well-known female authors of color. The Strand bookseller made a brilliant but far-fetched recommendation probably based more on the character of Morrison's writing than superficial similarities between Morrison and other authors. She cut through the chaff of the obvious to make a recommendation that will send the customer home with a new book, and returning to Strand again and again in the future.

This is not to say that Amazon's recommendation engine could not have made the same connection; the problem is that this helpful recommendation will be buried far down in the recommendation feed, beneath books that have more obvious similarities to "Beloved." The objective is to escape a recommendation filter bubble, a term which was originally coined by Eli Pariser to describe the tendency of personalized news feeds to only display articles that are blandly popular or further confirm the readers' existing biases.

As with the AltaVista-Google example, the lever a bookseller can control is the ranking of the recommendations. New data must also be collected to generate recommendations that will cause new sales. This will require conducting many randomized experiments in order to collect data about a wide range of recommendations for a wide range of customers.

The final step in the drivetrain process is to build the Model Assembly Line. One way to escape the recommendation bubble would be to build a Modeler containing two models for purchase probabilities, conditional on seeing or not seeing a recommendation. The difference between these two probabilities is a utility function for a given recommendation to a customer (see Recommendation Engine figure, below). It will be low in cases where the algorithm recommends a familiar book that the customer has already rejected (both components are small) or a book that he or she would have bought even without the recommendation (both components are large and cancel each other out). We can build a Simulator to test the utility of each of the many possible books we have in stock, or perhaps just over all the outputs of a collaborative filtering model of similar customer purchases, and then build a simple Optimizer that ranks and displays the recommended books based on their simulated utility. In general, when choosing an objective function to optimize, we need less emphasis on the "function" and more on the "objective." What is the objective of the person using our data product? What choice are we actually helping him or her make?

Recommendation engine
Recommendation Engine.

Optimizing lifetime customer value

This same systematic approach can be used to optimize the entire marketing strategy. This encompasses all the interactions that a retailer has with its customers outside of the actual buy-sell transaction, whether making a product recommendation, encouraging the customer to check out a new feature of the online store, or sending sales promotions. Making the wrong choices comes at a cost to the retailer in the form of reduced margins (discounts that do not drive extra sales), opportunity costs for the scarce real-estate on their homepage (taking up space in the recommendation feed with products the customer doesn't like or would have bought without a recommendation) or the customer tuning out (sending so many unhelpful email promotions that the customer filters all future communications as spam). We will show how to go about building an optimized marketing strategy that mitigates these effects.

As in each of the previous examples, we begin by asking: "What objective is the marketing strategy trying to achieve?" Simple: we want to optimize the lifetime value from each customer. Second question: "What levers do we have at our disposal to achieve this objective?" Quite a few. For example:

  1. We can make product recommendations that surprise and delight (using the optimized recommendation outlined in the previous section).
  2. We could offer tailored discounts or special offers on products the customer was not quite ready to buy or would have bought elsewhere.
  3. We can even make customer-care calls just to see how the user is enjoying our site and make them feel that their feedback is valued.

What new data do we need to collect? This can vary case by case, but a few online retailers are taking creative approaches to this step. Online fashion retailer Zafu shows how to encourage the customer to participate in this collection process. Plenty of websites sell designer denim, but for many women, high-end jeans are the one item of clothing they never buy online because it's hard to find the right pair without trying them on. Zafu's approach is not to send their customers directly to the clothes, but to begin by asking a series of simple questions about the customers' body type, how well their other jeans fit, and their fashion preferences. Only then does the customer get to browse a recommended selection of Zafu's inventory. The data collection and recommendation steps are not an add-on; they are Zafu's entire business model — women's jeans are now a data product. Zafu can tailor their recommendations to fit as well as their jeans because their system is asking the right questions.

Zafu screenshots

Starting with the objective forces data scientists to consider what additional models they need to build for the Modeler. We can keep the "like" model that we have already built as well as the causality model for purchases with and without recommendations, and then take a staged approach to adding additional models that we think will improve the marketing effectiveness. We could add a price elasticity model to test how offering a discount might change the probability that the customer will buy the item. We could construct a patience model for the customers' tolerance for poorly targeted communications: When do they tune them out and filter our messages straight to spam? ("If Hulu shows me that same dog food ad one more time, I'm gonna stop watching!") A purchase sequence causality model can be used to identify key "entry products." For example, a pair of jeans that is often paired with a particular top, or the first part of a series of novels that often leads to a sale of the whole set.

Once we have these models, we construct a Simulator and an Optimizer and run them over the combined models to find out what recommendations will achieve our objectives: driving sales and improving the customer experience.

A look inside the Modeler
A look inside the Modeler. Click to enlarge.

Best practices from physical data products

It is easy to stumble into the trap of thinking that since data exists somewhere abstract, on a spreadsheet or in the cloud, that data products are just abstract algorithms. So, we would like to conclude by showing you how objective-based data products are already a part of the tangible world. What is most important about these examples is that the engineers who designed these data products didn't start by building a neato robot and then looking for something to do with it. They started with an objective like, "I want my car to drive me places," and then designed a covert data product to accomplish that task. Engineers are often quietly on the leading edge of algorithmic applications because they have long been thinking about their own modeling challenges in an objective-based way. Industrial engineers were among the first to begin using neural networks, applying them to problems like the optimal design of assembly lines and quality control. Brian Ripley's seminal book on pattern recognition gives credit for many ideas and techniques to largely forgotten engineering papers from the 1970s.

When designing a product or manufacturing process, a drivetrain-like process followed by model integration, simulation and optimization is a familiar part of the toolkit of systems engineers. In engineering, it is often necessary to link many component models together so that they can be simulated and optimized in tandem. These firms have plenty of experience building models of each of the components and systems in their final product, whether they're building a server farm or a fighter jet. There may be one detailed model for mechanical systems, a separate model for thermal systems, and yet another for electrical systems, etc. All of these systems have critical interactions. For example, resistance in the electrical system produces heat, which needs to be included as an input for the thermal diffusion and cooling model. That excess heat could cause mechanical components to warp, producing stresses that should be inputs to the mechanical models.

The screenshot below is taken from a model integration tool designed by Phoenix Integration. Although it's from a completely different engineering discipline, this diagram is very similar to the Drivetrain Approach we've recommended for data products. The objective is clearly defined: build an airplane wing. The wing box includes the design levers like span, taper ratio and sweep. The data is in the wing materials' physical properties; costs are listed in another tab of the application. There is a Modeler for aerodynamics and mechanical structure that can then be fed to a Simulator to produce the Key Wing Outputs of cost, weight, lift coefficient and induced drag. These outcomes can be fed to an Optimizer to build a functioning and cost-effective airplane wing.

Screenshot from a model integration tool
Screenshot from a model integration tool designed by Phoenix Integration.

As predictive modeling and optimization become more vital to a wide variety of activities, look out for the engineers to disrupt industries that wouldn't immediately appear to be in the data business. The inspiration for the phrase "Drivetrain Approach," for example, is already on the streets of Mountain View. Instead of being data driven, we can now let the data drive us.

Suppose we wanted to get from San Francisco to the Strata 2012 Conference in Santa Clara. We could just build a simple model of distance / speed-limit to predict arrival time with little more than a ruler and a road map. If we want a more sophisticated system, we can build another model for traffic congestion and yet another model to forecast weather conditions and their effect on the safest maximum speed. There are plenty of cool challenges in building these models, but by themselves, they do not take us to our destination. These days, it is trivial to use some type of heuristic search algorithm to predict the drive times along various routes (a Simulator) and then pick the shortest one (an Optimizer) subject to constraints like avoiding bridge tolls or maximizing gas mileage. But why not think bigger? Instead of the femme-bot voice of the GPS unit telling us which route to take and where to turn, what would it take to build a car that would make those decisions by itself? Why not bundle simulation and optimization engines with a physical engine, all inside the black box of a car?

Let's consider how this is an application of the Drivetrain Approach. We have already defined our objective: building a car that drives itself. The levers are the vehicle controls we are all familiar with: steering wheel, accelerator, brakes, etc. Next, we consider what data the car needs to collect; it needs sensors that gather data about the road as well as cameras that can detect road signs, red or green lights, and unexpected obstacles (including pedestrians). We need to define the models we will need, such as physics models to predict the effects of steering, braking and acceleration, and pattern recognition algorithms to interpret data from the road signs.

As one engineer on the Google self-driving car project put it in a recent Wired article, "We're analyzing and predicting the world 20 times a second." What gets lost in the quote is what happens as a result of that prediction. The vehicle needs to use a simulator to examine the results of the possible actions it could take. If it turns left now, will it hit that pedestrian? If it makes a right turn at 55 mph in these weather conditions, will it skid off the road? Merely predicting what will happen isn't good enough. The self-driving car needs to take the next step: after simulating all the possibilities, it must optimize the results of the simulation to pick the best combination of acceleration and braking, steering and signaling, to get us safely to Santa Clara. Prediction only tells us that there is going to be an accident. An optimizer tells us how to avoid accidents.

Improving the data collection and predictive models is very important, but we want to emphasize the importance of beginning by defining a clear objective with levers that produce actionable outcomes. Data science is beginning to pervade even the most bricks-and-mortar elements of our lives. As scientists and engineers become more adept at applying prediction and optimization to everyday problems, they are expanding the art of the possible, optimizing everything from our personal health to the houses and cities we live in. Models developed to simulate fluid dynamics and turbulence have been applied to improving traffic and pedestrian flows by using the placement of exits and crowd control barriers as levers. This has improved emergency evacuation procedures for subway stations and reduced the danger of crowd stampedes and trampling during sporting events. Nest is designing smart thermostats that learn the home-owner's temperature preferences and then optimizes their energy consumption. For motor vehicle traffic, IBM performed a project with the city of Stockholm to optimize traffic flows that reduced congestion by nearly a quarter, and increased the air quality in the inner city by 25%. What is particularly interesting is that there was no need to build an elaborate new data collection system. Any city with metered stoplights already has all the necessary information; they just haven't found a way to suck the meaning out of it.

In another area where objective-based data products have the power to change lives, the CMU extension in Silicon Valley has an active project for building data products to help first responders after natural or man-made disasters. Jeannie Stamberger of Carnegie Mellon University Silicon Valley explained to us many of the possible applications of predictive algorithms to disaster response, from text-mining and sentiment analysis of tweets to determine the extent of the damage, to swarms of autonomous robots for reconnaissance and rescue, to logistic optimization tools that help multiple jurisdictions coordinate their responses. These disaster applications are a particularly good example of why data products need simple, well-designed interfaces that produce concrete recommendations. In an emergency, a data product that just produces more data is of little use. Data scientists now have the predictive tools to build products that increase the common good, but they need to be aware that building the models is not enough if they do not also produce optimized, implementable outcomes.

The future for data products

We introduced the Drivetrain Approach to provide a framework for designing the next generation of great data products and described how it relies at its heart on optimization. In the future, we hope to see optimization taught in business schools as well as in statistics departments. We hope to see data scientists ship products that are designed to produce desirable business outcomes. This is still the dawn of data science. We don't know what design approaches will be developed in the future, but right now, there is a need for the data science community to coalesce around a shared vocabulary and product design process that can be used to educate others on how to derive value from their predictive models. If we do not do this, we will find that our models only use data to create more data, rather than using data to create actions, disrupt industries and transform lives.


Do we want products that deliver data, or do we want products that deliver results based on data? Jeremy Howard examined these questions in his Strata CA 12 session, "From Predictive Modelling to Optimization: The Next Frontier." Full video from that session is embedded below:

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Related:

March 17 2012

Profile of the Data Journalist: The Homicide Watch

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Chris Amico (@eyeseast) is a journalist and web developer based in Washington, DC, where he works on NPR's State Impact project, building a platform for local reporters covering issues in their states. Laura Norton Amico (@LauraNorton) is the editor of Homicide Watch (@HomicideWatch), an online community news platform in Washington, D.C. that aspires to cover every homicide in the District of Columbia. And yes, the similar names aren't a coincidence: the Amicos were married in 2010.

Since Homicide Watch launched in 2009, it's been earning praise and interest from around the digital world, including a profile by the Nieman Lab at Harvard University that asked whether a local blog "could fill the gaps of DC's homicide coverage. Notably, Homicide Watch has turned up a number of unreported murders.

In the process, the site has also highlighted an important emerging set of data that other digital editors should consider: using inbound search engine analytics for reporting. As Steve Myers reported for the Poynter Institute, Homicide Watch used clues in site search queries to ID a homicide victim. We'll see if the Knight Foundation think this idea has legs: the husband and wife team have applied for a Knight News Challenge grant to build a tooklit for real-time investigative reporting from site analytics.

The Amico's success with the site - which saw big growth in 2011 -- offers an important case study into why organizing beats may well hold similar importance as investigative projects. It also will be a case study with respect to sustainability and business models for the "new news,"as Homicide Watch looks to license its platform to news outlets across the country.

Below, I've embedded a presentation on Homicide Watch from the January 2012 meeting of the Online News Association. Our interview follows.

Watch live streaming video from onlinenewsassociation at livestream.com

Where do you work now? What is a day in your life like?

Laura: I work full time right now for Homicide Watch, a database driven beat publishing platform for covering homicides. Our flagship site is in DC, and I’m the editor and primary reporter on that site as well as running business operations for the brand.

My typical days start with reporting. First, news checks, and maybe posting some quick posts on anything that’s happened overnight. After that, it’s usually off to court to attend hearings and trials, get documents, reporting stuff. I usually have to to-do list for the day that includes business meetings, scheduling freelancers, mapping out long-term projects, doing interviews about the site, managing our accounting, dealing with awards applications, blogging about the start-up data journalism life on my personal blog and for ONA at journalists.org, guest teaching the occasional journalism class, and meeting deadlines for freelance stories. The work day never really ends; I’m online keeping an eye on things until I go to bed.

Chris: I work for NPR, on the State Impact project, where I build news apps and tools for journalists. With Homicide Watch, I work in short bursts, usually an hour before dinner and a few hours after. I’m a night owl, so if I let myself, I’ll work until 1 or 2 a.m., just hacking at small bugs on the site. I keep a long list of little things I can fix, so I can dip into the codebase, fix something and deploy it, then do something else. Big features, like tracking case outcomes, tend to come from weekend code sprints.

How did you get started in data journalism? Did you get any special degrees or certificates?

Laura: Homicide Watch DC was my first data project. I’ve learned everything I know now from conceiving of the site, managing it as Chris built it, and from working on it. Homicide Watch DC started as a spreadsheet. Our start-up kit for newsrooms starting Homicide Watch sites still includes filling out a spreadsheet. The best lesson I learned when I was starting out was to find out what all the pieces are and learn how to manage them in the simplest way possible.

Chris: My first job was covering local schools in southern California, and data kept creeping into my beat. I liked having firm answers to tough questions, so I made sure I knew, for example, how many graduates at a given high school met the minimum requirements for college. California just has this wealth of education data available, and when I started asking questions of the data, I got stories that were way more interesting.

I lived in Dalian, China for a while. I helped start a local news site with two other expats (Alex Bowman and Rick Martin). We put everything we knew about the city -- restaurant reviews, blog posts, photos from Flickr -- into one big database and mapped it all. It was this awakening moment when suddenly we had this resource where all the information we had was interlinked. When I came back to California, I sat down with a book on Python and Django and started teaching myself to code. I spent a year freelancing in the Bay Area, writing for newspapers by day, learning Python by night. Then the NewsHour hired me.

Did you have any mentors? Who? What were the most important resources they shared with you?

Laura: Chris really coached me through the complexities of data journalism when we were creating the site. He taught me that data questions are editorial questions. When I realized that data could be discussed as an editorial approach, it opened the crime beat up. I learned to ask questions of the information I was gathering in a new way.

Chris: My education has been really informal. I worked with a great reporter at my first job, Bob Wilson, who is a great interviewer of both people and spreadsheets. At NewsHour, I worked with Dante Chinni on Patchwork Nation, who taught me about reporting around a central organizing principle. Since I’ve started coding, I’ve ended up in this great little community of programmer-journalists where people bound ideas around and help each other out.

What does your personal data journalism "stack" look like? What tools could you not live without?

Laura: The site itself and its database which I report to and from, WordPress, Wordpress analytics, Google Analytics, Google Calendar, Twitter, Facebook, Storify, Document Cloud, VINElink, and DC Superior Court’s online case lookup.

Chris: Since I write more Python than prose these days, I spend most of my time in a text editor (usually TextMate) on a MacBook Pro. I try not to do anything with git.

What data journalism project are you the most proud of working on or creating?

Laura: Homicide Watch is the best thing I’ve ever done. It’s not just about the data, and it’s not just about the journalism, but it’s about meeting a community need in an innovative way. I stared thinking about a Homicide Watchtype site when I was trying to follow a few local cases shortly after moving to DC. It was nearly impossible to find news sources for the information. I did find that family and friends of victims and suspects were posting newsy updates in unusual places -- online obituaries and Facebook memorial pages, for example. I thought a lot about how a news product could fit the expressed need for news, information, and a way for the community to stay in touch about cases.

The data part developed very naturally out of that. The earliest description of the site was “everything a reporter would have in their notebook or on their desk while covering a murder case from start to finish.” That’s still one of the guiding principals of the site, but it’s also meant that organizing that information is super important. What good is making court dates public if you’re not doing it on a calendar, for example.

We started, like I said, with a spreadsheet that listed everything we knew: victim name, age, race, gender, method of death, place of death, link to obituary, photo, suspect name, age, race, gender, case status, incarceration status, detective name, age, race, gender, phone number, judge assigned to case, attorneys connected to the case, co-defendants, connections to other murder cases.

And those are just the basics. Any reporter covering a murder case, crime to conviction, should have that information. What Homicide Watch does is organize it, make as much of it public as we can, and then report from it. It’s led to some pretty cool work, from developing a method to discover news tips in analytics, to simply building news packages that accomplish more than anyone else can.

Chris: Homicide Watch is really the project I wanted to build for years. It’s data-driven beat reporting, where the platform and the editorial direction are tightly coupled. In a lot of ways, it’s what I had in mind when I was writing about frameworks for reporting.

The site is built to be a crime reporter’s toolkit. It’s built around the way Laura works, based on our conversations over the dinner table for the first six months of the site’s existence. Building it meant understanding the legal system, doing reporting and modeling reality in ways I hadn’t done before, and that was a challenge on both the technical and editorial side.

Where do you turn to keep your skills updated or learn new things?

Laura: Assigning myself new projects and tasks is the best way for me to learn; it forces me to find solutions for what I want to do. I’m not great at seeking out resources on my own, but I keep a close eye on Twitter for what others are doing, saying about it, and reading.

Chris: Part of my usual morning news reading is a run through a bunch of programming blogs. I try to get exposed to technologies that have no immediate use to me, just so it keeps me thinking about other ways to approach a problem and to see what other problems people are trying to solve.

I spend a lot of time trying to reverse-engineer other people’s projects, too. Whenever someone launches a new news app, I’ll try to find the data behind it, take a dive through the source code if it’s available and generally see if I can reconstruct how it came together.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Laura: Working on Homicide Watch has taught me that news is about so much more than “stories.” If you think about a typical crime brief, for example, there’s a lot of information in there, starting with the "who-what-where-when." Once that brief is filed and published, though, all of that information disappears.

Working with news apps gives us the ability to harness that information and reuse/repackage it. It’s about slicing our reporting in as many ways as possible in order to make the most of it. On Homicide Watch, that means maintaining a database and creating features like victims’ and suspects’ pages. Those features help regroup, refocus, and curate the reporting into evergreen resources that benefit both reporters and the community.

Chris: Spend some time with your site analytics. You’ll find that there’s no one thing your audience wants. There isn’t even really one audience. Lots of people want lots of different things at different times, or at least different views of the information you have.

One of our design goals with Homicide Watch is “never hit a dead end.” A user may come in looking for information about a certain case, then decide she’s curious about a related issue, then wonder which cases are closed. We want users to be able to explore what we’ve gathered and to be able to answer their own questions. Stories are part of that, but stories are data, too.

March 14 2012

Now available: "Planning for Big Data"

Planning for Big DataEarlier this month, more than 2,500 people came together for the O'Reilly Strata Conference in Santa Clara, Calif. Though representing diverse fields, from insurance to media and high-tech to healthcare, attendees buzzed with a new-found common identity: they are data scientists. Entrepreneurial and resourceful, combining programming skills with math, data scientists have emerged as a new profession leading the march toward data-driven business.

This new profession rides on the wave of big data. Our businesses are creating ever more data, and as consumers we are sources of massive streams of information, thanks to social networks and smartphones. In this raw material lies much of value: insight about businesses and markets, and the scope to create new kinds of hyper-personalized products and services.

Five years ago, only big business could afford to profit from big data: Walmart and Google, specialized financial traders. Today, thanks to an open source project called Hadoop, commodity Linux hardware and cloud computing, this power is in reach for everyone. A data revolution is sweeping business, government and science, with consequences as far reaching and long lasting as the web itself.

Where to start?

Every revolution has to start somewhere, and the question for many is "how can data science and big data help my organization?" After years of data processing choices being straightforward, there's now a diverse landscape to negotiate. What's more, to become data driven, you must grapple with changes that are cultural as well as technological.

Our aim with Strata is to help you understand what big data is, why it matters, and where to get started. In the wake the recent conference, we're delighted to announce the publication of our "Planning for Big Data" book. Available as a free download, the book contains the best insights from O'Reilly Radar authors over the past three months, including myself, Alistair Croll, Julie Steele and Mike Loukides.

"Planning for Big Data" is for anybody looking to get a concise overview of the opportunity and technologies associated with big data. If you're already working with big data, hand this book to your colleagues or executives to help them better appreciate the issues and possibilities.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Related data reports and ebooks

March 08 2012

Profile of the Data Journalist: The Storyteller and The Teacher

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Sarah Cohen (@sarahduke), the Knight professor of the practice of journalism and public policy at Duke University, and Anthony DeBarros (@AnthonyDB), the senior database editor at USA Today, were both important sources of historical perspective for my feature on how data journalism is evolving from "computer-assisted reporting" (CAR) to a powerful Web-enabled practice that uses cloud computing, machine learning and algorithms to make sense of unstructured data.

The latter halves of our interviews, which focused upon their personal and professional experience, follow.

What data journalism project are you the most proud of working on or creating?

DeBarros: "In 2006, my USA TODAY colleague Robert Davis and I built a database of 620 students killed on or near college campuses and mined it to show how freshmen were uniquely vulnerable. It was a heart-breaking but vitally important story to tell. We won the 2007 Missouri Lifestyle Journalism Awards for the piece, and followed it with an equally wrenching look at student deaths from fires."

Cohen: "I'd have to say the Pulitzer-winning series on child deaths in DC, in which we documented that children were dying in predictable circumstances after key mistakes by people who knew that their agencies had specific flaws that could let them fall through the cracks.

I liked working on the Post's POTUS Tracker and Head Count. Those were Web projects that were geared at accumulating lots of little bits about Obama's schedule and his appointees, respectively, that we could share with our readers while simultaneously building an important dataset for use down the road. Some of the Post's Solyndra and related stories, I have heard, came partly from studying the president's trips in POTUS Tracker.

There was one story, called "Misplaced Trust," on DC's guardianship system, that created immediate change in Superior Court, which was gratifying. "Harvesting Cash," our 18-month project on farm subsidies, also helped point out important problems in that system.

The last one, I'll note, is a piece of a project I worked on, in which the DC water authority refused to release the results of a massive lead testing effort, which in turn had shown widespread contamination. We got the survey from a source, but it was on paper.

After scanning, parsing, and geocoding, we sent out a team of reporters to neighborhoods to spot check the data, and also do some reporting on the neighborhoods. We ended up with a story about people who didn't know what was near them.

We also had an interesting experience: the water authority called our editor to complain that we were going to put all of the addresses online -- they felt that it was violating peoples' privacy, even though we weren't identifyng the owners or the residents. It was more important to them that we keep people in the dark about their blocks. Our editor at the time, Len Downie, said, "you're right. We shouldn't just put it on the Web." He also ordered up a special section to put them all in print.

Where do you turn to keep your skills updated or learn new things?

Cohen: "It's actually a little harder now that I'm out of the newsroom, surprisingly. Before, I would just dive into learning something when I'd heard it was possible and I wanted to use it to get to a story. Now I'm less driven, and I have to force myself a little more. I'm hoping to start doing more reporting again soon, and that the Reporters' Lab will help there too.

Lately, I've been spending more time with people from other disciplines to understand better what's possible, like machine learning and speech recognition at Carnegie Mellon and MIT, or natural language processing at Stanford. I can't DO them, but getting a chance to understand what's out there is useful. NewsFoo, SparkCamp and NICAR are the three places that had the best bang this year. I wish I could have gone to Strata, even if I didn't understand it all."

DeBarros: For surveillance, I follow really smart people on Twitter and have several key Google Reader subscriptions.

To learn, I spend a lot of time training after work hours. I've really been pushing myself in the last couple of years to up my game and stay relevant, particularly by learning Python, Linux and web development. Then I bring it back to the office and use it for web scraping and app building.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Cohen: "I think anything that gets more leverage out of fewer people is important in this age, because fewer people are working full time holding government accountable. The news apps help get more eyes on what the government is doing by getting more of what we work with and let them see it. I also think it helps with credibility -- the 'show your work' ethos -- because it forces newsrooms to be more transparent with readers / viewers.

For instance, now, when I'm judging an investigative prize, I am quite suspicious of any project that doesn't let you see each item, I.e., when they say, "there were 300 cases that followed this pattern," I want to see all 300 cases, or all cases with the 300 marked, so I can see whether I agree.

DeBarros: "They're important because we're living in a data-driven culture. A data-savvy journalist can use the Twitter API or a spreadsheet to find news as readily as he or she can use the telephone to call a source. Not only that, we serve many readers who are accustomed to dealing with data every day -- accountants, educators, researchers, marketers. If we're going to capture their attention, we need to speak the language of data with authority. And they are smart enough to know whether we've done our research correctly or not.

As for news apps, they're important because -- when done right -- they can make large amounts of data easily understood and relevant to each person using them."

These interviews were edited and condensed for clarity.

Profile of the Data Journalist: The Hacks Hacker

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference. This interview followed the conference and featured a remote participant who diligently used social media and the World Wide Web to document and share the best of NICAR:

Chrys Wu (@MacDiva) is a data journalist and user engagement strategist based in New York City. Our interview follows.

Where do you work now? What is a day in your life like?

I work with clients through my company, Matchstrike, which specializes in user engagement strategy. It's a combination of user experience research, design and program planning. Businesses turn to me to figure out how to keep people's attention, create community and tie that back to return on investment.

I also launch Hacks/Hackers chapters around the world and co-organize the group in New York with Al Shaw of ProPublica and Jacqui Cox of The New York Times.

Both things involve seeking out people and ideas, asking questions, reading, wireframing and understanding what motivates people as individuals and as groups.

How did you get started in data journalism? Did you get any special degrees or certificates?

I had a stats class in high school with a really terrific instructor who also happened to be the varsity basketball coach. He was kind of like our John Wooden. Realizing the importance of statistics, being able to organize and interpret data — and learning how to be skeptical of claims (e.g., where "4 out of 5 dentists agree" comes from)— has always stayed with me.

Other than that class and studying journalism at university, what I know has come from exploring (finding what's out there), doing (making something) and working (making something for money). I think that's pretty similar to most journalists and journalist-developers currently in the field.

Though I've spent several years in newsrooms (most notably with the Los Angeles Times and CBS Digital Media Group), most of my journalism and communications career has been as a freelancer. One of my earliest clients specialized in fundraising for Skid Row shelters. I quantified the need cases for her proposals. That involved working closely with the city health and child welfare departments and digging through a lot of data.

Once I figured that out, it was important to balance the data with narrative. Numbers and charts have a much more profound impact on people if they're framed by an idea to latch onto and compelling story to share.

Did you have any mentors? Who? What were the most important resources they shared with you?

I don't have individual mentors, but there's an active community with a huge body of work out there to learn from. It's one of the reasons why I've been collecting things on Delicious and Pinboard, and it's why I try my best to put everything that's taught at NICAR on my blog.

I always try look beyond journalism to see what people are thinking about and doing in other fields. Great ideas can come from everywhere. There are lots of very smart people willing to share what they know.

What does your personal data journalism "stack" look like? What tools could you not live without?

I use Coda and TextMate most often. For wireframing, I'm a big fan of OmniGraffle. I code in Ruby, and a little bit in Python. I'm starting to learn how to use R for dataset manipulation and for its maps library.

For keeping tabs on new but not urgent-to-read material, I use my friend Samuel Clay's RSS reader, Newsblur.

What data journalism project are you the most proud of working on or creating?

I'm most proud of working with the Hacks/Hackers community. Since 2009, we've grown to more than 40 groups worldwide, with each locality bringing journalists, designers and developers together to push what's possible for news.

As I say, talking is good; making is better — and the individual Hacks/Hackers chapters have all done some version of that: presentations, demos, classes and hack days. They're all opportunities to share knowledge, make friends and create new things that help people better understand what's happening around them.

Where do you turn to keep your skills updated or learn new things?

MIT's open courses have been great. There's also blogs, mailing lists, meetups, lectures and conferences. And then there's talking with friends and people they know.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

I like Amanda Cox's view of the importance of reporting through data. She's a New York Times graphics editor who comes from a statistics background. To paraphrase: Presenting a pile of facts and numbers without directing people toward any avenue of understanding is not useful.

Journalism is fundamentally about fact-finding and opening eyes. One of the best ways to do that, especially when lots of people are affected by something, is to interweave narrative with quantifiable information.

Data journalism and news apps create the lens that shows people the big picture they couldn't see but maybe had a hunch about otherwise. That's important for a greater understanding of the things that matter to us as individuals and as a society.

This interview has been edited and condensed for clarity.

March 06 2012

Profile of the Data Journalist: The Data Editor

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Meghan Hoyer (@MeghanHoyer) is a data editor based in Virginia. Our interview follows.

Where do you work now? What is a day in your life like?

I work in an office within The Virginian Pilot’s newsroom. I’m a one-person team, so there’s no such thing as typical.

What I might do: Help a reporter pull Census data, work with IT on improving our online crime report app, create a DataTable of city property assessment changes, and plan training for a group of co-workers who’d like to grow their online skills. At least, that’s what I’m doing today.

Tomorrow, it’ll be helping with our online election report, planning a strategy to clean a dirty database, and working with a reporter to crunch data for a crime trend story.

How did you get started in data journalism? Did you get any special degrees or certificates?

I have a journalism degree from Northwestern, but I got started the same way most reporters probably got started - I had questions about my community and I wanted quantifiable answers. How had the voting population in a booming suburb changed? Who was the region’s worst landlord? Were our localities going after delinquent taxpayers? Anecdotes are nice, but it’s an amazingly powerful thing to be able to get the true measure of a situation. Numbers and analysis help provide a better focus - and sometimes, they upend entirely your initial theory.

Did you have any mentors? Who? What were the most important resources they shared with you?

I haven’t collected a singular mentor as much as a group of people whose work I keep tabs on, for inspiration and follow-up. The news community is pretty small. A lot of people have offered suggestions, guidance, cheat sheets and help over the years. Data journalism - from analysis to building apps -- is definitely not something you can or need to learn in a bubble all on your own.

What does your personal data journalism "stack" look like? What tools could you not live without?

In terms of daily tools, I keep it basic: Google docs, Fusion Tables and Refine, QGIS, SQLite and Excel are all in use pretty much every day.

I’ve learned some Python and JavaScript for specific projects and to automate some of the newsroom’s daily tasks, but I definitely don’t have the programming or technical background that a lot of people in this field have. That’s left me trying to learn as much as I can as quick as I can.

In terms of a data stack, we keep information such as public employee salaries, land assessment databases and court record databases (among others) updated in a shared drive in our newsroom. It’s amazing how often reporters use them, even if it’s just to find out which properties a candidate owns or how long a police officer caught at a DUI checkpoint has been on the force.

What data journalism project are you the most proud of working on or creating?

A few years ago, I combined property ownership records, code enforcement citations, real estate tax records and rental inspection information from all our local cities and found a company with hundreds of derelict properties.

Their properties seemed to change hands often, so a partner and I then hand-built a database from thousands of land deeds that proved the company was flipping houses among investors in a $26 million mortgage fraud scheme. None of the cities in our region had any idea this was going on because they were dealing with each parcel as a separate entity.

That’s what combining sets of data can get you - a better overall view of what’s really happening. While government agencies are great at collecting piles of data, it’s that kind of larger analysis that’s missing.

Where do you turn to keep your skills updated or learn new things?

To be honest - Twitter. I get a lot of ideas and updates on new tools there. And the NICAR conference and listserv. Usually when you hit up against a problem - whether it’s dealing with a dirty dataset or figuring out how to best visualize your data -- it’s something that someone else has already faced.

I also learn a lot from the people within our newsroom. We have a talented group of web producers who all are eager to try new things and learn.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Data is everywhere, but in most cases it’s just stockpiled and warehoused without a second thought to analysis or using it to solve larger problems.

Journalists are in a unique position to make sense of it, to find the stories in it, to make sure that governments and organizations are considering the larger picture.

I think, too, that people in our field need to truly push for open government in the sense not of government building interfaces for data, but for just releasing raw data streams. Government is still far too stuck in the “Here’s a PDF of a spreadsheet” mentality. That doesn’t create informed citizens, and it doesn’t lead to innovative ways of thinking about government.

I’ve been involved recently in a community effort to create an API and then apps out of the regional transit authority’s live bus GPS stream. It has been a really fun project - and something that I hope other local governments in our area take note of.

Profile of the Data Journalist: The Daily Visualizer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Matt Stiles (@stiles) , a data journalist based in Washington, D.C., maintains a popular Daily Visualization blog. Our interview follows.

Where do you work now? What is a day in your life like?

I work at NPR, where I oversee data journalism on the State Impact project, a local-national partnership between us and member stations. My typical day always begins with a morning "scrum" meeting among the D.C. team as part of our agile development process. I spend time acquiring and analyzing data throughout each data, and I typically work directly with reporters, training them on software and data visualization techniques. I also spend time planning news apps and interactives, a process that requires close consultation with reporters, designers and developers.

How did you get started in data journalism? Did you get any special degrees or certificates?

No special training or certificates, though I did attend three NICAR boot camps (databases, mapping, statistics) over the years.

Did you have any mentors? Who? What were the most important resources they shared with you?

I have several mentors, both on the reporting side and the data side. For data, I wouldn't be where I am today without the help of two people: Chase Davis and Jennifer LaFleur. Jen got me interested early, and has helped me with formal and informal training over the years. Chase helped me with day-to-day questions when we worked together at the Houston Chronicle.

What does your personal data journalism "stack" look like? What tools could you not live without?

I have a MacBook that runs Windows 7. I have the basic CAR suite (Excel/Access, ArcGIS, SPSS, etc.) but also plenty of open-source tools, such as R for visualization or MySQL/Postgres for databases. I use Coda and Text Mate for coding. I use BBEdit and Python for text manipulation. I also couldn't live without Photoshop and Illustrator for cleaning up graphics.

What data journalism project are you the most proud of working on or creating?

I'm most proud of the online data library I created (and others have since expanded) at The Texas Tribune, but we're building some sweet apps at NPR. That's only going to expand now that we've created a national news apps team, which I'm joining soon.

Where do you turn to keep your skills updated or learn new things?

I read blogs, subscribe to email lists and attend lots of conferences for inspiration. There's no silver bullet. If you love this stuff, you'll keep up.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

More and more information is coming at us every day. The deluge is so vast. Data journalism at its core is important because it's about facts, not anecdotes.

Apps are important because Americans are already savvy data consumers, even if they don't know it. We must get them thinking -- or, even better, not thinking -- about news consumption in the same way they think about syncing their iPads or booking flights on Priceline or purchasing items on eBay. These are all "apps" that are familiar to many people. Interactive news should be, too.

This interview has been edited and condensed for clarity.

March 05 2012

Profile of the Data Journalist: The API Architect

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Jacob Harris (@harrisj) is an interactive news developer based in New York City. Our interview follows.

Where do you work now? What is a day in your life like?

I work in the Interactive Newsroom team at the New York Times. A day in my life is usually devoted to coding rather than meetings. Currently, I am almost exclusively devoted to the NYT elections coverage, where I switch between the operations of loading election results from the AP or building internal APIs that provide data to our various parts of elections.nytimes.com. I also sometimes help fix problems in our server stack when they arise or sometimes get involved in other projects if they need me.

How did you get started in data journalism? Did you get any special degrees or certificates?

I have a classical CS education, with a combined B.A./M.Eng from MIT. I have no journalism background or experience. I never even worked for my newspaper in college or anywhere. I do have a profound skepticism and contrarian nature that does help me fit in well with the journalists.

Did you have any mentors? Who? What were the most important resources they shared with you?

I don't have any specific mentors. But that doesn't mean I haven't been learning from anybody. We're in a very open team and we all usually learn things from each other. Currently, several of the frontend guys are tolerating my new forays into Javascript. Soon, the map guys will learn to bear my questions with patience.

What does your personal data journalism "stack" look like? What tools could you not live without?

Our actual web stack is built on top of EC2, with Phusion Passenger and Ruby on Rails serving our apps. We also use haproxy as a load balancer. Varnish is an amazing cache that everybody should use. On my own machine, I do my coding currently in Sublime Text 2. I use Pivotal Tracker to track my coding tasks. I could probably live with a different editor, but don't take my server stack away from me.

What data journalism project are you the most proud of working on or creating?

I have two projects I'm pretty proud of working on. Last year, I helped out with the Wikileaks War Logs reporting. We built an internal news app for the reporters to search the reports, see them on a map, and tag the most interesting ones. That was an interesting learning experience.

One of the unique things I figured out was how to extract MGRS coordinates from within the reports to geocode the locations inside of them. From this, I was able to distinguish the locations of various homicides within Baghdad more finely than the geocoding for the reports. I built a demo, pitched it to graphics, and we built an effective and sobering look at the devastation on Baghdad from the violence.

This year, I am working on my third election as part of Interactive News. Although we are proud of our team's work in 2008 and 2010, we've been trying some new ways of presenting our election coverage online and new ways of architecting all of our data sources so that it's easier to build new stuff. It's been gratifying to see how internal APIs combine with novel new storytelling formats and modern browser technologies this year.

Where do you turn to keep your skills updated or learn new things?

Usually, I just find out about things by following all the other news app developers on Twitter. We're a small ecosystem with lots of sharing. It's great how everybody learns from each other. I have created a Twitter list @harrisj/news-hackers to help keep tabs on all the cool stuff being done out there. (If you know someone who should be on it, let me know.)


Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

We live in a world of data. Our reporting should do a better job of presenting and investigating that data. I think it's been an incredible time for the world of news applications lately. A few years back, it was just an achievement to put data online in a browsable way.

These days, news applications are at a whole other level. Scott Klein of ProPublica put it best when he described all good data stories as including both the "near" (individual cases, examples) and the "far" (national trends, etc.).

In an article, the reporter would be pick a few compelling "nears" for the story. As a reader, I also would want to know how my school is performing or how polluted my water supply is.

This is what news applications can do: tell the stories that are found in the data, but also allow the readers to investigate the stories in the data that are highly important to them.

This interview has been edited and condensed for clarity.

March 02 2012

Profile of the Data Journalist: The Visualizer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Michelle Minkoff (@MichelleMinkoff ) is an investigative developer/journalist based in Washington, D.C. Our interview follows.

Where do you work now? What is a day in your life like?

I am an Interactive Producer at the Associated Press' Washington DC bureau, where I focus on news applications related to politics and the election, as well as general mapping for our interactives on the Web. While my days pretty much always involve sitting in front of a computer, the actual tasks themselves can vary wildly. I may be chatting with reporters and editors in politics, environment, educational, national security or myriad beats about upcoming stories and how to use data to support reporting or create interactive stories. I might be gathering data, reformatting it or crafting Web applications. I spend a great deal of time creating interactive mapping systems, working a lot with geographic data, and collaborating with cartographers, editors and designers to decide how to best display it.

I split my time between working closely with my colleagues in the Washington bureau on the reporting/editing side, and my fellow interactive team members, only one of whom is also in DC. Our team is global, headquartered in New York, but with members spanning the globe from Phoenix to Bangkok.

It's a question of walking a balance between what needs to be done on daily deadlines for breaking news, longer-term stories which are often investigative, and creating frameworks that help The Associated Press to make the most of the Web's interactive nature in the long run.

How did you get started in data journalism? Did you get any special degrees or certificates?

I caught the bug when I took a computer-assisted reporting class from Derek Willis, a member of the New York Times' Interactive News Team, at Northwestern's journalism school where I was a grad student. I was fascinated by the role that technology could play in journalism for reporting and presentation, and very quickly got hooked. I also quickly discovered that I could lose track of hours playing with these tools, and that what came naturally to me was not as natural to others. I would spend days reporting for class, on and off Capitol Hill, and nights exchanging gchats with Derek and other data journalists he introduced me to. I started to understand SQL, advanced Excel, and fairly quickly thereafter, Python and Django.

I followed this up with an independent study in data visualization back at Medill's Chicago campus, under Rich Gordon. I practiced making Django apps, played with the Processing visualization language. I voraciously read through all the Tufte books. As a final project, I created a package about the persistence of Chicago art galleries that encompasses text, Flash visualization and a searchable database.

I have a concentration in Interactive Journalism, with my Medill masters' degree, but the courses mentioned above are but a partial component of that concentration.

Did you have any mentors? Who? What were the most important resources they shared with you?

The question here is in the wrong tense. I currently "do" have many mentors, and I don't know how I would do my job without what they've shared in the past, and in the present. Derek, mentioned above, was the first. He introduced me to his friend Matt [Waite], and then he told me there was a whole group of people doing this work at NICAR. Literally hundreds of people from that organization have helped me at various places on my journey, and I believe strongly in the mantra of "paying it forward" as they have -- no one can know it all, so we pass on what we've learned, so more people can do even better work.

Other key folks I've had the privilege to work with include all of the Los Angeles Times' Data Desk's members, which includes reporters, editors and Web developers. I worked most closely with Ben Welsh and Ken Schwencke, who answered many questions, and were extremely encouraging when I was at the very beginning of my journey.

At my current job at The Associated Press, I'm lucky to have teammates who mentor me in design, mapping and various Washington-based beats. Each is helpful in his or her own way.

Special attention deserves to be called to Jonathan Stray, who's my official boss, but also a fantastic mentor who enables me to do what I do. He's helping me to learn the appropriate technical skills to execute what I see in my head, as well as learn how to learn. He's not just teaching me the answers to the problems we encounter in our daily work, but also helping me learn how to better solve them, and work this whole "thing I do" into a sustainable career path. And all with more patience than I have for myself.

What does your personal data journalism "stack" look like? What tools could you not live without?

No matter how advanced our tools get, I always find myself coming back to Excel first to do simple work. It helps us an overall handle on a data set. I also will often quickly bring data into SQLite, a Firefox extension that allows a user to run SQL queries, with no database setup. I'm more comfortable asking complicated questions of data that way. I also like to use Google's Chart Tools to create quick visualizations for myself to better understand a story.

When it comes to presentation, since I've been doing a lot with mapping recently, I don't know what I'd do without my favorite open source tools, Tilemill and Leaflet. Building a map stack is hard work, but the work that others have done before it have made it a lot easier.

If we consider programming languages tools (which I do), JavaScript is my new Swiss army knife. Prior to coming to the AP, I did a lot with Python and Django, but I've learned a lot about what I like to call "Really Hard JavaScript." It's not just about manipulating the colors of a background on a Web page, but parsing, analyzing and presenting data. When I need to do more complex work to manipulate data, I use a combination of Ruby and Python -- depending on which has better tools for the job. For XML parsing, I like Ruby more. For simplifying geo data, I prefer Python.

What data journalism project are you the most proud of working on or creating?

That would be " Road to 270", a project we did at the AP that allows users to test out hypothetical "what-if" scenarios for the national election, painting states to define to which candidate a state's delegates could go. It combines demographic and past election data with the ability for users to make a choice and deeply engage with the interactive. It's not just telling the user a story, but informing the user by allowing him or her to be part of the story. That, I believe, is when data journalism becomes its most compelling and informative.

It also uses some advanced technical mapping skills that were new to me. I greatly enjoyed the thrill of learning how to structure a complex application, and add new tools to my toolkit. Now, I don't just have those new tools, but a better understanding of how to add other new tools.

Where do you turn to keep your skills updated or learn new things?

I look at other projects, both within the journalism industry and in general visualization communities. The Web inspector is my best friend. I'm always looking to see how people did things. I read blogs voraciously, and have a fairly robust Google Reader set of people whose work I follow closely. I also use lynda.com frequently (I tend to learn best by video tutorials.) Hanging out on listservs for free tools I use (such as Leaflet), programming languages I care about (Python), or projects whose mission our work is related to (Sunlight Foundation) help me engage with a community that cares about similar issues.

Help sites like Stack Overflow, and pretty much anything I can find on Google, are my other best friends. The not-so-secret secret of data journalism: we're learning as we go. That's part of what makes it so fun.

Really, the learning is not about paper or electronic resources. Like so much of journalism, this is best conquered, I argue, with persistence and stick-to-it-ness. I approach the process of data journalism and Web development as a beat. We attend key meetings. Instead of city council, it's NICAR. We develop vast rolodexes. I know people who have myriad specialties and feel comfortable calling on them. In return, I help people all over the world with this sort of work whenever I can, because it's that important. While we may work for competing places, we're really working toward the same goal: improving the way we inform the public about what's going on in our world. That knowledge matters a great deal.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

More and more information is coming at us every day. The deluge is so vast that we need to not just say things are true, but prove those truths with verifiable facts. Data journalism allows for great specificity, and truths based in the scientific method. Using computers to commit data journalism allows us to process great amounts of information much more efficiently, and make the world more comprehensible to a user.

Also, while we are working with big data, often only a subset of that data is valuable to a specific user. Data journalism and Web development skills allow us to customize those subsets for our various users, such as by localizing a map. That helps us give a more relevant and useful experience to each individual we serve.

Perhaps most importantly, more and more information is digital, and is coming at us through the Internet. It simply makes sense to display that information with a similar environment in which it's provided. Information is dispensed in a different way now than it was five years ago. It will be totally different in another five years. So, our explanations of that environment should match. We must make the most of the Internet to tell our stories differently now than we did before, and differently than we will in the future.

Knowing things are constantly changing, being at the forefront of that change, and enabling the public to understand and participate in that change, is a large part of what makes data journalism so exciting and fundamentally essential.

This interview has been edited and condensed for clarity.

Profile of the Data Journalist: The Human Algorithm

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Ben Welsh (@palewire) is an Web developer and journalist based in Los Angeles. Our interview follows.

Where do you work now? What is a day in your life like?

I work for the Los Angeles Times, a daily newspaper and 24-hour Web site based in Southern California. I'm a member of the Data Desk, a team of reporters and Web developers that specializes in maps, databases, analysis and visualization. We both build Web applications and conduct analysis for reporting projects.

I like to compare The Times to a factory, a factory that makes information. Metaphorically speaking, it has all sorts of different assembly lines. Just to list a few, one makes beautifully rendered narratives, another makes battleship-like investigative projects.

A typical day involves juggling work on difference projects, mentally moving from one assembly line to the other. Today I patched an embryonic open-source release, discussed our next move on a pending public records request, guided the real-time publication of results from the GOP primaries in Michigan and Arizona, and did some preparation for how we'll present a larger dump of results on Super Tuesday.

How did you get started in data journalism? Did you get any special degrees or certificates?

I'm thrilled to see new-found interest in "data journalism" online. It's drawing young, bright people into the field and involving people from different domains. But it should be said that the idea isn't new.

I was initiated into the field as a graduate student at the Missouri School of Journalism. There I worked at the National Institute for Computer-Assisted Reporting , also known as NICAR. Decades before anyone called it "data journalism," a disparate group of misfit reporters discovered that the data analysis made possible by computers enabled them to do more powerful investigative reporting. In 1989, they founded NICAR, which has, for decades, been training data skills to journalists and nurtured a tribe of journalism geeks. In the time since, computerized data analysis has become a dominant force in investigative reporting, responsible for a large share of the field's best work.

To underscore my point, here's a 1986 Time magazine article about how "newsmen are enlisting the machine."

Did you have any mentors? Who? What were the most important resources they shared with you?

My first journalism job was in Chicago. I got a gig working for two great people there, Carol Marin and Don Moseley, who have spent most of their careers as television journalists. I worked as their assistant. Carol and Don are warm people who are good teachers, but they are also excellent at what they do. There was a moment when I realized, "Hey, I can do this!" It wasn't just something I heard about in class, but I could actually see myself doing.

At Missouri, I had a great classmate named Brian Hamman, who is now at the New York Times. I remember seeing how invested Brian was in the Web, totally committed to Web development as a career path. When an opportunity opened up to be a graduate assistant at NICAR, Brian encouraged me to pursue it. I learned enough SQL to help do farmed-out investigative work for TV stations. And, more importantly, I learned that if you had technical skills you could get the job to work on a cool story.

After that I got a job doing data analysis at the Center for Public Integrity in Washington DC. I had the opportunity to work on investigative projects, but also the chance to learn a lot of computer programming along the way. I had the guidance of my talented coworkers, Daniel Lathrop, Agustin Armendariz, John Perry, Richard Mullins and Helena Bengtsson. I learned that computer programming wasn't impossible. They taught me that if you have a manageable task, a few friends to help you out and a door you can close, you can figure out a lot.

What does your personal data journalism "stack" look like? What tools could you not live without?

I do my daily development in gedit text editor, Byobu's slick implementation of the screen terminal and the Chromium browser. And, this part may be hard to believe, but I love Ubuntu Unity. I don't understand what everybody is complaining about.

I do almost all of my data management in the Python Web development framework Django and PostgreSQL's database, even if the work is an exploratory reporting project that will never be published. I find that the structure of the framework can be useful for organizing just about any data-driven project.

I use GitHub for both version-control and project management. Without it, I'd be lost.

What data journalism project are you the most proud of working on or creating?

As we all know, there's a lot of data out there. And, as anyone who works with it knows, most of it is crap. The projects I'm most proud of have taken large, ugly data sets and refined them into something worth knowing: a nut graf in an investigative story, or a data-driven app that gives the reader some new insight into the world around them. It's impossible to pick one. I like to think the best is still, as they say in the newspaper business, TK.

Where do you turn to keep your skills updated or learn new things?

Twitter is a great way to keep up with what is getting other programmers excited. I know a lot of people find social media overwhelming or distracting, but I feel plugged in and inspired by what I find there. I wouldn't want to live without it.

GitHub is another great source. I've learned so much just exploring other people's code. It's invaluable.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Computers offer us an opportunity to better master information, better understand each other and better watchdog those who would govern us. I tried to talk about some of the ways simply thinking about the process of journalism as an algorithm can point the way at last week's NICAR conference in a talk called "Human-Assisted Reporting." In my opinion, we should aspire to write code that embodies the idealistic principles and investigative methods of the previous generation. There's all this data out there now, and journalistic algorithms, "robot reporters," can help us ask it tougher questions.

Reposted byscyphi scyphi

March 01 2012

In the age of big data, data journalism has profound importance for society

The promise of data journalism was a strong theme throughout the National Institute for Computer-Assisted Reporting's (NICAR) 2012 conference. In 2012, making sense of big data through narrative and context, particularly unstructured data, will be a central goal for data scientists around the world, whether they work in newsrooms, Wall Street or Silicon Valley. Notably, that goal will be substantially enabled by a growing set of common tools, whether they're employed by government technologists opening Chicago, healthcare technologists or newsroom developers.

At NICAR 2012, you could literally see the code underpinning the future of journalism written - or at least projected - on the walls.

"The energy level was incredible," said David Herzog, associate professor for print and digital news at the Missouri School of Journalism, in an email interview after NICAR. "I didn't see participants wringing their hands and worrying about the future of journalism. They're too busy building it."

Just as open civic software is increasingly baked into government, open source is playing a pivotal role in the new data journalism.

"Free and open-source tools dominated," said Herzog. "It's clear from the panels and hands-on classes that free and open source tools have eliminated the barrier to entry in terms of many software costs."

While many developers are agnostic with respect to which tools they use to get a job done, the people who are building and sharing tools for data journalism are often doing it with open source code. As Dan Sinker, the head of the Knight-Mozilla News Technology Partnership for Mozilla, wrote afterwards, journo-coders took NICAR 12 "to a whole new level."

While some of that open source development was definitely driven by the requirements of the Knight News Challenge, which funded the PANDA and Overview projects, there's also a collaborative spirit in evidence throughout this community.

This is a group of people who are fiercely committed to "showing your work" -- and for newsroom developers, that means sharing your code. To put it another way, code, don't tell. Sessions on Python, Django, mapping, Google Refine and Google Fusion tables were packed at NICAR 12.

No, this is not your father's computer-assisted reporting.

"I thought this stacked up as the best NICAR conference since the first in 1993," said Herzog. "It's always been tough to choose from the menu of panels, demos and hands-on classes at NICAR conferences. But I thought there was an abundance of great, informative, sessions put on by the participants. Also, I think NICAR offered a good range of options for newbies and experts alike. For instance, attendees could learn how to map using Google Fusion tables on the beginner's end, or PostGIS and qGIS at the advanced level. Harvesting data through web scraping has become an ever bigger deal for data journalists. At the same time, it's getting easier for folks with no or little programming chops to scrape using tools like spreadsheets, Google Refine and ScraperWiki. "

On the history of NICAR

According to IRE, NICAR was founded in 1989. Since its founding, the Institute has trained thousands of journalists how to find, collect and public electronic information.

Today, "the NICAR conference helps journalists, hackers, and developers figure out best practices, best methods,and best digital tools for doing journalism that involves data analysis and classic reporting in the field," said Brant Houston, former executive director of Investigative Reporters and Editors, in an email interview. "The NICAR conference also obviously includes investigative journalism and the standards for data integrity and credibility."

"I believe the first IRE-sponsored [conference] was in 1993 in Raleigh, when a few reporters were trying to acquire and learn to use spreadsheets, database managers, etc. on newly open electronic records," said Sarah Cohen, the Knight professor of the practice of journalism and public policy at Duke University, in an email interview. "Elliott Jaspin was going around the country teaching reporters how to get data off of 9-track tapes. There really was no public Internet. At the time, it was really, really hard to use the new PC's, and a few reporters were trying to find new stories. The famous ones had been Elliott's school bus drivers who had drunk driving records and the Atlanta Color of Money series on redlining."

"St. Louis was my 10th NICAR conference," said Anthony DeBarros, the senior database editor at USA Today, in an email interview. "My first was in 1999 in Boston. The conference is a place where news nerds can gather and remind themselves that they're not alone in their love of numbers, data analysis, writing code and finding great stories by poring over columns in a spreadsheet. It serves as an important training vehicle for journalists getting started with data in the newsroom, and it's always kept journalists apprised of technological developments that offer new ways of finding and telling stories. At the same time, its connection to IRE keeps it firmly rooted in the best aspects of investigative reporting -- digging up stories that serve the public good.

Baby, you can drive my CAR

Long before we started talking about "data journalism," the practice of computer-assisted reporting (CAR) was growing around the world.

"The practice of CAR has changed over time as the tools and environment in the digital world has changed," said Houston. "So it began in the time of mainframes in the late 60s and then moved onto PCs (which increased speed and flexibility of analysis and presentation) and then moved onto the Web, which accelerated the ability to gather, analyze and present data. The basic goals have remained the same. To sift through data and make sense of it, often with social science methods. CAR tends to be an "umbrella" term - one that includes precision journalism and data driven journalism and any methodology that makes sense of date such as visualization and effective presentations of data."

On one level, CAR is still around because the journalism world hasn't coined a good term to use instead.

"Computer-assisted reporting" is an antiquated term, but most people who practice it have recognized that for years," said DeBarros. "It sticks around because no one has yet to come up with a dynamite replacement. Phil Meyer, the godfather of the movement, wrote a seminal book called "Precision Journalism, and that term is a good one to describe that segment of CAR that deals with statistics and the use of social science methods in newsgathering. As an umbrella term, data journalism seems to be the best description at the moment, probably because it adequately covers most of the areas that CAR has become -- from traditional data-driven reporting to the newer category of news applications."

The most significant shift in CAR may well be when all of those computers being used for reporting were connected through the network of networks in the 1990s.

"It may seem obvious, but of course the Internet changed it all, and for a while it got smushed in with trying to learn how to navigate the Internet for stories, and how to download data," said Cohen. "Then there was a stage when everyone was building internal intranets to deliver public records inside newsrooms to help find people on deadline, etc. So for much of the time, it was focused on reporting, not publishing or presentation. Now the data journalism folks have emerged from the other direction: People who are using data obtained through APIs who often skip the reporting side, and use the same techniques to deliver unfiltered information to their readers in an easier format the the government is giving us. But I think it's starting to come back together -- the so-called data journalists are getting more interested in reporting, and the more traditional CAR reporters are interested in getting their stories on the web in more interesting ways.

Whatever you call it, the goals are still the same.

"CAR has always been about using data to find and tell stories," said DeBarros. "And it still is. What has changed in recent years is more emphasis toward online presentations (interactive maps and applications) and the coding skills required to produce them (JavaScript, HTML/CSS, Django, Ruby on Rails). Earlier NICAR conferences revolved much more around the best stories of the year and how to use data techniques to cover particular topics and beats. That's still in place. But more recently, the conference and the practice has widened to include much more coding and presentation topics. That reflects the state of media -- every newsroom is working overtime to make its content work well on the web, on mobile, and on apps, and data journalists tend to be forward thinkers so it's not surprising that the conference would expand to include those topics."

What stood out at NICAR 2012?

The tools and tactics on display at NICAR were enough to convince Tyler Dukes at Duke to write that NICAR taught me I know nothing. Browse through the tools, slides and links from NICAR 2012 curated by Chrys Wu to get a sense of just how much is out there. The big theme, however, without a doubt, was data.

"Data really is the meat of the conference, and a quick scan of the schedule shows there were tons of sessions on all kinds of data topics, from the Census to healthcare to crime to education," said DeBarros.

What I saw everywhere at NICAR was interest not simply in what data was out there, however, but how to get it and put it to use, from finding stories and source to providing empirical evidence to back up other reporting to telling stories with maps and visualizations.

"A major theme was the analysis of data (using spreadsheets, data managers, GIS) that gives journalism more credibility by seeing patterns, trends and outliers," said Houston. "Other themes included collection and analysis of social media, visualization of data, planning and organizing stories based on data analysis, programming for web scraping (data collection from the Web) and mashing up various Web programs."

"Harvesting data through web scraping has become an ever bigger deal for data journalists," said Herzog. "At the same time, it's getting easier for folks with no or little programming chops to scrape using tools like spreadsheets, Google Refine and ScraperWiki. That said, another message for me was how important programming has become. No, not all journalists or even data journalists need to learn programming. But as Rich Gordon at Medill has said, all journalists should have an appreciation and understanding of what it can do."

Cohen similarly pointed to data, specifically its form. "The theme that I saw this year was a focus on unstructured rather than structured data," she said. "For a long time, we've been hammering governments to give us 'data' in columns and rows. I think we're increasingly seeing that stories just as likely (if not more likely) come from the unstructured information that comes from documents, audio and video, tweets, other social media -- from government and non-government sources. The other theme is that there is a lot more collaboration, openness and sharing among competing news organizations. (Witness PANDA and census.ire.org and the New York Times campaign finance API). But it only goes so far -- you don't see ProPublica sharing the 40+ states' medical licensure data that Dan scraped with everyone else. (I have to admit, though, I haven't asked him to share.) IRE has always been about sharing techniques and tools --- now we're actually sharing source material."

While data dominated NICAR 12, other trends mattered as well, from open mapping tools to macroeconomic trends in the media industry. "A lot of newsrooms are grappling with rapid change in mapping technology," said DeBarros. "Many of us for years did quite well with Flash, but the lack of support for Flash on iPad has fueled exploration into maps built on open source technologies that work across a range of online environments. Many newsrooms are grappling with this, and the number of mapping sessions at the conference reflected this."

There's also serious context to the interest in developing data journalism skills. More than 166 U.S. newspapers have stopped putting out a print edition or closed down altogether since 2008, resulting in more than 35,000 job losses or buyouts in the newspaper industry since 2007.

"The economic slump and the fundamental change in the print publishing business means that journalists are more aware of the business side than ever," said DeBarros, "and I think the conference reflected that more than in the past. There was a great session on turning your good work into money by Chase Davis and Matt Wynn, for example. I was on a panel talking about the business reasons for starting APIs. The general unease most journalists feel knowing that our industry still faces difficult economic times. Watching a new generation of journalists come into the fold has been exciting."

One notable aspect of that next generation of data journalists is that it does not appear likely to look or sound the same as the newsrooms of the 20th century.

"This was the most diverse conference that I can remember," said Herzog. "I saw more women and people of color than ever before. We had data journalists from many countries: Korea, the U.K., Serbia, Germany, Canada, Latin America, Denmark, Sweden and more. Also, the conference is much more diverse in terms of professional skills and interests. Web 2.0 entrepreneurs, programmers, open data advocates, data visualization specialists, educators, and app builders mixed with traditional CAR jockeys. I also saw a younger crowd, a new generation of data journalists who are moving into the fold. For many of the participants, this was their first conference."

What problems does data journalism face?

While the tools are improving, there are still immense challenges ahead, from the technology itself to education to resources in newsroom. "A major unsolved challenge is making the analysis of unstructured data easier and faster to do. Those working on this include myself, Sarah Cohen, the DocumentCloud team, teams at AP and Chicago Tribune and many others," said Houston.

There's also the matter of improving the level of fundamental numeracy in the media. "This is going to sound basic, but there are still far too many journalists around the world who cannot open an Excel spreadsheet, sort the values or write an equation to determine percentage change," said DeBarros, "and that includes a large number of the college interns I see year after year, which really scares me. Journalism programs need to step up and understand that we live in a data-rich society, and math skills and basic data analysis skills are highly relevant to journalism. The 400+ journalists at NICAR still represent something of an outlier in the industry, and that has to change if journalism is going to remain relevant in an information-based culture."

In that context, Cohen has high hopes for a new project, the Reporters Lab. "The big unsolved problem to me is that it's still just too hard to use "data" writ large," she said. " You might have seen 4 or 5 panels on how to scrape data [at NICAR]. People have to write one-off computer programs using Python or Ruby or something to scrape a site, rather than use a tool like Kapow, because newsrooms can't (and never have) invest that kind of money into something that really isn't mission-critical. I think Kapow and its cousins cost $20,000-$40,000 a year. Our project to find those kinds of holes and create, commission or adapt free, open source tools for regular reporters to use, not the data journalist skilled in programming. We're building communities of people who want to work on these problems."

What role does data journalism play in open government?

On the third day of NICAR 2012, I presented upon "open data journalism, which, to paraphrase Jonathan Stray, I'd define as obtaining, reporting upon, curating and publishing open data in the public interest. As someone who's been following the open government movement closely for a few years now, the parallels to what civic hackers are doing and what this community of data journalists are working on is unescapable. They're focused on putting data to work for the public good, whether it's in the public interest, for profit, in the service of civic utility or, in the biggest crossover, government accountability.

To do so will require that data journalists and civic coders alike apply the powerful emerging tools in the newsroom stack to the explosion of digital bits and bytes from government, business and our fellow citizens.

The need for data journalism, in the context of massive amounts of government data being released, could not any more timely, particularly given persistent quality issues.

"I can't find any downsides of more data rather than less," said Cohen, "but I worry about a few things."

First, emphasized Cohen, there's an issue of whether data is created open from the beginning -- and the consequences of 'sanitizing' it before release. "The demand for structured, nicely scrubbed data for the purpose of building apps can result in fake records rather than real records being released. USASpending.gov is a good example of that -- we don't get access to the actual spending records like invoices and purchase orders that agencies use, or the systems they use to actually do their business. Instead we have a side system whose only purpose is to make it public, so it's not a high priority inside agencies and there's no natural audit trail on it. It's not used to spend money, so mistakes aren't likely to be caught."

Second, there's the question of whether information relevant to an investigation has been scrubbed for release. "We get the lowest common denominator of information," she said. "There are a lot of records used for accountability that depend on our ability to see personally identifiable information (as opposed to private or personal information, which isn't the same thing). For instance, if you want to do stories on how farm subsidies are paid, you kind of have to know who gets them. If you want to do something on fraud in FEMA claims, you have to be able to find the people and businesses who get the aid. But when it gets pushed out as open government data, it often gets scrubbed of important details and then we have a harder time getting them under FOIA because the agencies say the records are already public."

To address those two issues, Cohen recommends getting more source documents, as a historian would. "I think what we can do is to push harder for actual records, and to not settle for what the White House wants to give us," she said. "We also have to get better at using records that aren't held in nice, neat forms -- they're not born that way, and we should get better at using records in whatever form they exist."

Why do data journalism and news apps matter?

Given the economic and technological context, it might seem like the case for data journalism should make itself. "CAR, data journalism, precision journalism, and news apps all are crucial to journalism -- and the future of journalism -- because they make sense of the tremendous amounts of data," said Houston, "so that people can understand the world and make sensible decisions and policies."

Given the reality that those practicing data journalism remain a tiny percentage of the world's media, however, there's clearly still a need for its foremost practitioners to show why it matters, in terms of impact.

"We're living in a data-driven culture," said DeBarros. "A data-savvy journalist can use the Twitter API or a spreadsheet to find news as readily as he or she can use the telephone to call a source. Not only that, we serve many readers who are accustomed to dealing with data every day -- accountants, educators, researchers, marketers. If we're going to capture their attention, we need to speak the language of data with authority. And they are smart enough to know whether we've done our research correctly or not. As for news apps, they're important because -- when done right -- they can make large amounts of data easily understood and relevant to each person using them."

New tools, same rules

While the platforms and toolkits for journalism are evolving and the sources of data are exploding, many things haven't changed. For one, the ethics that guide the choices of the profession remain central to the journalism of the 21st century, as the new NPR's new ethics guide makes clear.

Whether news developers are rendering data in real-time, validating data in the real world, or improving news coverage with data, good data journalism still must tell a story. And as Erika Owens reflected in her own blog after NICAR, looking back upon a group field trip to the marvelous City Museum in St. Louis, journalism is also joyous, whether one is "crafting the perfect lede or slaying an infuriating bug."

Whether the tool is a smartphone, notebook or dataset, these tools must also extend investigative reporting, as the Los Angeles Times Doug Smith emphasized to me at the NICAR conference.

If text is the next frontier in data journalism, harnessing the power of big data, it will be in the service of telling stories more effectively. Digital journalism and digital humanities are merging in the service of more informed society.

Profiles of the data journalist

To learn more about the people who are redefining the practice computer-assisted reporting, in some cases, building the newsroom stack for the 21st century, Radar conducted a series of email interviews with data journalists during the 2012 NICAR Conference. The first two of the series are linked below:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl