Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 27 2012

Top Stories: April 23-27, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Design your website for a graceful fail
A failure in secondary content doesn't need to take down an entire website. Here, Etsy's Mike Brittain explains how to build resilience into UIs and allow for graceful failures.

Big data in Europe
European application of big data is ramping up, but its spread is different from the patterns seen in the U.S. In this interview, Big Data Week organizers Stewart Townsend and Carlos Somohano share the key distinctions and opportunities associated with Europe's data scene.

The rewards of simple code
Simple code is born from planning, discipline and grinding work. But as author Max Kanat-Alexander notes in this interview, the benefits of simple code are worth the considerable effort it requires.


Fitness for geeks
Programmers who spend 14 hours a day in front of a computer know how hard it is to step away from the cubicle. But as "Fitness for Geeks" author Bruce Perry notes in this podcast, getting fit doesn't need to be daunting.


Joshua Bixby on the business of performance
Strangeloop's Joshua Bixby discusses the business of speed and why web performance optimization is an institutional need.


Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference, May 29 - 31 in San Francisco. Save 20% on registration with the code RADAR20.

April 23 2012

Big data in Europe

The worldwide Big Data Week kicks off today with gatherings in the U.K., U.S., Germany, Finland and Australia. As part of their global focus, Big Data Week founder/organizer Stewart Townsend (@stewarttownsend) of DataSift, and Carlos Somohano (@ds_ldn), founder of the Data Science London community and the Data Science Hackathon, have been tracking big data in Europe. This is an area that we're exploring here at Radar and through October's Strata Conference in London, so I asked Townsend and Somohano to share their perspectives on the European data scene. They combined their thoughts in the following Q&A.

Are the U.S. and Europe at similar stages in big data growth, adoption and application?

Townsend and Somohano: Based on our experience across industry verticals and markets in Europe, we believe the U.S. is leading the way in adopting so called "big data." The U.K. is perhaps the European leader, where the level of adoption is picking up quite quickly, although still lagging behind the U.S.

In Germany, surprisingly, many organizations are not adopting big data as quickly as in the U.S. and U.K. markets. In some southern European markets, big data is still quite a new concept. Practically speaking, it's not on the radar there yet.

Part of our organizational mission at Big Data Week is to promote big data across the whole European ecosystem and act as messengers of the benefits of adopting big data early.

What are the key differences between big data in the U.S. and Europe?

Townsend and Somohano: Many large organizations in Europe are still in the early stages of the adoption cycle. This is perhaps due to the level of confusion around aspects like terminology (i.e. what do we mean by "big data"?) and the acute lack of skills around "new" big data technologies.

Today, most of the new developments and technologies around big data come from the U.S., and this presents somewhat of a challenge for some European organizations as they try to keep up with the rapid level of change. Perhaps the speed of both the decision-making cycle and the organizational change in most European companies is a bit slower than in the U.S. counterparts, and this may be reflected in the adoption of big data. We also see that the concept of data science and the role of the data scientist is well adopted in many U.S. companies, and not so much in Europe.

Where is Europe excelling with big data?

Townsend and Somohano: The financial services sector — particularly investment banking and trading in London — is one of the early adopters of big data. In this sector, both the experience and expertise in big data is on par with big data leaders in the U.S. Equally, the level of investment in big data in this sector — despite the economic downturn — is healthy, with a positive outlook.

Technology startups are also becoming European leaders in big data, primarily around London, and to a lesser degree in Berlin. There is a relatively large number of startups that are not only adopting but developing new business models and value from big data in social media and business-to-consumer services.

Finally, some verticals like oil and gas, utilities, and manufacturing are increasingly adopting big data in areas like sensor data, telemetry, and operational data streams. Our research indicates that retail is perhaps a late adopter.

The U.K. government has a number of robust open data initiatives. How are those shaping the big data space?

Townsend and Somohano: Quite notably, the U.K. government is becoming one of the leaders in open data, although perhaps not so in big data. This is perhaps due to the fact that the key drivers of open data initiatives are mainly information transparency, information privacy, information accessibility for citizens, and European regulatory changes. It's also worth mentioning that the U.K. government has been involved in very large-scale IT projects in the past (e.g. NHS IT), which could qualify as big data program initiatives. For diverse reasons, these projects were not successful and experienced massive budget overruns and delays. We believe this could be a factor in the U.K. government not focusing its main effort on big data for now. However, the open data initiative is driving the open release of massive large public datasets, which eventually will require a strategic big data approach from the U.K. government.

What's the state of data science in Europe?

Townsend and Somohano: Similarly to the status of big data, Europe lags behind in the adoption levels of data science. Even in the U.K. — an early adopter in many ways — data science is still viewed in some sectors with a certain dose of skepticism. That's because data science is still understood as as new name for practices like business intelligence or analytics.

Additionally, in many large organizations in Europe the role of the data scientist is still not associated with a clear job description from the business and HR perspectives — and even in IT in some cases. Contrary to the corporate environment, where the data scientist role is still not fully recognized, in the European startup scene there is a healthy and vibrant data science community. This is perhaps best exemplified by our organization Data Science London. In its mission to promote data science and the data scientist role, our community is dedicated to the free, open dissemination of data science concepts and ideas.

Big Data Week is being held April 23-28. What are you hoping the event yields?

Townsend and Somohano: The main goal of Big Data Week is to bring together the big data communities around the world by hosting a series of events and activities to promote big data. We aim to spread the knowledge and understanding of big data challenges and opportunities from the technology, business, and commercial perspectives.


Big Data Week will feature the Data Science Hackathon, a 24-hour in-person/online event organized by Data Science London. Full Hackathon details are available here.


This interview was edited and condensed.

Related:

April 18 2012

Four short links: 18 April 2012

  1. CartoDB (GitHub) -- open source geospatial database, API, map tiler, and UI. For feature comparison, see Comparing Open Source CartoDB to Fusion Tables (via Nelson Minar).
  2. Future Telescope Array Drives Exabyte Processing (Ars Technica) -- Astronomical data is massive, and requires intense computation to analyze. If it works as planned, Square Kilometer Array will produce over one exabyte (260 bytes, or approximately 1 billion gigabytes) every day. This is roughly twice the global daily traffic of the entire Internet, and will require storage capacity at least 10 times that needed for the Large Hadron Collider. (via Greg Linden)
  3. Faster Touch Screens More Usable (Toms Hardware) -- check out that video! (via Greg Linden)
  4. Why Microsoft's New Open Source Division (Simon Phipps) -- The new "Microsoft Open Technologies, Inc." provides an ideal firewall to protect Microsoft from the risks it has been alleging exist in open source and open standards. As such, it will make it "easier and faster" for them to respond to the inevitability of open source in their market without constant push-back from cautious and reactionary corporate process.

April 10 2012

Open source is interoperable with smarter government at the CFPB

CFPBWhen you look at the government IT landscape of 2012, federal CIOs are being asked to address a lot of needs. They have to accomplish your mission. They need to be able to scale initiatives to tens of thousands of agency workers. They're under pressure to address not just network security but web security and mobile device security. They also need to be innovative, because all of this is supported by the same of less funding. These are common requirements in every agency.

As the first federal "start-up agency" in a generation, some of those needs at the Consumer Financial Protection Bureau (CFPB) are even more pressing. On the other hand, the opportunity for the agency to be smarter, leaner and "open from the beginning" is also immense.

Progress establishing the agency's infrastructure and culture over the first 16 months has been promising, save for larger context of getting a director at the helm. Enabling open government by design isn't just a catchphrase at the CFPB. There has been a bold vision behind the CFPB from the outset, where a 21st century regulator would leverage new technologies to find problems in the economy before the next great financial crisis escalates.

In the private sector, there's great interest right now is finding actionable insight in large volumes of data. Making sense of big data is increasingly being viewed as a strategic imperative in the public sector as well. Recently, the White House put its stamp on that reality with a $200 million big data research and development initiative, including a focus on improving the available tools. There's now an entire ecosystem of software around Hadoop, which is itself open source code. The problem that now exists in many organizations, across the public and private sector, is not so much that the technology to manipulate big data isn't available: it's that the expertise to apply big data doesn't exist in-house. The data science talent shortage is real.

People who work and play in the open source community understand the importance of sharing code, especially when that action leads to improving the code base. That's not necessarily an ethic or a perspective that has been pervasive across the federal government. That does seem to be slowly changing, with leadership from the top: the White House used Drupal for its site and has since contributed modules back into the open source community, including one that helps with 508 compliance.

In an in-person interview last week, CFPB CIO Chris Willey (@ChrisWilleyDC) and acting deputy CIO Matthew Burton (@MatthewBurton) sat down to talk about the agency's new open source policy, government IT, security, programming in-house, the myths around code-sharing, and big data.

The fact that this government IT leadership team is strongly supportive of sharing code back to the open source community is probably the most interesting part of this policy, as Scott Merrill picked up in his post on the CFPB and Github.

Our interview follows.

In addition to being the leader of the CFPB's development team over the past year and half, Burton was just named acting deputy chief information officer. What will that mean?

Willey: He hasn't been leading the software development team the whole time. In fact, we only really had an org chart as of October. In the time that he's been here, Matt has led his team to some amazing things. We're going to talk about a one of them today, but we've also got a great intranet. We've got some great internal apps that are being built and that we've built. We've unleashed one version of the supervision system that helps bank examiners do their work in the field. We've got a lot of faith he's going to do great things.

What it actually means is that he's going to be backing me up as CIO. Even though we're a fairly small organization, we have an awful lot going on. We have 76 active IT projects, for example. We're just building a team. We're actually doubling in size this fiscal year, from about 35 staff to 70, as well as adding lots of contractors. We're just growing the whole pie. We've got 800 people on board now. We're going to have 1,100 on board in the whole bureau by the end of the fiscal year. There's a lot happening, and I recognize we need to have some additional hands and brain cells helping me out.

With respect to building an internal IT team, what's the thinking behind having technical talent inside of an agency like this one? What does that change, in terms of your relationship with technology and your capacity to work?

Burton: I think it's all about experimentation. Having technical people on staff allows an organization to do new things. I think the way most agencies work is that when they have a technical need, they don't have the technical people on staff to make it happen so instead, that need becomes larger and larger until it justifies the contract. And by then, the problem is very difficult to solve.

By having developers and designers in-house, we can constantly be addressing things as they come up. In some cases, before the businesses even know it's a problem. By doing that, we're constantly staying ahead of the curve instead of always reacting to problems that we're facing.

How do you use open source technology to accomplish your mission? What are the tools you're using now?

Willey: We're actually trying to use open source in every aspect of what we do. It's not just in software development, although that's been a big focus for us. We're trying to do it on the infrastructure side as well.

As we look at network and system monitoring, we look at the tools that help us manage the infrastructure. As I've mentioned in the past, we are 100% in the cloud today. Open source has been a big help for us in giving us the ability to manipulate those infrastructures that we have out there.

At the end of the day, we want to bring in the tools that make the most sense for the business needs. It's not about only selecting open source or having necessarily a preference for open source.

What we've seen is that over time, the open source marketplace has matured. A lot of tools that might not have been ready for prime time a year ago or two years ago are today. By bringing them into the fold, we potentially save money. We potentially have systems that we can extend. We could more easily integrate with the other things that we have inside the shop that maybe we built or maybe things that we've acquired through other means. Open source gives us a lot of flexibility because there's a lot of opportunities to do things that we might not be able to do with some proprietary software.

Can you share a couple of specific examples of open source tools that you're using and what you actually use them for within mission?

Willey: On network monitoring, for example, we're using ZFS, which is an open source monitoring tool. We've been working with Nagios as well. Nagios, we actually inherited from Treasury — and while Treasury's not necessarily known for its use of open source technologies, it uses that internally for network monitoring. Splunk is another one that we have been using for web analysis. [After the interview, Burton and Willey also shared that they built the CFPB's intranet on MediaWiki, the software that drives Wikipedia.]

Burton: On the development side, we've invested a lot in Django and WordPress. Our site is a hybrid of them. It's WordPress at its core, with Django on top of that.

In November of 2010, it was actually a few weeks before I started here, Merici [Vinton] called me and said, "Matt, what should we use for our website?"

And I said, "Well, what's it going to do?"

And she said, "At first, it's going to be a blog with a few pages."

And this website needed to be up and running by February. And there was no hosting; there was nothing. There were no developers.

So I said, "Use WordPress."

And by early February, we had our website up. I'm not sure that would have been possible if we had to go through a lengthy procurement process for something not open source.

We use a lot of jQuery. We use Linux servers. For development ops, we use Selenium and Jenkins and Git to manage our releases and source code. We actually have GitHub Enterprise, which although not open source, is very sharing-focused. It encourages sharing internally. And we're using GitHub on the public side to share our code. It's great to have the same interface internally as we're using externally.

Developers and citizens alike can go to github.com/cfpb and see code that you've released back to the public and for other federal agencies. What projects are there?

Burton: These are the ones that came up between basic building blocks. They range from code that may not strike an outside developer as that interesting but that's really useful for the government, all the way to things that we created from scratch that are very developer-focused and are going to be very useful for any developer.

On the first side of that spectrum, there's an app that we made for transit subsidy involvement. Treasury used to manage our transit subsidy balances. That involved going to a webpage that you would print out, write into with a pen and then fax to someone.

Willey: Or scan and email it.

Burton: Right. And then once you'd had your supervisor sign it, faxed it over to someone, eventually, several weeks later, you would get your benefits. We started to take over that process and the human resources office came to us and asked, "How can we do this better?"

Obviously, that should just be a web form that you type into, that will auto fill any detail it knows about you. You press submit and it goes into the database, which goes directly to the DOT [Department of Transportation]. So that's what we made. We demoed that for DOT and they really like it. USAID is also into it. It's encouraging to see that something really simple could prove really useful for other agencies.

On the other side of the spectrum, we use a lot of Django tools. As an example, we have a tool we just released through our website called "Ask CFPB." It's a Django-based question and answer tool, with a series of questions and answers.

Now, the content is managed in Django. All of the content is managed from our staging server behind the firewall. When we need to get that content, we need to get the update from staging over to production.

Before, what we had to do was pick up the entire database, copy it and them move it over to production, which was kind of a nightmare. And there was no Django tool for selectively moving data modifications.

So we sat there and we thought, "Oh, we really need something to do that because we're going to be doing a lot of that. We can't be copying the database over every time we need to correct a copy. So two of our developers developed a Django app called "Nudge." Basically, you go into a Django and if you've ever seen a Django admin, you just go into it and assess, "Hey, here's everything that's changed. What do you want to move over?"

You can pick and choose what you want to move over and, with the click of a button, it goes to production. I think that's something that every Django developer will have a use for if they have a staging server.

In a way, we were sort of surprised it didn't exist. So, we needed it. We built it. Now we're giving it back and anybody in the world can use it.

You mentioned the cloud. I know that CFPB is very associated with Treasury. Are you using Treasury's FISMA moderate cloud?

Willey: We have a mix of what I would say are private and public clouds. On the public side, we're using our own cloud environments that we have established. On the private side, we are using Treasury for some of our apps. We're slowly migrating off of treasury systems onto our own cloud infrastructure or our own cloud.

In the case of email, for example, we're looking at email as a service. So we'll be looking at Google, Microsoft and others just to see what's out there and what we might be able to use.

Why is it important for the CFPB to share code back to the public? And who else in the federal government has done something like this, aside from the folks at the White House?

Burton:: We see it the same way that we believe the rest of the open source community sees it: The only way this stuff is going to get better and become more viable is if people share. Without that, then it'll only be hobbyists. It'll only be people who build their own little personal thing. Maybe it's great. Maybe it's not. Open source gets better by the community actually contributing to it. So it's self-interest in a lot of ways. If the tools get better, then what we have available to us is, therefore, gets better. We can actually do our mission better.

Using the transit subsidy enrollment application example, it's also an opportunity for government to help itself, for one agency to help another. We've created this thing. Every federal agency has a transit subsidy program. They all need to allow people to enroll in it. Therefore, it's immediately useful to any other agency in the federal government. That's just a matter of government improving its own processes.

If one group does it, why should another group have to figure it out or have to pay lots of money to have it figured out? Why not just share it internally and then everybody benefits?

Why do you think it's taken until 2012 to have that insight actually be made into reality in terms of a policy?

Burton: I think to some degree, the tools have changed. The ability to actually do this easily is a lot better now than it was even a year or two ago. Government also traditionally lags behind the private sector in a lot of ways. I think that's changing, too. With this administration in particular, I think what we've seen is that government has started to become a little bit on parity with the private sector, including some of the thinking around how to use technology to improve business processes. That's really exciting. And I think as a result, there are a lot of great people coming in as developers and designers who want to work in the federal government because they see that change.

Willey: It's also because we're new. There are two things behind that. First, we're able to sort of craft a technology philosophy with a modern perspective. So we can, from our founding, ask "What is the right way to do this?" Other agencies, if they want to do this, have to turn around decades of culture. We don't have that burden. I think that's a big reason why we're able to do this.

The second thing is a lot of agencies don't have the intense need that we do. We have 76 projects to do. We have to use every means available to us.

We can't say, "We're not going to use a large share of the software that's available to us." That's just not an option. We have to say, "Yes, we will consider this as a commercial good, just like any other piece of proprietary software."

In terms of the broader context for technology and policy, how does open source relate to open government?

Willey: When I was working for the District, Apps for Democracy was a big contest that we did around opening data and then asking developers to write applications using that data that could then be used by anybody. We said that the next logical step was to sort of create more participatory government. And in my mind, open sourcing the projects that we do is a way of asking the citizenry to participate in the active government.

So by putting something in the public space, somebody could pick that up. Maybe not the transit subsidy enrollment project — but maybe some other project that we've put out there that's useful outside of government as well as inside of government. Somebody can pick that code up, contribute to it and then we benefit. In that way, the public is helping us make government better.

When you have conversations around open source in government, what do you say about what it means to put your code online and to have people look at it or work on it? Can you take changes that people make to the code base to improve it and then use it yourself?

Willey: Everything that we put out there will be reviewed by our security team. The goal is that, by the time it's out there, not to have any security vulnerabilities. If someone does discover a security vulnerability, however, we'll be sharing that code in a way that makes it much more likely that someone will point it out to us and maybe even provide a fix than they will exploit it because it's out there. They wouldn't be exploiting our instance of the code; they would be working with the code on Github.com.

I've seen people in government with a misperception of what open source means. They hear that it's code that anyone can contribute to. I think that they don't understand that you're controlling your own instance of it. They think that anyone can come along and just write anything into your code that they like. And, of course, it's not like that.

I think as we talk more and more about this to other agencies, we might run into that, but I think it'll be good to have strong advocates in government, especially on the security side, who can say, "No, that's not the case; it doesn't work that way."

Burton: We have a firewall between our public and private instances at Git as well. So even if somebody contributes code, that's also reviewed on the way in. We wouldn't implement it unless we made sure that, from a security perspective, the code was not malicious. We're taking those precautions as well.

I can't point to one specifically, but I know that there have been articles and studies done on the relative security of open source. I think the consensus in the industry is that the peer review process of open source actually helps from a security perspective. It's not that you have a chaos of people contributing code whenever they want to. It improves the process. It's like the thinking behind academic papers. You do peer review because it enhances the quality of the work. I think that's true for open source as well.

We actually want to create a community of peer reviewers of code within the federal government. As we talk to agencies, we want people to actually use the stuff we build. We want them to contribute to it. We actually want them to be a community. As each agency contributes things, the other agencies can actually review that code and help each other from that perspective as well.

It's actually fairly hard. As we build more projects, it's going to put a little bit of a strain on our IT security team, doing an extra level of scrutiny to make sure that the code going out is safe. But the only way to get there is to grow that pie. And I think by talking with other agencies, we'll be able to do that.

A classic open source koan is that "with many eyes, all bugs become shallow." In IT security, is it that with many eyes, all worms become shallow?

Burton: What the Department of Defense said was if someone has malicious intent and the code isn't available, they'll have some way of getting the code. But if it is available and everyone has access to it, then any vulnerabilities that are there are much more likely to be corrected than before they're exploited.

How do you see open source contributing to your ability to get insights from large amounts of data? If you're recruiting developers, can they actually make a difference in helping their fellow citizens?

Burton: It's all about recruiting. As we go out and we bring on data people and software developers, we're looking for that kind of expertise. We're looking for people that have worked with PostgreSQL. We're looking for people that have worked with Solar. We're looking for people that have worked with Hadoop, because then we can start to build that expertise in-house. Those tools are out there.

R is an interesting example. What we're finding is that as more people are coming out of academia into the professional world, they're actually used to using R in school. And then they have to come out and learn a different tool and they're actually working in the marketplace.

It's similar with the Mac versus the PC. You get people using the Mac in college — and suddenly they have to go to a Windows interface. Why impose that on them? If they're going to be extremely productive with a tool like R, why not allow that to be used?

We're starting to see, in some pockets of the bureau, push from the business side to actually use some of these tools, which is great. That's another change I think that's happened in the last couple of years.

Before, there would've been big resistance on that kind of thing. Now that we're getting pushed a little bit, we have to respond to that. We also think it's worth it that we do.

Related:

March 14 2012

March 13 2012

Four short links: 13 March 2012

  1. Microsoft Universal Voice Translator -- the promise is that it converts your voice into another language, but the effect is more that it converts your voice into that of Darth You in another language. Still, that's like complaining that the first Wright Brothers flight didn't serve peanuts. (via Hacker News)
  2. Geography of the Basketball Court -- fascinating analytics of where NBA shooters make their shots from. Pretty pictures and sweet summaries even if you don't follow basketball. (via Flowing Data)
  3. Spark Research -- a programming model ("resilient distributed datasets") for applications that reuse an intermediate result in multiple parallel operations. (via Ben Lorica)
  4. Opening Up -- earlier I covered the problems that University of Washington's 3D printing lab had with the university's new IP policy, which prevented them from being as open as they had been. They've been granted the ability to distribute their work under Creative Commons license and are taking their place again as a hub of the emerging 3D printing world. (via BoingBoing)

March 12 2012

O'Reilly Radar Show 3/12/12: Best data interviews from Strata California 2012

Below you'll find the script and associated links from the March 12, 2012 episode of O'Reilly Radar. An archive of past shows is available through O'Reilly Media's YouTube channel and you can subscribe to episodes of O'Reilly Radar via iTunes.



In this special edition of the Radar Show we're bringing you three of our best interviews from the 2012 Strata Conference in California.

First up is Hadoop creator Doug Cutting discussing the similarities between Linux and the big data world. [Interview begins 16 seconds in.]

In our second interview from Strata California, Max Gadney from After the Flood explains the benefits of video data graphics. [Begins at 7:04.]

In our final Strata CA interview, Kaggle's Jeremy Howard looks at the difference between big data and analytics. [Begins at 13:46.]

Closing

Just a reminder that you can always catch episodes of O'Reilly Radar at youtube.com/oreillymedia and subscribe to episodes through iTunes.

All of the links and resources mentioned during this episode are posted at radar.oreilly.com/show.

That's all we have for this episode. Thanks for joining us and we'll see you again soon.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Four short links: 12 March 2012

  1. Web-Scale User Modeling for Targeting (Yahoo! Research, PDF) -- research paper that shows how online advertisers build profiles of us and what matters (e.g., ads we buy from are more important than those we simply click on). Our recent surfing patterns are more relevant than historical ones, which is another indication that value of data analytics increases the closer to real-time it happens. (via Greg Linden)
  2. Information Technology and Economic Change -- research showing that cities which adopted the printing press no prior growth advantage, but subsequently grew far faster than similar cities without printing presses. [...] The second factor behind the localisation of spillovers is intriguing given contemporary questions about the impact of information technology. The printing press made it cheaper to transmit ideas over distance, but it also fostered important face-to-face interactions. The printer’s workshop brought scholars, merchants, craftsmen, and mechanics together for the first time in a commercial environment, eroding a pre-existing “town and gown” divide.
  3. They Just Don't Get It (Cameron Neylon) -- curating access to a digital collection does not scale.
  4. Should Libraries Get Out of the Ebook Business? -- provocative thought: the ebook industry is nascent, a small number of patrons have ereaders, the technical pain of DRM and incompatible formats makes for disproportionate support costs, and there are already plenty of worthy things libraries should be doing. I only wonder how quickly the dynamics change: a minority may have dedicated ereaders but a large number have smartphones and are reading on them already.

March 01 2012

Profile of the Data Journalist: The Elections Developer

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Derek Willis (@derekwillis) is a news developer based in New York City. Our interview follows.

Where do you work now? What is a day in your life like?

I work for The New York Times as a developer in the Interactive News Technologies group. A day in my work life usually includes building or improving web applications relating to politics, elections and Congress, although I also get the chance to branch out to do other things. Since elections are such an important subject, I try to think of ways to collect information we might want to display and of ways to get that data in front of readers in an intelligent and creative manner.

How did you get started in data journalism? Did you get any special degrees or certificates?

No, I started working with databases in graduate school at the University of Florida (I left for a job before finishing my master's degree). I had an assistantship at an environmental occupations training center and part of my responsibilities was to maintain the mailing list database. And I just took to it - I really enjoyed working with data, and once I found Investigative Reporters & Editors, things just took off for me.

Did you have any mentors? Who? What were the most important resources they shared with you?

A ton of mentors, mostly met through IRE but also people at my first newspaper job at The Palm Beach Post. A researcher there, Michelle Quigley, taught me how to find information online and how sometimes you might need to take an indirect route to locating the stuff you want. Kinsey Wilson, now the chief content officer at NPR, hired me at Congressional Quarterly and constantly challenged me to think bigger about data and the news. And my current and former colleagues at The Times and The Washington Post are an incredible source of advice, counsel and inspiration.

What does your personal data journalism "stack" look like? What tools could you not live without?

It's pretty basic: spreadsheets, databases (MySQL, PostgreSQL, SQLite) and a programming language like Python or, these days, Ruby. I've been lucky to find excellent tools in the Ruby world, such as the Remote Table gem by Brighter Planet, and a host of others. I like PostGIS for mapping stuff.

What data journalism project are you the most proud of working on or creating?

I'm really proud of the elections work at The Times, but can't take credit for how good it looks. A project called Toxic Waters also was incredibly challenging and rewarding to work on, too. But my favorite might be the first one: the Congressional Votes Database that Adrian Holovaty, Alyson Hurt and I created at The Post in late 2005. It was a milestone for me and for The Post, and helped set the bar for what news organizations could do with data on the web.

Where do you turn to keep your skills updated or learn new things?

My colleagues are my first source. When you work with Jeremy Ashkenas, the author of the Backbone and Underscore Javascript libraries, you see and learn new things all the time. Our team is constantly bouncing new concepts around. I wish I had more time to learn new things; maybe after the elections!

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

A couple of reasons: one is that we live in an age where information is plentiful. Tools that can help distill and make sense of it are valuable. They save time and convey important insights. News organizations can't afford to cede that role. The second is that they really force you to think about how the reader/user is getting this information and why. I think news apps demand that you don't just build something because you like it; you build it so that others might find it useful.

This email interview has been edited and condensed for clarity.

February 27 2012

Big data is the next big thing in health IT

During the 2012 HIMSS conference in Las Vegas I was invited by Dell Healthcare, along with a group of health IT experts, to discuss issues in health information technology. The session sparked some passionate discourse about the challenges and opportunities that are important to the health IT industry.

Moderator Dan Briody started the event with a question about things we had seen at HIMSS that had changed our thinking about health IT. Never being shy, I jumped right in and spoke about the issues of payment reform and how the private market is beginning to show signs of disruptive innovation. After a great deal of back and forth among the panelists it seemed we slipped into listing many of the barriers — technological, political and cultural — that health IT faces. I was hoping we would get back to sharing possible solutions, so I made the proposal that big data is the next big thing in health IT (see video below).

When I talk about "big data" I am referring to a dataset that is too large for a typical database software tool to store, manage, and analyze. Obviously, as technology changes and improves, the size of a dataset that would be qualify as "big data" will change as well. There is also a big data difference between healthcare and other industry sectors, since there are different tools available and the required datasets have varying sizes. Since health data is very personal and sensitive, it also has special security and privacy protections. This makes sharing, aggregating, sorting and analyzing the data sometimes challenging.

Another difficulty in making the most of big data in healthcare is those who control different pools of data have different financial incentives. There is a lack of transparency in performance, cost and quality; it is currently structured so that payers who would gain from decreasing revenue to providers, but the providers control the clinical data that is necessary to analyze in order to pay for value. The payers control another pool, which includes claims data. This is not very useful for advanced analysis that will provide real insight. But enabling transparency of the data will help to identify and analyze sources of variability as well as find waste and inefficiencies. Publishing quality and performance data will also help patients make informed health decisions.

The proliferation of digital health information, including both clinical and claims information, is creating some very large datasets. This also creates some significant opportunity. For instance, analyzing and synthesizing clinical records and claims data can help identify patients appropriate for inclusion in a particular clinical trial. These new datasets can also help to provide insight into improved clinical decision making. One great example of this is when an analysis of a database of 1.4 million Kaiser Permanente members helped determine that Vioxx, a popular pain reliever that was widely used by arthritis patients, was dangerous. Vioxx was a big moneymaker for Merck, generating about $2.5 billion in yearly sales, and there was quite a battle to get the drug off the market. Only by having the huge dataset available from years of electronic health records, and tools to properly analyze the data, was this possible.

The big data portion of the Dell think tank discussion is embedded below. You can find video from the full session here.

Related:

February 24 2012

Four short links: 24 February 2012

  1. Excel Cloud Data Analytics (Microsoft Research) -- clever--a cloud analytics backend with Excel as the frontend. Almost every business and finance person I've known has been way more comfortable with Excel than any other tool. (via Dr Data)
  2. HTTP Client -- Mac OS X app for inspecting and automating a lot of HTTP. cf the lovely Charles proxy for debugging. (via Nelson Minar)
  3. The Creative Destruction of Medicine -- using big data, gadgets, and sweet tech in general to personalize and improve healthcare. (via New York Times)
  4. EFF Wins Protection of Time Zone Database (EFF) -- I posted about the silliness before (maintainers of the only comprehensive database of time zones was being threatened by astrologers). The EFF stepped in, beat back the buffoons, and now we're back to being responsible when we screw up timezones for phone calls.

February 23 2012

Everyone has a big data problem

Jonathan Gosier (@jongos), designer, developer and the co-founder of metaLayer.com, says the big data deluge presents problems for everyone, not just corporations and governments.

Gosier will be speaking at next week's Strata conference on "The Democratization of Data Platforms." In the following interview, he discusses the challenges and opportunities data democratization creates.

Your keynote is going to be about "everyone's" big data problems. If everyone really does have their own big data problem, how are we going to democratize big data tools and processes? It seems that our various problems would require many different solutions.

Jonathan GosierJonathan Gosier: It's a problem for everyone because data problems can manifest in a multitude of ways: too much email, too many passwords to remember, a deluge of legal documents related to a mortgage, or simply knowing where to look online for the answers to simple questions.

You're absolutely correct in noting that each of these problems requires different solutions. However, many of these solutions tend not to be accessible to the average person, whether this is because of prices or a level of expertise required to use the tools available.

There is a lot of talk about a "digital divide," but there's a growing "data divide" as well. It's no longer about having basic computer literacy skills. Being able to understand what data is available, how it can be manipulated, and how it can be used to actually improve one's life is a skill that not everyone possesses.

There's an opportunity here for growth as well. If you look at the market, there are tools for visualizing personal finance (think Mint.com or HelloWallet), personal health (23andMe), personal productivity (Basecamp), etc. But the overarching trend is that there is a growing need for products that simplify the wealth of information around people. The simplest way to do this is often through visuals.

Why are visualizations so important to a better understanding of data?

Jonathan Gosier: Visualizations are only "better" in that they can relate complex ideas to a general audience. Visualization is by no means a replacement for expertise and research. It simply represents a method for communicating across barriers of knowledge.

But beyond that, the problem with a lot of the data visuals on the web is that they are static, pre-constructed, and vague about their data sources. This means the general public either has to take what's presented on face value and agree or disagree, or they have to conduct their own research.

There's a need for "living infographics" — visualizations that are inviting and easy to understand, but are shared with the underlying data used to create them. This allows the casual consumer to simply admire the visual while the more discerning audience can actually analyze the underlying data to see if the message being presented is consistent with their findings.

It's far more transparent and credible to reveal, versus conceal, one's sources.

One of the pushbacks to data democratization efforts is that people might not know how to use these tools correctly and/or they might use them to further their own agendas. How do you respond to that?

Jonathan Gosier: The question illustrates the point, actually. It wasn't so long ago that the same could be said about the printing press. It was an innovation, but initially, it was so expensive that it was a technology that was only available to the elite and wealthy. Now it's common (at least in the Western world) for any given middle-class household to contain an inexpensive printing device. The web radicalized things even more, essentially turning anyone with access into a publisher.

So the question becomes, was it good or bad that publishing became something that anyone could do versus a select few? I'd argue that, ultimately, the pros have out-weighed the cons by magnitudes.

Right now data can be thought of as an asset of the elite and privileged. Those with wealth pay a lot for it, and those who are highly skilled can charge a great deal for their services around it. But the reality is, there is a huge portion of the market that has a legitimate need for data solutions that aren't currently available to them.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

February 21 2012

Building the health information infrastructure for the modern epatient

To learn more about what levers the government is pulling to catalyze innovation in the healthcare system, I turned to Dr. Farzad Mostashari (@Farzad_ONC). As the National Coordinator for Health IT, Mostashari is one of the most important public officials entrusted with improving the nation's healthcare system through smarter use of technology.

Dr. Farzad MostashariMostashari, a public-health informatics specialist, was named ONC chief in April 2011, replacing Dr. David Blumenthal. Mostashari's full biography, available at HHS.gov, notes that he "was one of the lead investigators in the outbreaks of West Nile Virus and anthrax in New York City, and was among the first developers of real-time electronic disease surveillance systems nationwide."

I talked to Mostashari on the same day that he published a look back over 2011, which he hailed as a year of momentous progress in health information technology. Our interview follows.

What excites you about your work? What trends matter here?

Farzad Mostashari‏: Well, it's a really fun job. It feels like this is the ideal time for this health IT revolution to tie into other massive megatrends that are happening around consumer and patient empowerment, payment and delivery reform, as I talked about in my TED Med Talk with Aneesh Chopra.

These three streams [how patients are cared for, how care is paid for, and how people take care of their own health] coming together feels great. And it really feels like we're making amazing progress.

How does what's happening today grow out of the passage of the Health Information Technology for Economic and Clinical Health Act (HITECH) Act in 2009?

Farzad Mostashari‏: HITECH was a key part of ARRA, the American Recovery and Reinvestment Act. This is the reinvestment part. People think of roadways and runways and railways. This is the information infrastructure for healthcare.

In the past two years, we made as much progress on adoption as we had made in the past 20 years before that. We doubled the adoption of electronic health records in physician offices between the time the stimulus passed and now. What that says is that a large number of barriers have been addressed, including the financial barriers that are addressed by the health IT incentive payments.

It also, I think, points to the innovation that's happening in the health IT marketplace, with more products that people want to buy and want to use, and an explosion in the number of options people have.

The programs we put in place, like the Regional Health IT Extension Centers modeled after the Agriculture Extension program, give a helping hand. There are local nonprofits throughout the country that are working with one-third of all primary care providers in this country to help them adopt electronic health records, particularly smaller practices and maybe health centers, critical access hospitals and so forth.

This is obviously a big lift and a big change for medicine. It moves at what Jay Walker called "med speed," not tech speed. The pace of transformation in medicine that's happening right now may be unparalleled. It's a good thing.

Healthcare providers have a number of options as they adopt electronic health records. How do you think about the choice between open source versus proprietary options?

Farzad Mostashari‏: We're pretty agnostic in terms of the technology and the business model. What matters are the outcomes. We've really left the decisions about what technology to use to the people who have to live with it, like the doctors and hospitals who make the purchases.

There are definitely some very successful models, not only on the EHR side, but also on the health information exchange side.

(Note: For more on this subject, read Brian Ahier's Radar post on the Health Internet.)

What role do open standards play in the future of healthcare?

Farzad Mostashari‏: We are passionate believers in open standards. We think that everybody should be using them. We've gotten really great participation by vendors of open source and proprietary software, in terms of participating in an open standards development process.

I think what we've enabled, through things like modular certification, is a lot more innovation. Different pieces of the entire ecosystem could be done through reducing the barrier to entry, enabling a variety of different innovative startups to come to the field. What we're seeing is, a lot of the time, this is migrating from installed software to web services.

If we're setting up a reference implementation of the standards, like the Connect software or popHealth, we do it through a process where the result is open source. I think the government as a platform approach at the Veterans Affairs department, DoD, and so forth is tremendously important.

How is the mobile revolution changing healthcare?

We had Jay Walker talking about big change [at a recent ONC Grantee Meeting]. I just have this indelible image of him waving in his left hand a clay cone with cuneiform on it that is from 2,000 B.C. — 4,000 years ago — and in his right hand he held his iPhone.

He was saying both of them represented the cutting edge of technology that evolved to meet consumer need. His strong assertion was that this is absolutely going to revolutionize what happens in medicine at tech speed. Again, not "med speed."

I had the experience of being at my clinic, where I get care, and the pharmacist sitting in the starched, white coat behind the counter telling me that I should take this medicine at night.

And I said, "Well, it's easier for me to take it in the morning." And he said, "Well, it works better at night."

And I asked, acting as an empowered patient, "Well, what's the half life?" And he answered, "Okay. Let me look it up."

He started clacking away at his pharmacy information system; clickity clack, clickity clack. I can't see what he's doing. And then he says, "Ah hell," and he pulls out his smartphone and Googles it.

There's now a democratization of information and information tools, where we're pushing the analytics to the cloud. Being able to put that in the hand of not just every doctor or every healthcare provider but every patient is absolutely going to be that third strand of the DNA, putting us on the right path for getting healthcare that results in health.

We're making sure that people know they have a right to get their own data, making sure that the policies are aligned with that. We're making sure that we make it easy for doctors to give patients their own information through things like the Direct Project, the Blue Button, meaningful use requirements, or the Consumer E-Health Pledge.

We have more than 250 organizations that collectively hold data for 100 million Americans that pledge to make it easy for people to get electronic copies of their own data.

Do you think people will take ownership of their personal health data and engage in what Susannah Fox has described as "peer-to-peer healthcare"?

Farzad Mostashari‏: I think that it will be not just possible, not even just okay, but actually encouraged for patients to be engaged in their care as partners. Let the epatient help. I think we're going to see that emerging as there's more access and more tools for people to do stuff with their data once they get it through things like the health data initiative. We're also beginning to work with stakeholder groups, like Consumer's Union, the American Nurses Association and some of the disease groups, to change attitudes around it being okay to ask for your own records.

This interview was edited and condensed. Photo from The Office of the National Coordinator for Health Information Technology.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

February 20 2012

February 15 2012

Book marketing is broken. Big data can fix it

Peter Collingridge (@gunzalis), cofounder of Enhanced Editions says digital books are requiring a new style of data-driven marketing and promotion that publishers aren't yet implementing. He also says that book marketing is broken and big data is the solution.

In the following interview, Collingridge talks about how real-time data and analytics can help publishers and he shares insights from the beta period of Bookseer, a market intelligence service for books his company is developing.

What are some key findings from the Bookseer beta?

peter-collingridge.jpgPeter Collingridge: I think despite the increasing awareness of data as being a critical tool for publishers to compete, it's genuinely hard for people to look at data as a natural addition to the work they are doing, whether that's in PR, marketing, acquisition, or pricing.

Publishing has operated in a well-defined way for a long time, where experience and intuition have dominated decision making and change is hard. What has been really exciting is that when people have the data in front of them, clearly showing the immediate impact of something they did — a link between cause and effect that they couldn't see before — they get really excited. We've had people talking about being "obsessed" and "addicted" to the data.

Some of the most surprising findings: That on some titles, big price changes aren't as relevant to volume as everyone thinks; that big-name glowing reviews of literary fiction don't have anywhere near the impact on sales to merit the effort; and that social media buzz almost never translates into sales.

For me, the key observations so far are around marketing. First, big budget media spending and ostentatious banner ads might impress authors and bookshops, but they deliver very poor return on investment (ROI) for sales. Secondly, the super-smart publishers are behaving like startups and doing tiny little pieces of very focused and cheap marketing — and watching the results like hawks before iterating in direct response to the data. Bookseer is designed to disclose the former and to aid the latter — and that is probably our biggest finding: it works!

Find out more about Bookseer in the following video from the If Book Then conference earlier this year in Milan.

What kinds of data are most important for publishers to track?

Peter Collingridge: Before we built Bookseer, we spoke with 25 people across the industry, including authors big, small and unpublished; editors and publishers; managing directors; digital directors; sales, marketing and PR directors; and literary agents. We asked exactly that question.

For most people, the data they had was pretty basic: Nielsen (which obviously only goes to the granularity of one week) plus the F5 button to manically refresh an Amazon web page for changes in sales rank. Neither of these is particularly helpful in determining the impact of an activity.

Of course, there are loads of data points, but we began with the lowest-hanging fruit. Aggregated sales (print and digital) across multiple sources; Amazon sales rank; price; best-seller charts; social media mentions; buzz; review coverage in mainstream and new media, and on social reading sites; and other factors such as promotion (advertising and other) and merchandising.

We think the most important thing to do is aggregate activity and data points across as many sources as possible, building a picture of what's going on for one title or across a whole retailer, and allowing publishers to draw their own conclusions.

What does real-time data let publishers do?

Peter Collingridge: Publishing has been B2B, about supplying books into bookshops, for forever — combined with working with media to support that. And for that world, weekly aggregated retail sales work, I guess. But when you're in a much faster-paced world, with the industry moving toward being consumer- rather than trade-facing, and with a fragmented retail and media landscape, you need to make decisions based on fact: What is the ROI on a £50,000 marketing campaign? Where do my banner ads have the best CTR? Who are the key influencers here — are they bloggers, mainstream media, or somewhere else? How many of our Twitter followers actually engage? When should we publish, in what format, and at what price?

Data should absolutely inform the answers to these questions. Furthermore, with a disciplined approach to promotion, where activities are separated from each other by a day or a few hours, real-time measurement can identify what works and what doesn't. We can identify the difference between Al Gore tweeting about a book and Tim O'Reilly doing the same; the difference between a Time review and a piece on CNN; the impact of a price drop against an email sent to 200,000 subscribers; and measure the exact ROI on a £300 campaign against a £30,000 one.

Over time, you build up a picture of which tactics work best and which don't. And immediate feedback allows you to hone your activities in real-time to what works best (particularly if you are A/B testing different approaches), or from a more strategic perspective, to plan out campaigns that have historically worked best for comparable titles.

How would you describe the relationship between sales and social media?

Peter Collingridge: Right now, sales drives social — not the other way round. However, I believe there will come a point when that's not the case, and we will be able to identify that.

This interview was edited and condensed.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

February 14 2012

The bond between data and journalism grows stronger

While reporters and editors have been the traditional vectors for information gathering and dissemination, the flattened information environment of 2012 now has news breaking first online, not on the newsdesk.

That doesn't mean that the integrated media organizations of today don't play a crucial role. Far from it. In the information age, journalists are needed more than ever to curate, verify, analyze and synthesize the wash of data.

To learn more about the shifting world of data journalism, I interviewed Liliana Bounegru (@bb_liliana), project coordinator of SYNC3 and Data Driven Journalism at the European Journalism Centre.

What's the difference between the data journalism of today and the computer-assisted reporting (CAR) of the past?

Liliana Bounegru: There is a "continuity and change" debate going on around the label "data journalism" and its relationship with previous journalistic practices that employ computational techniques to analyze datasets.

Some argue [PDF] that there is a difference between CAR and data journalism. They say that CAR is a technique for gathering and analyzing data as a way of enhancing (usually investigative) reportage, whereas data journalism pays attention to the way that data sits within the whole journalistic workflow. In this sense, data journalism pays equal attention to finding stories and to the data itself. Hence, we find the Guardian Datablog or the Texas Tribune publishing datasets alongside stories, or even just datasets by themselves for people to analyze and explore.

Another difference is that in the past, investigative reporters would suffer from a poverty of information relating to a question they were trying to answer or an issue that they were trying to address. While this is, of course, still the case, there is also an overwhelming abundance of information that journalists don't necessarily know what to do with. They don't know how to get value out of data. As Philip Meyer recently wrote to me: "When information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important."

On the other hand, some argue that there is no difference between data journalism and computer-assisted reporting. It is by now common sense that even the most recent media practices have histories as well as something new in them. Rather than debating whether or not data journalism is completely novel, a more fruitful position would be to consider it as part of a longer tradition but responding to new circumstances and conditions. Even if there might not be a difference in goals and techniques, the emergence of the label "data journalism" at the beginning of the century indicates a new phase wherein the sheer volume of data that is freely available online combined with sophisticated user-centric tools enables more people to work with more data more easily than ever before. Data journalism is about mass data literacy.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

What does data journalism mean for the future of journalism? Are there new business models here?

Liliana Bounegru: There are all kinds of interesting new business models emerging with data journalism. Media companies are becoming increasingly innovative with the way they produce revenues, moving away from subscription-based models and advertising to offering consultancy services, as in the case of the German award-winning OpenDataCity.

Digital technologies and the web are fundamentally changing the way we do journalism. Data journalism is one part in the ecosystem of tools and practices that have sprung up around data sites and services. Quoting and sharing source materials (structured data) is in the nature of the hyperlink structure of the web and in the way we are accustomed to navigating information today. By enabling anyone to drill down into data sources and find information that is relevant to them as individuals or to their community, as well as to do fact checking, data journalism provides a much needed service coming from a trustworthy source. Quoting and linking to data sources is specific to data journalism at the moment, but seamless integration of data in the fabric of media is increasingly the direction journalism is going in the future. As Tim Berners-Lee says, "data-driven journalism is the future".

What data-driven journalism initiatives have caught your attention?

Liliana Bounegru: The data journalism project FarmSubsidy.org is one of my favorites. It addresses a real problem: The European Union (EU) is spending 48% of its budget on agriculture subsidies, yet the money doesn't reach those who need it.

Tracking payments and recipients of agriculture subsidies from the European Union to all member states is a difficult task. The data is scattered in different places in different formats, with some missing and some scanned in from paper records. It is hard to piece it together to form a comprehensive picture of how funds are distributed. The project not only made the data available to anyone in an easy to understand way, but it also advocated for policy changes and better transparency laws.

LRA Crisis Tracker

Another of my favorite examples is the LRA Crisis Tracker, a real-time crisis mapping platform and data collection system. The tracker makes information about the attacks and movements of the Lord's Resistance Army (LRA) in Africa publicly available. It helps to inform local communities, as well as the organizations that support the affected communities, about the activities of the LRA through an early-warning radio network in order to reduce their response time to incidents.

I am also a big fan of much of the work done by the Guardian Datablog. You can find lots of other examples featured on datadrivenjournalism.net, along with interviews, case studies and tutorials.

I've talked to people like Chicago Tribune news app developer Brian Boyer about the emerging "newsroom stack." What do you feel are the key tools of the data journalist?

Liliana Bounegru: Experienced data journalists list spreadsheets as a top data journalism tool. Open source tools and web-based applications for data cleaning, analysis and visualization play very important roles in finding and presenting data stories. I have been involved in organizing several workshops on ScraperWiki and Google Refine for data collection and analysis. We found that participants were quite able to quickly ask and answer new kinds of questions with these tools.

How does data journalism relate to open data and open government?

Liliana Bounegru: Open government data means that more people can access and reuse official information published by government bodies. This in itself is not enough. It is increasingly important that journalists can keep up and are equipped with skills and resources to understand open government data. Journalists need to know what official data means, what it says and what it leaves out. They need to know what kind of picture is being presented of an issue.

Public bodies are very experienced in presenting data to the public in support of official policies and practices. Journalists, however, will often not have this level of literacy. Only by equipping journalists with the skills to use data more effectively can we break the current asymmetry, where our understanding of the information that matters is mediated by governments, companies and other experts. In a nutshell, open data advocates push for more data, and data journalists help the public to use, explore and evaluate it.

This interview has been edited and condensed for clarity.

Photo on associated home and category pages: NYTimes: 365/360 - 1984 (in color) by blprnt_van, on Flickr.

Related:

February 13 2012

Four short links: 13 February 2012

  1. Rise of the Independents (Bryce Roberts) -- companies that don't take VC money and instead choose to grow organically: indies. +1 for having a word for this.
  2. The Performance Golden Rule (Steve Souders) -- 80-90% of the end-user response time is spent on the frontend. Check out his graphs showing where load times come from for various popular sites. The backend responds quickly, but loading all the Javascript and images and CSS and embedded autoplaying videos and all that kerfuffle takes much much longer.
  3. Starry Night Comes to Life -- wow, beautiful, must-see.
  4. MapReduce Patterns, Algorithms, and Use Cases -- In this article I digest a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found in the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting.

February 10 2012

O'Reilly Radar Show 2/10/12: The 5 trends that will shape the data world

Below you'll find the script and associated links from the February 10, 2012 episode of O'Reilly Radar. An archive of past shows is available through O'Reilly Media's YouTube channel and you can subscribe to episodes of O'Reilly Radar via iTunes.


Introduction

There are five major trends that will shape the data world in the months to come. Strata Conference chair Edd Dumbill reveals them in this episode of O'Reilly Radar. [Starts 12 seconds in.]

Also in this episode: We revisit a conversation with Wired's Kevin Kelly in which he discusses freemium models and why digital rights management will likely persist in some form or another. [Interview begins at 11:04.]

Radar posts of note

[This segment begins at the 10:06 mark.]

For now, legislators have backed off of the Stop Online Piracy Act and the Protect IP Act, but the friction between media companies and online piracy persists. In his piece "SOPA and PIPA are bad industrial policy," Tim O'Reilly explains why these efforts — and those sure to emerge down the road — hold back innovative business models that grow the overall market.

It's the hot trend in software right now, but what does big data mean, and how can you exploit it? In "What is big data?," Strata chair Edd Dumbill presents an introduction and orientation to the big data landscape.

Finally, books, publishing processes and readers have all made the jump to digital, and that's creating considerable opportunities for publishing startups. Justo Hidalgo explores the digital shift in his piece, "Three reasons why we're in a golden age of publishing entrepreneurship."

As always, links to these stories and other resources mentioned during this episode are available at radar.oreilly.com/show.

Radar video spotlight

At the 2011 Tools of Change for Publishing conference I had a chance to interview Wired's Kevin Kelly about two topics that continue to play big roles in the content world: the freemium model and digital rights management.

As you'll see in the following video, Kelly has a unique, long-view perspective on both of these issues.

[Interview begins at 11:04.]

Closing

Just a reminder that you can always catch episodes of O'Reilly Radar at youtube.com/oreillymedia and subscribe to episodes through iTunes.

All of the links and resources mentioned during this episode are posted at radar.oreilly.com/show.

That's all we have for this episode. Thanks for joining us and we'll see you again soon.

Four short links: 10 February 2012

  1. Monki Gras 2012 (Stephen Walli) -- nice roundup of highlights of the Redmonk conference in London. Sample talk: Why Most UX is Shite.
  2. Frozen -- flow-based programming, intent is to build the toolbox of small pieces loosely joined by ZeroMQ for big data programming.
  3. Arctext.js -- jQuery plugin for curving text on web pages. (via Javascript Weekly)
  4. Hi, My Name is Diane Feinstein (BuyTheVote) -- presents the SOPA position and the entertainment industry's campaign contributions together with a little narrative. Clever and powerful. (via BoingBoing)

February 08 2012

Four short links: 8 February 2012

  1. Mavuno -- an open source, modular, scalable text mining toolkit built upon Hadoop. (Apache-licensed)
  2. Cow Clicker -- Wired profile of Cowclicker creator Ian Bogost. I was impressed by Cow Clickers [...] have turned what was intended to be a vapid experience into a source of camaraderie and creativity. People create communities around social activities, even when they are antisocial. (via BoingBoing)
  3. Unicode Has a Pile of Poo Character (BoingBoing) -- this is perfect.
  4. The Research Works Act and the Breakdown of Mutual Incomprehension (Cameron Neylon) -- an excellent summary of how researchers and publishers view each other and their place in the world.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl