Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 26 2012

Health records support genetics research at Children's Hospital of Philadelphia

Michael Italia leads a team of programmers and scientists at Children's Hospital of Philadelphia (CHOP), Center for Biomedical Informatics, where they develop applications, data repositories, and web interfaces to support CHOP's leading roles in both treatment and research. Recently we recorded an interview discussing the collection of data at CHOP and its use to improve both care and long-term research.

Italia, who will speak on this topic at OSCON, describes how the informatics staff derived structured data from electronic health record (EHR) forms developed by audiologists to support both research and clinical care. He describes the custom web interface that makes data available to researchers and discusses the exciting potential of genomic sequencing to improve care. He also lists tools used to collect and display data, many of which are open source.

Particular topics in this video include:

  • The relationship between clinical care and research at Children's. [Discussed at the 00:22 mark]
  • The value of research using clinical data. [Discussed at the 02:30 mark]
  • The challenge of getting good data from health records. [Discussed at the 03:30 mark]
  • Tools for capturing, exporting, and displaying data. [Discussed at the 05:41 mark]
  • Making data useful to clinicians through a simple, modular web interface; tools used. [Discussed at the 12:07 mark]
  • Size of the database and user cohort. [Discussed at the 17:19 mark]
  • The ethical and technical issues of genome sequencing in medical treatment; benefits of sequencing. [Discussed at the 18:23 mark]
  • "Pick out the signal from the noise": integrating genetic information into the electronic health record and "actionable information". [Discussed at the 24:27 mark]

You can view the entire conversation in the following video:

OSCON 2012 Health Care Track — The conjunction of open source and open data with health technology promises to improve creaking infrastructure and give greater control and engagement to patients. Learn more at OSCON 2012, being held July 16-20 in Portland, Oregon.

Save 20% on registration with the code RADAR


February 23 2012

Everyone has a big data problem

Jonathan Gosier (@jongos), designer, developer and the co-founder of, says the big data deluge presents problems for everyone, not just corporations and governments.

Gosier will be speaking at next week's Strata conference on "The Democratization of Data Platforms." In the following interview, he discusses the challenges and opportunities data democratization creates.

Your keynote is going to be about "everyone's" big data problems. If everyone really does have their own big data problem, how are we going to democratize big data tools and processes? It seems that our various problems would require many different solutions.

Jonathan GosierJonathan Gosier: It's a problem for everyone because data problems can manifest in a multitude of ways: too much email, too many passwords to remember, a deluge of legal documents related to a mortgage, or simply knowing where to look online for the answers to simple questions.

You're absolutely correct in noting that each of these problems requires different solutions. However, many of these solutions tend not to be accessible to the average person, whether this is because of prices or a level of expertise required to use the tools available.

There is a lot of talk about a "digital divide," but there's a growing "data divide" as well. It's no longer about having basic computer literacy skills. Being able to understand what data is available, how it can be manipulated, and how it can be used to actually improve one's life is a skill that not everyone possesses.

There's an opportunity here for growth as well. If you look at the market, there are tools for visualizing personal finance (think or HelloWallet), personal health (23andMe), personal productivity (Basecamp), etc. But the overarching trend is that there is a growing need for products that simplify the wealth of information around people. The simplest way to do this is often through visuals.

Why are visualizations so important to a better understanding of data?

Jonathan Gosier: Visualizations are only "better" in that they can relate complex ideas to a general audience. Visualization is by no means a replacement for expertise and research. It simply represents a method for communicating across barriers of knowledge.

But beyond that, the problem with a lot of the data visuals on the web is that they are static, pre-constructed, and vague about their data sources. This means the general public either has to take what's presented on face value and agree or disagree, or they have to conduct their own research.

There's a need for "living infographics" — visualizations that are inviting and easy to understand, but are shared with the underlying data used to create them. This allows the casual consumer to simply admire the visual while the more discerning audience can actually analyze the underlying data to see if the message being presented is consistent with their findings.

It's far more transparent and credible to reveal, versus conceal, one's sources.

One of the pushbacks to data democratization efforts is that people might not know how to use these tools correctly and/or they might use them to further their own agendas. How do you respond to that?

Jonathan Gosier: The question illustrates the point, actually. It wasn't so long ago that the same could be said about the printing press. It was an innovation, but initially, it was so expensive that it was a technology that was only available to the elite and wealthy. Now it's common (at least in the Western world) for any given middle-class household to contain an inexpensive printing device. The web radicalized things even more, essentially turning anyone with access into a publisher.

So the question becomes, was it good or bad that publishing became something that anyone could do versus a select few? I'd argue that, ultimately, the pros have out-weighed the cons by magnitudes.

Right now data can be thought of as an asset of the elite and privileged. Those with wealth pay a lot for it, and those who are highly skilled can charge a great deal for their services around it. But the reality is, there is a huge portion of the market that has a legitimate need for data solutions that aren't currently available to them.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


February 15 2012

Why data visualization matters

Visualization exampleLet's say you need to understand thousands or even millions of rows of data, and you have a short time to do it in. The data may come from your team, in which case perhaps you're already familiar with what it's measuring and what the results are likely to be. Or it may come from another team, or maybe several teams at once, and be completely unfamiliar. Either way, the reason you're looking at it is that you have a decision to make, and you want to be informed by the data before making it. Something probably hangs in the balance: a customer, a product, or a profit.

How are you going to make sense of all that information efficiently so you can make a good decision? Data visualization is an important answer to that question.

However, not all visualizations are actually that helpful. You may be all too familiar with lifeless bar graphs, or line graphs made with software defaults and couched in a slideshow presentation or lengthy document. They can be at best confusing, and at worst misleading. But the good ones are an absolute revelation.

The best data visualizations are ones that expose something new about the underlying patterns and relationships contained within the data. Understanding those relationships — and being able to observe them — is key to good decision making. The Periodic Table is a classic testament to the potential of visualization to reveal hidden relationships in even small datasets. One look at the table, and chemists and middle school students alike grasp the way atoms arrange themselves in groups: alkali metals, noble gasses, halogens.

If visualization done right can reveal so much in even a small dataset like this, imagine what it can reveal within terabytes or petabytes of information.

Types of visualization

It's important to point out that not all data visualization is created equal. Just as we have paints and pencils and chalk and film to help us capture the world in different ways, with different emphases and for different purposes, there are multiple ways in which to depict the same dataset.

Or, to put it another way, think of visualization as a new set of languages you can use to communicate. Just as French and Russian and Japanese are all ways of encoding ideas so that those ideas can be transported from one person's mind to another, and decoded again — and just as certain languages are more conducive to certain ideas — so the various kinds of data visualization are a kind of bidirectional encoding that lets ideas and information be transported from the database into your brain.

Explaining and exploring

An important distinction lies between visualization for exploring and visualization for explaining. A third category, visual art, comprises images that encode data but cannot easily be decoded back to the original meaning by a viewer. This kind of visualization can be beautiful, but it is not helpful in making decisions.

Visualization for exploring can be imprecise. It's useful when you're not exactly sure what the data has to tell you and you're trying to get a sense of the relationships and patterns contained within it for the first time. It may take a while to figure out how to approach or clean the data, and which dimensions to include. Therefore, visualization for exploring is best done in such a way that it can be iterated quickly and experimented upon, so that you can find the signal within the noise. Software and automation are your friends here.

Visualization for explaining is best when it is cleanest. Here, the ability to pare down the information to its simplest form — to strip away the noise entirely — will increase the efficiency with which a decision maker can understand it. This is the approach to take once you understand what the data is telling you, and you want to communicate that to someone else. This is the kind of visualization you should be finding in those presentations and sales reports.

Visualization for explaining also includes infographics and other categories of hand-drawn or custom-made images. Automated tools can be used, but one size does not fit all.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at

Your customers make decisions, too

While data visualization is a powerful tool for helping you and others within your organization make better decisions, it's important to remember that, in the meantime, your customers are trying to decide between you and your competitors. Many kinds of data visualization, from complex interactive or animated graphs to brightly-colored infographics, can help your customers explore and your customer service folks explain.

That's why all kinds of companies and organizations, from GE to Trulia to NASA, are beginning to invest significant resources in providing interactive visualizations to their customers and the public. This allows viewers to better understand the company's business, and interact in a self-directed manner with the company's expertise.

As big data becomes bigger, and more companies deal with complex datasets with dozens of variables, data visualization will become even more important. So far, the tide of popularity has risen more quickly than the tide of visual literacy, and mediocre efforts abound, in presentations and on the web.

But as visual literacy rises, thanks in no small part to impressive efforts in major media such as The New York Times and The Guardian, data visualization will increasingly become a language your customers and collaborators expect you to speak — and speak well.

Do yourself a favor and hire a designer

It's well worth investing in a talented in-house designer, or a team of designers. Visualization for explaining works best when someone who understands not only the data itself, but also the principles of design and visual communication, tailors the graph or chart to the message.

Translation example
Whether it's text or visuals, important translations require more than basic tools.

To go back to the language analogy: Google Translate is a powerful and useful tool for giving you the general idea of what a foreign text says. But it's not perfect, and it often lacks nuance. For getting the overall gist of things, it's great. But I wouldn't use it to send a letter to a foreign ambassador. For something so sensitive, and where precision counts, it's worth hiring an experienced human translator.

Since data visualization is like a foreign language, in the same way, hire an experienced designer for important jobs where precision matters. If you're making the kinds of decisions in which your customer, product, or profit hangs in the balance, you can't afford to base those decisions on incomplete or misleading representations of the knowledge your company holds.

Your designer is your translator, and one of the most important links you and your customers have to your data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


January 31 2012

Embracing the chaos of data

A data scientist and a former Apple engineer, Pete Warden (@petewarden) is now the CTO of the new travel photography startup Jetpac. Warden will be a keynote speaker at the upcoming Strata Conference, where he'll explain why we should rethink our approach to data. Specifically, rather than pursue the perfection of structured information, Warden says we should instead embrace the chaos of unstructured data. He expands on that idea in the following interview.

What do you mean asking data scientists to embrace the chaos of data?

Pete WardenPete Warden: The heart of data science is designing instruments to turn signals from the real world into actionable information. Fighting the data providers to give you those signals in a convenient form is a losing battle, so the key to success is getting comfortable with messy requirements and chaotic inputs. As an engineer, this can feel like a deal with the devil, as you have to accept error and uncertainty in your results. But the alternative is no results at all.

Are we wasting time trying to make unstructured data structured?

Pete Warden: Structured data is always better than unstructured, when you can get it. The trouble is that you can't get it. Most structured data is the result of years of effort, so it is only available with a lot of strings, either financial or through usage restrictions.

The first advantage of unstructured data is that it's widely available because the producers don't see much value in it. The second advantage is that because there's no "structuring" work required, there's usually a lot more of it, so you get much broader coverage.

A good comparison is Yahoo's highly-structured web directory versus Google's search index built on unstructured HTML soup. If you were looking for something that was covered by Yahoo, its listing was almost always superior, but there were so many possible searches that Google's broad coverage made it more useful. For example, I hear that 30% of search queries are "once in history" events — unique combinations of terms that never occur again.

Dealing with unstructured data puts the burden on the consuming application instead of the publisher of the information, so it's harder to get started, but the potential rewards are much greater.

How do you see data tools developing over the next few years? Will they become more accessible to more people?

Pete Warden: One of the key trends is the emergence of open-source projects that deal with common patterns of unstructured input data. This is important because it allows one team to solve an unstructured-to-structured conversion problem once, and then the entire world can benefit from the same solution. For example, turning street addresses into latitude/longitude positions is a tough problem that involves a lot of fuzzy textual parsing, but open-source solutions are starting to emerge.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Associated photo on home and category pages: "mess with graphviz by Toms Bauģis, on Flickr


September 15 2011

Global Adaptation Index enables better data-driven decisions

The launch of the Global Adaptation Index (GaIn) literally puts a powerful open data browser into the hands of anyone with a connected mobile device. The index rates a given country's vulnerability to environmental shifts precipitated by climate change, its readiness to adapt to such changes, and its ability to utilize investment capital that would address the state of those vulnerabilities.

Global Adaptation Index

The Global Adaptation Index combines development indicators from 161 countries into a map that provides quick access to thousands of open data records. All of the data visualizations at are powered by indicators that are openly available and downloadable under a Creative Commons license.

"All of the technology that we're using is a way to bring this information close to society," said Bruno Sanchez-Andrade Nuño, the director of science and technology at the Global Adaptation Institute (GAI), the organization that launched the index.

Open data, open methodology

The project was helped by the World Bank's move to open data, including the release of its full development database. "All data is from sources that are already open," said Ian Noble, chief scientist at GAI. "We would not use any data that had restrictions. We can point people through to the data source and encourage them to download the data."

Being open in this manner is "the most effective way of testing and improving the index," said Noble. "We have to be certain that data is from a quality, authoritative source and be able to give you an immediate source for it, like the FAO, WHO or disaster database."

"It's not only the data that's open, but also our methodology," said Nuño. " is a really good base, with something like 70% of our data going through that portal. With some of the rest of the data, we see lots of gaps. We're trying to make all values consistent.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Node.js powers the data browser

"This initiative is a big deal in the open data space as it shows a maturing from doing open data hacking competitions to powering a portal that will help channel billions of investment dollars over the next several years," said Development Seed associate Bonnie Bugle in a prepared statement. Development Seed built the site with open source tools, including Node.js and CouchDB.

The choice of Node is a useful indicator, in terms of where the cutting edge of open source technology is moving. "The most important breakthrough is moving beyond PHP and Drupal — our initial thought — to Node.js," said Nuño. "Drupal and PHP are robust and well known, but this seems like the next big thing. We really wanted to push the limits of what's possible. Node.js is faster and allows for more connections. If you navigate countries using the data browser, you're just two clicks away from the source data. It doesn't feel like a web page. It feels native."

Speed of access and interoperability were important considerations, said Nuño. "It works on an iOS device or on a slow connection, like GPRS." Noble said he had even accessed it from rural Australia using an iPad.

Highlights from the GAI press conference are available in the following video:

Global Adaptation Index Press Conference: Data Browser Launched from Development Seed on Vimeo.


August 29 2011

August 25 2011

The Daily Dot wants to tell the web's story with social data journalism

If the Internet is the public square of the 21st century, the Daily Dot wants to be its town crier. The newly launched online media startup is trying an experiment in community journalism, where the community is the web. It's an interesting vision, and one that looks to capitalize on the amount of time people are spending online.

The Daily Dot wants to tell stories through a mix of data journalism and old-fashioned reporting, where its journalists pick up the phone and chase down the who, what, when, where, how and why of a video, image or story that's burning up the social web. The site's beat writers, who are members of the communities they cover, watch what's happening on Twitter, Facebook, Reddit, YouTube, Tumblr and Etsy, and then cover the issues and people that matter to them.

Daily Dot screenshot

Even if the newspaper metaphor has some flaws, this focus on original reporting could help distinguish the Daily Dot in a media landscape where attention and quality are both fleeting. In the hurly burly of the tech and new media blogosphere, picking up the phone to chase down a story is too often neglected.

There's something significant about that approach. Former VentureBeat editor Owen Thomas (@OwenThomas), the founding editor of the Daily Dot, has emphasized this angle in interviews with AdWeek and Forbes. Instead of mocking what people do online, as many mainstream media outlets have been doing for decades, the Daily Dot will tell their stories in the same way that a local newspaper might cover a country fair or concert. While Thomas was a well-known master of snark and satire during his tenure at Valleywag, in this context he's changed his style.

Where's the social data?

Whether or not this approach gains traction within the communities the Daily Dot covers remains to be seen. The Daily Dot was co-founded by Nova Spivack, former newspaper executive Nicholas White, and PR consultant Josh Jones-Dilworth, with a reported investment of some $600,000 from friends and family. White has written that he gave up the newspaper to save newspapering. Simply put, the Daily Dot is experimenting with covering the Internet in a way that most newspapers have failed to do.

"I trust that if we keep following people into the places where they gather to trade gossip, argue the issues, seek inspiration, and share lives, then we will also find communities in need of quality journalism," wrote White. "We will be carrying the tradition of local community-based journalism into the digital world, a professional coverage, practice and ethics coupled with the kind of local interaction and engagement required of a relevant and meaningful news source. Yet local to us means the digital communities that are today every bit as vibrant as those geographically defined localities."

To do that, they'll be tapping into an area that Spivack, a long-time technology entrepreneur, has been investing and writing about for years: data. Specifically, applying data journalism to mining and analyzing the social data from two of the web's most vibrant platforms: Tumblr and Reddit.

White himself is unequivocal about the necessity of data journalism in the new digital landscape, whether at the Daily Dot or beyond:

The Daily Dot may be going in this direction now because of our unique coverage area, but if this industry is to flourish in the 21st century, programming journalists should not remain unique. Data, just like the views of experts, men on the street, polls and participants, is a perspective on the world. And in the age of ATMs, automatic doors and customer loyalty cards, it's become just as ubiquitous. But the media isn't so good with data, with actual mathematics. Our stock-in-trade is the anecdote. Despite a complete lack of solid evidence, we've been telling people their cell phones will give them cancer. Our society ping-pongs between eating and not eating carbs, drinking too much coffee and not enough water, getting more Omega-3s — all on the basis of epidemiological research that is far, far, far from definitive. Most reporters do not know how to evaluate research studies, and so they report the authors' conclusions without any critical evaluation — and studies need critical evaluation.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Marshall Kirkpatrick, a proponent and practitioner of data journalism, dug deep into how data journalism happens at the Daily Dot. While he's similarly unsure of whether the publication will be interesting to a large enough audience to sustain an advertising venture, the way that the Daily Dot is going about hunting down digital stories is notable. Kirkpatrick shared the details over at ReadWriteWeb:

In order to capture and analyze that data from sites like Twitter, YouTube, Reddit, Etsy and more (the team says it's indexing a new community about every six weeks), the Dot has partnered with the mathematicians at Ravel Data. Ravel uses 80Legs for unblockable crawling, then Hadoop, its own open source framework called GoldenOrb and then an Eigenvector centrality algorithm (similar to Pagerank) to index, analyze, rank and discover connections between millions of users across these social networks.

There are a couple of aspects of data journalism to consider here. One is supplementing the traditional "nose for news" that Daily Dot writers apply to finding stories. "The data really begins to serve as our editorial prosthetics of sorts, telling us where to look, with whom to speak, and giving us the basic groundwork of the communities that we can continue to prod in interesting ways and ask questions of," explained Doug Freeman, an associate at Daily Dot investor Josh Jones-Dilworth's PR firm, in an interview. In other words, the editors of the Daily Dot analyze social data to identify the community's best sources for stories and share them on a "Leaderboard" that — in beta — shows a ranked list of members of Tumblr and Reddit.

Another open question is how social data could help with the startup's revenue down the road. "Our data business is a way of creating and funding new value in this regard; we instigated structured crawls of all of the communities we will cover and will continue to do so as we expand into new places," said Freeman. "We started with Reddit (for data and editorial both) because it is small and has a lot of complex properties — a good test balloon. We've now completed data work with Tumblr and YouTube and are continuing." For each community, data provides a view of members, behaviors, and influence dynamics.

That data also relates to how the Daily Dot approaches marketing, branding and advertising. "It's essentially a to-do list of people we need to get reading the Dot, and a list of their behaviors," said Freeman. "From a brand [point of view], it's market and audience intelligence that we can leverage, with services alongside it. From an advertiser [point of view], this data gives resolution and insight that few other outlets can provide. It will get even more exciting over time as we start to tie Leaderboard data to user accounts and instigate CPA-based campaigns with bonuses and bounties for highly influential clicks."

Taken as a whole, what the Daily Dot is doing with social data and digital journalism feels new, or at least like a new evolution. We've seen Facebook and Twitter integration into major media sites, but not Reddit and Tumblr. It could be that the communities of these sites acting as "curation layers" for the web will produce excellent results in terms of popular content, though relevance could still be at issue. Whether this venture in data journalism is successful or not will depend upon it retaining the interest and loyalty of the communities it covers. What is clear, for now, is that the experiment will be fun to watch — cute LOL cats and all.


August 11 2011

Strata Week: Twitter's coming Storm, data and maps from the London riots

Here are a few of the data stories that caught my attention this week:

Twitter's coming Storm


In a blog post late last week, Twitter announced that it plans to open source Storm, its Hadoop-like data processing tool. Storm was developed by BackType, the social media analytics company that Twitter acquired last month. Several of BackType's other technologies, including ElephantDB, have already been open sourced, and Storm will join them this fall, according to Nathan Marz, formerly of BackType now of Twitter.

Marz's post digs into how Storm works as well as how it can be applied. He notes that a Storm cluster is only "superficially similar" to a Hadoop cluster. Instead of running MapReduce "jobs," Storm runs "topologies." One of the key differences is that a MapReduce job eventually finishes, whereas a topology processes messages "forever (or until you kill it)." This makes Storm useful, among other things, for processing real-time streams of data, continuous computation, and distributed RPC.

Touting the technology's ease-of-use, Marz lists the following complexities "under the hood: guaranteed message processing, robust process management, fault detection and automatic reassignment, efficient message passing, and local mode and distributed mode. More details -- and more documentation -- will follow in September 19 when Storm is officially open sourced.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

20% on registration with the code STN11RAD

Mapping the London riots

Using real-time social streams and mapping tools in a crisis situation is hardly new. We've seen citizens, developers, journalists, governments alike undertake these efforts following a multitude of natural disasters. But the violence that erupted in London over weekend has proven yet again that these data tools are important for both safety and for analysis and understanding. Indeed, as journalist Kevin Anderson argued, "data journalists and social scientists should join forces" to understand the causes and motivations for the riots, rather than the more traditional "hours of speculation on television and acres of newsprint positing theories."

NPR's Matt Stiles was just one of the data journalists who picked up the mantle. Using data from The Guardian, he created a map that highlighted riot locations, overlaid on top of a colored representation of indices of deprivation." This makes a pretty compelling visualization, demonstrating that the areas with the most incidents of violence are also the least well-off areas of London.


In a reflective piece in PaidContent, James Cridland examined his experiences trying to use social media to map the riots. He created a Google Map where he was marking "verified incident areas." As he describes it, however, that verifiability became quite challenging. His "lessons learned" included realizations about what constitutes a reliable source.

"Twitter is not a reliable source: I lost count of the amount of times I was told that riots were occurring in Derby or Manchester. They weren’t, yet on Twitter they were being reported as fact, despite the Derbyshire Constabulary and Greater Manchester Police issuing denials on Twitter. I realised that, in order for this map to be useful, every entry needed to be verified, and verifiable for others too. For every report, I searched Google News, Twitter, and major news sites to try and establish some sort of verification. My criteria was that something had to be reported by an established news organisation (BBC, Sky, local newspapers) or by multiple people on Twitter in different ways.

Cridland points out that the traditional news media wasn't reliable either, as the BBC for example reported disturbances that never occurred or misreported their location.

"Many people don't know what a reliable source is," he concludes. "I discovered it was surprisingly easy to check the veracity of claims being made on Twitter by using the Internet to check and cross-reference, rather than blindly retweet."

When data disappears

Following the riots in the U.K., there is now a trove of data -- from Blackberry Messenger, from Twitter, from CCTV -- that the authorities can utilize to investigate "what happened." There are also probably plenty of people who wish that data would just disappear.

What happens when that actually happens? How can we ensure that important digital information is preserved? Those were the questions asked in an Op-Ed in Sunday's The New York Times. Kari Kraus, an assistant professor in the College of Information Studies and the English department at the University of Maryland, makes a strong case for why "digitization" isn't really the end-of-the-road when it comes to preservation.

"For all its many promises, digital storage is perishable, perhaps even more so than paper. Disks corrode, bits "rot" and hardware becomes obsolete.

But that doesn't mean digital preservation is pointless: if we're going to save even a fraction of the trillions of bits of data churned out every year, we can't think of digital preservation in the same way we do paper preservation. We have to stop thinking about how to save data only after it's no longer needed, as when an author donates her papers to an archive. Instead, we must look for ways to continuously maintain and improve it. In other words, we must stop preserving digital material and start curating it.


She points to the efforts made to curate and preserve video games, something that highlights the struggles of not just saving the content -- the games -- but the technology -- NES cartridges, for example, as well as the gaming systems themselves. "It might seem silly to look to video-game fans for lessons on how to save our informational heritage, but in fact complex interactive games represent the outer limit of what we can do with digital preservation." By figuring out the complexities around preserving this sort of material -- a game, a console, for example -- we can get a better sense of how to develop systems to preserve other things, whether it's our Twitter archives, digital maps of London, or genetic data.

Got data news?

Send me an email.

Reposted bycheg00 cheg00

July 21 2011

Strata Week: When does data access become data theft?

Here are a few of the data stories that caught my eye this week.

Aaron Swartz and the politics of violating a database's TOS

JSTORAaron Swartz, best known as an early Reddit-er and the founder of the progressive non-profit Demand Progress, was charged on Tuesday of multiple felony counts for the illegal download of some 4 million academic journal articles from the MIT Library.

The indictment against Swartz (a full copy is here) details the steps he took to procure a laptop and register it on the MIT network, all in the name of securing access to JSTOR. JSTOR is an online database of academic journals, providing full text search and access to library patrons at both academic and public universities.

Swartz accessed the JSTOR database via MIT and proceeded to devise a mechanism to download a massive number of documents. It isn't clear what his intentions were for these — Swartz has been involved previously with open data efforts. Was he planning to liberate the JSTOR database? Or, as others have suggested, was he in the middle of an academic project that required a massive dataset?

The government has made it clear this is "stealing." JSTOR, the library, and the university are less willing to comment or condemn.

Kevin Webb asks an important question in a post reprinted by Reuters. What's the difference between what Swartz did and what Google does?

What's missing from the news articles about Swartz's arrest is a realization that the methods of collection and analysis he's used are exactly what makes companies like Google valuable to its shareholders and its users. The difference is that Google can throw the weight of its name behind its scrapers ...

Although Swartz did allegedly download data from JSTOR in such quantities that it violates a Terms of Service agreement, many questions remain: Why does this constitute stealing? How much data does one need to take to be at risk of accusations of theft and fraud? For data scientists, not just for activists, these are very real questions.

Update: GigaOm's Janko Roettgers reports that a torrent with 18,592 scientific publications — all of them apparently from JSTOR — was uploaded to The Pirate Bay.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Microsoft releases its big data toolkit for scientists

Although we're all creating massive amounts of data, for scholars and scientists that data creation and analysis can quickly run afoul of the limitations of university computing centers. To that end, Microsoft Research this week unveiled Daytona, a tool designed to help scientists with big data computation.

Created by the eXtreme Computing Group, the tool lets scholars and scientists use Microsoft's Azure platform to work with large datasets. According to Roger Barga, an architect in the eXtreme Computing Group:

Daytona has a very simple, easy-to-use programming interface for developers to write machine-learning and data-analytics algorithms. They don't have to know too much about distributed computing or how they're going to spread the computation out, and they don't need to know the specifics of Windows Azure.

Daytona is meant to be an alternative to Hadoop or MapReduce (although it does utilize the latter), but with an emphasis on ease-of-use. Daytona comes with code samples and programming guides to get people up and running.

The eXtreme Computing Group has also built Excel Datascope, which as the name suggests is a tool that offers data analytics from Excel.

While making it easier for academics to perform big data analysis is an honorable goal, I can't help but ask (as a recovering academic myself) when will academy realize that the skills needed to work with these datasets warrant formal attention? Scholars need to be trained to manage this information. That way, it isn't just a matter of making it "easier," but making these tools better.

The state of open data in Canada

Code for America program director David Eaves has taken a look at the state of open data licenses in Canada in order to assess what works, what doesn't work, and where to go from here.

Eaves examines how the Canadian government (provincial and otherwise) has made strides toward opening up data to its citizens, developers, and others. But as Eaves makes clear in his post, it isn't as simple as just "opening" data as a gesture, but rather making sure data is readily accessible and usable.

"Licenses matter because they determine how you are able to use government data — a public asset," he writes. "As I outlined in the three laws of open data, data is only open if it can be found, be played with and be shared." Eaves contends that licensing is particularly important, as this can limit what sorts of restrictions are put on the sharing of data and, in turn, on the sorts of apps one can build using it.

What do we want then? Eaves lists these attributes:

  • Open: there should maximum freedom for reuse
  • Secure: it offers governments appropriate protections for privacy and security
  • Simplicity: to keep down legal costs, and make it easier for everyone to understand
  • Standardized: so my work is accessible across jurisdictions
  • Stable: so I know that the government won't change the rules on me

When it comes to the "where do we go from here" aspect, Eaves isn't optimistic. He notes that while some municipalities may have opened their datasets, the federal government — in Canada and elsewhere — seems unprepared to fully engage with the developer and open data communities.

Got data news?

Feel free to email me.


July 08 2011

Top stories: July 4-8, 2011

Here's a look at the top stories published across O'Reilly sites this week.

Seven reasons you should use Java again
To mark the launch of Java 7, here's seven reasons why Java is worth your time and worth another look.
What is Node.js?
Learning Node might take a little effort, but it's going to pay off. Why? Because you're afforded solutions to your web application problems that require only JavaScript to solve.
3 Android predictions: In your home, in your clothes, in your car
"Learning Android" author Marko Gargenta believes Android will soon be a fixture in our homes, in our clothes and in our vehicles. Here he explains why and how this will happen.
Into the wild and back again
Burnt out from years of school and tech work, Ryo Chijiiwa quit his job and moved off the grid. In this interview, Chijiiwa talks about how solitude and time in the wilderness has changed his perspective on work and life.
Data journalism, data tools, and the newsroom stack
The MIT Civic Media conference and 2011 Knight News Challenge winners made it clear that data journalism and data tools will play key roles in the future of media and open government.

OSCON Java 2011, being held July 25-27 in Portland, Ore., is focused on open source technologies that make up the Java ecosystem. Save 20% on registration with the code OS11RAD

June 27 2011

Get started with Hadoop: From evaluation to your first production cluster

Hadoop is growing up. Apache Software Foundation (ASF) Hadoop and its related projects and sub-projects are maturing as an integrated, loosely coupled stack to store, process and analyze huge volumes of varied semi-structured, unstructured and raw data.

Hadoop has come a long way in a relatively short time. Google papers on Google File System (GFS) and MapReduce inspired work on co-locating data storage and computational processing in individual notes spread across a cluster. Then, it was just over five years ago, in early 2006, that Doug Cutting joined Yahoo and set up a 300-node research cluster there, adapting the distributed computing platform that was formerly a part of the Apache Nutch search engine project. What began as a technique to index and catalog web content has extended to a variety of analytic and data science applications, from ecommerce customer segmentation and A/B testing to fraud detection, machine learning and medical research. Now, the largest production clusters are 4,000 nodes with about 15 petabytes of storage in each cluster.

In just the nine months since I wrote an introduction to this emerging stack, it has become easier to install, configure and write programs to use Hadoop. Not surprisingly with an emerging technology, there is still work to do. As Tom White notes in his Hadoop: The Definitive Guide, Second Edition:

To gain even wider adoption, we need to make Hadoop even easier to use. This will involve writing more tools; integrating with more systems; and writing new, improved APIs.

This piece provides tips, cautions and best practices for an organization that would like to evaluate Hadoop and deploy an initial cluster. It focuses on the Hadoop Distributed File System (HDFS) and MapReduce. If you are looking for details on Hive, Pig or related projects and tools, you will be disappointed in this specific article, but I do provide links for where you can find more information. You can also refer to the live or archived presentations at the Yahoo Developer Network Hadoop Summit 2011 on June 29, 2011 in Santa Clara, Calif., and Hadoop World 2011, sponsored by Cloudera, in New York City on November 8-9, 2011.

Start with a free evaluation in stand-alone or pseudo-distributed mode

HadoopIf you have not done so already, you can begin evaluating Hadoop by downloading and installing one of the free Hadoop distributions. The Apache Hadoop website offers a Quick Start guide.

You can start an initial evaluation by running Hadoop in either local stand-alone or pseudo-distributed mode on a single machine. You can pick the flavor of Linux you prefer, or use Solaris. In stand-alone mode, no daemons run; everything runs in a single Java virtual machine (JVM) with storage using your machine's standard file system. In pseudo-distributed mode, each daemon runs its own JVM but they all still run on a single machine, with storage using HDFS by default. For example, I'm running a Hadoop virtual machine in pseudo-distributed mode on my Intel-processor MacBook Pro, using VMWare Fusion, Ubuntu Linux, and Cloudera's Distribution including Apache Hadoop (CDH) version CDH3.

If it's not already pre-set for you, remember to change the HDFS replication value to one, versus the default factor of three, so you don't see continual error messages due to HDFS' inability to replicate blocks to alternate data notes. Configuration files reside in the directory named "conf" and are written in XML. You'll find the replication parameter at dfs.replication.

Even with a basic evaluation in pseudo-distributed mode, you can start to use the web interfaces that come with the Hadoop daemons such as those that run on ports 50030 and 50070. With these web interfaces, you can view the NameNode and JobTracker status. The example screen shot below shows the NameNode web interface. For more advanced reporting, Hadoop includes built-in connections to Ganglia, and you can use Nagios to schedule alerts.

Name Node Status
Hadoop NameNode web interface profile of the Hadoop distributed file system, nodes and capacity for a test cluster running in pseudo-distributed mode.

Pick a distribution

As you progress to testing a multi-node cluster using a hosted offering or on-premise hardware, you'll want to pick a Hadoop distribution. Apache Hadoop has Common, HDFS, and MapReduce. Hadoop Common is a set of utilities that support the Hadoop sub-projects. These include FileSystem, remote procedure call (RPC), and serialization libraries. Additional Apache projects and sub-projects are available separately.

One benefit of picking a commercial distribution in addition to the availability of commercial support services is that the vendor tests version compatibility among all of the various moving parts within the related Apache projects and sub-projects. It's similar to the choice of commercially supported Red Hat Linux or Canonical Ubuntu Linux, but arguably of event greater importance given Hadoop's relative youth and the large number of loosely coupled projects and sub-projects.

To date, Cloudera's Distribution including Apache Hadoop is the most complete distribution. It includes Apache Hadoop, Apache Hive, Apache Pig, Apache HBase, Apache Zookeeper, Apache Whirr (a library for running Hadoop in a cloud), Flume, Oozie, and Sqoop. CDH3 supports Amazon EC2, Rackspace and Softlayer clouds. With Cloudera Enterprise, Cloudera adds the Cloudera Management Suite and Production Support in addition to the components in CDH3.

On June 14, 2011, Tom White and Patrick Hunt at Cloudera proposed a new project for the Apache Incubator named Bigtop. To quote from their Bigtop proposal: "Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem. The goal is to do testing at various levels (packaging, platform, runtime, upgrade etc...) developed by a community with a focus on the system as a whole, rather than individual projects." Currently, project code resides in Cloudera's Github account, and is based on work Cloudera has put into packaging multiple Apache projects together in CDH. Following approval of the incubator proposal, the code would eventually find its way into the ASF.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

The IBM Distribution of Apache Hadoop (IDAH) contains Apache Hadoop, a 32-bit Linux version of the IBM SDK for Java 6 SR 8, and an installer and configuration tool for Hadoop. IBM has indicated that it sees Hadoop as a cornerstone of its big data strategy, in which IBM is building software packages that run on top of Hadoop. IBM InfoSphere BigInsights supports unstructured text analytics and indexing, along with features for data governance, security, developer tools, and enterprise integration. IBM offers a free downloadable BigInsights Basic Edition. IBM clients can extend BigInsights to analyze streaming data from IBM InfoSphere Streams. In the Jeopardy game show competition, IBM Watson used Hadoop to distribute the workload for processing information, including support for understanding natural language.

Amazon Elastic MapReduce (EMR) builds proprietary versions of Apache Hadoop, Hive, and Pig optimized for running on Amazon Web Services. Amazon EMR provides a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (EC2) or Simple Storage Service (S3).

For programmers and infrastructure architects, whether or not you use Amazon EC2 or S3 for production applications, there are benefits to understanding the Amazon Web Services APIs. One of my takeaways from attending the 2011 Open Source Business Conference (OSBC) in San Francisco is that the Amazon Web Services EC2 API so far remains a de facto standard for cloud interoperability, despite the best efforts of OpenStack promoters Rackspace and NASA and other organizations such as the IEEE P2301 cloud portability working group.

According to EMC, the EMC Greenplum HD Community Edition will include HDFS, MapReduce, Zookeeper, Hive and HBase. EMC also announced an OEM with MapR.

MapR recently emerged from stealth mode. MapR replaces HDFS with proprietary software. MapR has a beta program but not yet generally available (GA) software or announced customers other than EMC. According to MapR, its distribution includes HBase, Pig, Hive, Cascading, the Apache Mahout machine learning library, Nagios integration and Ganglia integration. In addition to replacing HDFS, MapR has worked to speed MapReduce operations, and has added high availability (HA) options for the NameNode and JobTracker. MapR supports Network File System (NFS) and NFS log collection.

HStreaming’s distribution is a proprietary version of Hadoop with support for stream processing and real-time analytics as well as standard MapReduce batch processing.

Consider Hadoop training

Education courses can be helpful to get started with Hadoop and train new staff. I had the opportunity to take a two-day Hadoop system administrator class taught by senior instructor Glynn Durham at Cloudera, and would recommend that course for system administrators and enterprise IT architects. Chris Wensel, founder and CTO of Concurrent, developed the initial course materials.

To find other organizations offering Hadoop training and other support services, visit the Apache Hadoop wiki support page. For example, Bixo Labs and Scale Unlimited offer a 2-day Hadoop Boot Camp, designed to enable Java programmers to learn Hadoop development. I do not have personal experience with that course, but the instructor, Ken Krugler, has a good background. And Datameer CEO and co-founder Stefan Groschupf previously held the same roles with Scale Unlimited, with Chris Wensel as a co-founder.

In addition to the companies shown on the Hadoop wiki support page, the Wall Street Journal reported in an April 2011 article that Yahoo is considering to spin out its Hadoop engineering unit into a startup to offer commercial Hadoop support and tools. For its columnar storage technology, Yahoo has already sold an exclusive license for it, to nPario. For an update on what Yahoo envisions as "The Next Generation of Apache Hadoop and MapReduce," see Arun Murthy's blog post on the Yahoo Developer Network.

For the Cloudera-taught Hadoop system administrator course, prior knowledge of Hadoop is not required, but it is important to have at least a basic understanding of writing Linux commands. Some understanding of Java is also beneficial, as each Hadoop daemon runs in a Java process. Following completion of the course, you can take a one-hour test to become a Cloudera Certified Hadoop Administrator. The systems admin class and test covers Hadoop cluster operations, planning and management; job scheduling; monitoring and logging.

For those who are already proficient with Hadoop and would like to become certified through Cloudera, on June 28, the day before the Yahoo-sponsored Hadoop Summit 2011, Cloudera is offering the Certified Developer and Administrator for Apache Hadoop exams. The exams are open book. The developer exam takes 90 minutes, while the administrator exam requires 60 minutes. For more information, visit the Cloudera Developer Center blog.

Plan your Hadoop architecture

For an on-premise cluster, once you start to have more than several dozen nodes, you will probably want to invest in three separate enterprise-grade servers, one for each of the following central daemons:

  • NameNode
  • SecondaryNameNode (Checkpoint Node)
  • JobTracker

These three enterprise-grade servers should be heavy on RAM memory, but do not require much of their own disk storage capacity — that's the job of the DataNodes. For the NameNode memory, consider a baseline of enough RAM to represent 250 bytes per file plus 250 bytes per block. Note that for the blocks, most organizations will want to change the default block size from 64 MB to 128 MB. You can change that at dfs.block.size.

For large clusters, 32 GB memory for the NameNode should be plenty. Much more than 50 GB of memory may be counter-productive, as the Java virtual machine running the NameNode may spend inordinately long, disruptive periods on garbage collection.

Garbage collection refers to the process in which the JVM reclaims unused RAM. It can pop up at any time, for almost any length of time, with few options for system administrators to control it. As noted by Eric Bruno in a Dr. Dobb's Report, the Real-Time Specification for Java (RTSJ) may solve the garbage collection problem, along with providing other benefits to enable JVMs to better support real-time applications. While the Oracle Java Real-Time System, IBM WebSphere Real-Time VM, and the Timesys RTSJ reference implementation support RTSJ, as a standard it remains in the pilot stage, particularly for Linux implementations.

If the NameNode is lost and there is not a backup, HDFS is lost. HDFS is based on the Google File System (GFS), in which there is no data caching. From the NameNode, you should plan to store at least one or two synchronous copies as well as a Network File System (NFS) mount disk-based image. At Yahoo and Facebook, they use a couple of NetApp filers to hold the actual data that the NameNode writes. In this architecture, the two NetApp filers are run in HA mode with non-volatile RAM (NVRAM) copying.

"Hadooplers" (Hadoop Open Storage Solutions) are the first of several new storage appliances that NetApp plans to ship for Hadoop. They are based on NetApp's new E-Series array family. They are designed to offload Hadoop storage I/O processing and disk failure recovery from the DataNodes, and provide HA for the NameNode. The head of NetApp's strategic planning team, Val Bercovici, has a blog post with more details.

The poorly named SecondaryNameNode (which is being renamed as the Checkpoint Node) is not a hot fail-over, but instead a server that keeps the NameNode's log from becoming ever larger, which would result in ever-longer restart times and would eventually cause the NameNode to run out of available memory. On startup, the NameNode loads its baseline image file and then applies recent changes recorded in its log file. The SecondaryNameNode performs periodic checkpoints for the NameNode, in which it applies the contents of the log file to the image file, producing a new image. Even with this checkpoint, cluster reboots may take 90 minutes or more on large clusters, as the NameNode requires storage reports from each of the DataNodes before it can bring the file system online.

At present, Hadoop does not support IPV6. For organizations with IPv6, investigate using machine names with a DNS server instead of IP addresses to label each node. While organizations in quickly growing markets such as China have been early adopters to IPv6 due to historical allocations of IPv4 addresses, IPv6 is started to get more attention in North America and other regions too. On the recent "World IPv6 Day," Google, Facebook, Yahoo and around 300 other Internet websites offered content using IPv6.

For the DataNodes, expect failure. With HDFS, when a DataNode crashes, the cluster does not crash, but performance degrades in proportion to the amount of lost storage and processing capacity as the still-functioning DataNodes pick up the slack. In general, don't use RAID for the DataNodes, and avoid using Linux Logical Volume Manager (LVM) with Hadoop. HDFS already offers built-in redundancy by replicating blocks across multiple nodes.

NetApp is one organization that advocates use of RAID for the DataNodes. According to an email exchange with NetApp head of strategic planning Val Bercovici on June 26, 2011: "NetApp's E-Series Hadooplers use highly engineered HDFS RAID configurations for DataNodes which separate data protection from job and query completion... we are witnessing improved Extract and Load performance for most Hadoop ETL tasks, plus the enablement of Map Reduce pipelining for Transformations not previously possible. Long-tail HDFS data can also be safely migrated to RAID-protected Data Nodes using a rep count of 1 for much greater long-term HDFS storage efficiency."

For each DataNode, plan between 1 GB and 2 GB of memory per Map or Reduce task slot. In addition to file storage, DataNodes require about 30% of disk capacity to be set aside as free disk space for ephemeral files generated during MapReduce processing. For total storage per DataNode, 12 terabytes is common.

Hadoop gives you the option to compress data using the codec you specify. For example, at Facebook, they rely on the gzip codec with a compression factor of between six and seven for the majority of their data sets. (SIGMOD'10 presentation, June 2010, "Data Warehousing and Analytics Infrastructure at Facebook").

Try to standardize on a hardware configuration for the DataNodes. You can run multiple families of hardware, but that does complicate provisioning and operations. When you have nodes that use different hardware, your architecture begins to be more of a grid than a cluster

As well, don't expect to use virtualization — there is a major capacity hit. Hadoop works best when a DataNode can access all its disks. Hadoop is a highly scalable technology, not a high-performance technology. Given the nature of HDFS, you are much better off with data blocks and co-located processing spread across dozens, hundreds or thousands of independent nodes. If you have a requirement to use virtualization, consider assigning one VM to each DataNode hardware device; this would give you the benefit of virtualization such as for automated provisioning or automated software updates but keep the performance hit as low as possible.

True high availability (HA) protection is not available at present with Hadoop, unless you have an unlimited budget and can afford a duplicate cluster. That said, there are positive developments to support HA. Yahoo's AvatarNode project implements a one minute managed failover.

There is the option from Hadoop version 0.21 onward to run a Backup NameNode. However, as explained by MapR Co-Founder and CTO M.C. Srivas in a blog Q&A, even with a Backup NameNode, if the NameNode fails, the cluster will need a complete restart that may take several hours. You may be better off resurrecting the original NameNode from a synchronous or NFS mount copy versus using a Backup NameNode.

If you need to support a fully distributed application, consider Apache ZooKeeper.

From Facebook, Application Operations Engineer Andrew Ryan shared lessons learned for Hadoop cluster administrators at the Bay Area Hadoop User Group (HUG) meetup in February 2011 at the Yahoo Sunnyvale campus. A few of the suggestions from his excellent talk:

  • Maintain a central registry of clusters, nodes, and each node’s role in the cluster, integrated with your service/asset management platform.
  • Failed/failing hardware is your biggest enemy -- the ‘excludes’ file is your friend.
  • Never fsck ext3 data drives unless Hadoop says you have to.
  • Segregate different classes of users on different clusters, with appropriate service levels and capacities.

While it may seem obvious with a cluster, it's still worth a reminder that Hadoop is not well suited for running a cluster that spans more than one data center. Even at Yahoo, which runs the largest private-sector Hadoop production clusters, no Hadoop file systems or MapReduce jobs are divided across multiple data centers. Organizations such as Facebook are considering ways to federate Hadoop clusters across multiple data centers, but it does introduce challenges with time boundaries and overhead for common dimension data.

Import data into HDFS

Regardless of the source of the data to store in the cluster, input is through the HDFS API. For example, you can collect log data files in Apache Chukwa, Cloudera-developed Flume, or Facebook-developed Scribe and feed those files through the HDFS API into the cluster, to be divided up into HDFS' block storage. One approach for streaming data such as log files is to use a staging server to collect the data as it comes in and then feed it into HDFS using batch loads.

Sqoop is designed to import data from relational databases. Sqoop imports a database table, runs a MapReduce job to extract rows from the table, and writes the records to HDFS. You can download Sqoop source code at GitHub or as part of the Cloudera CDH.

Informatica's newly announced PowerCenter version 9.1 includes connectivity for HDFS, to load data into Hadoop or extract data from Hadoop, as explained in a recent call I had with Informatica solution evangelist Julianna DeLua. Customers can use Informatica data quality and other transformation tools either pre- or post-writing the data into HDFS. Future Informatica support for Hadoop is targeted to include a graphical integrated development environment (IDE) for Hadoop, including codeless and metadata-driven development; tools to prepare and integrate data in Hadoop; and tracking of metadata lineage.

IBM and Cloudera announced the Apache Hadoop Connector for IBM Netezza. It's available as a free download, and connects with HFDS for two-way exchange of data. Cloudera provides support for the connector through a Cloudera Enterprise subscription. This collaboration opens up multiple use cases, such as: use of Hadoop to collect and explore "data bags" of anecdotal data or disorderly name value pairs, with results sent to IBM Netezza; and use of Hadoop clusters as "analytic virtual data marts" for ad hoc projects offloaded from IBM Netezza, such as for multivariate A/B testing. In these type examples, IBM Netezza serves as the persistent data store of record with fast processing of queries that benefit from set schema, while Hadoop stores and processes the hodgepodge "data bags" of terabyte or petabyte size volumes of raw, unstructured, or semi-structured data from disparate sources.

Both Teradata Aster Data and EMC Greenplum provide a two-way, parallelized data connector between their PostgreSQL-derived data stores and HDFS. For Oracle customers, Quest Data Connector for Hadoop allows for data transfer between Oracle databases and Hadoop using a freeware plug-in to Sqoop.

One benefit of the EMC Greenplum, IBM Netezza, Teradata Aster Data, and Quest for Oracle connectors is that they support faster data transfer than what is typically possible using standard ODBC or JDBC drivers.

For more on the coexistence between Hadoop and more traditional data warehouses, a good resource is Ralph Kimball's recent white paper with Informatica, "The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics" (free download but registration required).

Manage jobs and answer queries

You have the choice of several job schedulers for MapReduce. Fair Scheduler, developed by Facebook, provides for faster response times for small jobs and quality of service for production jobs. Jobs are grouped into pools, and you can assign pools a minimum number of map slots and reduce slots, and a limit on the number of running jobs. There is no option to cap maximum share for map slots or reduce slots (for example if you are worried about a poorly written job taking up too much cluster capacity), but one option to get around this is giving each pool enough of a guaranteed minimum share that they will hold off the avalanche of an out-of-control job.

With Capacity Scheduler, developed by Yahoo, you can assign priority levels to jobs, which are submitted to queues. Within a queue, when resources become available, they are assigned to the highest priority job. Note that there is no preemption once a job is running to take back capacity that has already been allocated.

Yahoo developed Oozie for workflow management. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop, including HDFS, Pig and MapReduce.

Azkaban provides a batch scheduler for constructing and running Hadoop jobs or other offline processes. At LinkedIn, they use a combination of Hadoop to process massive batch workloads, Project Voldemort for a NoSQL key/value storage engine, and the Azkaban open-source workflow system to empower large-scale data computations of more than 100 billion relationships a day and low-latency site serving. LinkedIn is supporting Azkaban as an open source project.

For organizations that use Eclipse as an integrated development environment (IDE), you can set up a Hadoop development environment under an Eclipse IDE. With the Hadoop Eclipse Plug in, you can create Mapper, Reducer and Driver classes; submit MapReduce jobs; and monitor job execution.

Karmasphere Studio provides a graphical environment to develop, debug, deploy and optimize MapReduce jobs. Karmasphere supports a free community edition and license-based professional edition.

You can query large data sets using Apache Pig, or with R using the R and Hadoop Integrated Processing Environment (RHIPE). With Apache Hive, you can enable SQL-like queries on large data sets as well as a columnar storage layout using RCFile.

The MicroStrategy 9 Platform allows application developers and data analysts to submit queries using HiveQL and view Hadoop data in MicroStrategy dashboards, for on-premise Hadoop clusters or cloud offerings such as Amazon Elastic MapReduce. Groupon adopted MicroStrategy to analyze Groupon's daily deals, obtain a deeper understanding of customer behavior, and evaluate advertising effectiveness. Groupon staff can use MicroStrategy-based reports and dashboards for analysis of their data in Hadoop and HP Vertica.

Teradata Aster Data supports a SAS/ACCESS connector to enable SAS users to execute MapReduce jobs.

To enable applications outside of the cluster to access the file system, the Java API works well for Java programs. For example, you can use a object to open a stream. You can use the Thrift API to connect to programs written in C++, Perl, PHP, Python, Ruby or other programming languages.

Strengthen security

There are some ease-of-use advantages to running password-less Secure Shell (SSH) — for example you can start and stop all of the cluster nodes with simple commands — but your organization may have security policies that prevent use of password-less SSH. In general, the and commands are useful for running a Hadoop evaluation or test cluster, but you probably do not need or want to use them for a production cluster.

By itself, HDFS does not provide robust security for user authentication. A user with a correct password can access the cluster. Beyond the password, there is no authentication to verify that users are who they claim to be. To enable user authentication for HDFS, you can use a Kerberos network authentication protocol. This provides a Simple Authentication and Security Layer (SASL), via a Generic Services Application Program Interface (GSS-API). This setup uses a Remote Procedure Call (RPC) digest scheme with tokens for Delegation, Job and Block Access. Each of these tokens is similar in structure and based on HMAC-SHA1. Yahoo offers an explanation for how to set up the Oozie workflow manager for Hadoop Kerberos authentication.

Kerberos authentication is a welcome addition, but by itself does not enable Hadoop to reach enterprise-grade security. As noted by Andrew Becherer in a presentation and white paper for the BlackHat USA 2010 security conference, "Hadoop Security Design: Just add Kerberos? Really?," remaining security weak points include:

  • Symmetric cryptographic keys are widely distributed.
  • Web tools for Job Tracker, Task Tracker, nodes and Oozie rely on pluggable web user interface (UI) authentication for security.
  • Some implementations of HDFS use proxy IP addresses and a database of roles to authorize access for bulk data transfer.

Given the limited security within Hadoop itself, even if your Hadoop cluster will be operating in a local- or wide-area network behind an enterprise firewall, you may want to consider a cluster-specific firewall to more fully protect non-public data that may reside in the cluster. In this deployment model, think of the Hadoop cluster as an island within your IT infrastructure — for every bridge to that island you should consider an edge node for security.

In addition to a network firewall around your cluster, you could consider a database firewall or broader database activity monitoring product. These include AppSec DbProtect, Imperva SecureSphere and Oracle Database Firewall. A database firewall enables rules-based determination of whether to pass, log, alert, block or substitute access to the database. Database firewalls are a subset of the broader software category of database monitoring products. Note that most database firewall or activity monitoring products are not yet set up for out-of-the-box Hadoop support, so you may require help from the database firewall vendor or a systems integrator.

Other security steps may also be necessary to secure applications outside of the cluster that are authorized and authenticated to access the cluster. For example, if you choose to use the Oozie workflow manager, Oozie becomes an approved "super user" that can perform actions on behalf of any Hadoop user. Accordingly, if you decide to adopt Oozie, you should consider an additional authentication mechanism to plug into Oozie.

Some of the above security concerns you may be able to address using paid software, e.g. Cloudera Enterprise or Karmasphere Studio, such as the use of management applications in place of the web user interfaces that come with the Hadoop daemons.

Add to your Hadoop toolbox

As noted by Mike Loukides in "What is Data Science?," "If anything can be called a one-stop information platform, Hadoop is it." The broad Hadoop ecosystem provides a variety of choices for tools and capabilities among Apache projects and sub-projects, other open source tools, and proprietary software offerings. These include the following:

  • For serialization, Apache Avro is designed to enable native HDFS clients to be written in languages other than Java. Other possible options for serialization include Google Protocol Buffers (Protobuf) and Binary JSON (BSON).
  • Cascading is an open-source, data-processing API that sits atop MapReduce, with commercial support from Concurrent. Cascading supports job and workflow management. According to Concurrent founder and CTO Chris Wensel, in a single library you receive Pig/Hive/Oozie functionality, without all the XML and text syntax. Nathan Marz wrote db-migrate, a Cascading-based JDBC tool for import/export onto HDFS. At BackType and previously Rapleaf, Nathan also authored Cascalog, a Hadoop/Cascading-based query language hosted in Clojure. Multitool allows you to "grep", "sed", or join large datasets on HDFS or Amazon S3 from a command line.
  • Hadoop User Experience (HUE) provides "desktop-like" access to Hadoop via a browser. With HUE, you can browse the file system, create and manage user accounts, monitor cluster health, create MapReduce jobs, and enable a front end for Hive called Beeswax. Beeswax provides Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format. Cloudera contributed HUE as an open source project.
  • Pervasive TurboRush for Hive allows Hive to generate an execution plan using dataflow graphs as an alternative to MapReduce. It then executes these graphs using Pervasive DataRush distributed across the machines of the cluster.
  • Pentaho offers a visual design environment for data integration, extract transform load (ETL), report design, analytics and dashboards that integrates with Hadoop.
  • Karmasphere Analyst provides Visual SQL access to data in Hadoop, that can support visualizations in Tableau or Microsoft Excel.

  • Datameer offers a spreadsheet-like interface for analysts to work with data in Hadoop as one part of their Datameer Analytics Solution.

  • Modeled after Google BigTable, Apache HBase is a distributed column-oriented database built on top of HDFS. With HBase, you can run real-time read and write random access to large data sets. You can use HBase as a back-end for materialized views; according to Facebook, this can support real-time analytics.

  • The Hadoop Online Prototype (HOP) is a modified version of Hadoop MapReduce that allows data to be pipelined between tasks and between jobs. This can enable better cluster utilization and increased parallelism, and allows new functionality: online aggregation (approximate answers as a job runs), and stream processing (MapReduce jobs that run continuously, processing new data as it arrives). Note that HOP is a development prototype that is not production-ready at this stage.

One of the benefits of Hadoop's loosely coupled architecture is the ability to replace components such as HDFS or MapReduce. As a data-centric architecture with loose coupling, Hadoop supports modular components. As noted by Rajive Joshi in a March 2011 InformationWeek article, "The key to data-centric design is to separate data from behavior. The data and data-transfer constructs then become the primary organizing constructs." That's not to say that there is complete plug and play with no integration work needed among Hadoop components (the proposed Apache Bigtop project may help in this area), but there is no requirement for every organization to use the same monolithic stack.

Several startups have emerged recently with offerings that replace elements of the Hadoop stack with different approaches. These include the following:

Given Hadoop's increasing market penetration, it's not surprising to begin to see more alternatives or competitors to Hadoop begin to emerge. These include Spark and LexisNexis HPCC.

Spark was developed in the UC Berkeley AMP Lab, and is used by several research groups at Berkeley for spam filtering, natural language processing and road traffic prediction. The AMP Lab developed Spark for machine learning iterative algorithms and interactive data mining. Online video analytic service provider Conviva uses Spark. Spark is open source under a BSD license. Spark runs on the Mesos cluster manager, which can also run Hadoop applications. Mesos adopters include Conviva, Twitter and UC Berkeley. Mesos joined the Apache Incubator in January 2011.

In June 2011, LexisNexis announced a high performance computing cluster (HPCC) alternative to Hadoop, combining open source and proprietary elements. HPCC Systems announced plans for a free community version along with an enterprise edition that comes with support, training and consulting. Sandia National Laboratories uses HPCC's precursor technology, the data analytics supercomputer (DAS) platform, to sort through petabytes of data to find correlations and generate hypotheses. According to LexisNexis, HPCC configurations require fewer nodes to provide the same processing performance as a Hadoop cluster, and are faster in some benchmark tests.

However, it's unclear what types of semi-structured, unstructured or raw data analysis will perform faster and / or on fewer nodes with HPCC. HPCC may struggle to create as vibrant of an ecosystem of companies and contributors as what the Apache Hadoop community has achieved. And HPCC uses the ECL programming language, which has not been widely adopted outside of LexisNexis or occasional use in government or academia.

To wrap up

Hadoop is maturing as a platform to store, process and analyze huge volumes of varied semi-structured, unstructured and raw data from disparate sources. To get started:

  • Evaluate one of the free distributions in stand-alone or pseudo-distributed mode.
  • Refer to Tom White's "Hadoop: The Definitive Guide, Second Edition" and consider taking one or more of the Hadoop courses from Cloudera, Scale Unlimited, or another of the service providers listed on the Apache Hadoop wiki support page.
  • For an on-premise cluster, invest in separate enterprise-grade servers for the NameNode, SecondaryNameNode and JobTracker.
  • Remember operation tips for rack awareness, file system checks and load balancing.
  • Beef up security with user authentication, an edge node security firewall and other security measures.

What deployment tips, cautions or best practices would you like to add or comment on based on your own experience with Hadoop?

My thanks for comments on the draft article to Val Bercovici, Julianna DeLua, Glynn Durham, Jeff Hammerbacher, Sarah Sproehnle, M.C. Srivas, and Chris K. Wensel.


June 23 2011

Strata Week: Data Without Borders

Here are some of the data stories that caught my attention this week:

Data without borders

Data without bordersData is everywhere. That much we know. But the usage of and benefit from data is not evenly distributed, and this week, New York Times data scientist Jake Porway has issued a call to arms to address this. He's asking for developers and data scientists to help build a Data Without Borders-type effort to take data — particularly NGO and non-profits' data — and match it with people who know what to do with it.

As Porway observes:

There's a lot of effort in our discipline put toward what I feel are sort of "bourgeois" applications of data science, such as using complex machine learning algorithms and rich datasets not to enhance communication or improve the government, but instead to let people know that there's a 5% deal on an iPad within a 1 mile radius of where they are. In my opinion, these applications bring vanishingly small incremental improvements to lives that are arguably already pretty awesome.

Porway proposes building a program to help match data scientists with non-profits and the like who need data services. The idea is still under development, but drop Porway a line if you're interested.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Big data and the future of journalism

ScraperWikiThe Knight Foundation announced the winners of its Knight News Challenge this week, a competition to find and support the best new ideas in journalism. The Knight Foundation selected 16 projects to fund from among hundreds of applicants.

In announcing the winners, the Knight Foundation pointed out a couple of important trends, including "the rise of the hacker/data journalist." Indeed, several of the projects are data-related, including Swiftriver, a project that aims to make sense of crisis data; ScraperWiki, a tool for users to create their own custom scrapers; and Overview, a project that will create visualization tools to help journalists better understand large data sets.

IBM releases it first Netezza appliance

Last fall, IBM announced its acquisition of the big data analytics company Netezza. The acquisition was aimed at helping IBM build out its analytics offerings.

This week, IBM released its first new Netezza appliance since acquiring the company. The IBM Netezza High Capacity Appliance is designed to analyze up to 10 petabytes in just a few minutes. "With the new appliance, IBM is looking to make analysis of so-called big data sets more affordable," Steve Mills, senior vice president and group executive of software and systems at IBM, told ZDNet.

The new Netezza appliance is part of IBM's larger strategy of handling big data, of which its recent success with Watson on Jeopardy was just one small part.

The superhero social graph

MarvelPlenty of attention is paid to the social graph: the ways in which we are connected online through our various social networks. And while there's still lots of work to be done making sense of that data and of those relationships, a new dataset released this week by the data marketplace Infochimps points to other social (fictional) worlds that can be analyzed.

The world, in this case, is that of the Marvel Comics universe. The Marvel dataset was constructed by Cesc Rosselló, Ricardo Alberich, and Joe Miro from the University of the Balearic Islands. Much like a real social graph, the data shows the relationships between characters, and according to the researchers "is closer to a real social graph than one might expect."

Got data news?

Feel free to email me.


June 02 2011

Strata Week: Hadoop competition heats up

Here are a few of the data stories that caught my eye this week.

Hadoop competition heats up

HadoopAs the number of Hadoop vendors increases, companies are looking for ways to differentiate themselves. A couple of announcements this past week point to the angles vendors are taking.

Infrastructure company Rainstor announced that its latest data retention technology can be deployed using Cloudera's Hadoop distribution. Rainstor says it will improve the Hadoop Distributed File System with better compression and de-duplication, and it promises a physical footprint that is at least 97% smaller.

In other Hadoop news, MapR revealed that it will serve as the storage component for EMC's recently announced Greenplum HD Enterprise Edition Hadoop distribution. EMC's Hadoop distribution is not based on the official Apache Software Foundation version of the code, but is instead based on Facebook's optimized version.

In an interesting twist, MapR also became an official contributor to the Apache Hadoop project this week. As GigaOm's Derrick Harris observes:

More contributors [to Hadoop] means more (presumably) great ideas to choose from and, ideally, more voices deciding what changes to adopt and which ones to leave alone. For individual companies, getting officially involved with Apache means that perhaps Hadoop will evolve in ways that actually benefit their products that are based upon or seeking to improve Hadoop.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Visualizing Facebook's PHP codebase

Facebook's Greg Schechter offered an explanation this week of how and why Facebook built a visualization project in order to better grasp some of the interdependencies among the more than 10,000 modules that comprise Facebook's front-end code.

Facebook has been normalizing its PHP usage, particularly as it relates to managing modules' dependencies. With its new system, when a module is written or modified, other modules that are directly dependent are fully determinable. This makes sure that circular dependencies are avoided.

But graphing this with a classic "arc-and-node" graph visualization won't work at Facebook's scale, so at a recent hackathon, the company came up with a better visualization method.

Screen from Facebook PHP codebase visualization
Screen from Facebook PHP codebase visualization. See more here.

This method divides the information into layers, where each row represents a layer and a layer's modules are dependent only on modules in the rows below it, and are depended upon only by modules in the rows above it. The visualization also colors modules more darkly if they have more dependencies.

A few screens showing the visualization are available here. Unfortunately, the full tool is only available internally for the Facebook engineering team.

Visualizing Shaquille O'Neal's data

In honor of the end of Shaquille O'Neal's 19-year NBA career (an announcement he tweeted yesterday), data journalist Matt Stiles has created an interactive visualization of the star's stats.

The visualization was created using data from and the Many Eyes data visualization tool. The Atlantic's Alexis Madrigal used the tool to take a look at Shaq's shoddy free-throw record.

While Shaq's career — and now his retirement — provide ample data for off-hand curiosity, the merging of sports stats and visualizations also opens the door to broader opportunities and new kinds of data products.

Shaq three pointer graph
Because few things are funnier than a center lofting three pointers, this graph matches Shaquille O'Neal's age against his three-point attempts. He hit a high-water mark (the big dot) at 23 when he attempted two three pointers and hit the only three of his career.

Got data news?

Feel free to email me.


May 12 2011

Strata Week: Data tools, data weapons, data stories

Here are a few of the data stories that caught my eye this week.

Another acquisition for AVOS

AVOSLast month, news broke that Yahoo had sold the popular bookmarking service Delicious to YouTube founders Chad Hurley and Steve Chen. Hurley and Chen have founded a new company AVOS, and while details are still fuzzy exactly what Delicious will look like under new ownership, that picture became a little clearer this week when AVOS announced another acquisition this week, this time purchasing the social media analytics startup Tap11.

"Our vision is to create the world's best platform for users to save, share, and discover new content," said Hurley in a statement on AVOS blog. "With the acquisition of Tap11, we will be able to provide consumer and enterprise users with powerful tools to publish and analyze their links' impact in real-time."

With the Delicious and Tap11 tools in its toolbox, AVOS could build a sophisticated system for recommending content and monitoring sentiment.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Data and mathematical intimidation

The idea of mathematical analysis being misused and misconstrued is hardly new, but as data-driven decision-making moves into new sectors, there are likely to be new controversies over the ways in which math and data are wielded. That's certainly the case with a string of recent stories written by The LA Times in which the newspaper has constructed a mathematical model to rate teachers' impact on their students. The model uses students' test scores to devise a teacher's "value-add." The analysis compares a student's performance on tests with their prior performance, and that difference — for better or worse or the same — is attributed to the teacher.

Teachers have balked at the method in part because The LA Times has published teachers' names and scores. But now some mathematicians are pushing back as well. An editorial written by John Ewing, president of Math for America, traces the history of the value-added systems and looks at several of the problems with these models (and with sweeping judgments made on standardized test scores).

Ewing writes:

Value-added modeling is promoted because it has the right pedigree — because it is based on "sophisticated mathematics." As a consequence, mathematics that ought to be used to illuminate ends up being used to intimidate. When that happens, mathematicians have a responsibility to speak out.

Data and scientific storytelling

Scientific papers are stories that persuade with data. That was the topic of a very interesting presentation by Anita de Waard, Disruptive Technologies Director at Elsevier Labs, at the recent Harvard Digital Scholarship Summit. Her talk provided a story analysis of scientific text, looking at the ways in which citations create facts and how linked data can help better support this knowledge (or story) creation process.

The research paper isn't going away any time soon, de Waard argues, but she does point to several other ways in which the citation, review and publication processes of scientific papers can be improved. Ideas include a science data app store and executable scientific papers.

The slides from her talk are embedded below

Got data news?

Feel free to email me.


April 26 2011

Open source tools look to make mapping easier

The rapid evolution of tools for mapping open data is an important trend for the intersection of data, new media, citizens and society. Whether it's mapping issues, mapping broadband access or mapping crisis data, geospatial technology is giving citizens and policy makers alike new insight into the world we inhabit. Below, earthquake data is mapped in Japan.

Earlier today, Washington-based Development Seed launched cloud-based hosting for files created with their map design suite, MapBox.

"We are trying to radically lower the barrier of entry to map making for organizations and activists," said Eric Gundersen, the founder of Development Seed.

Media organizations and nonprofits have been making good use of Development Seed's tools. MapBox was used to tell stories with World Bank data. The Department of Education broadband maps were designed with Development Seed's open source TileMill tool and are hosted in the cloud. The Chicago Tribune also used TileMill to map population change using open data from the United States census.

Maps from the MapBox suite can be customized as interactive embeds, enabling media, nonprofits and government entities to share a given story far beyond a single static web page. For instance, the map below was made using open data in Baltimore that was released by the city earlier this year:

"This isn't about picking one person's API," said Gundersen. "This is working with anyone's API. It's your data. It's your tiles. If we do this right, we're about to have a lot of good GIS folks who will be able to make better web maps. There's a lot of locked up data that could be shared."

Making maps faster with Node.js

After making its mark in open source development with Drupal, Development Seed is now focusing on Node.js.

Why? Speed matters. "Data projects really are custom and they need more of a framework that focuses on speed," said Gundersen. "That's what Node.js delivers."

Node.js is a relatively recent addition to the development world that has seen high-profile adoption at Google and Yahoo. The framework was created by Ryan Dahl (@ryah). Jolie O'Dell covered what's hot about Node.js this March, focusing on its utility for real-time web apps.

(If you're interested in getting up and running with Node.js, O'Reilly has a preview of an upcoming book on the framework.)


April 07 2011

Data hand tools

drill bitThe flowering of data science has both driven, and been driven by, an explosion of powerful tools. R provides a great platform for doing statistical analysis, Hadoop provides a framework for orchestrating large clusters to solve problems in parallel, and many NoSQL databases exist for storing huge amounts of unstructured data. The heavy machinery for serious number crunching includes perennials such as Mathematica, Matlab, and Octave, most of which have been extended for use with large clusters and other big iron.

But these tools haven't negated the value of much simpler tools; in fact, they're an essential part of a data scientist's toolkit. Hilary Mason and Chris Wiggins wrote that "Sed, awk, grep are enough for most small tasks," and there's a layer of tools below sed, awk, and grep that are equally useful. Hilary has pointed out the value of exploring data sets with simple tools before proceeding to a more in-depth analysis. The advent of cloud computing, Amazon's EC2 in particular, also places a premium on fluency with simple command-line tools. In conversation, Mike Driscoll of Metamarkets pointed out the value of basic tools like grep to filter your data before processing it or moving it somewhere else. Tools like grep were designed to do one thing and do it well. Because they're so simple, they're also extremely flexible, and can easily be used to build up powerful processing pipelines using nothing but the command line. So while we have an extraordinary wealth of power tools at our disposal, we'll be the poorer if we forget the basics.

With that in mind, here's a very simple, and not contrived, task that I needed to accomplish. I'm a ham radio operator. I spent time recently in a contest that involved making contacts with lots of stations all over the world, but particularly in Russia. Russian stations all sent their two-letter oblast abbreviation (equivalent to a US state). I needed to figure out how many oblasts I contacted, along with counting oblasts on particular ham bands. Yes, I have software to do that; and no, it wasn't working (bad data file, since fixed). So let's look at how to do this with the simplest of tools.

(Note: Some of the spacing in the associated data was edited to fit on the page. If you copy and paste the data, a few commands that rely on counting spaces won't work.)

Log entries look like this:

QSO: 14000 CW 2011-03-19 1229 W1JQ       599 0001  UV5U       599 0041
QSO: 14000 CW 2011-03-19 1232 W1JQ       599 0002  SO2O       599 0043
QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0003  RG3K       599 VR  
QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0004  UD3D       599 MO  

Most of the fields are arcane stuff that we won't need for these exercises. The Russian entries have a two-letter oblast abbreviation at the end; rows that end with a number are contacts with stations outside of Russia. We'll also use the second field, which identifies a ham radio band (21000 KHz, 14000 KHz, 7000 KHz, 3500 KHz, etc.) So first, let's strip everything but the Russians with grep and a regular expression:

$ grep '599 [A-Z][A-Z]' rudx-log.txt | head -2
QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0003  RG3K       599 VR
QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0004  UD3D       599 MO

grep may be the most useful tool in the Unix toolchest. Here, I'm just searching for lines that have 599 (which occurs everywhere) followed by a space, followed by two uppercase letters. To deal with mixed case (not necessary here), use grep -i. You can use character classes like :upper: rather than specifying the range A-Z, but why bother? Regular expressions can become very complex, but simple will often do the job, and be less error-prone.

If you're familiar with grep, you may be asking why I didn't use $ to match the end of line, and forget about the 599 noise. Good question. There is some whitespace at the end of the line; we'd have to match that, too. Because this file was created on a Windows machine, instead of just a newline at the end of each line, it has a return and a newline. The $ that grep uses to match the end-of-line only matches a Unix newline. So I did the easiest thing that would work reliably.

The simple head utility is a jewel. If you leave head off of the previous command, you'll get a long listing scrolling down your screen. That's rarely useful, especially when you're building a chain of commands. head gives you the first few lines of output: 10 lines by default, but you can specify the number of lines you want. -2 says "just two lines," which is enough for us to see that this script is doing what we want.

Next, we need to cut out the junk we don't want. The easy way to do this is to use colrm (remove columns). That takes two arguments: the first and last column to remove. Column numbering starts with one, so in this case we can use colrm 1 72.

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | colrm 1 72 | head -2

How did I know we wanted column 72? Just a little experimentation; command lines are cheap, especially with command history editing. I should actually use 73, but that additional space won't hurt, nor will the additional whitespace at the end of each line. Yes, there are better ways to select columns; we'll see them shortly. Next, we need to sort and find the unique abbreviations. I'm going to use two commands here: sort (which does what you'd expect), and uniq (to remove duplicates).

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | colrm 1 72 | sort |\
   uniq | head -2

Sort has a -u option that suppresses duplicates, but for some reason I prefer to keep sort and uniq separate. sort can also be made case-insensitive (-f), can select particular fields (meaning we could eliminate the colrm command, too), can do numeric sorts in addition to lexical sorts, and lots of other things. Personally, I prefer building up long Unix pipes one command at a time to hunting for the right options.

Finally, I said I wanted to count the number of oblasts. One of the most useful Unix utilities is a little program called wc: "word count." That's what it does. Its output is three numbers: the number of lines, the number of words, and the number of characters it has seen. For many small data projects, that's really all you need.

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | colrm 1 72 | sort | uniq | wc
      38      38     342

So, 38 unique oblasts. You can say wc -l if you only want to count the lines; sometimes that's useful. Notice that we no longer need to end the pipeline with head; we want wc to see all the data.

But I said I also wanted to know the number of oblasts on each ham band. That's the first number (like 21000) in each log entry. So we're throwing out too much data. We could fix that by adjusting colrm, but I promised a better way to pull out individual columns of data. We'll use awk in a very simple way:

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | awk '{print $2 " " $11}' |\
     sort | uniq 
14000 AD
14000 AL
14000 AN

awk is a very powerful tool; it's a complete programming language that can do almost any kind of text manipulation. We could do everything we've seen so far as an awk program. But rather than use it as a power tool, I'm just using it to pull out the second and eleventh fields from my input. The single quotes are needed around the awk program, to prevent the Unix shell from getting confused. Within awk's print command, we need to explicitly include the space, otherwise it will run the fields together.

The cut utility is another alternative to colrm and awk. It's designed for removing portions of a file. cut isn't a full programming language, but it can make more complex transformations than simply deleting a range of columns. However, although it's a simple tool at heart, it can get tricky; I usually find that, when colrm runs out of steam, it's best jumping all the way to awk.

We're still a little short of our goal: how do we count the number of oblasts on each band? At this point, I use a really cheesy solution: another grep, followed by wc:

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | awk '{print $2 " " $11}' |\
     sort | uniq | grep 21000 | wc
      20      40     180
$ grep '599 [A-Z][A-Z]' rudx-log.txt  | awk '{print $2 " " $11}' |\
     sort | uniq | grep 14000 | wc
      26      52     234

OK, 20 oblasts on the 21 MHz band, 26 on the 14 MHz band. And at this point, there are two questions you really should be asking. First, why not put grep 21000 first, and save the awk invocation? That's just how the script developed. You could put the grep first, though you'd still need to strip extra gunk from the file. Second: What if there are gigabytes of data? You have to run this command for each band, and for some other project, you might need to run it dozens or hundreds of times. That's a valid objection. To solve this problem, you need a more complex awk script (which has associative arrays in which you can save data), or you need a programming language such as perl, python, or ruby. At the same time, we've gotten fairly far with our data exploration, using only the simplest of tools.

Now let's up the ante. Let's say that there are a number of directories with lots of files in them, including these rudx-log.txt files. Let's say that these directories are organized by year (2001, 2002, etc.). And let's say we want to count oblasts across all the years for which we have records. How do we do that?

Here's where we need find. My first approach is to take the filename (rudx-log.txt) out of the grep command, and replace it with a find command that looks for every file named rudx-log.txt in subdirectories of the current directory:

$ grep '599 [A-Z][A-Z]' `find . -name rudx-log.txt -print`  |\
   awk '{print $2 " " $11}' | sort | uniq | grep 14000 | wc
      48      96     432

OK, so 48 directories on the 14 MHz band, lifetime. I thought I had done better than that. What's happening, though? That find command is simply saying "look at the current directory and its subdirectories, find files with the given name, and print the output." The backquotes tell the Unix shell to use the output of find as arguments to grep. So we're just giving grep a long list of files, instead of just one. Note the -print option: if it's not there, find happily does nothing.

We're almost done, but there are a couple of bits of hair you should worry about. First, if you invoke grep with more than one file on the command line, each line of output begins with the name of the file in which it found a match:

./2008/rudx-log.txt:QSO: 14000 CW 2008-03-15 1526 W1JQ      599 0054 \\
UA6YW         599 AD    
./2009/rudx-log.txt:QSO: 14000 CW 2009-03-21 1225 W1JQ      599 0015 \\
RG3K          599 VR    

We're lucky. grep just sticks the filename at the beginning of the line without adding spaces, and we're using awk to print selected whitespace-separated fields. So the number of any field didn't change. If we were using colrm, we'd have to fiddle with things to find the right columns. If the filenames had different lengths (reasonably likely, though not possible here), we couldn't use colrm at all. Fortunately, you can suppress the filename by using grep -h.

The second piece of hair is less common, but potentially more troublesome. If you look at the last command, what we're doing is giving the find command a really long list of filenames. How long is long? Can that list get too long? The answers are "we don't know," and "maybe." In the nasty old days, things broke when the command line got longer than a few thousand characters. These days, who knows what's too long ... But we're doing "big data," so it's easy to imagine the find command expanding to hundreds of thousands, even millions of characters. More than that, our single Unix pipeline doesn't parallelize very well; and if we really have big data, we want to parallelize it.

The answer to this problem is another old Unix utility, xargs. Xargs dates back to the time when it was fairly easy to come up with file lists that were too long. Its job is to break up command line arguments into groups and spawn as many separate commands as needed, running in parallel if possible (-P). We'd use it like this:

$ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]'  |\ 
  awk '{print $2 " " $11}' | grep 14000 | sort | uniq | wc
      48      96     432

This command is actually a nice little map-reduce implementation: the xargs command maps grep all the cores on your machine, and the output is reduced (combined) by the awk/sort/uniq chain. xargs has lots of command line options, so if you want to be confused, read the man page.

Another approach is to use find's -exec option to invoke arbitrary commands. It's somewhat more flexible than xargs, though in my opinion, find -exec has the sort of overly flexible but confusing syntax that's surprisingly likely to lead to disaster. (It's worth noting that the examples for -exec almost always involve automating bulk file deletion. Excuse me, but that's a recipe for heartache. Take this from the guy who once deleted the business plan, then found that the backups hadn't been done for about 6 months.) There's an excellent tutorial for both xargs and find -exec at Softpanorama. I particularly like this tutorial because it emphasizes testing to make sure that your command won't run amok and do bad things (like deleting the business plan).

That's not all. Back in the dark ages, I wrote a shell script that did a recursive grep through all the subdirectories of the current directory. That's a good shell programming exercise which I'll leave to the reader. More to the point, I've noticed that there's now a -R option to grep that makes it recursive. Clever little buggers ...

Before closing, I'd like to touch on a couple of tools that are a bit more exotic, but which should be in your arsenal in case things go wrong. od -c gives a raw dump of every character in your file. (-c says to dump characters, rather than octal or hexadecimal). It's useful if you think your data is corrupted (it happens), or if it has something in it that you didn't expect (it happens a LOT). od will show you what's happening; once you know what the problem is, you can fix it. To fix it, you may want to use sed. sed is a cranky old thing: more than a hand tool, but not quite a power tool; sort of an antique treadle-operated drill press. It's great for editing files on the fly, and doing batch edits. For example, you might use it if NUL characters were scattered through the data.

Finally, a tool I just learned about (thanks, @dataspora): the pipe viewer, pv. It isn't a standard Unix utility. It comes with some versions of Linux, but the chances are that you'll have to install it yourself. If you're a Mac user, it's in macports. pv tells you what's happening inside the pipes as the command progresses. Just insert it into a pipe like this:

$ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]'  |\ 
  awk '{print $2 " " $11}' | pv | grep 14000 | sort | uniq | wc
3.41kB 0:00:00 [  20kB/s] [<=>  
      48      96     432

The pipeline runs normally, but you'll get some additional output that shows the command's progress. If something's getting malfunctioning or performing too slowly, you'll find out. pv is particularly good when you have huge amounts of data, and you can't tell whether something has ground to a halt, or you just need to go out for coffee while the command runs to completion.

Whenever you need to work with data, don't overlook the Unix "hand tools." Sure, everything I've done here could be done with Excel or some other fancy tool like R or Mathematica. Those tools are all great, but if your data is living in the cloud, using these tools is possible, but painful. Yes, we have remote desktops, but remote desktops across the Internet, even with modern high-speed networking, are far from comfortable. Your problem may be too large to use the hand tools for final analysis, but they're great for initial explorations. Once you get used to working on the Unix command line, you'll find that it's often faster than the alternatives. And the more you use these tools, the more fluent you'll become.

Oh yeah, that broken data file that would have made this exercise superfluous? Someone emailed it to me after I wrote these scripts. The scripting took less than 10 minutes, start to finish. And, frankly, it was more fun.

Related books:

Related coverage:

February 16 2011

Google Public Data Explorer goes public

The explosion of data has created important new roles for mapping tools, data journalism and data science. Today, Google made it possible for the public to visualize data in the Google Public Data Explorer.

Uploading a dataset is straightforward. Once the data sets have been uploaded, users can easily link to them or embed them. For instance, embedded below is a data visualization of unemployment rates in the continental United States. Click play to watch it change over time, with the expected alarming growth over the past three years.

As Cliff Kuang writes at Fast Company's design blog, Google infographic tools went online after the company bought the Gapminder Trendalizer, the data visualization technology invented by Dr. Hans Rosling.

Google Public Data Explorer isn't the first big data visualization app to go online, as Mike Melanson pointed out over at ReadWriteWeb. Sites like Factual, CKAN, InfoChimps and Amazon's Public Data Sets are also making it easier for people to work with big data


Of note to government agencies: Google is looking for partnerships with "official providers" of public data, which can request to have their datasets appear in the Public Data Explorer directory.

In a post on Google's official blog, Omar Benjelloun, technical lead of Google's public data team, wrote more about Public Data Explorer and the different ways that the search giant has been working with public data:

Together with our data provider partners, we've curated 27 datasets including more than 300 data metrics. You can now use the Public Data Explorer to visualize everything from labor productivity (OECD) to Internet speed (Ookla) to gender balance in parliaments (UNECE) to government debt levels (IMF) to population density by municipality (Statistics Catalonia), with more data being added every week.

Google also introduced a new metadata format, the Dataset Publishing Language (DSPL). DSPL is an XML-based format that Google says will support rich, interactive visualizations like those in the Public Data Explorer.

For those interested, as is Google's way, they have created a helpful embeddable document that explains how to use Public Data Explorer:

And for those interested in what democratized data visualization means to journalism, check out Megan Garber's thoughtful article at the Nieman Journalism Lab.

February 08 2011

"Copy, paste, map"

Data, data everywhere, and all too many spreadsheets to think.

Citizens have a new tool to visualize data and map it onto their own communities. Geospatial startup FortiusOne and the Federal Communications Commission (FCC) have teamed up to launch IssueMap is squarely aimed at addressing one of the biggest challenges that government agencies, municipalities and other public entities have in 2011: converting open data into information that people can distill into knowledge and insight.

IssueMap must, like the data it visualizes, be put in context. The world is experiencing an unprecedented data deluge, a reality that my colleague Edd Dumbill described as another "industrial revolution" at last week's Strata Conference. The release of more data under the Open Government Directive issued by the Obama Administration has resulted in even more data becoming available. The challenge is that for most citizens, the hundreds of thousands of data sets available at, or at state or city data catalogs, don't lead to added insight or utility in their every day lives. This partnership between FortiusOne and the FCC is an attempt to give citizens a mapping tool to make FCC data meaningful.


There are powerful data visualization tools available to developers who wish to mash up data with maps, but IssueMap may have a key differentiator: simplicity. As Michael Byrne, the first FCC geospatial information officer, put it this morning, the idea is to make it as easy as "copy, paste, map."

Byrne blogged about the history of IssueMap at

Maps are a data visualization tool that can fix a rotten spreadsheet by making the data real and rich with context. By showing how data — and the decisions that produce data — affect people where they live, a map can make the difference between a blank stare and a knowing nod. Maps are also a crucial part of a decision-maker's toolkit, clearly plotting the relationship between policies and geographies in easy-to-understand ways.

Working with FCC deputy GIO Eric Spry, Byrne created the video embedded below:

IssueMap was created using FortiusOne's GeoIQ data visualization and analysis platform. "We built GeoIQ to enable non-technical users to easily make sense of data," said Sean Gorman, president and founder of FortiusOne. "IssueMap capitalizes on those core capabilities, enabling citizens to bring greater awareness of important issues and prompt action."

Gorman explained how to use IssueMap at the FortiusOne blog:

Once you've found some data you can either upload the spreadsheet (.csv, .xls, .xlsx, .odf) or just cut and paste into the IssueMap text box. Many tables you find online can also be cut and pasted to create a map. The data just needs to be clean with the first row containing your attributes and the data beneath having the values and geographies you would like to map. Even if you muck it up a bit IssueMap will give you helpful errors to let you know where you went wrong.

Once you've loaded your data just select the boundary you would like to join to and the value you would like to map. Click “Create Map” and magic presto you have a thematic map. Share your map via Twitter, Facebook or email. Now anyone can grab your map as an embed, download an image, grab the map as KML or get the raw data as a .csv. Your map is now viral and it can can be repurposed in a variety of useful ways.

One of the most powerful ways humanity has developed to communicate information over time is through maps. If you can take data in an open form (and CSV files are one of the most standard formats available) then there's an opportunity to tell stories in a way that's relevant to a region and personalized to an individual. That's a meaningful opportunity.


January 06 2011

4 free data tools for journalists (and snoops)

Note: The following is an excerpt from Pete Warden's free ebook "Where are the bodies buried on the web? Big data for journalists."

There's been a revolution in data over the last few years, driven by an astonishing drop in the price of gathering and analyzing massive amounts of information. It only cost me $120 to gather, analyze and visualize 220 million public Facebook profiles, and you can use 80legs to download a million web pages for just $2.20. Those are just two examples.

The technology is also getting easier to use. Companies like Extractiv and Needlebase are creating point-and-click tools for gathering data from almost any site on the web, and every other stage of the analysis process is getting radically simpler too.

What does this mean for journalists? You no longer have to be a technical specialist to find exciting, convincing and surprising data for your stories. For example, the following four services all easily reveal underlying data about web pages and domains.


Many of you will already be familiar with WHOIS, but it's so useful for research it's still worth pointing out. If you go to this site (or just type "whois" in on a Mac) you can get the basic registration information for any website. In recent years, some owners have chosen "private" registration, which hides their details from view, but in many cases you'll see a name, address, email and phone number for the person who registered the site.

You can also enter numerical IP addresses here and get data on the organization or individual that owns that server. This is especially handy when you're trying to track down more information on an abusive or malicious user of a service, since most websites record an IP address for everyone who accesses them

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD


The newest search engine in town, one of Blekko's selling points is the richness of the data it offers. If you type in a domain name followed by /seo, you'll receive a page of statistics on that URL


Blekko statistics page

The first tab shows other sites that are linking to the current domain, in popularity order. This can be extremely useful when you're trying to understand what coverage a site is receiving, and if you want to understand why it's ranking highly in Google's search results, since they're based on those inbound links. Inclusion of this information would have been an interesting addition to the recent DecorMyEyes story, for example.

The other handy tab is "Crawl stats," especially the "Cohosted with" section:

Cohosted with section on Blekko

This tells you which other websites are running from the same machine. It's common for scammers and spammers to astroturf their way toward legitimacy by building multiple sites that review and link to each other. They look like independent domains, and may even have different registration details, but often they'll actually live on the same server because that's a lot cheaper. These statistics give you an insight into the hidden business structure of shady operators.

I always turn to when I want to know how people are sharing a particular link. To use it, enter the URL you're interested in:

Bitly link shortening box

Then click on the 'Info Page+' link:


That takes you to the full statistics page (though you may need to choose "aggregate link" first if you're signed in to the service).


This will give you an idea of how popular the page is, including activity on Facebook and Twitter. Below that you'll see public conversations about the link provided by

Facebook and Twitter activity on Bitly

I find this combination of traffic data and conversations very helpful when I'm trying to understand why a site or page is popular, and who exactly its fans are. For example, it provided me with strong evidence that the prevailing narrative about grassroots sharing and Sarah Palin was wrong.

[Disclosure: O'Reilly AlphaTech Ventures is an investor in]


By surveying a cross-section of American consumers, Compete builds up detailed usage statistics for most websites, and they make some basic details freely available.

Choose the "Site Profile" tab and enter a domain:

Compete site profile box

You'll then see a graph of the site's traffic over the last year, together with figures for how many people visited, and how often.

Compete Traffic

Since they're based on surveys, Compete's numbers are only approximate. Nonetheless, I've found them reasonably accurate when I've been able to compare them against internal analytics.

Compete's stats are a good source when comparing two sites. While the absolute numbers may be off for both sites, Compete still offers a decent representation of the sites' relative difference in popularity.

One caveat: Compete only surveys U.S. consumers, so the data will be poor for predominantly international sites.

Additional data resources and tools are discussed in Pete's free ebook.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!