Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 18 2012

Visualization of the Week: Urban metabolism

This week's visualization comes from PhD candidates David Quinn and Daniel Wiesmann, who've built an interactive web-mapping tool that lets you explore the "urban metabolism" of major U.S. cities. The map includes data about cities' and neighborhoods' energy usage (kilowatt per hour per person) and material intensity (kilo per person) patterns. You can also view population density.

Click to see the full interactive version.

Quinn writes that "one of the objectives of this work is to share the results of our analysis. We would like to help provide better urban data to researchers." The map allows users to analyze information on the screen, draw out an area to analyze, compare multiple areas, and generate a report (downloadable as a PDF) with more details, including information about the specific data sources.

Quinn is a graduate student at MIT; Wiesmann is a PhD candidate at the Instituto Superior Técnico in Lisbon, Portugal.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

More Visualizations:

April 05 2012

Data as seeds of content

Despite the attention big data has received in the media and among the technology community, it is surprising that we are still shortchanging the full capabilities of what data can do for us. At times, we get caught up in the excitement of the technical challenge of processing big data and lose sight of the ultimate goal: to derive meaningful insights that can help us make informed decisions and take action to improve our businesses and our lives.

I recently spoke on the topic of automating content at the O'Reilly Strata Conference. It was interesting to see the various ways companies are attempting to make sense out of big data. Currently, the lion's share of the attention is focused on ways to analyze and crunch data, but very little has been done to help communicate results of big data analysis. Data can be a very valuable asset if properly exploited. As I'll describe, there are many interesting applications one can create with big data that can describe insights or even become monetizable products.

To date, the de facto format for representing big data has been visualizations. While visualizations are great for compacting a large amount of data into something that can be interpreted and understood, the problem is just that — visualizations still require interpretation. There were many sessions at Strata about how to create effective visualizations, but the reality is the quality of visualizations in the real world varies dramatically. Even for the visualizations that do make intuitive sense, they often require some expertise and knowledge of the underlying data. That means a large number of people who would be interested in the analysis won't be able to gain anything useful from it because they don't know how to interpret the information.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

To be clear, I'm a big fan of visualizations, but they are not the end-all in data analysis. They should be considered just one tool in the big data toolbox. I think of data as the seeds for content, whereby data can ultimately be represented in a number of different formats depending on your requirements and target audiences. In essence, data are the seeds that can spout as large a content tree as your imagination will allow.

Below, I describe each limb of the content tree. The examples I cite are sports related because that's what we've primarily focused on at my company, Automated Insights. But we've done very similar things in other content areas rich in big data, such as finance, real estate, traffic and several others. In each case, once we completed our analysis and targeted the type of content we wanted to create, we completely automated the future creation of the content.

Long-form content

By long-form, I mean three or more paragraphs — although it could be several pages or even book length — that use human-readable language to reveal key trends, records and deltas in data. This is the hardest form of content to automate, but technology in this space is rapidly improving. For example, here is a recap of an NFL game generated out of box score and play-by-play data.

A long-form sports recap driven by data
A long-form sports recap driven by data. See the full story.

Short-form content

These are bullets, headlines, and tweets of insights that can boil a huge dataset into very actionable bits of language. For example, here is a game notes article that was created automatically out of an NCAA basketball box score and historical stats.

Mobile and social content

We've done a lot of work creating content for mobile applications and various social networks. Last year, we auto-generated more than a half-million tweets. For example, here is the automated Twitter stream we maintain that covers UNC Basketball.


By metrics, I'm referring to the process of creating a single number that's representative of a larger dataset. Metrics are shortcuts to boil data into something easier to understand. For instance, we've created metrics for various sports, such as a quarterback ranking system that's based on player performance.

Real-time updates

Instead of thinking of data as something you crunch and analyze days or weeks after it was created, there are opportunities to turn big data into real-time information that provides interested users with updates as soon as they occur. We have a real-time NCAA basketball scoreboard that updates with new scores.

Content applications

This is one few people consider, but creating content-based applications is a great way to make use of and monetize data. For example, we created StatSmack, which is an app that allows sports fans to discover 10-20+ statistically based "slams" that enable them to talk trash about any team.

A variation on visualizations

Used in the right context, visualizations can be an invaluable tool for understanding a large dataset. The secret is combining bulleted text-based insights with the graphical visualization to allow them to work together to truly inform the user. For example, this page has a chart of win probability over the course of game seven of the 2011 World Series game. It shows the ebb and flow of the game.

Win probability from World Series 2011 game 7
Play-by-play win probability from game seven of the 2011 World Series.

What now?

As more people get their heads around how to crunch and analyze data, the issue of how to effectively communicate insights from that data will be a bigger concern. We are still in the very early stages of this capability, so expect a lot of innovation over the next few years related to automating the conversion of data to content.


Sponsored post

February 03 2012

Top stories: January 30-February 3, 2012

Here's a look at the top stories published across O'Reilly sites this week.

What is Apache Hadoop?
Apache Hadoop has been the driving force behind the growth of the big data industry. But what does it do, and why do you need all its strangely-named friends? (Related: Hadoop creator Doug Cutting on why Hadoop caught on.)

Embracing the chaos of data
Data scientists, it's time to welcome errors and uncertainty into your data projects. In this interview, Jetpac CTO Pete Warden discusses the advantages of unstructured data.

Moneyball for software engineering, part 2
A look at the "Moneyball"-style metrics and techniques managers can employ to get the most out of their software teams.

With GOV.UK, British government redefines the online government platform
A new beta .gov website in Britain is open source, mobile friendly, platform agnostic, and open for feedback.

When will Apple mainstream mobile payments?
David Sims parses the latest iPhone / near-field-communication rumors and considers the impact of Apple's (theoretical) entrance into the mobile payment space.

Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

February 02 2012

What is Apache Hadoop?

HadoopApache Hadoop has been
the driving force behind the growth of the big data industry. You'll
hear it mentioned often, along with associated technologies such as
Hive and Pig. But what does it do, and why do you need all its
strangely-named friends, such as Oozie, Zookeeper and Flume?

Hadoop brings the ability to cheaply process large amounts of
data, regardless of its structure. By large, we mean from 10-100
gigabytes and above. How is this different from what went before?

Existing enterprise data warehouses and relational databases excel
at processing structured data and can store massive amounts of
data, though at a cost: This requirement for structure restricts the kinds of
data that can be processed, and it imposes an inertia that makes
data warehouses unsuited for agile exploration of massive
heterogenous data. The amount of effort required to warehouse data
often means that valuable data sources in organizations are never
mined. This is where Hadoop can make a big difference.

This article examines the components of the Hadoop ecosystem and
explains the functions of each.

The core of Hadoop: MapReduce

Created at
in response to the problem of creating web search
indexes, the MapReduce framework is the powerhouse behind most of
today's big data processing. In addition to Hadoop, you'll find
MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.

The important innovation of MapReduce is the ability to take a query
over a dataset, divide it, and run it in parallel over multiple
nodes. Distributing the computation solves the issue of data too large to fit
onto a single machine. Combine this technique with commodity Linux
servers and you have a cost-effective alternative to massive
computing arrays.

At its core, Hadoop is an open source MapReduce
implementation. Funded by Yahoo, it emerged in 2006 and, href="">according to its
creator Doug Cutting, reached "web scale" capability in early

As the Hadoop project matured, it acquired further components to enhance
its usability and functionality. The name "Hadoop" has
come to represent this entire ecosystem. There are parallels
with the emergence of Linux: The name refers strictly to the Linux
kernel, but it has gained acceptance as referring to a complete
operating system.

Hadoop's lower levels: HDFS and MapReduce

Above, we discussed the ability of MapReduce to distribute
computation over multiple servers. For that computation to take
place, each server must have access to the data. This is the role of
HDFS, the Hadoop Distributed File System.

HDFS and MapReduce are robust. Servers in a Hadoop cluster can
fail and not abort the computation process. HDFS ensures data is
replicated with redundancy across the cluster. On completion of a
calculation, a node will write its results back into HDFS.

There are no restrictions on the data that HDFS stores. Data may
be unstructured and schemaless. By contrast, relational databases
require that data be structured and schemas be defined before storing
the data. With HDFS, making sense of the data is the responsibility
of the developer's code.

Programming Hadoop at the MapReduce level is a case of working with the
Java APIs, and manually loading data files into HDFS.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at

Improving programmability: Pig and Hive

Working directly with Java APIs can be tedious and error prone.
It also restricts usage of Hadoop to Java programmers. Hadoop offers
two solutions for making Hadoop programming easier.

  • Pig is a programming
    language that simplifies the common tasks of working with Hadoop:
    loading data, expressing transformations on the data, and storing
    the final results. Pig's built-in operations can make sense of
    semi-structured data, such as log files, and the language is
    extensible using Java to add support for custom data types and

  • Hive enables Hadoop
    to operate as a data warehouse. It superimposes structure on data in HDFS
    and then permits queries over the data using a familiar SQL-like
    syntax. As with Pig, Hive's core capabilities are

Choosing between Hive and Pig can be confusing. Hive
is more suitable for data warehousing tasks, with predominantly
static structure and the need for frequent analysis. Hive's closeness
to SQL makes it an ideal point of integration between Hadoop and
other business intelligence tools.

Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming
data flows for incorporation into larger applications. Pig is a
thinner layer over Hadoop than Hive, and its main advantage is to
drastically cut the amount of code needed compared to direct
use of Hadoop's Java APIs. As such, Pig's intended audience remains
primarily the software developer.

Improving data access: HBase, Sqoop and Flume

At its heart, Hadoop is a batch-oriented system. Data are loaded
into HDFS, processed, and then retrieved. This is somewhat of a
computing throwback, and often, interactive and random access to data
is required.

Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google's
the project's goal is to host billions of rows of data for rapid access.
can use HBase as both a source and a destination for its
computations, and Hive and Pig can be used in combination with

In order to grant random access to the data, HBase does impose a
few restrictions: Performance with Hive is 4-5 times slower than plain
HDFS, and the maximum amount of data you can store is approximately
a petabyte, versus HDFS' limit of over 30PB.

HBase is ill-suited to ad-hoc analytics and more appropriate for
integrating big data as part of a larger application. Use cases
include logging, counting and storing time-series data.

The Hadoop Bestiary

Deployment, configuration and monitoring
Collection and import of log and event data
HBase Column-oriented database scaling to billions of rows
HCatalog Schema and data type sharing over Pig, Hive and MapReduce
Distributed redundant file system for Hadoop
Data warehouse with SQL-like access
Library of machine learning and data mining algorithms
Parallel computation on server clusters
High-level programming language for Hadoop computations
Orchestration and workflow management
Imports data from relational databases
Cloud-agnostic deployment of clusters
Configuration management and coordination

Getting data in and out

Improved interoperability with the rest of the data world is
provided by href="">Sqoop and href="">Flume. Sqoop is a tool designed to import data from
relational databases into Hadoop, either directly into HDFS or into
Hive. Flume is designed to import streaming flows of log data
directly into HDFS.

Hive's SQL friendliness means that it can be used as a point of
integration with the vast universe of database tools capable of making
connections via JBDC or ODBC database drivers.

Coordination and workflow: Zookeeper and Oozie

With a growing family of services running as part of a Hadoop
cluster, there's a need for coordination and naming services. As
computing nodes can come and go, members of the cluster need
to synchronize with each other, know where to access services, and
know how they should be configured. This is the purpose of href="">Zookeeper.

Production systems utilizing Hadoop can often contain complex
pipelines of transformations, each with dependencies on each
other. For example, the arrival of a new batch of data will trigger
an import, which must then trigger recalculations in dependent
datasets. The Oozie
component provides features to manage the workflow and dependencies,
removing the need for developers to code custom solutions.

Management and deployment: Ambari and Whirr

One of the commonly added features incorporated into Hadoop by
distributors such as IBM and Microsoft is monitoring and
administration. Though in an early stage, href="">Ambari aims
to add these features to the core Hadoop project. Ambari is intended to help system
administrators deploy and configure Hadoop, upgrade clusters, and
monitor services. Through an API, it may be integrated with other
system management tools.

Though not strictly part of Hadoop, href="">Whirr is a highly complementary
component. It offers a way of running services, including Hadoop, on
cloud platforms. Whirr is cloud neutral and
currently supports the Amazon EC2 and Rackspace services.

Machine learning: Mahout

Every organization's data are diverse and particular
to their needs. However, there is much less diversity in the kinds of
analyses performed on that data. The href="">Mahout project is a library of
Hadoop implementations of common analytical computations. Use cases
include user collaborative filtering, user recommendations,
clustering and classification.

Using Hadoop

Normally, you will use Hadoop href="">in
the form of a distribution. Much as with Linux before it,
vendors integrate and test the components of the Apache Hadoop
ecosystem and add in tools and administrative features of their

Though not per se a distribution, a managed cloud installation
of Hadoop's MapReduce is also available through Amazon's Elastic
MapReduce service

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


January 19 2012

Strata Week: A home for negative and null results

Here are a few of the data stories that caught my attention this week:

Figshare sees the upside of negative results

FigshareScience data-sharing site Figshare relaunched its website this week, adding several new features. Figshare lets researchers publish all of their data online, including negative and null results.

Using the site, researchers can now upload and publish all file formats, including videos and datasets that are often deemed "supplemental materials" or excluded from current publishing models. This is part of a larger "open science" effort. According to Figshare:

"... by opening up the peer review process, researchers can easily publish null results, avoiding the file drawer effect and helping to make scientific research more efficient. Figshare uses creative commons licensing to allow frictionless sharing of research data whilst allowing users to maintain their ownership."

As the startup argues: "Unless we as scientists publish all of our data, we will never achieve access to the sum of all scientific knowledge."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Accel's $100 million data fund makes its first ($52.5 million) investment

Late last year, the investment firm Accel Partners announced a new $100 Million Big Data Fund, with a promise to invest in big data startups. This year, the first investment from that fund was revealed, with a whopping $52.5 million going to Code 42.

Founded in 2001, Code 42 is the creator of the backup software CrashPlan, and the company describes itself as building "high-performance hardware and easy-to-use software solutions that protect the world's data."

Describing the investment, GigaOm's Stacey Higginbotham writes:

"With the growth in mobile devices and the data stored on corporate and consumer networks that is moving not only from device to server, but device to device, [CEO Matthew] Dornquast realized Code 42's software could become more than just a backup and sharing service, but a way for corporations to understand what data and how data was moving between employees and the devices they use."

Higginbotham also cites Accel Partners' Ping Li, who notes that further investments from its Big Data Fund are unlikely to be so sizable.

LinkedIn open sources DataFu

LinkedInLinkedIn has been a heavy user of Apache Pig for performing analysis with Hadoop on projects such as its People You May Know tool, among other things. For more advanced tasks like these, Pig supports User Defined Functions (UDFs), which allow the integration of custom code into scripts.

This week, LinkedIn announced the release of DataFu, the consolidation of its UDFs into a single, general-purpose library. DataFu enables users to "run PageRank on a large number of independent graphs, perform set operations such as intersect and union, compute the haversine distance between two points on the globe," and more.

LinkedIn is making DataFu available on GitHub under the Apache 2.0 license.

Got data news?

Feel free to email me.


December 20 2011

There's a map for that

On November 6, 2012, millions of citizens in the United States will elect or re-elect representatives in Congress. Long before those citizens reach the polls, however, their elected representatives and their political allies in the state legislatures will have selected their voters.

Given powerful new data analysis tools, the practice of "gerrymandering, or creating partisan, incumbent-protected electoral districts through the manipulation of maps, has reached new heights in the 21st century. The drawing of these maps has been one of the least transparent processes in governance. Public participation has been limited or even blocked by the authorities in charge of redistricting.

While gerrymandering has been part of American civic life since the birth of the republic, one of the best policy innovations of 2011 may offer hope for improving the redistricting process. DistrictBuilder, an open-source tool created by the Public Mapping Project, allows anyone to easily create legal districts.

Michael P. McDonald, associate professor at George Mason University and director of the U.S. Elections Project, and Micah Altman, senior research scientist at Harvard University Institute for Quantitative Social Science, collaborated on the creation of DistrictBuilder with Azavea.

"During the last year, thousands of members of the public have participated in online redistricting and have created hundreds of valid public plans," said Altman, via an email. "In substantial part, this is due to the project's effort and software. This year represents a huge increase in participation compared to previous rounds of redistricting — for example, the number of plans produced and shared by members of the public this year is roughly 100 times the number of plans submitted by the public in the last round of redistricting 10 years ago. Furthermore, the extensive news coverage has helped make a whole new set of people aware of the issue and has reframed it as a problem that citizens can actively participate in to solve, rather than simply complain about."

For more on the potential and the challenges present here, watch the C-SPAN video of the Brookings Institution discussion on Congressional redistricting and gerrymandering, including what's happening in states such as California and Maryland. Participants include Norm Ornstein of the American Enterprise Institute and David Wasserman of the Cook Political Report. 

The technology of district building

DistrictBuilder lets users analyze if a given map complies with federal and advocacy-oriented standards. That means maps created with DistrictBuilder are legal and may be submitted to a given's state's authority. The software pulls data from several sources, including the 2010 US Census (race, age, population and ethnicity); election data; and map data, including how the current districts are drawn. Districts can also be divided by county lines, overall competitiveness between parties, and voting age. Each district must have the same total population number, though they are not required to have the same number of eligible voters.

On the tech side, DistrictBuilder is a combination of Django, GeoServer, Celery, jQuery, PostgreSQL, and PostGIS. For more developer-related posts about DistrictBuilder, visit the Azavea website. A webinar that explains how to use DistrictBuilder is available here.

DistrictBuilder is not the first attempt to make software that lets citizens try their hands at redistricting. ESRI launched a web-based application for Los Angeles this year.

"The online app makes redistricting accessible to a wide audience, increasing the transparency of the process and encouraging citizen engagement," said Mark Greninger, geographic information officer for the County of Los Angeles, in a prepared statement. "Citizens feel more confident because they are able to build their own plans online from wherever they are most comfortable. The tool is flexible enough to accommodate a lot of information and does not require specialized technical capabilities."

DistrictBuilder does, however, look like an upgrade to existing options available online. "There are a handful of tools" that enable citizens to participate, said Justin Massa in an email. Massa was the director of project and grant development at the Metro Chicago Information Center (MCIC) and is currently the founder and CEO of Food Genius. "An ESRI plugin and Borderline jump to mind although I know there are more, but all of them are proprietary and quite expensive. There's a few web-based versions, but none of them were usable in my testing."

Redistricting competitions

DistrictBuilder is being used in several state competitions to stimulate more public participation in the redistricting process and improve the maps themselves. "While gerrymandering is unlikely to be the driving force in the trend toward polarization in U.S. politics, it would result in a significant number of seats changing hands, and this could have a substantial effect on what laws get passed," said Altman. "We don't necessarily expect that software alone will change this, or that the legislatures will adopt public plans (even where they are clearly better) but making software and data available, holding competitions, and hosting sites where the public can easily evaluate and create plans that pass legal muster, has increased participation and awareness dramatically."

The New York Redistricting Project (NYRP) is hosting an open competition to redistrict New York congressional and state legislative districts. NYRP is collaborating with the Center for Electoral Politics and Democracy at Fordham University in an effort to see if college students can outclass Albany. The deadline for entering the New York student competition is Jan. 5, and the contest is open to all NY students.

In Philadelphia, included cash prizes when it kicked off in August of this year. By the end of September, citizensourced redistricting efforts reached the finish line, though it's unclear how much impact they had. In Virginia, a similar competition is taking aim at the "rigged redistricting process."

"This [DistrictBuilder] redistricting software is available not only to students, but to the public at large," said Costas Panagopoulos in a phone interview. At Fordham University, Panagopoulos is an assistant professor of political science, the director of the Center for Electoral Politics and Democracy, and the director of the graduate program in Elections and Campaign Management. "It's open source, user friendly and has no costs associated with it. It's a great opportunity for people to get involved and have the tools they need to design maps as alternatives for legislatures to consider."

Panagopoulos says maps created in DistrictBuilder can matter when redistricting disputes end up in the courts. "We have seen evidence from other states where competitions have been held," he said. "Official government entities have looked to maps that have been drawn by students for guidance. In Virginia, students submitted maps that enhanced minority representation. There are elements in the plan that will be officially adopted."

While it might seem unlikely that a map created by a team of students will be adopted, elements created by students in New York could make their way into discussions in Albany, posited Panagopoulos. "Our sense is that the criteria students will use to design maps will be somewhat different than what lawmakers will choose to pursue," he said. "Lawmakers may take concerns about protecting incumbents or partisan interests more to heart than citizens will. At the end of the day, if lawmakers think that a plan is ultimately worse off for both parties, they may adopt something that's more benign. That's what happened in the last round of redistricting. Legislators pushed through a different map rather than the one imposed by a judge."

For a concrete example of how the politics play out in one state, look at Texas. Ross Ramsey, the executive editor of The Texas Tribune, wrote about redistricting in the Texas legislature and courts:

The 2010 elections put overwhelming Republican majorities in both houses of the Legislature just as the time came to draw new political maps for state legislators, the Congressional delegation and members of the State Board of Education. Those Republicans drew maps to give each district an even number of people and to maximize the number of Republican districts that could be created, they thought, under the Voting Rights Act and the federal and state constitutions.

Or look at Illinois, where a Democratic redistricting plan would maximize the number of Democratic districts in that state. Or Pennsylvania, where a new map is drawing condemnation for being "rife with gerrymandering," according to Eric Boehm of the PA Independent.

While redistricting has historically not been the most accessible governance issue to the voting public, historic levels of dissatisfaction with the United States Congress could be channeled into more civic engagement. "The bottom line is that the public never had an opportunity to be as involved in redistricting as they are now," said Panagopoulos. "It's important that the public get involved."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Better redistricting software requires better data

Redistricting is "an emerging open-government issue that, for whatever reason, hasn't gotten a ton of attention yet from our part of the world," wrote Massa. "This scene is filled with proprietary datasets, intentionally confusing legislative proposals, antiquated laws that don't compel the publication of shape files, and election results data that is unbelievably messy."

As is the case with other open-government platforms, DistrictBuilder will only work with the right base data. "About a year ago, MCIC worked on a voting data project just for seven counties around Chicago," said Massa. "We found that none of the data we obtained from county election boards matched what the Census published as part of the '08 boundary files." In other words, a hoary software adage applies: "garbage in, garbage out."

That's where MCIC has played a role. "MCIC has been working with the Midwest Democracy Network to implement DistrictBuilder for six states in the Midwest," wrote Massa. According to Massa, Illinois, Indiana, Wisconsin, Michigan, and Ohio didn't have anything available at a state level. Among these states, according to Massa, only Minnesota publishes clean data. Earlier this year, MCIC launched DistrictBuilder for Minnesota.

"The unfortunate part is that the data to power a truly democratic process exists," said Massa. "We all know that no one is hand-drawing maps and then typing out the lengthy legislative proposals that describe, in text, the boundaries of a district. The fact that the political parties use tech and data to craft their proposals and then, in most cases, refuse to publish the data they used to make their decisions, or electronic versions of the proposals themselves, is particularly infuriating. This is a prime example of data 'empowering the empowered'."

Image Credit: Elkanah Tisdale's illustration of gerrymandering, via Wikipedia.


September 20 2011

BuzzData: Come for the data, stay for the community

BuzzDataAs the data deluge created by the activities of global industries accelerates, the need for decision makers to find a signal in the noise will only grow more important. Therein lies the promise of data science, from data visualization to dashboard to predictive algorithms that filter the exaflood and produce meaning for those who need it most. Data consumers and data producers, however, are both challenged by "dirty data" and limited access to the expertise and insight they need. To put it another way, if you can't derive value, as Alistair Croll has observed here at Radar, there's no such thing as big data.

BuzzData, based in Toronto, Canada, is one of several startups looking to help bridge that gap. BuzzData launched this spring with a combination of online community and social networking that is reminiscent of what GitHub provides for code. The thinking here is that every dataset will have a community of interest around the topic it describes, no matter how niche it might be. Once uploaded, each dataset has tabs for tracking versions, visualizations, related articles, attachments and comments. BuzzData users can "follow" datasets, just as they would a user on Twitter or a page on Facebook.

"User experience is key to building a community around data, and that's what BuzzData seems to be set on doing," said Marshall Kirkpatrick, lead writer at ReadWriteWeb, in an interview. "Right now it's a little rough around the edges to use, but it's very pretty, and that's going to open a lot of doors. Hopefully a lot of creative minds will walk through those doors and do things with the data they find there that no single person would have thought of or been capable of doing on their own."

The value proposition that BuzzData offers will depend upon many more users showing up and engaging with one another and, most importantly, the data itself. For now, the site remains in limited beta with hundreds of users, including at least one government entity, the City of Vancouver.

"Right now, people email an Excel spreadsheet around or spend time clobbering a shared file on a network," said Mark Opauszky, the startup's CEO, in an interview late this summer. "Our behind-the-scenes energy is focused on interfaces so that you can talk through BuzzData instead. We're working to bring the same powerful tools that programmers have for source code into the world of data. Ultimately, you're not adding and removing lines of code — you're adding and removing columns of data."

Opauszky said that BuzzData is actively talking with data publishers about the potential of the platform: "What BuzzData will ultimately offer when we move beyond a minimum viable product is for organizations to have their own territory in that data. There is a 'brandability' to that option. We've found it very easy to make this case to corporations, as they're already spending dollars, usually on social networks, to try to understand this."

That corporate constituency may well be where BuzzData finds its business model, though the executive team was careful to caution that they're remaining flexible. It's "absolutely a freemium model," said Opauszky. "It's a fundamentally free system, but people can pay a nominal fee on an individual basis for some enhanced features — primarily the ability to privatize data projects, which by default are open. Once in a while, people will find that they're on to something and want a smaller context. They may want to share files, commercialize a data product, or want to designate where data is stored geographically."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Open data communities

"We're starting to see analysis happen, where people tell 'data stories' that are evolving in ways they didn't necessarily expect when they posted data on BuzzData," said Opauszky. "Once data is uploaded, we see people use it, fork it, and evolve data stories in all sorts of directions that the original data publishers didn't perceive."

For instance, a dataset of open data hubs worldwide has attracted a community that improved the original upload considerably. BuzzData featured the work of James McKinney, a civic hacker from Montreal, Canada, in making it so. A Google Map mashing up locations is embedded below:

The hope is that communities of developers, policy wonks, media, and designers will self-aggregate around datasets on the site and collectively improve them. Hints of that future are already present, as open government advocate David Eaves highlighted in his post on open source data journalism at BuzzData. As Eaves pointed out, it isn't just media companies that should be paying attention to the trends around open data journalism:

For years I argued that governments — and especially politicians — interested in open data have an unhealthy appetite for applications. They like the idea of sexy apps on smart phones enabling citizens to do cool things. To be clear, I think apps are cool, too. I hope in cities and jurisdictions with open data we see more of them. But open data isn't just about apps. It's about the analysis.

Imagine a city's budget up on BuzzData. Imagine the flow rates of the water or sewage system. Or the inventory of trees. Think of how a community of interested and engaged "followers" could supplement that data, analyze it, and visualize it. Maybe they would be able to explain it to others better, to find savings or potential problems, or develop new forms of risk assessment.

Open data journalism

"It's an interesting service that's cutting down barriers to open data crunching," said Craig Saila, director of digital products at the Globe and Mail, Canada's national newspaper, in an interview. He said that the Globe and Mail has started to open up the data that it's collecting, like forest fire data, at the Globe and Mail BuzzData account.

"We're a traditional paper with a strong digital component that will be a huge driver in the future," said Saila. "We're putting data out there and letting our audiences play with it. The licensing provides us with a neutral source that we can use to share data. We're working with data suppliers to release the data that we have or are collecting, exposing the Globe's journalism to more people. In a lot of ways, it's beneficial to the Globe to share census information, press releases and statistics."

The Globe and Mail is not, however, hosting any information there that's sensitive. "In terms of confidential information, I'm not sure if we're ready as a news organization to put that in the cloud," said Saila. "Were just starting to explore open data as a thing to share, following the Guardian model."

Saila said that he's found the private collaboration model useful. "We're working on a big data project where we need to combine all of the sources, and we're trying to munge them all together in a safe place," he said. "It's a great space for journalists to connect and normalize public data."

The BuzzData team emphasized that they're not trying to be another data marketplace, like Infochimps, or replace Excel. "We made an early decision not to reinvent the wheel," said Opauszky, "but instead to try to be a water cooler, in the same way that people go to Vimeo to share their work. People don't go to Flickr to edit photos or YouTube to edit videos. The value is to be the connective tissue of what's happening."

If that question about "what's happening?" sounds familiar to Twitter users, it's because that kind of stream is part of BuzzData's vision for the future of open data communities.

"One of the things that will become more apparent is that everything in the interface is real time," said Opauszky. "We think that topics will ultimately become one of the most popular features on the site. People will come from the Guardian or the Economist for the data and stay for the conversation. Those topics are hives for peers and collaborators. We think that BuzzData can provide an even 'closer to the feed' source of information for people's interests, similar to the way that journalists monitor feeds in Tweetdeck."


August 19 2011

Visualizing hunger in the Horn of Africa

Drought, conflict and rising food prices have put the lives of millions of people in the Horn of Africa at risk. Today, on World Humanitarian Day, citizens and governments alike are looking for ways to help victims of the East Africa drought. According to the State Department, more people than the combined populations of New York City and Houston need urgent assistance in the Horn of Africa. To understand the scope of the unfolding humanitarian disaster, explore the embedded map below.

The map was built by Development Seed using open source tools and open data. It includes estimates from the Famine Early Warning System Network (FEWS NET) and the Food Security and Nutrition Survey Unit - Somalia (FSNAU), coupled with data from the UN Office of Humanitarian Coordination and Affairs (UN OCHA). The map mashes up operational data from the World Food Program with situational data to show how resources are being allocated.

"This is about more than just creating a new map," writes Nate Smith, a data lead at Development Seed:

This map makes information actionable and makes its easy to see both the extent of the crisis and the response to it. It allows people to quickly find information about how to easily contribute much needed donations to support aid efforts on the ground, and see where those donations are actually going. In the Horn of Africa, the World Food Programme can feed one person for one day with just $0.50. Using this map it is possible to see what is needed budget wise to feed those in need, and how close the World Food Programme is in achieving this. Going forward, new location and shipment data will be posted in near real-time, keeping the data as accurate as possible.

Development Seed has also applied a fundamental platform principle by making it easy to spread both the data and message through social tools and embeddable code.

If you'd like to donate to organizations that are working to help people directly affected in the crisis, has posted a list of charities. If you'd prefer to donate directly to the World Food Program, you can also text AID to 27722 using your mobile phone to give $10 to help those affected by the Horn of Africa crisis.


June 24 2011

Big data and open source unlock genetic secrets

Replicating Nanomachines by jurvetson, on FlickrThe world is experiencing an unprecedented data deluge, a reality that my colleague Edd Dumbill described as another "industrial revolution" at February's Strata Conference. Many sectors of the global economy are waking up to the need to use data as a strategic resource, whether in media, medicine, or moving trucks. Open data has been a major focus of Gov 2.0, as federal and state governments move forward with creating new online platforms for open government data.

The explosion of data requires new tools and management strategies. These new approaches include more than technical evolution, as a recent conversation with Charlie Quinn, director of data integration technologies at the Benaroya Research Institute, revealed: they involve cultural changes that create greater value by sharing data between institutions. In Quinn's field, genomics, big data is far from a buzzword, with scanned sequences now rating on the terabyte scale.

In the interview below, Quinn shares insights about applying open source to data management and combining public data with experimental data. You can hear more about open data and open source in advancing personalized medicine from Quinn at the upcoming OSCON Conference.

How did you become involved in data science?

Charlie QuinnCharlie Quinn: I got into the field through a friend of mine. I had been doing data mining for fraud on credit cards and the principal investigator, who I work with now, was going to work in Texas. We had a novel idea that to build the tools for researchers, we should hire software people. What had happened in the past was you had bioinformaticians writing scripts. They found the programs that they needed did about 80% of what they wanted, and they had a hard time gaining the last 20%. So we had had a talk way back when saying, "if you really want proper software tools, you ought to hire software people to build them for you." He called my boss to come on down and take a look. I did, and the rest is history.

You've said that there's a "data explosion" in genomics research. What do you mean? What does this mean for your field?

Charlie Quinn: It's like the difference between analog and digital technology. The amount of data you'd have with analog is still substantial, but as we move toward digital, it grows exponentially. If we're looking at technology in gene expression values, which is what we've been focusing on in genomics, it's about a gigabyte per scan. As we move into doing targeted RNA sequencing, or even high frequency sequencing, if you take the raw output from the sequence, you're looking at terabytes per scan. It's orders of magnitude more data.

What that means from a practical perspective is there's more data being generated than just for your request. There's more data being generated than a single researcher could possibly ever hope to get their head wrapped around. Where the data explosion becomes interesting is how we engage researchers to take data they're generating and share it with others, so that we can reuse data, and other people might be able to find something interesting in it.

Health IT at OSCON 2011 — The conjunction of open source and open data with health technology promises to improve creaking infrastructure and give greater control and engagement for patients. These topics will be explored in the healthcare track at OSCON (July 25-29 in Portland, Ore.)

Save 20% on registration with the code OS11RAD

What are the tools you're using to organize and make sense of all that data?

Charlie Quinn: A lot of it's been homegrown so far, which is a bit of an issue as you start to integrate with other organizations because everybody seems to have their own homegrown system. There's an open source group in Seattle called Lab Key, which a lot of people have started to use. We're taking another look at them to see if we might be able to use some of their technology to help us move forward in organizing the backend. A lot of this is so new. It's hard to keep up with where we're at and quite often, we're outpacing it. It's a question of homegrown and integrating with other applications as we can.

How does open source relate to that work?

Charlie Quinn: We try and use open source as much as we can. We try and contribute back where we can. We haven't been contributing back anywhere near as much as we'd like to, but we're going to try and get into that more.

We're huge proponents not only of open source, but of open data. What we've been doing is going around and trying to convince people that we understand they have to keep data private up to a certain point, but let's try and release as much data as we can as early as we can.

When we go back to talking about the explosion of data, if we're looking at Gene X and we happen to see something that might be interesting on Y or Z, we can post a quick discovery note or a short blurb. In that way, you're trying to push ideas out and take the data behind those ideas and make it public. That's where I think we're going to get traction: trying to share data earlier rather than later.

At OSCON, you'll talk about how experimental data combines with public data. When did you start folding the two together?

Charlie Quinn: We've been playing with it for a while. What we're hoping to do is make more of it public, now that we're getting the institutional support for it. Years ago, we went and indexed all of the abstracts at Pubnet by gene so that when people went to a text engine, you could type in your query and you would get a list of genes, as opposed to a list of articles. That helped researchers find what they were looking for — and that's just leveraging openly available data. Now, with NIH's mandate for more people to publish their results back into repositories, we're downloading that data and combining it with the data we have internally. Now, as we go across a project or across a disease trying to find how a gene is acting or how a protein is acting, it's just giving us a bigger dataset to work with.

What are some of the challenges you've encountered in your work?

Charlie Quinn: The issues we've had are with the quality of the datasets in the public repositories. You need to hire a curator to validate if the data is going to be usable or not, to make sure it's comparable to the data that we want to use it with.

What's the future of open data in research and personalized medicine?

Charlie Quinn: We're going to be seeing multiple tiers of data sharing. In the long run, you've going to have very well curated public repositories of data. We're a fair ways away from there in reality because there's still a lot of inertia against doing that within the research community. The half-step to get there will be large project consortiums where we start sharing data inter-institutionally. As people get more comfortable with that, we'll be able to open it up to a wider audience.

This interview was edited and condensed.

Photo: Replicating Nanomachines by jurvetson, on Flickr


June 06 2011

Google Correlate: Your data, Google's computing power

Google CorrelateGoogle Correlate is awesome. As I noted in Search Notes last week, Google Correlate is a new tool in Google Labs that lets you upload state- or time-based data to see what search trends most correlate with that information.

Correlation doesn't necessarily imply causation, and as you use Google Correlate, you'll find that the relationship (if any) between terms varies widely based on the topic, time, and space.

For instance, there's a strong state-based correlation between searches for me and searches for Vulcan Capital. But the two searches have nothing to do with each other. As you see below, the correlation is that the two searches have similar state-based interest.

Picture 476.png

For both searches, the most volume is in Washington state (where we're both located). And both show high activity in New York.

State-based data

For a recent talk I gave in Germany, I downloaded state-by-state income data from the U.S. Census Bureau and ran it through Google Correlate. I found that high income was highly correlated with searches for [lohan breasts] and low income was highly correlated with searches for [police shootouts]. I leave the interpretation up to you.

Picture 443.png

Picture 445.png

By default, the closest correlations are with the highest numbers, so to get correlations with low income, I multiplied all of the numbers by negative one.

Clay Johnson looked at correlations based on state obesity rates from the CDC. By looking at negative correlations (in other words, what search queries are most closely correlated with states with the lowest obesity rates), we see that the most closely related search is [yoga mat bags]. (Another highly correlated term is [nutrition school].)

Picture 478.png

Maybe there's something to that "working out helps you lose weight" idea I've heard people mention. Then again, another highly correlated term is [itunes movie rentals], so maybe I should try the "sitting on my couch, watching movies work out plan" just to explore all of my options.

To look at this data more seriously, we can see with search data alone that the wealthy seem to be healthier (at least based on obesity data) than the poor. In states with low obesity rates, searches are for optional material goods, such as Bose headphones, digital cameras, and red wine and for travel to places like Africa, Jordan, and China. In states with high obesity rates, searches are for jobs and free items.

With this hypothesis, we can look at other data (access to nutritious food, time and space to exercise, health education) to determine further links.

Time-based data

Time-based data works in a similar way. Google Correlate looks for matching patterns in trends over time. Again, that the trends are similar doesn't mean they're related. But this data can be an interesting starting point for additional investigation.

One of the economic indicators from the U.S. Census Bureau is housing inventory. I looked at the number of months' supply of homes at the current sales rate between 2003 and today. I have no idea how to interpret data like this (the general idea is that you, as an expert in some field, would upload data that you understand). But my non-expert conclusion here is that as housing inventory increases (which implies no one's buying), we are looking to spiff up our existing homes with cheap stuff, so we turn to Craigslist.

Picture 481.png

Picture 482.png

Picture 483.png

Of course, it could also be the case that the height of popularity of Craiglist just happened to coincide with the months when the most homes were on the market, and both are coincidentally declining at the same rate.

Search-based data

You can also simply enter a search term, and Google will analyze the state or time-based patterns of that term and chart other queries that most closely match those patterns. Google describes this as a kind of Google Trends in reverse.

Google Insights for Search already shows you state distribution and volume trends for terms, and Correlate takes this one step further by listing all of the other terms with a similar regional distribution or volume trend.

For instance, regional distribution for [vegan restaurants] searches is strongly correlated to the regional distribution for searches for [mac store locations].

Picture 484.png

What does the time-trend of search volume for [vegan restaurants] correlate with? Flights from LAX.

Picture 485.png

Time-based data related to a search term can be a fascinating look at how trends spark interest in particular topics. For instance, as the Atkins Diet lost popularity, so too did interest in the carbohydrate content of food.

Picture 486.png

Interest in maple syrup seems to follow interest in the cleanse diet (of which maple syrup is a key component).

Picture 488.png

Drawing-based data

Don't have any interesting data to upload? Aren't sure what topic you're most interested in? Then just draw a graph!

Maybe you want to know what had no search volume at all in 2004, spiked in 2005, and then disappeared again. Easy. Just draw it on a graph.

Picture 489.png

Apparently the popular movies of the time were "Phantom of the Opera," "Darkness," and "Meet the Fockers." And we all were worried about our Celebrex prescriptions.

Picture 490.png

Picture 491.png

(Note: the accuracy of this data likely is dependent on the quality of your drawing skills.)

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD


June 01 2011

Search Notes: Connecting Google's dots

Here's what recently caught my attention in the search space.

Google Wallet

Google WalletLast week, Google unveiled Google Wallet, which on the one hand, might be the future of payments, but on the other hand, seems like it's just using your phone instead of your credit card to pay for things. And phones so far are bulkier to carry around than credit cards. But Google says:

... because Google Wallet is a mobile app, it will do more than a regular wallet ever could. You'll be able to store your credit cards, offers, loyalty cards and gift cards, but without the bulk.

Wallet will be integrated with Google Offers (Google's answer to Groupon) and one can imagine the possible future integrations. For instance, Google could manage travel from start to finish by integrating elements of its ITA acquisition for booking, Hotpot and Places for reviews and maps, and Wallet for paying on the go.

Google Wallet will be available this summer, initially on the Nexus S.

After the unveiling of Wallet, PayPal sued. They said that Google had been nearing the end of negotiations with PayPal to make it a payment option in the Android marketplace, but instead of signing, Google hired away the PayPal executive they'd been negotiating with and built their own version.

Of course, this isn't the first time Google has been sued for hiring talent away from a competitor. And since they had the two key ex-PayPal employees introduce Google Wallet publicly, they weren't exactly keeping things on the down low to avoid this lawsuit.

Android Open, being held October 9-11 in San Francisco, is a big-tent meeting ground for app and game developers, carriers, chip manufacturers, content creators, OEMs, researchers, entrepreneurs, VCs, and business leaders.

Save 20% on registration with the code AN11RAD

Google Correlate: Mine search trends using uploaded state-based or time-based data

Google CorrelateGoogle Correlate, new in Google Labs, takes the idea behind Flu Trends and makes it available to anyone, for any data. You can enter data by state or by time and find out what searches are most closely correlated. You can also simply enter a search term and see what other queries are most closely correlated (by state or by time).

This is all U.S. data for now. Google Correlate was launched in Labs, so hopefully when it graduates from there it will be launched worldwide.

Google's comic book about the product stresses that correlation does not imply causation. This data simply shows similar search patterns. But data patterns can provide insight. Flu Trends, for instance, predicts when and where flu is spreading based on how much people are searching for flu-related information. "We found aggregated flu-related queries which produced a seasonal curve that suggested actual flu activity," Google notes. They have corroborated these trends historically with government data about flu activity.

Google's worldwide market share

This column is "Search Notes," not "Google Notes," so why so much Google coverage? The fact is Google is the dominant search engine worldwide, more so even outside the U.S. Along those lines, as I was finalizing slides for a conference session in Germany, I double checked Google's search share there. I found that Google's share was relatively unchanged year over year, at more than 90% for Germany, France, the UK, and Spain. This week, comScore noted that Google is at more than 90% share in Latin America as well.

Removing content from Google

Last fall, I wrote two fairly detailed articles about removing content from Google search results:

Now, Google has made it easier for content owners to remove content. Just verify ownership of your site in Webmaster Tools, and then you can specify what pages from your site you want Google to remove from its results.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...