Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 07 2011

Look at Cook sets a high bar for open government data visualizations

Every month, more open government data is available online. Open government data is being used in mobile apps, baked into search engines or incorporated into powerful data visualizations. An important part of that trend is that local governments are becoming data suppliers.

For local, state and federal governments, however, releasing data is not enough. Someone has to put it to work, pulling the data together to create cohesive stories so citizens and other stakeholders can gain more knowledge. Sometimes this work is performed by public servants, though data visualization and user experience design has historically not been the strong suit of government employees. In the hands of skilled developers and designers, however, open data can be used to tell powerful stories.

One of the best recent efforts at visualizing local open government data can be found at Look at Cook, which tracks government budgets and expenditures from 1993-2011 in Cook County, Illinois.


The site was designed and developed by Derek Eder and Nick Rougeux, in collaboration with Cook County Commissioner John Fritchey. Below, Eder explains how they built the site, the civic stack tools they applied, and the problems Look at Cook aims to solve.

Why did you build Look at Cook?

Derek Eder: After being installed as a Cook County Commissioner, John Fritchey, along with the rest of the Board of Commissioners, had to tackle a very difficult budget season. He realized that even though the budget books were presented in the best accounting format possible and were also posted online in PDF format, this information was still not friendly to the public. After some internal discussion, one of his staff members, Seth Lavin approached me and Nick Rougeux and asked that we develop a visualization that would let the public easily explore and understand the budget in greater detail. Seth and I had previously connected through some of Chicago's open government social functions, and we were looking for an opportunity for the county and the open government community to collaborate.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

What problems does Look at Cook solve for government?

Derek Eder: Look at Cook shines a light on what's working in the system and what's not. Cook County, along with many other municipalities, has its fair share of problems, but before you can even try to fix any of them, you need to understand what they are. This visualization does exactly that. You can look at the Jail Diversion department in the Public Safety Fund and compare it to the Corrections and Juvenile Detention departments. They have an inverse relationship, and you can actually see one affecting the other between 2005 and 2007. There are probably dozens of other stories like these hidden within the budget data. All that was needed was an easy way to find and correlate them — which anyone can now do with our tool.

Look at Cook visualization example
Is there a relationship between the lower funding for Cook County's Jail Diversion and Crime Prevention division and the higher funding levels for the Department of Corrections and the Juvenile Temporary Detention Center divisions? (Click to enlarge.)

What problems does Look at Cook solve for citizens?

Derek Eder: Working on and now using Look at Cook opened my eyes to what Cook County government does. In Chicago especially, there is a big disconnect between where the county begins and where the city ends. Now I can see that the county runs specific hospitals and jails, maintains highways, and manages dozens of other civic institutions. Additionally, I know how much money it is spending on each, and I can begin to understand just how $3.5 billion dollars are spent every year. If I'm interested, I can take it a step further and start asking questions about why the county spends money on what it does and how it has been distributed over the last 18 years. Examples include:

  • Why did the Clerk of the Circuit Court get a 480% increase in its budget between 2007 and 2008? See the 2008 public safety fund.
  • How is the Cook County Board President going to deal with a 74% decrease in appropriations for 2011? See the 2011 president data.
  • What happened in 2008 when the Secretary of the Board of Commissioners got its funding reallocated to the individual District Commissioners? See the 2008 corporate fund.

As a citizen, I now have a powerful tool for asking these questions and being more involved in my local government.

What data did you use?

Derek Eder: We were given budget data in a fairly raw format as a basic spreadsheet broken down into appropriations and expenditures by department and year. That data went back to 1993. Collectively, we and Commissioner Fritchey's office agreed that clear descriptions of everything were crucial to the success of the site, so his office diligently spent the time to write and collect them. They also made connections between all the data points so we could see what control officer was in charge of what department, and they hunted down the official websites for each department.

What tools did you use to build Look at Cook?

Derek Eder: Our research began with basic charts in Excel to get an initial idea of what the data looked like. Considering the nature of the data, we knew we wanted to show trends over time and let people compare departments, funds, and control officers. This made line and bar charts a natural choice. From there, we created a couple iterations of wireframes and storyboards to get an idea of the visual layout and style. Given our prior technical experience building websites at Webitects, we decided to use free tools like jQuery for front-end functionality and Google Fusion Tables to house the data. We're also big fans of Google Analytics, so we're using it to track how people are using the site.

Specifically, we used:

What design principles did you apply?

Derek Eder: Our guiding principles were clarity and transparency. We were already familiar with other popular visualizations, like the New York Times' federal budget and the Death and Taxes poster from WallStats. While they were intriguing, they seemed to lack some of these traits. We wanted to illustrate the budget in a way that anyone could explore without being an expert in county government. From a visual standpoint, the goal was to present the information professionally and essentially let the visuals get out of the way so the data could be the focus.

We feel that designing with data means that the data should do most of the talking. Effective design encourages people to explore information without making them feel overwhelmed. A good example of this is how we progressively expose more information as people drill down into departments and control officers. Effective design should also create some level of emotional connection with people so they understand what they're seeing. For example, someone may know one of the control officers or have had an experience with one of the departments. This small connection draws their attention to those areas and gets them to ask questions about why things are the way they are.

This interview was edited and condensed.


August 09 2011

Building data startups: Fast, big, and focused

This is a written follow-up to a talk presented at a recent Strata online event.

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Setting the stage: The attack of the exponentials

The question of why this style of startup is arising today, versus a decade ago, owes to a confluence of forces that I call the Attack of the Exponentials. In short, over the past five decades, the cost of storage, CPU, and bandwidth has been exponentially dropping, while network access has exponentially increased. In 1980, a terabyte of disk storage cost $14 million dollars. Today, it's at $30 and dropping. Classes of data that were previously economically unviable to store and mine, such as machine-generated log files, now represent prospects for profit.

Attack of the exponentials

At the same time, these technological forces are not symmetric: CPU and storage costs have fallen faster than that of network and disk IO. Thus data is heavy; it gravitates toward centers of storage and compute power in proportion to its mass. Migration to the cloud is the manifest destiny for big data, and the cloud is the launching pad for data startups.

Leveraging the big data stack

As the foundational layer in the big data stack, the cloud provides
the scalable persistence and compute power needed to manufacture data

At the middle layer of the big data stack is analytics, where features are extracted from data, and fed into classification and prediction algorithms.

Finally, at the top of the stack are services and applications. This is the level at which consumers experience a data product, whether it be a music recommendation or a traffic route prediction.

Let's take each of layers and discuss the competitive axes at each.

The emerging big data stack
The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

Fast data

At the base of the big data stack — where data is stored, processed, and queried — the dominant axis of competition was once scale. But as cheaper commodity disks and Hadoop have effectively addressed scalable persistence and processing, the focus of competition has shifted toward speed. The demand for faster disks has led to an explosion in interest in solid-state disk firms, such as Fusion-IO, which went public recently. And several startups, most notably MapR, are promising faster versions of Hadoop.

FusionIO and MapR represent another trend at the data layer: commercial technologies that challenge open source or commodity offerings on an efficiency basis, namely watts or CPU cycles consumed. With energy costs driving between one-third and one-half of data center operating costs, these efficiencies have a direct financial impact.

Finally, just as many large-scale, NoSQL data stores are moving from disk to SSD, others have observed that many traditional, relational databases will soon be entirely in memory. This is particularly true for applications that require repeated, fast access to a full set of data, such as building models from customer-product matrices. This brings us to the second tier of the big data stack, analytics.

Big analytics

At the second tier of the big data stack, analytics is the brains to cloud computing's brawn. Here, however, the speed is less of a challenge; given an addressable data set in memory, most statistical algorithms can yield results in seconds. The challenge is scaling these out to address large datasets, and rewriting algorithms to operate in an online, distributed manner across many machines.

Because data is heavy, and algorithms are light, one key strategy is to push code deeper to where the data lives, to minimize network IO. This often requires a tight coupling between the data storage layer and the analytics, and algorithms often need to be re-written as user-defined functions (UDFs) in a language compatible with the data layer. Greenplum, leveraging its Postgres roots, supports UDFs written in both Java and R. Following Google's BigTable, HBase is introducing coprocessors in its 0.92 release, which allows Java code to be associated with data tablets, and minimize data transfer over the network. Netezza pushes even further into hardware, embedding an array of functions into FPGAs that are physically co-located with the disks of its storage appliances.

The field of what's alternatively called business or predictive analytics is nascent, and while a range of enabling tools and platforms exist (such as R, SPSS, and SAS), most of the algorithms developed are proprietary and vertical-specific. As the ecosystem matures, one may expect to see the rise of firms selling analytical services — such as recommendation engines — that interoperate across data platforms. But in the near-term, consultancies like Accenture and McKinsey, are positioning themselves to provide big analytics via billable hours.

Outside of consulting, firms with analytical strengths push upward, surfacing focused products or services to achieve success.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Focused services

The top of the big data stack is where data products and services directly touch consumers and businesses. For data startups, these offerings more frequently take the form of a service, offered as an API rather than a bundle of bits.

BillGuard is a great example of a startup offering a focused data service. It monitors customers' credit card statements for dubious charges, and even leverages the collective behavior of users to improve its fraud predictions.

Several startups are working on algorithms that can crack the content relevance nut, including Flipboard and Klout delivers a pure data service that uses social media activity to measure online influence. My company, Metamarkets, crunches server logs to provide pricing analytics for publishers.

For data startups, data processes and algorithms define their competitive advantage. Poor predictions — whether of fraud, relevance, influence, or price — will sink a data startup, no matter how well-designed their web UI or mobile application.

Focused data services aren't limited to startups: LinkedIn's People You May Know and FourSquare's Explore feature enhance engagement of their companies' core products, but only when they correctly suggest people and places.

Democratizing big data

The axes of strategy in the big data stack show analytics to be squarely at the center. Data platform providers are pushing upwards into analytics to differentiate themselves, touting support for fast, distributed code execution close to the data. Traditional analytics players, such as SAS and SAP, are expanding their storage footprints and challenging the need for alternative data platforms as staging areas. Finally, data startups and many established firms are creating services whose success hinges directly on proprietary analytics algorithms.

The emergence of data startups highlights the democratizing consequences of a maturing big data stack. For the first time, companies can successfully build offerings without deep infrastructure know-how and focus at a higher level, developing analytics and services. By all indications, this is a democratic force that promises to unleash a wave of innovation in the coming decade.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...