Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 15 2014

Four short links: 15 January 2014

  1. Hackers Gain ‘Full Control’ of Critical SCADA Systems (IT News) — The vulnerabilities were discovered by Russian researchers who over the last year probed popular and high-end ICS and supervisory control and data acquisition (SCADA) systems used to control everything from home solar panel installations to critical national infrastructure. More on the Botnet of Things.
  2. mclMarkov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for graphs (also known as networks) based on simulation of (stochastic) flow in graphs.
  3. Facebook to Launch Flipboard-like Reader (Recode) — what I’d actually like to see is Facebook join the open web by producing and consuming RSS/Atom/anything feeds, but that’s a long shot. I fear it’ll either limit you to whatever circle-jerk-of-prosperity paywall-penetrating content-for-advertising-eyeballs trades the Facebook execs have made, or else it’ll be a leech on the scrotum of the open web by consuming RSS without producing it. I’m all out of respect for empire-builders who think you’re a fool if you value the open web. AOL might have died, but its vision of content kings running the network is alive and well in the hands of Facebook and Google. I’ll gladly post about the actual product launch if it is neither partnership eyeball-abuse nor parasitism.
  4. Map Projections Illustrated with a Face (Flowing Data) — really neat, wish I’d had these when I was getting my head around map projections.

December 27 2013

Four short links: 27 December 2013

  1. Intel XDKIf you can write code in HTML5, CSS3 and JavaScript*, you can use the Intel® XDK to build an HTML5 web app or a hybrid app for all of the major app stores. It’s a .exe. What more do I need to say? FFS.
  2. Behind the Scenes of a Dashboard Design — the design decisions that go into displaying complex info.
  3. Superconductora web framework for creating data visualizations that scale to real-time interactions with up to 1,000,000 data points. It compiles to WebCL, WebGL, and web workers. (via Ben Lorica)
  4. BIDMach: Large-scale Learning with Zero Memory Allocation (PDF) — GPU-accelerated machine learning. In this paper we describe a caching approach that allows code with complex matrix (graph) expressions at massive scale, i.e. multi-terabyte data, with zero memory allocation after the initial setup. (via Siah)
Sponsored post
Reposted byLegendaryy Legendaryy

January 10 2013

Four short links: 10 January 2013

  1. How To Make That One Thing Go Viral (Slideshare) — excellent points about headline writing (takes 25 to find the one that works), shareability (your audience has to click and share, then it’s whether THEIR audience clicks on it), and A/B testing (they talk about what they learned doing it ruthlessly).
  2. A More Complete Picture of the iTunes Economy — $12B/yr gross revenue through it, costs about $3.5B/yr to operate, revenue has grown at a ~35% compounded rate over last four years, non-app media 2/3 sales but growing slower than app sales. Lots of graphs!
  3. Visualizing the iOS App Store — interactive exploration of app store sales data.
  4. BORPHan Operating System designed for FPGA-based reconfigurable computers. It is an extended version of the Linux kernel that handles FPGAs as if they were CPUs. BORPH introduces the concept of a ‘hardware process’, which is a hardware design that runs on an FPGA but behaves just like a normal user program. The BORPH kernel provides standard system services, such as file system access to hardware processes, allowing them to communicate with the rest of the system easily and systematically. The name is an acronym for “Berkeley Operating system for ReProgrammable Hardware”.

Four short links: 10 January 2013

  1. How To Make That One Thing Go Viral (Slideshare) — excellent points about headline writing (takes 25 to find the one that works), shareability (your audience has to click and share, then it’s whether THEIR audience clicks on it), and A/B testing (they talk about what they learned doing it ruthlessly).
  2. A More Complete Picture of the iTunes Economy — $12B/yr gross revenue through it, costs about $3.5B/yr to operate, revenue has grown at a ~35% compounded rate over last four years, non-app media 2/3 sales but growing slower than app sales. Lots of graphs!
  3. Visualizing the iOS App Store — interactive exploration of app store sales data.
  4. BORPHan Operating System designed for FPGA-based reconfigurable computers. It is an extended version of the Linux kernel that handles FPGAs as if they were CPUs. BORPH introduces the concept of a ‘hardware process’, which is a hardware design that runs on an FPGA but behaves just like a normal user program. The BORPH kernel provides standard system services, such as file system access to hardware processes, allowing them to communicate with the rest of the system easily and systematically. The name is an acronym for “Berkeley Operating system for ReProgrammable Hardware”.

September 09 2011

Top Stories: September 5-9, 2011

Here's a look at the top stories published across O'Reilly sites this week.

The new guy wants to hack the city's data
Instead of quietly settling in like most new residents, Tyler, Texas, transplant Christopher Groskopf is on a mission to find and unlock his new city's datasets.

RIP Michael S. Hart
Michael Hart was the founder of Project Gutenberg, an incredible visionary for online books, and someone who played an important role in Nat Torkington's life.

Look at Cook sets a high bar for open government data visualizations
One of the best recent efforts at visualizing open government data can be found at, which tracks government budgets and expenditures from 1993-2011 in Cook County, Illinois.

Master a new skill? Here's your badge
The Mozilla Foundation's Erin Knight talks about how the badges and open framework of the Open Badge Project could change what "counts" as learning.

The boffins and the luvvies
Whether we're discussing ancients versus moderns, scientists versus poets, or the latest variant — computer science versus humanities, the debate between science and art is persistent and quite old.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively. Save 30% on registration with the code ORM30.

August 23 2011

The nexus of data, art and science is where the interesting stuff happens

Jer Thorp (@blprnt), data artist in residence at The New York Times, was tasked a few years ago with designing an algorithm for the placement of the names on the 9/11 memorial. If an algorithm sounds unnecessarily complex for what seems like a basic bit of organization, consider this: Designer Michael Arad envisioned names being arranged according to "meaningful adjacencies," rather than by age or alphabetical order.

The project, says Thorp, is a reminder that data is connected to people, to real lives, and to the real world. I recently spoke with Thorp about the challenges that come with this type of work and the relationship between data, art and science. Thorp will expand on many of these ideas in his session at next month's Strata Conference in New York City.

Our interview follows.

How do aesthetics change our understanding of data?

Jer ThorpJer Thorp: I'm certainly interested in the aesthetic of data, but I rarely think when I start a project "let's make something beautiful." What we see as beauty in a data visualization is typically pattern and symmetry — something that often emerges when you find the "right" way, or one of the right ways, to represent a particular dataset. I don't really set out for beauty, but if the result is beautiful, I've probably done something right.

My work ranges from practical to conceptual. In the utilitarian projects I try not to add aesthetic elements unless they are necessary for communication. In the more conceptual projects, I'll often push the acceptable limits of complexity and disorder to make the piece more effective. Of course, often these more abstract pieces get mistaken for infographics, and I've had my fair share Internet comment bashing as a result. Which I kind of like, in some sort of masochistic way.

What's it like working as a data artist at the New York Times? What are the biggest challenges you face?

Jer Thorp: I work in the R&D Group at the New York Times, which is tasked to think about what media production and consumption will look like in the next three years or so. So we're kind of a near-futurist department. I've spent the last year working on Project Cascade, which is a really novel system for visualizing large-scale sharing systems in real time. We're using it to analyze how New York Times content gets shared through Twitter, but it could be used to look at any sharing system — meme dispersal, STD spread, etc. The system runs live on a five-screen video wall outside the lab, and it gives us a dynamic, exploratory look at the vast conversation that is occurring at any time around New York Times articles, blog posts, etc.

It's frankly amazing to be able to work in a group where we're encouraged to take the novel path. Too many "R&D" departments, particularly in advertising agencies, are really production departments that happen to do work with augmented reality, or big data, or whatever else is trendy at the moment. There's an "R" in R&D for a reason, and I'm lucky to be in a place where we're given a lot of room to roam. Most of the credit for this goes to Michael Zimbalist, who is a great thinker and has an uncanny sense of the future. Add to that a soundly brilliant design and development team and you get a perfect creative storm.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

I try to straddle the border between design, art and science, and one of my biggest challenges is to not get pulled too far in one direction. I'm always conscious when I'm starting new projects to try to face in a different direction from where I was headed last. This keeps me at that boundary where I think the most interesting things are happening. Right now I'm working on two projects that concern memory and history, which is relatively uncharted territory for me and is getting me into a mix of neurobiology and psychology research alongside a lot of art and design history. So far, it's been tremendously satisfying.

In addition to your position at the Times, you're also a visiting professor at New York University. I'm curious how you see data visualization changing the way art and technology are taught and learned.

Jer Thorp: The class I'm currently teaching is called "Data Representation." Although it does include a fair amount of visualization, we talk a lot about how data can be used in a creative practice in different ways — sculpture, performance, participatory practice, etc. I'm really excited about artists who are representing information in novel media, such as Adrien Segal and Nathalie Miebach, and I try to encourage my students to push into areas that haven't been well explored. It's an exciting time for students because there are a million new niches just waiting to be found.

This interview was edited and condensed.


February 02 2011

Four Short Links: 2 February 2011

  1. Seven Foundational Visualization Papers -- seven classics in the field that are cited and useful again and again.
  2. Git Immersion -- a "walking tour" of Git inspired by the premise that to know a thing is to do it. Cf Learn Python the Hard Way or even NASA's Planet Makeover. We'll see more and more tutorials that require participation because you don't get muscle memory by reading. (NASA link via BoingBoing
  3. Readability -- strips out ads and sends money to the publishers you like. I'd never thought of a business model as something that's imposed from the outside quite like this, but there you go.
  4. Quora's Technology Examined (Phil Whelan) -- In this blog post I will delve into the snippets of information available on Quora and look at Quora from a technical perspective. What technical decisions have they made? What does their architecture look like? What languages and frameworks do they use? How do they make that search bar respond so quickly? Lots of Python. (via Joshua Schachter on Delicious)

January 27 2011

Four short links: 27 January 2011

  1. Mozilla Home Dash -- love this experiment in rethinking the browser from Mozilla. They call it a "browse-based browser" as opposed to "search-based browser" (hello, Chrome). Made me realize that, with Chrome, Google's achieved a 0-click interface to search--you search without meaning to as you type in URLs, you see advertising results without ever having visited a web site.
  2. Periodic Table of Google APIs -- cute graphic, part of a large push from Google to hire more outreach engineers to do evangelism, etc. The first visible signs of Google's hiring binge.
  3. NFC in the Real World (Dan Hill) -- smooth airline checkin with fobs mailed to frequent fliers.
  4. XSS Prevention Cheat Sheet (OWASP) -- HTML entity encoding doesn't work if you're putting untrusted data inside a

January 25 2011

3 skills a data scientist needs

To prepare for next week's Strata Conference, we're continuing our series of conversations with big data innovators. Today, we talk with LinkedIn senior research scientist Pete Skomoroch about the core skills of data scientists.

The first skill, as you might expect, is a base in statistics, algorithms, machine learning, and mathematics. "You need to have a solid grounding in those principles to actually extract signals from this data and build things with it," Skomoroch said.

Second, a good data scientist is handy with a collection of open-source tools — Hadoop, Java, Python, among others. Knowing when to use those tools, and how to code, are prerequisites.

The third set of skills focus on making products real and making data available to users. "That might mean data visualization, building web prototypes, using external APIs, and integrating with other services," Skomoroch said. In other words, this one's a combination of coding skills, an ability to see where data can add value, and collaborating with teams to make these products a reality.

Skomorich's position gives him insight into the job market, what jobs are being posted, and who is hiring for which roles. He said he's glad to see new startups adding a data scientist or engineer to the founding staff. "That's a good sign."

Skomorich discusses data science skill sets and related topics in the following video:

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD


January 21 2011

Visualization deconstructed: Mapping Facebook's friendships

In the first post in Radar's new "visualization deconstructed" series, I talked about how data visualization originated from cartography (which some now just call "mapping"). Cartography initially focused on mapping physical spaces, but at the end of the 20th century we created and discovered new spaces that were made possible by the Internet. By abstracting away the constraints of the physical space, social networks such as Facebook emerged and opened up new territories, where topology is primarily defined by the social fabric rather than physical space. But is this fabric completely de-correlated from the physical space?

Mapping Facebook's friendships

Last December, Paul Butler, an intern on Facebook's data infrastructure engineering team, posted a visualization that examined a subset of the relations between Facebook users. Users were positioned in their respective cities and arcs denoted friendships.

Paul extracted the data and started playing with it. As he put it:

Visualizing data is like photography. Instead of starting with a blank canvas, you manipulate the lens used to present the data from a certain angle.

There is definitely discovery involved in the process of creating a visualization, where by giving visual attributes to otherwise invisible data, you create a form for data to embody.

The most striking discovery that Paul made while creating his visualization was the unraveling of a very detailed map of the world, including the shapes of the continents (remember that only lines representing relationships are drawn).

If you compare the Facebook visualization with NASA's world at night pictures, you can see how close the two maps are, except for Russia and parts of China. It seems that Facebook has a big growth opportunity in these regions!

So let's have a look at Paul's visualization:

  • A complex network of arcs and lines does a great job communicating the notions of human activity and organic social fabric.
  • The choice of color palette works very well, as it immediately make us think about night shots of earth, where the light of the city makes human activity visible. The color contrast is well balanced, so that we don't see too much blurring or bleeding of colors.
  • Choosing to draw only lines and arcs makes the visualization very interesting, as at first sight, we would think that the outlines of continents and the cities have been pre-drawn. Instead, they emerge from the drawing of arcs representing friendships between people in different cities, and we can make the interesting discovery of a possible correlation between physical location and social friendships on the Internet.

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD

Overall, this is a great visualization that had a lot of success last December, being mentioned in numerous blogs and liked by more than 2,000 people on Facebook. However, I can see a couple ways to improve it and open up new possibilities:

  • Play with the color scale -- By using a less linear gradient as a color scale, or by using more than two colors, some other patterns may emerge. For instance, by using a clearer cut-off in the gradient, we could better see relations with a weight above a specific threshold. Also, using more than one color in the gradient might reveal the predominance of one color over another in specific regions. Again, it's something to try, and we'll probably lose some of the graphic appeal in favor of (perhaps) more insights into the data.
  • Play with the drawing of the lines -- Because the lines are spread all over the map, it's a little difficult to identify "streams" of lines that all flow in the same direction. It would be interesting to draw the lines in three parts, where the middle part would be shared by many lines, creating "pipelines" of relationships from one region to another. Of course, this would require a lot of experimentation and it might not even be possible with the tools used to draw the visualization.
  • Use a different reference to position cities -- Cities in the visualization are positioned using their geographical position, but there are other ways they could be placed. For instance, we could position them on a grid, ordered by their population, or GDP. What kind of patterns and trends would emerge by changing this perspective ?

Static requires storytelling

In last week's post, I looked at an interactive visualization, where users can explore the data and its different representations. With the Facebook data, we have a static visualization where we can only look, not touch — it's like gazing at the stars.

Although a static visualization has the potential to evolve into an interactive visualization, I think creating a static image involves a little bit more care. Interactive visualizations can be used as exploration tools, but static visualizations need to present insight the data explorer had when creating the visualization. It has to tell a story to be interesting.


January 07 2011

Visualization deconstructed: New York Times "Mapping America"

Data visualization is an emerging domain that is deeply rooted in the tradition of cartography, having evolved to match the quantity and diversity of data we find in today's technological environment.

In this first post in an ongoing data visualization series, I'll take a closer look at the New York Times' Mapping America interactive map of the American census data. This subject also gives me an opportunity to talk briefly about the relationship between cartography and visualization.

2011-01-04-MappingAmerica-image-1.jpgAs the ancestor of data visualization, cartography was initially used to navigate the land and the sea, to give us an overview of our physical space and help us explore the world more safely. Maps were catalyzers for the development of human societies. Cartography quickly evolved to display not only the shapes of the land but also location-specific data, such as temperature, population or tax income. By overlaying location-specific data on top of topographic maps, cartography provided a tool for everyone to take a step back, and discover their environment in a way they couldn't imagine before.

Data visualization is all about extending the concept of cartography to mapping any kind of data, whether numerical, spatial, textual or social. As David McCandless said "by visualizing information ... we turn it into a landscape," a virtual landscape that we can then explore to discover hidden trends and patterns that will help us better understand the world we live in.

The New York Times' "Mapping America" visualization is a good illustration of this strong heritage between cartography and data visualization. It consists of an interactive map of data extracted from the American Community Survey Census, based on samples from 2005 to 2009 and including indicators such as ethnic groups, income, housing, families and education.

Mapping America

From a purely graphical standpoint, the "mapping america" visualization is a very good example of clean, simple, careful design:

  • The topographic base is a custom-styled Google Maps overlay. By using subtle shades of blue-gray to denote borders and geological elements, the map blends into the background and lets the viewer focus on the mapped indicators.
  • The use of colored dots for ethnicity and income, and colored areas for the other indicators, takes into account population density (with dots) or leaves it out of the equation (with areas) when it is not relevant.
  • Details are shown on hovering the mouse, allowing to get the numbers for the indicator and current area. (See this in action on the Times' website.)
  • The interaction is very smooth. Tooltip and areas appear and disappear gracefully.

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD

The presence of controls for interacting with the representation of the data in this visualization is important: you can enter a zip code, city or address to go to a specific location, or change the indicator being displayed without resetting the map. These features might sound simple, as we're now used to services like Google Maps, but interaction is one of the key elements of data visualization. The user is not only a viewer, he or she becomes an explorer who can use the visualization as a tool to understand what is going on.

The choice of colors in this visualization is also worth mentioning. The palette is based on a playful set of pastel colors (green, blue, yellow, red) which are then adapted for every indicator. Some indicators will use only one color in different shades (education) some will use a gradient between two main colors (when the indicator displays a change). Not only is the type of palette linked to the type of indicator displayed (same color by default, two colors when the indicators denotes a change) but sometimes specific colors are picked for their connotation. For example, the map of households earning under $30K uses red, while the other earning maps use green.

Finally, if you find something interesting using the visualization, you can share a specific URL. This type of targeted sharing encourages discussion and new insights.

Data visualization to empower people

By creating a tool that is easy to use, and that is true to the original data, the New York Times opened up a new range of possibilities. The census data is a cornerstone of social statistics and studies, but without proper tools it is difficult for most people to comprehend. This example taps into the power of visualization: it makes complex information simpler to understand.

December 31 2010

Four short links: 31 December 2010

  1. The Joy of Stats -- Hans Rosling's BBC documentary on statistics, available to watch online.
  2. Best Tech Writing of 2010 -- I need a mass "add these to Instapaper" button. (via Hacker News)
  3. Google Shared Spaces: Why We Made It (Pamela Fox) -- came out of what people were trying to do with Google Wave.
  4. The Great Delicious Exodus -- traffic graph as experienced by pinboard.

December 29 2010

Four short links: 29 December 2010

  1. datastore -- implementation of Google App Engine Datastore in Java, running on hbase and hadoop. (via Hacker News)
  2. Mining of Massive Datasets -- 340 page book from Stanford with the best copyright cautionary coverletter: we expect that you will acknowledge our authorship if you republish parts or all of it. We are sorry to have to mention this point, but we have evidence that other items we have published on the Web have been appropriated and republished under other names. It is easy to detect such misuse, by the way, as you will learn in Chapter 3. (via Delicious)
  3. Wordcram -- generate word clouds in Processing. (via jandot on Twitter)
  4. URL Design -- the why and how of designing your URLs. Must-read. (via kneath on Twitter)

December 27 2010

Four short links: 27 December 2010

  1. emscripten -- LLVM to Javascript compiler. Any code that compiles to LLVM can run in the browser (Python, Lua, C++). LLVM is open source virtual machine that Apple bought into (literally, they hired the developer).
  2. 30 Lessons Learned in Computing Over The Last 10 Years -- Backup every day at the minimum, and test restores every week. I don't think I've worked at an organisation that didn't discover at one point that they couldn't restore from their backups. Many other words of wisdom, and this one rang particularly true: all code turns into shit given enough time and hands. (via Hacker News)
  3. What Your Computer Does While You Wait -- top-to-bottom understanding of your system makes you a better programmer.
  4. How to Visualize the Competition -- elegant graphing of strategy. (via Dave Moskovitz on Twitter)

December 16 2010

Open data study shows progress, but a long road ahead for open government

A new report on the attitudes, quality and use of open government data shows strong support for the release of open data among citizens and government employees. While the success of New York State Governor-Elect Cuomo or Rhode Island Governor-Elect Lincoln Chafee in the 2010 election didn't provide sufficient data points in of themselves, this report showed that, by a 3 to 1 margin, the citizens surveyed are more likely to vote for politicians who champion open government. The full results of the open data benchmark study are available at

"The findings of this study support what Sunlight has been seeing from our open government stakeholders in the public sector, the tech community and citizen advocates," said Ellen Miller, co-founder and executive director of the Sunlight Foundation in a prepared statement. "The current commitment among all of those working to advance open government shows that we are at a good starting point, but more hard work is still ahead of us in order to create the promise of a truly open government."

Supporters of the Sunlight Foundation's transparency work were no doubt pleased to hear that 67.9 percent of surveyed citizens and 92.6 percent of surveyed government employees indicated that if open government data is made public, it should be publicly available online.

"The transformative impact of Open Data will become self-perpetuating, but is not there yet," said Kevin Merritt, founder and CEO of Socrata in a statement. "The flywheel effect requires two things: significantly more high-value data that is universally accessible; and more active engagement between governments, citizens and developers."

Socrata, a three year-old Seattle-based startup that provides social data discovery services for opening government data, delivered the report in a partnership with the Sunlight Foundation, Personal Democracy Forum, GovLoop, Code for America and David Eaves. The report is based on data from three surveys conducted between August and October 2010.

The results of the open data benchmark survey were grim with respect to how developers rated the availability of government data. Safouen Raba, vice president of marketing at Socrata, said that of those surveyed, only 30 percent said that government data was available, and of that, 50 percent was unusable. "There's a lot of munging necessary on data then side, along with screen scraping," Raba said at the Open Government Data Summit at GOSCON earlier this year. Developers surveyed indicated issues with data timeliness, accuracy, usable formats, metadata schemas, consistency, and incomplete data sets.

While the government stakeholders surveyed indicated that 21.5 percent of government organizations are "actively engaging with developers to build applications, 40.9 percent stated they had no current plans to engage developers."

Developers Rate the Current State of Gov Data Accessibility (Chart)

Powered by Socrata

The results of the study also suggest that there's a long way to go for the release of open government data. Less than a quarter of government organizations surveyed reported the launch of an open data site. Large majorities of citizens surveyed said they'd never heard of open data initiatives.

Open data proponents within government face major obstacles in government, with some 27 percent of respondents citing lack of political will or leadership, along with lack of funding (19 percent) and privacy and security concerns (16.5 percent).

That said, 55.6 percent of government organizations surveyed reported that they have do a mandate to share public data with they public, with some 48.1 percent already publishing data in some form.

Another key takeaway: 63 percent of citizens surveyed indicated that they prefer to explore and interact with data online, as opposed to downloading data to examine in a spreadsheet. Given than downloadable files are currently the prevalent mode of publishing government data electronically, there are clear takeaways for policy makers.

When it comes to motivation for open government data initiatives, it's not hard to see the effect of the Open Government Directive at the federal level. By way of contrast, the open data survey results also showed that motivators aren't strongly reported at the state, municipal and county level.

Transparency Motivators - By Type of Government

Powered by Socrata

"The single best thing we could do in open government is to get the American people engaged in the question of what high value data is," said Aneesh Chopra, the first United States chief technology officer, speaking at Politico's "What's Next in Tech" forum in Union Station this fall. Now there's more insight on that question. The five most important data categories identified in the survey are public safety, revenue and expenditure, accountability (like campaign finance or voting records), education, and information about where and how government services can be accessed.

This survey provides valuable insight to anyone interested in the progress, attitudes and prospects for open government data. Best of all, it's publicly available online, so share it, embed it, and build visualizations.

Strata Week: Shop 'til you drop

Need a break from the holiday madness? You're not alone. Check out these items of interest from the land of data and see why even the big consumers face tough choices.

Does this place accept returns?

On Monday, Stack Overflow announced that they have moved the Stack Exchange Data Explorer (SEDE) off of the Windows Azure platform and onto in-house hardware.


SEDE is an open source, web-based tool for querying the monthly data dump of Creative Commons data from its four main Q&A sites (Stack Overflow, Server Fault, Super User, and Meta) as well as other sites in the Stack Exchange family. The primary reason given (within a polite write-up by Jeff Atwood and SEDE lead Sam Saffron), was the desire to have fine-tuned control over the platform.

When you are using a [Platform-as-a-Service] you are giving up a lot of control to the service provider. The service provider chooses which applications you can run and imposes a series of restrictions. ... It was disorienting moving to a platform where we had no idea what kind of hardware was running our app. Giving up control of basic tools and processes we use to tune our environment was extremely painful.

While the support that comes with Platform-as-a-Service was acknowledged, it seems that the ability to better automate, adjust, and perpetuate processes and systems with more fine-grained control won out as a bigger convenience.

Where did you get that lovely platform?

Strata 2011Of course, one company's headache is another's dream. Netflix, a company known for playing with big data and crowdsourcing solutions "before it was cool," posted on Tuesday the four reasons they've chosen to use Amazon Web Services (AWS) as their platform and have moved onto it over the last year.

Laudably, the company states that it viewed its tremendous recent growth (in terms of both members and streaming devices) as a license to question everything in the necessary process of re-architecting. Instead of building out their own data centers, etc., they decided to answer that set of questions by paying someone else to worry about it.

Also to their credit, Netflix has enough self-awareness to know what they are and aren't good at. Building top-notch recommendation systems and providing entertainment? You betcha. Predicting customer growth and device engagement? Not so much.

How many subscribers would you guess used our Wii application the week it launched? How many would you guess will use it next month? We have to ask ourselves these questions for each device we launch because our software systems need to scale to the size of the business, every time.

Self-awareness is in fact the primary lesson in both Netflix's and Stack Exchange's platform decisions. If you feel your attention is better spent elsewhere, write a check. If you've got the time and expertise to hone your hardware, roll your own.

[Of course, Netflix doesn't go for the pre-packaged solutions every time. They also posted recently about why they love open source software, and listed among the projects they make use of and contribute back to: Hadoop, Hive, HBase, Honu, Ant, Tomcat, Hudson, Ivy, Cassandra, etc.]

With what shall we shop?

The New York Times this week released a cool group of interactive maps based on data collected in the Census Bureau's American Community Survey (ACS) from 2005 to 2009. Data is compared against the 2000 census to uncover rates of change.

[While similar to the census, the ACS is conducted every year instead of every 10 years. The ACS includes only a sampling of addresses instead of a comprehensive inventory. It covers much of the same ground on population (age, race, disability status, family relationships), but it also asks for information that is used to help make funding distribution decisions about community services and institutions.]

The Times maps explore education levels; rent, mortgage rates, and home values; household income; and racial distribution. Viewers can select among 22 maps in these four categories, and then pan and zoom to view national, state, or local trends down to the level of individual census tracts.

Above is the national view of the map that looks at change in median household income. The ACS website itself provides some maps displaying the survey numbers from the 2000 census and the 2005-2009 survey, as well as a listing of data tables.

The Times map shows the uneven way in which these numbers have gone up or down in various parts of the country, with some surprising results that are worth exploring. Note that the blue regions are places where income has dropped, and the yellow regions are places where it has increased. (No wonder a lot of us are getting creative with holiday shopping.)

If this kind of research floats your boat, check out Social Explorer, the mapping tool used to create the New York Times maps.

Even markets like to buy things

The emerging landscape of custom data markets is already shifting as Infochimps recently announced the acquisition of Data Marketplace, a start-up incubated at Y Combinator.

While Stewart Brand may be right in thinking information wants to be free, there's also enormous value to be added by aggregating, structuring, and packaging data, as well as in matching up buyers with sellers. That's the main service Data Marketplace aims to provide, particularly in the field of financial data.

At Infochimps, information is offered a la carte, and many of the site's datasets are offered for free. These include sets as diverse as "Word List - 100,000+ official crossword words (Excel readable)", "Measuring Worth: Interest Rates - US & UK 1790-2000", and "Retrosheet: Game Logs (play-by-play) for Major League Baseball Games." Data Marketplace is a bit different, in that it allows users to enter requests for data (with a deadline and budget, if desired) and then matches up would-be buyers with data providers.

Infochimps has said that Data Marketplace, which is less than a year old, will continue to operate as a standalone site, although its founders Steve DeWald and Matt Hodan will depart for new projects.

If you're interested in the burgeoning business of aggregated datasets, be sure to check out the Data Marketplaces panel I'll be moderating at Strata in February.

Not yet signed up for Strata? Register now and save 30% with the code STR11RAD.

December 14 2010

Four short links: 14 December 2010

  1. The Million Follower Fallacy (PDF) -- We found that indegree represents a user’s popularity, but is not related to other important notions of influence such as engaging audience, i.e., retweets and mentions. Retweets are driven by the content value of a tweet, while mentions are driven by the name value of the user. Such subtle differences lead to dissimilar groups of the top Twitter users; users who have high indegree do not necessarily spawn many retweets or mentions. This finding suggests that indegree alone reveals very little about the influence of a user. Research confirms what we all knew, that idiots who chase follower numbers have the influence they deserve. (via Steve O'Grady on Twitter, indirectly)
  2. Geocoding Github: Visualizing Distributed Open-Source Development -- work for the Stanford visualization class, plotting open source commits on maps over time. See this page for the interactive explorer. (via Michael Driscoll on Twitter)
  3. ArduPilotMega 1.0 Launched -- autopilot built on the Arduino platform. (via Chris Anderson on Twitter)
  4. Lessons of the Gawker Security Mess (Forbes blog) -- nice deconstruction of what happened. In the chat, Gawker’s Hamilton Nolan, after hearing that it is just Gawker users who have been compromised, remarks “oh, well. unimportant”. Gawker’s Richard Lawson wants to know if the breach is limited to “just the peasants?” Don't trash talk about your users in company channels. The business that forgets it lives and dies on its customers is a business that will eventually be hated by its customers. (via Nahum Wild on Twitter)

December 09 2010

Strata Gems: Make beautiful graphs of your Twitter network

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Explore and visualize graphs with Gephi.

Strata 2011 Where better to start analyzing social networks than with your own? Using the graphing tool Gephi and a little bit of Python script, you can analyze your own Twitter network, revealing the inherent structure among those you follow. It's also a fun way to learn more about network analysis.

Inspired by the LinkedIn Gephi graphs, I analyzed my Twitter friend network. I took everybody that I followed on Twitter, and found out who among them followed each other. I've shared the Python code I used to do this on

To use the script, you need to create a Twitter application and use command-line OAuth authentication to get the tokens to plug into the script. Writing about that is a bit gnarly for this post, but the easiest way I've found to authenticate a script with OAuth is by using the oauth command-line tool that ships with the Ruby OAuth gem.

The output of my Twitter-reading tool is a graph, in GraphML, suitable for import into Gephi. The graph has a node for each person, and an edge for each "follows" relationship. On initial load into Gephi, the graph looks a bit like a pile of spider webs, not showing much information.

I wanted to show a couple of things in the graph: cluster closely related people, and highlight who are the well-connected people. To find related groups of people, you can use Gephi to analyze the modularity of the network, and then color nodes according to the discovered communities. To find the well-connected people, run the "Degree Power Law" statistic in Gephi, which will calculate the betweenness centrality for each person, which essentially computes how much of a hub they are.

These steps are neatly laid out in a great slide deck from Sociomantic Labs on analyzing Facebook social networks. Follow the tips there and you'll end up with a beautiful graph of your network that you can export to PDF from Gephi.

Social graph
Overview of my social graph: click to view the full PDF version

The final result for my network is shown above. If you download the full PDF, you'll notice there are several communities, which I'll explain for interest. The mass of pink is predominantly my O'Reilly contacts, dark green shows the Strata and data community, the lime green the Mono and GNOME worlds, mustard shows the XML and open source communities. The balance of purple is assorted technologist friends.

Finally my sporting interests are revealed: the light blue are cricket fans and commentators, the red Formula 1 motor racing. Unsurprisingly, Tim O'Reilly, Stephen Fry and Miguel de Icaza are big hubs in my network. Your own graphs will reveal similar clusters of people and interests.

If this has whetted your appetite, you can discover more about mining social networks at Matthew Russell's Strata session, Unleashing Twitter Data For Fun And Insight.

December 08 2010

Strata Gems: Explore and visualize graphs with Gephi

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Five data blogs you should read.

Strata 2011If you need to explore your data as a graph, then Gephi is a great place to start. An open source project, Gephi is the ideal tool for exploring data and analyzing networks.

Gephi is available for Windows, Linux and OS X. You can get started by downloading and installing Gephi, and playing with one of the example data sets.

Gephi is a sophisticated tool. A "Photoshop for data", it offers a rich palette of features, including those specialized for social network analysis.

Gephi screenshot

Graphs can be loaded and created using many common graph file formats, and explored interactively. Hierarchical graphs such as social networks can be clustered in order to extract meaning. Gephi's layout algorithms automatically give shape to a graph to help exploration, and you can tinker with the colors and layout parameters to improve communication and appearance.

Following the Photoshop metaphor, one of the most powerful aspects of Gephi is that it is extensible through plugins. Though the plugin ecosystem is just getting started, existing plugins let you export a graph for publication on the web and experiment with additional layouts. The AlchemyAPI plugin uses natural language processing to identify real world entities from graph data, and shows the promise of connecting Gephi to web services.

Earlier this year, DJ Patil from LinkedIn brought Gephi-generated graphs of LinkedIn social networks to O'Reilly's Foo Camp. Aside from importing the data, very little manipulation was needed inside Gephi. In this video he explains the social networks of several participants.

December 07 2010

Strata Gems: Five data blogs you should read

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: The timeless utility of sed and awk.

Strata 2011Whether your interest in data is professional or casual, commercial or political, there's a blog out there for you. Feel free to add your own suggestions to the comments at the bottom.

Measuring Measures @bradfordcross

Eclectic, thoughtful and forthright, Bradford Cross spends his time making research work in practice: from hedge funds to data-driven startup FlightCaster. His blog covers topics ranging from venture capital and startups to coding in Clojure.

Dataists @vsbuffalo @hmason

Unapologetically geeky, and subtitled Fresher than seeing your model doesn't have heteroscedastic errors, Dataists is a group blog featuring contributions from writers in the New York data scene, such as Hilary Mason and Drew Conway. Dataists includes an insightful mix of instruction and opinion.

Flowing Data @flowingdata

Consistently excellent, Nathan Yau's Flowing Data blog is a frequently updated stream of articles on visualization, statistics and data. Always pretty to look at, the blog often includes commentary and coverage of topical data stories.

Flowing Data Blog

Guardian Data Blog @datastore

Subtitled Facts are sacred, and part of the UK Guardian's pioneering approach to online content, this blog uncovers the stories behind public data. Edited by Strata keynoter Simon Rogers.

Pete Warden @petewarden

Founder of OpenHeatMap, Pete Warden's keeps a personal blog with a strong component of data and visualization topics, as well as commentary on the emerging data industry: most recently, Data is snake oil.

And plenty more...

While these are some of my favorites, the question and answer web site Quora has a more exhaustive list of data blogs.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...