Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 18 2012

Four short links: 18 January 2012

  1. Many Core Processors -- not the first time I've heard nondeterministic computing discussed as a solution to some of our parallel-programming travails. Can't imagine what a pleasure it is to debug.
  2. Pinterest Cloned -- it's not the pilfering of the idea that offends my sensibilities, it's the blatant clone of every aspect of the UI. I never thought much of the old Apple look'n'feel lawsuit but this really rubs me the wrong way.
  3. What You May Not Know About jQuery -- far more than DOM and AJAX calls. (via Javascript Weekly)
  4. Spark -- Scala-implemented alternative framework to the model of parallelism in MapReduce. (via Pete Warden)

December 06 2011

Four short links: 6 December 2011

  1. How to Dispel Your Illusions (NY Review of Books) -- Freeman Dyson writing about Daniel Kahneman's latest book. Only by understanding our cognitive illusions can we hope to transcend them.
  2. Appify-UI (github) -- Create the simplest possible Mac OS X apps. Uses HTML5 for the UI. Supports scripting with anything and everything. (via Hacker News)
  3. Translation Memory (Etsy) -- using Lucene/SOLR to help automate the translation of their UI. (via Twitter)
  4. Automatically Tagging Entities with Descriptive Phrases (PDF) -- Microsoft Research paper on automated tagging. Under the hood it uses Map/Reduce and the Microsoft Dryad framework. (via Ben Lorica)

October 13 2011

Strata Week: Simplifying MapReduce through Java

Here are a few of the data stories that caught my attention this week:

Crunch looks to make MapReduce easier

Despite the growing popularity of MapReduce and other data technologies, there's still a steep learning curve associated with these tools. Some have even wondered if they're worth introducing to programming students.

All of this makes the introduction of Crunch particularly good news. Crunch is a new Java library from Cloudera that's aimed at simplifying the writing, testing, and running of MapReduce pipelines. In other words, developers won't need to write a lot of custom code or libraries, which as Cloudera data scientist Josh Willis points out, "is a serious drain on developer productivity."

He adds that:

Crunch shares a core philosophical belief with Google's FlumeJava: novelty is the enemy of adoption. For developers, learning a Java library requires much less up-front investment than learning a new programming language. Crunch provides full access to the power of Java for writing functions, managing pipeline execution, and dynamically constructing new pipelines, obviating the need to switch back and forth between a data flow language and a real programming language.

The Crunch library has been released under the Apache license, and the code can be downloaded here.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Querying the web with Datafiniti

DatafinitiDatafiniti launched this week into public beta, calling itself the "first search engine for data." That might just sound like a nifty startup slogan, but when you look at what Datafiniti queries and how it works, the engine begins to look profoundly ambitious and important.

Datafiniti enables its users to enter a search query (or make an API call) against the web. Or, that's the goal at least. As it stands, Datafiniti lets users make calls about location, products, news, real estate, and social identity. But that's a substantial number of datasets, using information that's publicly available on the web.

Although Datafiniti demands you enter SQL parameters, it tries to make the process of doing so fairly easy, with a guide that pops up beneath the search box to help you phrase things properly. That interface is just one of the indications that Datafiniti is making a move to help democratize big data search.

The company grew out of a previous startup named 80Legs. As Shion Deysarker, founder of Datafiniti told me, it was clear that the web-crawling services provided by 80Legs were really just being utilized to ask specific queries. Things like, what's the average listing price for a home in Houston? How many times has a brand name been mentioned on Twitter or Facebook over the last few months? And so on.

Deysarker frames Datafiniti in terms of data access, arguing that until now a few providers have controlled the data. The startup wants to help developers and companies overcome both access and expense issues associated with gathering, processing, curating and accessing datasets. It plans to offer both subscription-based and unit-based pricing.

Keep tabs on the Large Hadron Collider from your smartphone

LHSee screenshotNew apps don't often make it into my data news roundup, but it's hard to ignore this one: LHSee is an Android app from the University of Oxford that delivers data directly from the ATLAS experiment at CERN. The app lets you see data from collisions at the Large Hadron Collider.

The ATLAS experiment describes itself as an effort to learn about "the basic forces that have shaped our Universe since the beginning of time and that will determine its fate. Among the possible unknowns are the origin of mass, extra dimensions of space, unification of fundamental forces, and evidence for dark matter candidates in the Universe."

The LHSee app provides detailed information into how CERN and the Large Hadron Collider work. It also offers a "Hunt the Higgs Boson" game as well as opportunities to watch 3-D collisions streamed live from CERN. The app is available for free through the Android Market.

Got data news?

Feel free to email me.


September 16 2011

Four short links: 16 September 2011

  1. A Quick Buck by Copy and Paste -- scorching review of O'Reilly's Gamification by Design title. tl;dr: reviewer, he does not love. Tim responded on Google Plus. Also on the gamification wtfront, Mozilla Open Badges. It talks about establishing a part of online identity, but to me it feels a little like a Mozilla Open Gradients project would: cargocult-confusing the surface for the substance.
  2. Google + API Launched -- first piece of a Google + API is released. It provides read-only programmatic access to people, posts, checkins, and shares. Activities are retrieved as triples of (subject, verb, object), which is semweb cute and ticks the social object box, but is unlikely in present form to reverse Declining numbers of users.
  3. Cube -- open source time-series visualization software from Square, built on MongoDB, Node, and Redis. As Artur Bergman noted, the bigger news might be that Square is using MongoDB (known meh).
  4. Tenzing -- an SQL implementation on top of Map/Reduce. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries. (via Raphaël Valyi)

September 08 2011

Strata Week: MapReduce gets its arms around a million songs

Here are some of the data stories that caught my attention this week.

A millions songs and MapReduce

Million Song DatasetEarlier this year, Echo Nest and LabROSA at Columbia University released the Million Song Dataset, a freely available collection of audio and metadata for a million contemporary popular music tracks. The purpose of the dataset, among other things, was to help encourage research on music algorithms. But as Paul Lamere, director of Echo Nest's Developer Platform, makes clear, getting started with the dataset can be daunting.

In a post on his Music Machinery blog, Lamere explains how to use Amazon's Elastic MapReduce to process the data. In fact, Echo Nest has loaded the entire Million Song Dataset onto a single S3 bucket, available at The bucket contains approximately 300 files, each with data on about 3,000 tracks. Lamere also points to a small subset of the data — just 20 tracks — available in a file on GitHub, and he also created to parse track data and return a dictionary containing all of it.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

GPS steps in where memory fails

Garmin 305 GPS deviceAfter decades of cycling without incident, The New York Times science writer John Markoff experienced what every cyclist dreads: a major crash, one that resulted in a broken nose, a deep gash on his knee, and road rash aplenty. He was knocked unconscious by the crash, unable to remember what had happened to cause it. In a recent piece in the NYT, he chronicled the steps he took to reconstruct the accident.

He did so by turning to the GPS data tracked by the Garmin 305 on his bicycle. Typically, devices like this are utilized to track the distance and location of rides as well as a cyclist's pedaling and heart rates. But as Markoff investigated his own crash, he found that the data stored in these types of devices can be use to ascertain what happens in cycling accidents.

In investigating his own memory-less crash, Markoff was able to piece together data about his trip:

My Garmin was unharmed, and when I uploaded the data I could see that in the roughly eight seconds before I crashed, my speed went from 30 to 10 miles per hour — and then 0 — while my heart rate stayed a constant 126. By entering the GPS data into Google Maps, I could see just where I crashed. I realized I did have several disconnected memories. One was of my hands being thrown off the handlebars violently, but I had no sense of where I was when it happened. With a friend, Bill Duvall, who many years ago also raced for the local bike club Pedali Alpini, I went back to the spot. La Honda Road cuts a steep and curving path through the redwoods. Just above where the GPS data said I crashed, we could see a long, thin, deep pothole. (It was even visible in Google's street view.) If my tire hit that, it could easily have taken me down. I also had a fleeting recollection of my mangled dark glasses, and on the side of the road, I stooped and picked up one of the lenses, which was deeply scratched. From the swift deceleration, I deduced that when my hands were thrown from the handlebars, I must have managed to reach my brakes again in time to slow down before I fell. My right hand was pinned under the brake lever when I hit the ground, causing the nasty road rash.

It's one thing for a rider to reconstruct his own accident, but Markoff says insurance companies are also starting to pay attention to this sort of data. As one lawyer notes in the Times article, "Frankly, it's probably going to be a booming new industry for experts."

Crowdsourcing and crisis mapping from WWI

The explosion of mobile, mapping, and web technologies has facilitated the rise of crowdsourcing during crisis situations, giving citizens and NGOs — among others — the ability to contribute to and coordinate emergency responses. But as Patrick Meier, director of crisis mapping and partnerships at Ushahidi has found, there are examples of crisis mapping that pre-date our Internet age.

Meier highlights maps he discovered from World War I at the National Air and Space Museum, pointing to the government's request for citizens to help with the mapping process:

In the event of a hostile aircraft being seen in country districts, the nearest Naval, Military or Police Authorities should, if possible, be advised immediately by Telephone of the time of appearance, the direction of flight, and whether the aircraft is an Airship or an Aeroplane.

And he asks a number of very interesting questions: How often were these maps updated? What sources were used? And "would public opinion at the time have differed had live crowdsourced crisis maps existed?"

Got data news?

Feel free to email me.


July 25 2011

Four short links: 25 July 2011

  1. Anonymity in Bitcoin -- TL;DR: Bitcoin is not inherently anonymous. It may be possible to conduct transactions is such a way so as to obscure your identity, but, in many cases, users and their transactions can be identified. We have performed an analysis of anonymity in the Bitcoin system and published our results in a preprint on arXiv. (via Hacker News)
  2. 3D Printing + Algorithmic Generation -- clever designers use algorithms based on leaf vein generation to create patterns for lamps, which are then 3d-printed. (via Imran Ali)
  3. Manimal: Relational Optimization for Data-Intensive Programs (PDF) -- static code analysis to detect MapReduce program semantics and thereby enable wholly-automatic optimization of MapReduce programs. (via BigData)
  4. Screenfly -- preview your site in different devices' screen sizes and resolutions. (via Smashing Magazine)

June 23 2011

Four short links: 23 June 2011

  1. The Wisdom of Communities -- Luke Wroblewski's notes from Derek Powazek's talk at Event Apart. Wisdom of Crowds theory shows that, in aggregate, crowds are smarter than any single individual in the crowd. See this online in most emailed features, bit torrent, etc. Wise crowds are built on a few key characteristics: diversity (of opinion), independence (of other ideas), decentralization, and aggregation.
  2. How to Fit an Elephant (John D. Cook) -- for the stats geeks out there. Someone took von Neumann's famous line "with four parameters I can fit an elephant, and with five I can make him wiggle his trunk", and found the four complex parameters that do, indeed, fit an elephant.
  3. How to Run a News Site and Newspaper Using Wordpress and Google Docs -- clever workflow that's digital first but integrated with print. (via Sacha Judd)
  4. All Watched Over: On Foo, Cybernetics, and Big Data -- I'm glad someone preserved Matt Jones's marvelous line, "the map-reduce is not the territory". (via Tom Armitage)

January 12 2011

Hadoop: What it is, how it works, and what it can do

HadoopHadoop gets a lot of buzz these days in database and content management circles, but many people in the industry still don't really know what it is and or how it can be best applied.

Cloudera CEO and Strata speaker Mike Olson, whose company offers an enterprise distribution of Hadoop and contributes to the project, discusses Hadoop's background and its applications in the following interview.

Where did Hadoop come from?

Mike OlsonMike Olson: The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google's innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

What problems can Hadoop solve?

Mike Olson: The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.

Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD

How is Hadoop architected?

Mike Olson: Hadoop is designed to run on a large number of machines that don't share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization's data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There's no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.

In a centralized database system, you've got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That's MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.

Architecturally, the reason you're able to deal with lots of data is because Hadoop spreads it out. And the reason you're able to ask complicated computational questions is because you've got all of these processors, working in parallel, harnessed together.

At this point, do companies need to develop their own Hadoop applications?

Mike Olson: It's fair to say that a current Hadoop adopter must be more sophisticated than a relational database adopter. There are not that many "shrink wrapped" applications today that you can get right out of the box and run on your Hadoop processor. It's similar to the early '80s when Ingres and IBM were selling their database engines and people often had to write applications locally to operate on the data.

That said, you can develop applications in a lot of different languages that run on the Hadoop framework. The developer tools and interfaces are pretty simple. Some of our partners — Informatica is a good example — have ported their tools so that they're able to talk to data stored in a Hadoop cluster using Hadoop APIs. There are specialist vendors that are up and coming, and there are also a couple of general process query tools: a version of SQL that lets you interact with data stored on a Hadoop cluster, and Pig, a language developed by Yahoo that allows for data flow and data transformation operations on a Hadoop cluster.

Hadoop's deployment is a bit tricky at this stage, but the vendors are moving quickly to create applications that solve these problems. I expect to see more of the shrink-wrapped apps appearing over the next couple of years.

Where do you stand in the SQL vs NoSQL debate?

Mike Olson: I'm a deep believer in relational databases and in SQL. I think the language is awesome and the products are incredible.

I hate the term "NoSQL." It was invented to create cachet around a bunch of different projects, each of which has different properties and behaves in different ways. The real question is, what problems are you solving? That's what matters to users.


January 06 2011

Big data faster: A conversation with Bradford Stephens

Strata Conference 2011 To prepare for O'Reilly's upcoming Strata Conference, we're continuing our series of conversations with some of the leading innovators working with big data and analytics. Today, we hear from Bradford Stephens, founder of Drawn to Scale.

Drawn to Scale is a database platform that works with large data sets. Stephens describes its focus as slightly different from that of other big data tools: "Other tools out there concentrate on doing complex things with your data in seconds to minutes. We really concentrate on doing simple things with your data in milliseconds."

Stephens calls such speed "user time" and he credits Drawn to Scale's performance to its indexing system working in parallel with backend batch tools. Like other big data tools, Drawn to Scale uses MapReduce and Hadoop for batch processing on the back end. But on the front end, a series of secondary indices on top of the storage layer speed up retrieval. "We find that when you index data in the manner in which you wish to use it, it's basically one single call to the disk to access it," Stephens says. "So it can be extremely fast."

Big data tools and applications will be examined at the Strata Conference (Feb. 1-3, 2011). Save 30% on registration with the code STR11RAD.

Drawn to Scale's customers include organizations working with analytics, in social media, in mobile ad targeting and delivery, and also organizations with large arrays of sensor networks. While he expects to see some consolidation on the commercial side ("I see a lot of vendors out there doing similar things"), on the open source side he expects to see a proliferation of tools available in areas such as geo data and managing time series. "People have some very specific requirements that they're going to cook up in open source."

You'll find the full interview in the following video:

September 23 2010

Strata Week: Grabbing a slice

Reminder: This is the last week to submit proposals for O'Reilly Strata; the call ends on Sept. 28. We're eager to hear your stories about the business and practice of data, analytics and visualization, so submit a proposal now.

A bigger slice of pi

A team led by Nicholas Sze of Yahoo! has calculated the 2,000,000,000,000,000th digit of pi, more than doubling the previous record. Extreme computational power helped -- Sze employed a cluster of 1,000 Hadoop machines to do in 23 days what would have taken 500 years on a standard PC -- but so did a non-traditional approach. Instead of working linearly from the decimal point, Sze used MapReduce to calculate isolated portions of the mathematical constant and express them in binary.

Aside from earning serious nerd-cred and bragging rights, the exercise proves handy in practical terms, too.

"This kind of calculation is useful in benchmarking and testing," Sze told the BBC. "We have used it to compare the [processor] performance among our clusters."

The iPad's slice of light

Dentsu London and BERG have developed an impressive technique that uses iPads to extrude light paintings by dragging the device across a long exposure shot as it flashes consecutive single cross-sections of a 3D model.

Making Future Magic: iPad light painting from Dentsu London.

Stephen Von Worley at Data Pointed asks whether a similar but reverse technique might be used by doctors to explore digital body scans by moving an iPad through the space of a fixed block of data in order to explore it.

To my knowledge, no one is yet doing that, but it seems just the kind of technique that could be considered by the team at Linköping University in Sweden. That group is working with digital body scans to create cut-free autopsies.

A different slice of life

Historypin, a photo project of We Are What We Do in collaboration with Google, has recently released The Blitz Collection. The idea behind Historypin is to build a catalog of public-space photos that can be searched by place, subject matter, or time. Users can upload their own photos and "pin" them to a map, adding contextual information about time or topic that other users can then access. The best part, though, is being able to view historical photos "pinned" to modern-day Google Street View images of the same place.

The Blitz Collection is a special set of these images meant to commemorate the 70th anniversary of the bombing of London and other cities across Britain during WWII.

An image of St. Stephen's Street in Norwich, dated April 28, 1942.

Historypin's next group project will be to collect American Civil Rights photos in celebration of Dr. Martin Luther King Jr.'s "I Have a Dream" speech, so start searching your attics and family photo albums.

Send us news

Email us news, tips, and interesting tidbits at

September 22 2010

The SMAQ stack for big data

"Big data" is data that becomes large enough that it cannot be processed using conventional methods. Creators of web search engines were among the first to confront this problem. Today, social networks, mobile phones, sensors and science contribute to petabytes of data created daily.

To meet the challenge of processing such large data sets, Google created MapReduce. Google's work and Yahoo's creation of the Hadoop MapReduce implementation has spawned an ecosystem of big data processing tools.

As MapReduce has grown in popularity, a stack for big data systems has emerged, comprising layers of Storage, MapReduce and Query (SMAQ). SMAQ systems are typically open source, distributed, and run on commodity hardware.

SMAQ Stack

In the same way the commodity LAMP stack of Linux, Apache, MySQL and PHP changed the landscape of web applications, SMAQ systems are bringing commodity big data processing to a broad audience. SMAQ systems underpin a new era of innovative data-driven products and services, in the same way that LAMP was a critical enabler for Web 2.0.

Though dominated by Hadoop-based architectures, SMAQ encompasses a variety of systems, including leading NoSQL databases. This paper describes the SMAQ stack and where today's big data tools fit into the picture.


Created at Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today's big data processing. The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. This distribution solves the issue of data too large to fit onto a single machine.

SMAQ Stack - MapReduce

To understand how MapReduce works, look at the two phases suggested by its name. In the map phase, input data is processed, item by item, and transformed into an intermediate data set. In the reduce phase, these intermediate results are reduced to a summarized data set, which is the desired end result.

MapReduce example

A simple example of MapReduce is the task of counting the number of unique words in a document. In the map phase, each word is identified and given the count of 1. In the reduce phase, the counts are added together for each word.

If that seems like an obscure way of doing a simple task, that's because it is. In order for MapReduce to do its job, the map and reduce phases must obey certain constraints that allow the work to be parallelized. Translating queries into one or more MapReduce steps is not an intuitive process. Higher-level abstractions have been developed to ease this, discussed under Query below.

An important way in which MapReduce-based systems differ from conventional databases is that they process data in a batch-oriented fashion. Work must be queued for execution, and may take minutes or hours to process.

Using MapReduce to solve problems entails three distinct operations:

  • Loading the data -- This operation is more properly called Extract, Transform, Load (ETL) in data warehousing terminology. Data must be extracted from its source, structured to make it ready for processing, and loaded into the storage layer for MapReduce to operate on it.
  • MapReduce -- This phase will retrieve data from storage, process it, and return the results to the storage.
  • Extracting the result -- Once processing is complete, for the result to be useful to humans, it must be retrieved from the storage and presented.

Many SMAQ systems have features designed to simplify the operation of each of these stages.

Hadoop MapReduce

Hadoop is the dominant open source MapReduce implementation. Funded by Yahoo, it emerged in 2006 and, according to its creator Doug Cutting, reached “web scale” capability in early 2008.

The Hadoop project is now hosted by Apache. It has grown into a large endeavor, with multiple subprojects that together comprise a full SMAQ stack.

Since it is implemented in Java, Hadoop's MapReduce implementation is accessible from the Java programming language. Creating MapReduce jobs involves writing functions to encapsulate the map and reduce stages of the computation. The data to be processed must be loaded into the Hadoop Distributed Filesystem.

Taking the word-count example from above, a suitable map function might look like the following (taken from the Hadoop MapReduce documentation, the key operations shown in bold).

public static class Map
extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
context.write(word, one);


The corresponding reduce function sums the counts for each word.

public static class Reduce
		extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {

int sum = 0;
for (IntWritable val : values) {
sum += val.get();
context.write(key, new IntWritable(sum));


The process of running a MapReduce job with Hadoop involves the following steps:

  • Defining the MapReduce stages in a Java program
  • Loading the data into the filesystem
  • Submitting the job for execution
  • Retrieving the results from the filesystem

Run via the standalone Java API, Hadoop MapReduce jobs can be complex to create, and necessitate programmer involvement. A broad ecosystem has grown up around Hadoop to make the task of loading and processing data more straightforward.

Other implementations

MapReduce has been implemented in a variety of other programming languages and systems, a list of which may be found in Wikipedia's entry for MapReduce. Notably, several NoSQL database systems have integrated MapReduce, and are described later in this paper.


MapReduce requires storage from which to fetch data and in which to store the results of the computation. The data expected by MapReduce is not relational data, as used by conventional databases. Instead, data is consumed in chunks, which are then divided among nodes and fed to the map phase as key-value pairs. This data does not require a schema, and may be unstructured. However, the data must be available in a distributed fashion, to serve each processing node.

SMAQ Stack - Storage

The design and features of the storage layer are important not just because of the interface with MapReduce, but also because they affect the ease with which data can be loaded and the results of computation extracted and searched.

Hadoop Distributed File System

The standard storage mechanism used by Hadoop is the Hadoop Distributed File System, HDFS. A core part of Hadoop, HDFS has the following features, as detailed in the HDFS design document.

  • Fault tolerance -- Assuming that failure will happen allows HDFS to run on commodity hardware.
  • Streaming data access -- HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data.
  • Extreme scalability -- HDFS will scale to petabytes; such an installation is in production use at Facebook.
  • Portability -- HDFS is portable across operating systems.
  • Write once -- By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput.
  • Locality of computation -- Due to data volume, it is often much faster to move the program near to the data, and HDFS has features to facilitate this.

HDFS provides an interface similar to that of regular filesystems. Unlike a database, HDFS can only store and retrieve data, not index it. Simple random access to data is not possible. However, higher-level layers have been created to provide finer-grained functionality to Hadoop deployments, such as HBase.

HBase, the Hadoop Database

One approach to making HDFS more usable is HBase. Modeled after Google's BigTable database, HBase is a column-oriented database designed to store massive amounts of data. It belongs to the NoSQL universe of databases, and is similar to Cassandra and Hypertable.

HBase and MapReduce

HBase uses HDFS as a storage system, and thus is capable of storing a large volume of data through fault-tolerant, distributed nodes. Like similar column-store databases, HBase provides REST and Thrift based API access.

Because it creates indexes, HBase offers fast, random access to its contents, though with simple queries. For complex operations, HBase acts as both a source and a sink (destination for computed data) for Hadoop MapReduce. HBase thus allows systems to interface with Hadoop as a database, rather than the lower level of HDFS.


Data warehousing, or storing data in such a way as to make reporting and analysis easier, is an important application area for SMAQ systems. Developed originally at Facebook, Hive is a data warehouse framework built on top of Hadoop. Similar to HBase, Hive provides a table-based abstraction over HDFS and makes it easy to load structured data. In contrast to HBase, Hive can only run MapReduce jobs and is suited for batch data analysis. Hive provides a SQL-like query language to execute MapReduce jobs, described in the Query section below.

Cassandra and Hypertable

Cassandra and Hypertable are both scalable column-store databases that follow the pattern of BigTable, similar to HBase.

An Apache project, Cassandra originated at Facebook and is now in production in many large-scale websites, including Twitter, Facebook, Reddit and Digg. Hypertable was created at Zvents and spun out as an open source project.

Cassandra and MapReduce

Both databases offer interfaces to the Hadoop API that allow them to act as a source and a sink for MapReduce. At a higher level, Cassandra offers integration with the Pig query language (see the Query section below), and Hypertable has been integrated with Hive.

NoSQL database implementations of MapReduce

The storage solutions examined so far have all depended on Hadoop for MapReduce. Other NoSQL databases have built-in MapReduce features that allow computation to be parallelized over their data stores. In contrast with the multi-component SMAQ architectures of Hadoop-based systems, they offer a self-contained system comprising storage, MapReduce and query all in one.

Whereas Hadoop-based systems are most often used for batch-oriented analytical purposes, the usual function of NoSQL stores is to back live applications. The MapReduce functionality in these databases tends to be a secondary feature, augmenting other primary query mechanisms. Riak, for example, has a default timeout of 60 seconds on a MapReduce job, in contrast to the expectation of Hadoop that such a process may run for minutes or hours.

These prominent NoSQL databases contain MapReduce functionality:

  • CouchDB is a distributed database, offering semi-structured document-based storage. Its key features include strong replication support and the ability to make distributed updates. Queries in CouchDB are implemented using JavaScript to define the map and reduce phases of a MapReduce process.
  • MongoDB is very similar to CouchDB in nature, but with a stronger emphasis on performance, and less suitability for distributed updates, replication, and versioning. MongoDB MapReduce operations are specified using JavaScript.
  • Riak is another database similar to CouchDB and MongoDB, but places its emphasis on high availability. MapReduce operations in Riak may be specified with JavaScript or Erlang.

Integration with SQL databases

In many applications, the primary source of data is in a relational database using platforms such as MySQL or Oracle. MapReduce is typically used with this data in two ways:

  • Using relational data as a source (for example, a list of your friends in a social network).
  • Re-injecting the results of a MapReduce operation into the database (for example, a list of product recommendations based on friends' interests).

It is therefore important to understand how MapReduce can interface with relational database systems. At the most basic level, delimited text files serve as an import and export format between relational databases and Hadoop systems, using a combination of SQL export commands and HDFS operations. More sophisticated tools do, however, exist.

The Sqoop tool is designed to import data from relational databases into Hadoop. It was developed by Cloudera, an enterprise-focused distributor of Hadoop platforms. Sqoop is database-agnostic, as it uses the Java JDBC database API. Tables can be imported either wholesale, or using queries to restrict the data import.

Sqoop also offers the ability to re-inject the results of MapReduce from HDFS back into a relational database. As HDFS is a filesystem, Sqoop expects delimited text files and transforms them into the SQL commands required to insert data into the database.

For Hadoop systems that utilize the Cascading API (see the Query section below) the cascading.jdbc and cascading-dbmigrate tools offer similar source and sink functionality.

Integration with streaming data sources

In addition to relational data sources, streaming data sources, such as web server log files or sensor output, constitute the most common source of input to big data systems. The Cloudera Flume project aims at providing convenient integration between Hadoop and streaming data sources. Flume aggregates data from both network and file sources, spread over a cluster of machines, and continuously pipes these into HDFS. The Scribe server, developed at Facebook, also offers similar functionality.

Commercial SMAQ solutions

Several massively parallel processing (MPP) database products have MapReduce functionality built in. MPP databases have a distributed architecture with independent nodes that run in parallel. Their primary application is in data warehousing and analytics, and they are commonly accessed using SQL.

  • The Greenplum database is based on the open source PostreSQL DBMS, and runs on clusters of distributed hardware. The addition of MapReduce to the regular SQL interface enables fast, large-scale analytics over Greenplum databases, reducing query times by several orders of magnitude. Greenplum MapReduce permits the mixing of external data sources with the database storage. MapReduce operations can be expressed as functions in Perl or Python.
  • Aster Data's nCluster data warehouse system also offers MapReduce functionality. MapReduce operations are invoked using Aster Data's SQL-MapReduce technology. SQL-MapReduce enables the intermingling of SQL queries with MapReduce jobs defined using code, which may be written in languages including C#, C++, Java, R or Python.

Other data warehousing solutions have opted to provide connectors with Hadoop, rather than integrating their own MapReduce functionality.

  • Vertica, famously used by Farmville creator Zynga, is an MPP column-oriented database that offers a connector for Hadoop.
  • Netezza is an established manufacturer of hardware data warehousing and analytical appliances. Recently acquired by IBM, Netezza is working with Hadoop distributor Cloudera to enhance the interoperation between their appliances and Hadoop. While it solves similar problems, Netezza falls outside of our SMAQ definition, lacking both the open source and commodity hardware aspects.

Although creating a Hadoop-based system can be done entirely with open source, it requires some effort to integrate such a system. Cloudera aims to make Hadoop enterprise-ready, and has created a unified Hadoop distribution in its Cloudera Distribution for Hadoop (CDH). CDH for Hadoop parallels the work of Red Hat or Ubuntu in creating Linux distributions. CDH comes in both a free edition and an Enterprise edition with additional proprietary components and support. CDH is an integrated and polished SMAQ environment, complete with user interfaces for operation and query. Cloudera's work has resulted in some significant contributions to the Hadoop open source ecosystem.


Specifying MapReduce jobs in terms of defining distinct map and reduce functions in a programming language is unintuitive and inconvenient, as is evident from the Java code listings shown above. To mitigate this, SMAQ systems incorporate a higher-level query layer to simplify both the specification of the MapReduce operations and the retrieval of the result.

SMAQ Stack - Query

Many organizations using Hadoop will have already written in-house layers on top of the MapReduce API to make its operation more convenient. Several of these have emerged either as open source projects or commercial products.

Query layers typically offer features that handle not only the specification of the computation, but the loading and saving of data and the orchestration of the processing on the MapReduce cluster. Search technology is often used to implement the final step in presenting the computed result back to the user.


Developed by Yahoo, Pig provides a new high-level language, Pig Latin, for describing and running Hadoop MapReduce jobs. It is intended to make Hadoop accessible for developers familiar with data manipulation using SQL, and provides an interactive interface as well as a Java API. Pig integration is available for the Cassandra and HBase databases.

Below is shown the word-count example in Pig, including both the data loading and storing phases (the notation $0 refers to the first field in a record).

input = LOAD 'input/sentences.txt' USING TextLoader();
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group, COUNT(words);
ordered = ORDER counts BY $0;
STORE ordered INTO 'output/wordCount' USING PigStorage();

While Pig is very expressive, it is possible for developers to write custom steps in User Defined Functions (UDFs), in the same way that many SQL databases support the addition of custom functions. These UDFs are written in Java against the Pig API.

Though much simpler to understand and use than the MapReduce API, Pig suffers from the drawback of being yet another language to learn. It is SQL-like in some ways, but it is sufficiently different from SQL that it is difficult for users familiar with SQL to reuse their knowledge.


As introduced above, Hive is an open source data warehousing solution built on top of Hadoop. Created by Facebook, it offers a query language very similar to SQL, as well as a web interface that offers simple query-building functionality. As such, it is suited for non-developer users, who may have some familiarity with SQL.

Hive's particular strength is in offering ad-hoc querying of data, in contrast to the compilation requirement of Pig and Cascading. Hive is a natural starting point for more full-featured business intelligence systems, which offer a user-friendly interface for non-technical users.

The Cloudera Distribution for Hadoop integrates Hive, and provides a higher-level user interface through the HUE project, enabling users to submit queries and monitor the execution of Hadoop jobs.

Cascading, the API Approach

The Cascading project provides a wrapper around Hadoop's MapReduce API to make it more convenient to use from Java applications. It is an intentionally thin layer that makes the integration of MapReduce into a larger system more convenient. Cascading's features include:

  • A data processing API that aids the simple definition of MapReduce jobs.
  • An API that controls the execution of MapReduce jobs on a Hadoop cluster.
  • Access via JVM-based scripting languages such as Jython, Groovy, or JRuby.
  • Integration with data sources other than HDFS, including Amazon S3 and web servers.
  • Validation mechanisms to enable the testing of MapReduce processes.

Cascading's key feature is that it lets developers assemble MapReduce operations as a flow, joining together a selection of “pipes”. It is well suited for integrating Hadoop into a larger system within an organization.

While Cascading itself doesn't provide a higher-level query language, a derivative open source project called Cascalog does just that. Using the Clojure JVM language, Cascalog implements a query language similar to that of Datalog. Though powerful and expressive, Cascalog is likely to remain a niche query language, as it offers neither the ready familiarity of Hive's SQL-like approach nor Pig's procedural expression. The listing below shows the word-count example in Cascalog: it is significantly terser, if less transparent.

	(defmapcatop split [sentence]
		(seq (.split sentence "\\s+")))

(?<- (stdout) [?word ?count]
(sentence ?s) (split ?s :> ?word)
(c/count ?count))

Search with Solr

An important component of large-scale data deployments is retrieving and summarizing data. The addition of database layers such as HBase provides easier access to data, but does not provide sophisticated search capabilities.

To solve the search problem, the open source search and indexing platform Solr is often used alongside NoSQL database systems. Solr uses Lucene search technology to provide a self-contained search server product.

For example, consider a social network database where MapReduce is used to compute the influencing power of each person, according to some suitable metric. This ranking would then be reinjected to the database. Using Solr indexing allows operations on the social network, such as finding the most influential people whose interest profiles mention mobile phones, for instance.

Originally developed at CNET and now an Apache project, Solr has evolved from being just a text search engine to supporting faceted navigation and results clustering. Additionally, Solr can manage large data volumes over distributed servers. This makes it an ideal solution for result retrieval over big data sets, and a useful component for constructing business intelligence dashboards.


MapReduce, and Hadoop in particular, offers a powerful means of distributing computation among commodity servers. Combined with distributed storage and increasingly user-friendly query mechanisms, the resulting SMAQ architecture brings big data processing within reach for even small- and solo-development teams.

It is now economic to conduct extensive investigation into data, or create data products that rely on complex computations. The resulting explosion in capability has forever altered the landscape of analytics and data warehousing systems, lowering the bar to entry and fostering a new generation of products, services and organizational attitudes - a trend explored more broadly in Mike Loukides' "What is Data Science?" report.

The emergence of Linux gave power to the innovative developer with merely a small Linux server at their desk: SMAQ has the same potential to streamline data centers, foster innovation at the edges of an organization, and enable new startups to cheaply create data-driven businesses.


September 16 2010

Strata Week: The challenge of real-time analytics

The call for proposals for O'Reilly Strata ends on Sept. 28. We're keen to hear your stories about the business and practice of data, analytics and visualization. Submit a proposal now.

When MapReduce is too slow

This week the Register reported on Google's move away from a MapReduce architecture for compiling their search index. Pioneered by Google, MapReduce is a way to distribute calculations among many processors. MapReduce led the field in big data analytics frameworks, and is now popularly used in the form of the open-source Hadoop project, spearheaded by Yahoo!

Does Google think MapReduce is dead? Not quite. The problem is that MapReduce is a batch processing architecture. Google was recomputing their entire search index and replacing it wholesale every few hours. By contrast, content is being updated on the web in real-time. With a MapReduce-centric architecture, Google could never be truly up-to-date.

Caffeine, Google's new indexing system, supports incremental indexing and avoids the refresh rate problem of MapReduce. Carrie Grimes of Google explained the benefits, writing in a Google blog post:

With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index.

Google isn't the only company that wants real-time big data processing. Facebook, deeply invested in Hadoop, are working to get their latencies down to matters of seconds rather than minutes. Real-time analytics is a priority for many of companies I've spoken to in researching the Strata program. Whether it's MapReduce-based or not, we will see the emergence of more real-time big data technologies over the next 12 months.

  • Want to get a taste of using MapReduce without deploying any infrastructure?
    Check out, a simple self-contained Python MapReduce implementation.

Feeling blue

The folks at COLOURlovers noticed that a lot of people favored the color blue for their Twitter theme. Was this just because Twitter itself was blue, or does blue have a stronger hold on our preferences? To investigate, COLOURlovers decided to research the top 100 online brands.

Blue, the color of Twitter, Facebook, Paypal, and AT&T, does indeed dominate online brands. But it's not alone. There's a strong showing for red from companies such as CNN, ESPN, Comcast, CNET, BBC and YouTube. Red seems to be a strong indicator for media organizations.

Excerpt from COLOURlovers visualization.

Is there any hope for variety, or we doomed to a red-blue future? COLOURlovers suspect that once category leaders establish a certain color, newcomers are likely to repeat it.

Once a rocketship of a web startup takes flight, there are a number of Jr. Internet astronauts hoping to emulate their success ... and are inspired by their brands. And so blue and red will probably continue to dominate, but we can have hope for the GoWallas, DailyBooths and other more adventurous brands out there.

Personal email analytics

Email is one of the richest, most useful and most infuriating sources of data in our lives. For years we've been wanting tools to help make sense of the flow of people and information that it brings. In 1991, for example, Jaimie Zawinski created the Insidious Big Brother Database (BBDB), with the aim of making both email and people more manageable.

BBDB can automatically keep track of what other topics the sender has corresponded with you about; when you last read a message from the sender; what mailing lists their messages came through; and any other details or automatic annotations you care to add. It also does a good job of noting when someone's email address has changed.

More recently it seems that innovation has been slower to come to email clients. However, the opening up of GMail's API has brought some interesting new tools, based on machine learning and analytics.

For basic exploration of your email flows, try Graph Your Inbox. This is a Chrome browser extension that will chart queries over your GMail data, essentially a Google Trends for your email. Below is a graph comparing the volumes of email I sent and received.

Graph Your Inbox results for outbound vs incoming email

With tools such as Graph Your Inbox we can retrospectively mine our own email and discover the ebb and flow of people and projects in our lives. Can machine learning help us in a proactive way? Whether you are conscious of it or not, machine learning techniques help us daily in the fight against spam. But what about separating the signal from the noise among our non-spam communications?

Google recently introduced Priority Inbox, in an attempt to help users decide which emails are important. Small voting buttons and dividers in the interface enable you to train Priority Inbox. But some of this seems a bit redundant -- we already passively prioritize by how quickly we read and reply to messages from different people. Why can't the computer learn?

SaneBox is a web application that takes a more low-key approach. It will label mail according to whether it needs immediate attention, can be postponed for later, or whether it's a bulk mailing. I've been using it for some months, and it admittedly takes a little time to learn to trust. The results however are impressive. Simply removing non-urgent mail from view lowers stress levels considerably.

Send us news

Email us news, tips and interesting tidbits at

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!