Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 07 2013

Four short links: 7 June 2013

  1. Accumulo — NSA’s BigTable implementation, released as an Apache project.
  2. How the Robots Lost (Business Week) — the decline of high-frequency trading profits (basically, markets worked and imbalances in speed and knowledge have been corrected). Notable for the regulators getting access to the technology that the traders had: Last fall the SEC said it would pay Tradeworx, a high-frequency trading firm, $2.5 million to use its data collection system as the basic platform for a new surveillance operation. Code-named Midas (Market Information Data Analytics System), it scours the market for data from all 13 public exchanges. Midas went live in February. The SEC can now detect anomalous situations in the market, such as a trader spamming an exchange with thousands of fake orders, before they show up on blogs like Nanex and ZeroHedge. If Midas sees something odd, Berman’s team can look at trading data on a deeper level, millisecond by millisecond.
  3. PRISM: Surprised? (Danny O’Brien) — I really don’t agree with the people who think “We don’t have the collective will”, as though there’s some magical way things got done in the past when everyone was in accord and surprised all the time. It’s always hard work to change the world. Endless, dull hard work. Ten years later, when you’ve freed the slaves or beat the Nazis everyone is like “WHY CAN’T IT BE AS EASY TO CHANGE THIS AS THAT WAS, BACK IN THE GOOD OLD DAYS. I GUESS WE’RE ALL JUST SHEEPLE THESE DAYS.”
  4. What We Don’t Know About Spying on Citizens is Scarier Than What We Do Know (Bruce Schneier) — The U.S. government is on a secrecy binge. It overclassifies more information than ever. And we learn, again and again, that our government regularly classifies things not because they need to be secret, but because their release would be embarrassing. Open source BigTable implementation: free. Data gathering operation around it: $20M/year. Irony in having the extent of authoritarian Big Brother government secrecy questioned just as a whistleblower’s military trial is held “off the record”: priceless.

April 25 2012

Four short links: 25 April 2012

  1. World History Since 1300 (Coursera) -- Coursera expands offerings to include humanities. This content is in books and already in online lectures in many formats. What do you get from these? Online quizzes and the online forum with similar people considering similar things. So it's a book club for a university course?
  2. mod_spdy -- Apache module for the SPDY protocol, Google's "faster than HTTP" HTTP.
  3. The Top 10 Dying Industries in the United States (Washington Post) -- between the Internet and China, yesterday's cash cows are today's casseroles.
  4. Notes from JSConf2012 -- excellent conference report: covers what happens, why it was interesting or not, and even summarizes relevant and interesting hallway conversations. AA++ would attend by proxy again. (via an old Javascript Weekly)

Sponsored post
Reposted byckisbacksternenglueckverdantforcerunkensteinMrCoffesucznikrudgierdsunakoniedonaprawieniawildhorses

February 01 2012

Why Hadoop caught on

Doug Cutting (@cutting) is a founder of the Apache Hadoop project and an architect at Hadoop provider Cloudera. When Cutting expresses surprise at Hadoop's growth — as he does below — that carries a lot of weight.

In the following interview, Cutting explains why he's surprised at Hadoop's ascendance, and he looks at the factors that helped Hadoop catch on. He'll expand on some of these points during his Hadoop session at the upcoming Strata Conference.

Why do you think Hadoop has caught on?

Doug CuttingDoug Cutting: Hadoop is a technology whose time had come. As computer use has spread, institutions are generating vastly more data. While commodity hardware offers affordable raw storage and compute horsepower, before Hadoop, there was no commodity software to harness it. Without tools, useful data was simply discarded.

Open source is a methodology for commoditizing software. Google published its technological solutions, and the Hadoop community at Apache brought these to the rest of the world. Commodity hardware combined with the latent demand for data analysis formed the fuel that Hadoop ignited.

Are you surprised at its growth?

Doug Cutting: Yes. I didn't expect Hadoop to become such a central component of data processing. I recognized that Google's techniques would be useful to other search engines and that open source was the best way to spread these techniques. But I did not realize how many other folks had big data problems nor how many of these Hadoop applied to.

What role do you see Hadoop playing in the near-term future of data science and big data?

Doug Cutting: Hadoop is a central technology of big data and data science. HDFS is where folks store most of their data, and MapReduce is how they execute most of their analysis. There are some storage alternatives — for example, Cassandra and CouchDB, and useful computing alternatives, like S4, Giraph, etc. — but I don't see any of these replacing HDFS or MapReduce soon as the primary tools for big data.

Long term, we'll see. The ecosystem at Apache is a loosely-coupled set of separate projects. New components are regularly added to augment or replace incumbents. Such an ecosystem can survive the obsolescence of even its most central components.

In your Strata session description, you note that "Apache Hadoop forms the kernel of an operating system for big data." What else is in that operating system? How is that OS being put to use?

Doug Cutting: Operating systems permit folks to share resources, managing permissions and allocations. The two primary resources are storage and computation. Hadoop provides scalable storage through HDFS and scalable computation through MapReduce. It supports authorization, authentication, permissions, quotas and other operating system features. So, narrowly speaking, Hadoop alone is an operating system.

But no one uses Hadoop alone. Rather, folks also use HBase, Hive, Pig, Flume, Sqoop and many other ecosystem components. So, just as folks refer to more than the Linux kernel when they say "Linux," folks often refer to the entire Hadoop ecosystem when they say "Hadoop." Apache BigTop combines many of these ecosystem projects together into a distribution, much like RHL and Ubuntu do for Linux.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

July 15 2011

Developer Week in Review: Christmas in July for Apache

Only a few weeks left until OSCON. Alas, I won't be making it this year. I'm taking a few weeks with the family in August to drive down the California coast, and with a major software delivery coming up at work, I just don't have the time for another trip out west. So raise a glass for me, it looks like it'll be a blast.

Meanwhile ...

IBM hands off Lotus Symphony

It seems like everyone these days is in a gifting mood these days, and the various open source foundations are the benefactors. Sun gave OpenOffice to Apache, Hudson went to Eclipse, and now IBM has left a big bundle of love on Apache's doorstep containing Lotus Symphony, with a note asking Apache to give it a good home.

It makes sense for IBM to make the gift, since Symphony was built on top of OpenOffice. Speculation abounds that Apache will merge the Symphony codebase into the mainline OpenOffice source, creating the One Office Suite to Rule Them All (but remember, one does not simply send a purchase order to Mordor...)

OSCON 2011 — Join today's open source innovators, builders, and pioneers July 25-29 as they gather at the Oregon Convention Center in Portland, Ore.

Save 20% on registration with the code OS11RAD

Your travel / patent guide to East Texas

If you work for a high-tech company, there's an increasing likelihood that you're going to be taking a trip out to East Texas sometime in your career. That's because there were 236 patent cases filed there in 2006 (up from 14 in 2003), and it's just kept growing since then.

Since you may have to travel out there because you're on the receiving end of a patent troll's suit, we provide the following handy guide to the area. We'll start with Marshall, Texas, with a population in the mid-twenty thousands. If you're there in the winter, stuck in Judge T. John Ward's courtroom, be sure to check out the Wonderland of Lights, and look for pottery deals year-round.

If your legal troubles bring you to Tyler, home of Judge Leonard Davis, make sure to bring your lightweight suits in the summer, as the average temperatures are in the mid-90s, and nothing spoils your testimony on user interface prior art more than embarrassing sweat stains.

Finally, if you draw Judge David Folsom, you'll be off to a city that actually straddles two states, Texarkana. Texarkana's main employer is the Red River Army Depot. Other than that, it has a depressingly sparse Wikipedia page, so maybe you should hope you end up in one of the other venues.

Got news?

Please send tips and leads here.


The Java parade: What about IBM and Apache?

IBM and ApacheBefore I finished Who Leads the Java parade, Stephen Chin made the comment "what about IBM?" Since publication, I've had several Twitter discussions on the same question, and a smaller number over Apache. It feels very strange to say that IBM isn't a contender for the leadership of the Java community, particularly given all that they've contributed (most notably, Eclipse and Apache Harmony). But I just don't see either IBM or Apache as potential leaders for the Java community.

IBM has always seemed rather insular to me. They solve their own problems; those are big, large problems. IBM is one of the few organizations that could come up with a "Smarter Planet" initiative and not be laughed out of the room (Google may be the only other). So they're definitely a very positive force to be taken seriously; they're doing amazing work putting big data to practical use. But at the same time, they don't tend to engage, at least in my experience. They work on their own, and with their partners. Somewhat like Google, they're all the Java community they need.

"They don't engage? What about Harmony?" That's a good point. But it's a point that cuts both ways. Harmony was an important project that could have had a huge role in opening up the Java world. But when Oracle acquired Sun, IBM fairly quickly backed off from Harmony. Don't misunderstand me; I don't have any real problem with IBM's decision here. But if IBM wanted a role in leading the Java community, this was the time to stand up. Dropping support for Harmony is essentially saying that they're following Oracle's lead. That is IBM's prerogative, but it's also opting out as a potential leader.

There are other ways in which IBM doesn't engage. I was on the program committee for OSCON Java, and reviewed all of the proposals that were submitted. (As Arlo Guthrie said, "I'm not proud ... or tired"). I don't recall any proposals submitted from IBM. That doesn't mean there weren't any, but there certainly weren't many. (There are a couple of IBMers speaking at OSCON "Classic.") They are neither a sponsor nor an exhibitor. Again, I'm not complaining, but engagement is engagement, and disengagement is just that.

OSCON Java 2011, being held July 25-27 in Portland, Ore., is focused on open source technologies that make up the Java ecosystem. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Is it possible that IBM has decided that their best
strategy for Java is to unite with Oracle in pushing
OpenJDK forward? Yes, certainly. Is it likely that they felt
maintaining an alternative to OpenJDK was just a distraction to real
work? Very possibly. And it was unfair of me to characterize IBM as
insular. But good as IBM's decisions may be,
they're not the decisions of a company that wants to exercise Java

In the long run, does this mean that IBM is any different from VMware?
Both companies have large stakes in Java, lots of expertise, lots of tools
at their disposal, and lots of competing business interests. Is it
possible that they just want to skip the politics and
get down to work? Maybe, but I think it's something
different. If the question is getting in front of the parade and
leading, IBM has its own parade: using data to solve difficult
problems about living on this planet. When you're
almost four
times the size of Oracle
, 16 times the size of Google, and 50
times the size of VMware, you have to think big. Large as the Java
community is, IBM is aiming at a larger value of "big."

Now, for the Apache Software Foundation (ASF): when writing, I thought
seriously about the possibility that the ASF might contend for leadership of the Java community. But they're just not in the race. That's not their function. They provide resources and frameworks for open source collaboration, but on the whole, they don't provide community leadership, technical or otherwise. Hadoop, together with its subprojects and former subprojects, is probably the most important
project in the Apache galaxy. But would you call the ASF a leader of
the Hadoop community? Clearly not. That role is shared by Cloudera
and Yahoo!, and possibly Yahoo!'s new
spinoff, HortonWorks.
Apache provides resources and licenses, but they aren't leading the
community in any meaningful sense.

Apache walked away from a leadership role when it

left the JCP
. That was a completely understandable decision, and
a decision that I agree was necessary, but a decision with
consequences. It's possible that Apache was hoping to spark a revolt
against Oracle's leadership. I think Apache meant what they said, that
the JCP was no longer a process in which they could participate with
integrity, and they had no choice but to leave.
Any chance of Apache retaining a significant role in the Java community ended
when IBM walked
away from Apache Harmony. Harmony remains interesting, but
it's very difficult to imagine Harmony
thriving without IBM's support. And with
Harmony marginalized, it's not clear how Apache could exert much
influence over the course of Java.

So, why did I ignore IBM and Apache? They've both opted out. They had good reasons for doing so, but nevertheless, they're not in the running. IBM and Apache might be considered dark horses in the race for Java leadership. And given that neither VMWare nor Google seems to want leadership, and Oracle hasn't demonstrated the "social skills" to exercise leadership, I have to grant that a dark horse may be in as good a position as anyone else.


June 02 2011

Developer Week in Review: The other shoe drops on iOS developers

Bags packed? Check! Ticket printed? Check! "I (Heart) Steve" T-shirt worn? Check! Yes, it's that time of year, when the swallows return to Capistrano the developers return to San Francisco for WWDC. I'll be there Sunday to Saturday, so keep an eye out for me and maybe we can get a beer or something.

But even as we await the release of Lion, iOS 5 and iCloud, the world continues to turn.

Well, so much for Apple's big umbrella

App store screenshotLast week, iOS developers everywhere breathed a sigh of relief as Apple stepped up to the plate, and said that they considered their developer community to be covered under Apple's existing licensing agreement with patent holding company Lodsys. Lodsys, evidently, had a difference of opinion on the subject. This leaves the lucky seven developers who got hit with the first round of lawsuits with an interesting choice. Do they settle with Lodsys, perhaps paying out many times what they have brought in as income from their apps, or do they fight and face expensive legal fees and a lawsuit that could drag on for years?

Android developers shouldn't gloat too much at the misfortune of their iPhone counterparts, since Lodsys is asserting that two of their patents cover Android apps as well. Apple and Google are going to have to take things up another notch, and offer free legal services to their developers, or things could get quite messy, quite fast.

OSCON 2011 — Join today's open source innovators, builders, and pioneers July 25-29 as they gather at the Oregon Convention Center in Portland, Ore.

Save 20% on registration with the code OS11RAD

OpenOffice finds a home at Apache

Oracle, as part of their ongoing shedding of all of their Sun acquisitions, had promised earlier in the year that OpenOffice would be given to some third party at some point. Well, that third party is Apache. Oracle will be donating the source code to Apache, where it will become an incubator project. For developers who have be interested in poking around with the guts of OpenOffice (or extending the functionality), but were leery of Oracle holding the strings, this announcement should eliminate any doubts. Statements from The Document Foundation (who split off a fork of OpenOffice) were guarded, but it seems like there's hope of reuniting the code streams, and avoiding yet another case of parallel development of the same "product."

Java rant of the week: Interface madness

As I am wont to do from time to time, I'd like to take a moment today to rant about a coding abuse that I see more and more frequently. That abuse would be the indiscriminate use of interfaces in front of implementing classes, usually with a factory. There are certainly places where the interface/factory pattern makes sense, such as when you genuinely do have multiple implementations of something that you want to be able to swap out easily

However, far too often, I see factories and interfaces used between classes simply because "we might" want to someday put something else in there. I recently saw an implementation of a servlet that called for authentication of the request. There's only one implemented version of the authentication code, and no real plans to make another. But still, there were Foo and FooImpl files sitting right there (there was probably a FooFactory somewhere, I didn't go looking ...)

Unneeded interfaces are not only wasted code, they make reading and debugging the code much more difficult, because they break the link between the call and the implementation. The only way to find the implementing code is to look for the factory, and see what class is being provisioned to implement the interface. If you're really lucky, the factory gets the class name from a property file, so you have to look another level down.

There's no excuse for doing this. It's anti-agile, and the refactor cost once you do genuinely do have a second version, and need an interface, is relatively low. End of rant.

Got news?

Please send tips and leads here.


March 31 2011

Four short links: 31 March 2011

  1. Debt: The First 5,000 Years -- Throughout its 5000 year history, debt has always involved institutions - whether Mesopotamian sacred kingship, Mosaic jubilees, Sharia or Canon Law - that place controls on debt's potentially catastrophic social consequences. It is only in the current era, writes anthropologist David Graeber, that we have begun to see the creation of the first effective planetary administrative system largely in order to protect the interests of creditors. (via Tim O'Reilly)
  2. Know Your History -- where Google's +1 came from (answer: Apache project).
  3. MIT Autonomous Quadcopter -- MIT drone makes a map of a room in real time using an X Box Kinect and is able to navigate through it. All calculations performed on board the multicopter. Wow. (via Slashdot and Sara Winge)
  4. How Great Entrepreneurs Think -- leaving aside the sloppy open-mouth kisses to startups that "great entrepreneurs" implies, an interesting article comparing the mindsets of corporate execs with entrepreneurs. I'd love to read the full interviews and research paper. Sarasvathy explains that entrepreneurs' aversion to market research is symptomatic of a larger lesson they have learned: They do not believe in prediction of any kind. "If you give them data that has to do with the future, they just dismiss it," she says. "They don't believe the future is predictable...or they don't want to be in a space that is very predictable." [...] the careful forecast is the enemy of the fortuitous surprise. (via Sacha Judd)

January 19 2011

Developer Week in Review

Having somehow found myself in the bizarro world where the Patriots lose to the Jets, I scanned the InterWebs to see what other strange things might have occurred in the past week. My report follows.

The new cat in town

Despite all the hype that mobile platforms and dynamic languages such as Python garner, a lot of the web world still runs on Java. Thus, the news that the first stable build of Apache Tomcat version 7 was released is significant.

For those who have not had the pleasure, Tomcat is a fully functional J2EE (that's Java 2 Enterprise Edition) server, but without the big (or any) price tag. Version 7 brings along with it support for the 3.0 Servlet API and JSP version 2.2, upgrades sure to bring a sparkle of delight to your local Java web guru.

Amazon knows best

Thinking about selling your Android apps through the new Amazon App store? Amazon thinks they have a better idea what that app is worth than you do. Evidently, apps submitted for sale can have a suggested retail price, but Amazon will determine the actual selling price. The developer gets 70 percent of the selling price, or 20 percent of the retail price, whichever is greater.

So, if you put your masterpiece — "Enraged Avians" — into the store with an MSRP of $20, Amazon could decide to put it on sale for as little as $5.70, netting you $4 (20% of $20 = 70% of $5.70). What this means is that if you want to make a certain amount off each sale, you'll need to set an MSRP five times that amount to make sure you actually get it. Hilarious hijinks are sure to ensue as Amazon and the developer community dance with pricing levels.

HTML5, now with 100 percent more logos

HTML5 logoThere may be a battle brewing over which video codec to use with HTML5, and people are still debating if HTML5 is a good replacement for technologies like Flash, but the biggest issue facing the new standard has clearly been settled: HTML5 now has a logo. (The logo appears to either be co-branded with the Hunter Safety program or Tropicana orange juice.)

According to the W3C, the new logo is:

... an all-purpose banner for HTML5, CSS, SVG, WOFF, and other technologies that constitute an open web platform. The logo does not have a specific meaning; it is not meant to imply conformance or validity, for example. The logo represents "the Web platform" in a very general sense.

We can all rest safe in our beds tonight, knowing that we now have a logo to represent the "Web platform, in a very general sense."

What new logos await us in the coming week, and can they possibly use a more garish color scheme? We'll find out next week, in this same space. Suggestions are always welcome, so please send tips or news here.


November 10 2010

Developer Week in Review

Here's your weekly helping of developer info:

There's an App (Store) for that!

It seems like all the cool kids these days are doing it. Creating app stores, that is. Intel just unveiled their AppUp store, designed to let developers sell directly to netbook owners, using an App Store model.

Unfortunately, to use AppUp on your netbook you have to run Windows. All those Linux netbook app developers aren't going to find much of a welcome there, at least at the moment.

As the Java brews

In the continued soap opera that is Java these days, the Apache folks have decided to strike back at Oracle for what Apache claims is bad-faith action regarding the open-sourceness of Java.

Of course, Apache being Apache, the dramatic action isn't a lawsuit, but instead a strongly worded letter (that'll show 'em!) urging the members of the JCP to reject the next version of Java, unless Oracle mends their ways. If that doesn't work, they may even organize a bake sale or write letters to the editor.

Oracle's announcement this week that they will be splitting the JVM into a premium and free edition couldn't have helped things. If you're old enough to remember the kerfuffle that Sun raised when Microsoft tried to create their own version of Java, claiming that non-uniform Javas would defeat the value of the language, this recent move by Java's new daddy has particular irony.

Microsoft open sources a language! Oh wait, it's F#...

Raise your hand if you've ever heard of F#? No, it's not FORTRAN++, it's Microsoft's functional language, from Microsoft Research. This week, Microsoft dropped the F# compiler sources in a nice neat bundle on Apache's doorstep, with a note saying that they hoped that Apache could find a good home for it.

Two observations here: Does anyone but me remember that not too long ago, Mr. Ballmer referred to open source as akin to cancer? Microsoft seems to have embraced it recently, but I don't recall ever hearing an "oops, my mistake" from Steve B.

Secondly, why doesn't Microsoft open source something of real value to the community, but well past its prime? Windows 98 comes to mind, or maybe Word 2003. Either one would allow all sorts of interesting mashups and compatibility enhancements with other open source projects, and it's not like either codebase is particularly relevant anymore from a competitive standpoint.

Oh yeah, those guys in Cupertino ...

This Wednesday is supposed to be the day that Snow Leopard 10.6.5 and iTunes 10.1 release, followed soon after by iOS 4.2. I recently read that Netflix now consumes something like 20 percent of the total Internet traffic. I wonder if, on days when Apple drops one of their massive OS upgrades, Netflix doesn't take a back seat to Apple for a day or two, as all those loyal MacHeads run their updates?

That's it for this week. Suggestions are always welcome, so please send tips or news here.

September 22 2010

The SMAQ stack for big data

"Big data" is data that becomes large enough that it cannot be processed using conventional methods. Creators of web search engines were among the first to confront this problem. Today, social networks, mobile phones, sensors and science contribute to petabytes of data created daily.

To meet the challenge of processing such large data sets, Google created MapReduce. Google's work and Yahoo's creation of the Hadoop MapReduce implementation has spawned an ecosystem of big data processing tools.

As MapReduce has grown in popularity, a stack for big data systems has emerged, comprising layers of Storage, MapReduce and Query (SMAQ). SMAQ systems are typically open source, distributed, and run on commodity hardware.

SMAQ Stack

In the same way the commodity LAMP stack of Linux, Apache, MySQL and PHP changed the landscape of web applications, SMAQ systems are bringing commodity big data processing to a broad audience. SMAQ systems underpin a new era of innovative data-driven products and services, in the same way that LAMP was a critical enabler for Web 2.0.

Though dominated by Hadoop-based architectures, SMAQ encompasses a variety of systems, including leading NoSQL databases. This paper describes the SMAQ stack and where today's big data tools fit into the picture.


Created at Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today's big data processing. The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. This distribution solves the issue of data too large to fit onto a single machine.

SMAQ Stack - MapReduce

To understand how MapReduce works, look at the two phases suggested by its name. In the map phase, input data is processed, item by item, and transformed into an intermediate data set. In the reduce phase, these intermediate results are reduced to a summarized data set, which is the desired end result.

MapReduce example

A simple example of MapReduce is the task of counting the number of unique words in a document. In the map phase, each word is identified and given the count of 1. In the reduce phase, the counts are added together for each word.

If that seems like an obscure way of doing a simple task, that's because it is. In order for MapReduce to do its job, the map and reduce phases must obey certain constraints that allow the work to be parallelized. Translating queries into one or more MapReduce steps is not an intuitive process. Higher-level abstractions have been developed to ease this, discussed under Query below.

An important way in which MapReduce-based systems differ from conventional databases is that they process data in a batch-oriented fashion. Work must be queued for execution, and may take minutes or hours to process.

Using MapReduce to solve problems entails three distinct operations:

  • Loading the data -- This operation is more properly called Extract, Transform, Load (ETL) in data warehousing terminology. Data must be extracted from its source, structured to make it ready for processing, and loaded into the storage layer for MapReduce to operate on it.
  • MapReduce -- This phase will retrieve data from storage, process it, and return the results to the storage.
  • Extracting the result -- Once processing is complete, for the result to be useful to humans, it must be retrieved from the storage and presented.

Many SMAQ systems have features designed to simplify the operation of each of these stages.

Hadoop MapReduce

Hadoop is the dominant open source MapReduce implementation. Funded by Yahoo, it emerged in 2006 and, according to its creator Doug Cutting, reached “web scale” capability in early 2008.

The Hadoop project is now hosted by Apache. It has grown into a large endeavor, with multiple subprojects that together comprise a full SMAQ stack.

Since it is implemented in Java, Hadoop's MapReduce implementation is accessible from the Java programming language. Creating MapReduce jobs involves writing functions to encapsulate the map and reduce stages of the computation. The data to be processed must be loaded into the Hadoop Distributed Filesystem.

Taking the word-count example from above, a suitable map function might look like the following (taken from the Hadoop MapReduce documentation, the key operations shown in bold).

public static class Map
extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
context.write(word, one);


The corresponding reduce function sums the counts for each word.

public static class Reduce
		extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {

int sum = 0;
for (IntWritable val : values) {
sum += val.get();
context.write(key, new IntWritable(sum));


The process of running a MapReduce job with Hadoop involves the following steps:

  • Defining the MapReduce stages in a Java program
  • Loading the data into the filesystem
  • Submitting the job for execution
  • Retrieving the results from the filesystem

Run via the standalone Java API, Hadoop MapReduce jobs can be complex to create, and necessitate programmer involvement. A broad ecosystem has grown up around Hadoop to make the task of loading and processing data more straightforward.

Other implementations

MapReduce has been implemented in a variety of other programming languages and systems, a list of which may be found in Wikipedia's entry for MapReduce. Notably, several NoSQL database systems have integrated MapReduce, and are described later in this paper.


MapReduce requires storage from which to fetch data and in which to store the results of the computation. The data expected by MapReduce is not relational data, as used by conventional databases. Instead, data is consumed in chunks, which are then divided among nodes and fed to the map phase as key-value pairs. This data does not require a schema, and may be unstructured. However, the data must be available in a distributed fashion, to serve each processing node.

SMAQ Stack - Storage

The design and features of the storage layer are important not just because of the interface with MapReduce, but also because they affect the ease with which data can be loaded and the results of computation extracted and searched.

Hadoop Distributed File System

The standard storage mechanism used by Hadoop is the Hadoop Distributed File System, HDFS. A core part of Hadoop, HDFS has the following features, as detailed in the HDFS design document.

  • Fault tolerance -- Assuming that failure will happen allows HDFS to run on commodity hardware.
  • Streaming data access -- HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data.
  • Extreme scalability -- HDFS will scale to petabytes; such an installation is in production use at Facebook.
  • Portability -- HDFS is portable across operating systems.
  • Write once -- By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput.
  • Locality of computation -- Due to data volume, it is often much faster to move the program near to the data, and HDFS has features to facilitate this.

HDFS provides an interface similar to that of regular filesystems. Unlike a database, HDFS can only store and retrieve data, not index it. Simple random access to data is not possible. However, higher-level layers have been created to provide finer-grained functionality to Hadoop deployments, such as HBase.

HBase, the Hadoop Database

One approach to making HDFS more usable is HBase. Modeled after Google's BigTable database, HBase is a column-oriented database designed to store massive amounts of data. It belongs to the NoSQL universe of databases, and is similar to Cassandra and Hypertable.

HBase and MapReduce

HBase uses HDFS as a storage system, and thus is capable of storing a large volume of data through fault-tolerant, distributed nodes. Like similar column-store databases, HBase provides REST and Thrift based API access.

Because it creates indexes, HBase offers fast, random access to its contents, though with simple queries. For complex operations, HBase acts as both a source and a sink (destination for computed data) for Hadoop MapReduce. HBase thus allows systems to interface with Hadoop as a database, rather than the lower level of HDFS.


Data warehousing, or storing data in such a way as to make reporting and analysis easier, is an important application area for SMAQ systems. Developed originally at Facebook, Hive is a data warehouse framework built on top of Hadoop. Similar to HBase, Hive provides a table-based abstraction over HDFS and makes it easy to load structured data. In contrast to HBase, Hive can only run MapReduce jobs and is suited for batch data analysis. Hive provides a SQL-like query language to execute MapReduce jobs, described in the Query section below.

Cassandra and Hypertable

Cassandra and Hypertable are both scalable column-store databases that follow the pattern of BigTable, similar to HBase.

An Apache project, Cassandra originated at Facebook and is now in production in many large-scale websites, including Twitter, Facebook, Reddit and Digg. Hypertable was created at Zvents and spun out as an open source project.

Cassandra and MapReduce

Both databases offer interfaces to the Hadoop API that allow them to act as a source and a sink for MapReduce. At a higher level, Cassandra offers integration with the Pig query language (see the Query section below), and Hypertable has been integrated with Hive.

NoSQL database implementations of MapReduce

The storage solutions examined so far have all depended on Hadoop for MapReduce. Other NoSQL databases have built-in MapReduce features that allow computation to be parallelized over their data stores. In contrast with the multi-component SMAQ architectures of Hadoop-based systems, they offer a self-contained system comprising storage, MapReduce and query all in one.

Whereas Hadoop-based systems are most often used for batch-oriented analytical purposes, the usual function of NoSQL stores is to back live applications. The MapReduce functionality in these databases tends to be a secondary feature, augmenting other primary query mechanisms. Riak, for example, has a default timeout of 60 seconds on a MapReduce job, in contrast to the expectation of Hadoop that such a process may run for minutes or hours.

These prominent NoSQL databases contain MapReduce functionality:

  • CouchDB is a distributed database, offering semi-structured document-based storage. Its key features include strong replication support and the ability to make distributed updates. Queries in CouchDB are implemented using JavaScript to define the map and reduce phases of a MapReduce process.
  • MongoDB is very similar to CouchDB in nature, but with a stronger emphasis on performance, and less suitability for distributed updates, replication, and versioning. MongoDB MapReduce operations are specified using JavaScript.
  • Riak is another database similar to CouchDB and MongoDB, but places its emphasis on high availability. MapReduce operations in Riak may be specified with JavaScript or Erlang.

Integration with SQL databases

In many applications, the primary source of data is in a relational database using platforms such as MySQL or Oracle. MapReduce is typically used with this data in two ways:

  • Using relational data as a source (for example, a list of your friends in a social network).
  • Re-injecting the results of a MapReduce operation into the database (for example, a list of product recommendations based on friends' interests).

It is therefore important to understand how MapReduce can interface with relational database systems. At the most basic level, delimited text files serve as an import and export format between relational databases and Hadoop systems, using a combination of SQL export commands and HDFS operations. More sophisticated tools do, however, exist.

The Sqoop tool is designed to import data from relational databases into Hadoop. It was developed by Cloudera, an enterprise-focused distributor of Hadoop platforms. Sqoop is database-agnostic, as it uses the Java JDBC database API. Tables can be imported either wholesale, or using queries to restrict the data import.

Sqoop also offers the ability to re-inject the results of MapReduce from HDFS back into a relational database. As HDFS is a filesystem, Sqoop expects delimited text files and transforms them into the SQL commands required to insert data into the database.

For Hadoop systems that utilize the Cascading API (see the Query section below) the cascading.jdbc and cascading-dbmigrate tools offer similar source and sink functionality.

Integration with streaming data sources

In addition to relational data sources, streaming data sources, such as web server log files or sensor output, constitute the most common source of input to big data systems. The Cloudera Flume project aims at providing convenient integration between Hadoop and streaming data sources. Flume aggregates data from both network and file sources, spread over a cluster of machines, and continuously pipes these into HDFS. The Scribe server, developed at Facebook, also offers similar functionality.

Commercial SMAQ solutions

Several massively parallel processing (MPP) database products have MapReduce functionality built in. MPP databases have a distributed architecture with independent nodes that run in parallel. Their primary application is in data warehousing and analytics, and they are commonly accessed using SQL.

  • The Greenplum database is based on the open source PostreSQL DBMS, and runs on clusters of distributed hardware. The addition of MapReduce to the regular SQL interface enables fast, large-scale analytics over Greenplum databases, reducing query times by several orders of magnitude. Greenplum MapReduce permits the mixing of external data sources with the database storage. MapReduce operations can be expressed as functions in Perl or Python.
  • Aster Data's nCluster data warehouse system also offers MapReduce functionality. MapReduce operations are invoked using Aster Data's SQL-MapReduce technology. SQL-MapReduce enables the intermingling of SQL queries with MapReduce jobs defined using code, which may be written in languages including C#, C++, Java, R or Python.

Other data warehousing solutions have opted to provide connectors with Hadoop, rather than integrating their own MapReduce functionality.

  • Vertica, famously used by Farmville creator Zynga, is an MPP column-oriented database that offers a connector for Hadoop.
  • Netezza is an established manufacturer of hardware data warehousing and analytical appliances. Recently acquired by IBM, Netezza is working with Hadoop distributor Cloudera to enhance the interoperation between their appliances and Hadoop. While it solves similar problems, Netezza falls outside of our SMAQ definition, lacking both the open source and commodity hardware aspects.

Although creating a Hadoop-based system can be done entirely with open source, it requires some effort to integrate such a system. Cloudera aims to make Hadoop enterprise-ready, and has created a unified Hadoop distribution in its Cloudera Distribution for Hadoop (CDH). CDH for Hadoop parallels the work of Red Hat or Ubuntu in creating Linux distributions. CDH comes in both a free edition and an Enterprise edition with additional proprietary components and support. CDH is an integrated and polished SMAQ environment, complete with user interfaces for operation and query. Cloudera's work has resulted in some significant contributions to the Hadoop open source ecosystem.


Specifying MapReduce jobs in terms of defining distinct map and reduce functions in a programming language is unintuitive and inconvenient, as is evident from the Java code listings shown above. To mitigate this, SMAQ systems incorporate a higher-level query layer to simplify both the specification of the MapReduce operations and the retrieval of the result.

SMAQ Stack - Query

Many organizations using Hadoop will have already written in-house layers on top of the MapReduce API to make its operation more convenient. Several of these have emerged either as open source projects or commercial products.

Query layers typically offer features that handle not only the specification of the computation, but the loading and saving of data and the orchestration of the processing on the MapReduce cluster. Search technology is often used to implement the final step in presenting the computed result back to the user.


Developed by Yahoo, Pig provides a new high-level language, Pig Latin, for describing and running Hadoop MapReduce jobs. It is intended to make Hadoop accessible for developers familiar with data manipulation using SQL, and provides an interactive interface as well as a Java API. Pig integration is available for the Cassandra and HBase databases.

Below is shown the word-count example in Pig, including both the data loading and storing phases (the notation $0 refers to the first field in a record).

input = LOAD 'input/sentences.txt' USING TextLoader();
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group, COUNT(words);
ordered = ORDER counts BY $0;
STORE ordered INTO 'output/wordCount' USING PigStorage();

While Pig is very expressive, it is possible for developers to write custom steps in User Defined Functions (UDFs), in the same way that many SQL databases support the addition of custom functions. These UDFs are written in Java against the Pig API.

Though much simpler to understand and use than the MapReduce API, Pig suffers from the drawback of being yet another language to learn. It is SQL-like in some ways, but it is sufficiently different from SQL that it is difficult for users familiar with SQL to reuse their knowledge.


As introduced above, Hive is an open source data warehousing solution built on top of Hadoop. Created by Facebook, it offers a query language very similar to SQL, as well as a web interface that offers simple query-building functionality. As such, it is suited for non-developer users, who may have some familiarity with SQL.

Hive's particular strength is in offering ad-hoc querying of data, in contrast to the compilation requirement of Pig and Cascading. Hive is a natural starting point for more full-featured business intelligence systems, which offer a user-friendly interface for non-technical users.

The Cloudera Distribution for Hadoop integrates Hive, and provides a higher-level user interface through the HUE project, enabling users to submit queries and monitor the execution of Hadoop jobs.

Cascading, the API Approach

The Cascading project provides a wrapper around Hadoop's MapReduce API to make it more convenient to use from Java applications. It is an intentionally thin layer that makes the integration of MapReduce into a larger system more convenient. Cascading's features include:

  • A data processing API that aids the simple definition of MapReduce jobs.
  • An API that controls the execution of MapReduce jobs on a Hadoop cluster.
  • Access via JVM-based scripting languages such as Jython, Groovy, or JRuby.
  • Integration with data sources other than HDFS, including Amazon S3 and web servers.
  • Validation mechanisms to enable the testing of MapReduce processes.

Cascading's key feature is that it lets developers assemble MapReduce operations as a flow, joining together a selection of “pipes”. It is well suited for integrating Hadoop into a larger system within an organization.

While Cascading itself doesn't provide a higher-level query language, a derivative open source project called Cascalog does just that. Using the Clojure JVM language, Cascalog implements a query language similar to that of Datalog. Though powerful and expressive, Cascalog is likely to remain a niche query language, as it offers neither the ready familiarity of Hive's SQL-like approach nor Pig's procedural expression. The listing below shows the word-count example in Cascalog: it is significantly terser, if less transparent.

	(defmapcatop split [sentence]
		(seq (.split sentence "\\s+")))

(?<- (stdout) [?word ?count]
(sentence ?s) (split ?s :> ?word)
(c/count ?count))

Search with Solr

An important component of large-scale data deployments is retrieving and summarizing data. The addition of database layers such as HBase provides easier access to data, but does not provide sophisticated search capabilities.

To solve the search problem, the open source search and indexing platform Solr is often used alongside NoSQL database systems. Solr uses Lucene search technology to provide a self-contained search server product.

For example, consider a social network database where MapReduce is used to compute the influencing power of each person, according to some suitable metric. This ranking would then be reinjected to the database. Using Solr indexing allows operations on the social network, such as finding the most influential people whose interest profiles mention mobile phones, for instance.

Originally developed at CNET and now an Apache project, Solr has evolved from being just a text search engine to supporting faceted navigation and results clustering. Additionally, Solr can manage large data volumes over distributed servers. This makes it an ideal solution for result retrieval over big data sets, and a useful component for constructing business intelligence dashboards.


MapReduce, and Hadoop in particular, offers a powerful means of distributing computation among commodity servers. Combined with distributed storage and increasingly user-friendly query mechanisms, the resulting SMAQ architecture brings big data processing within reach for even small- and solo-development teams.

It is now economic to conduct extensive investigation into data, or create data products that rely on complex computations. The resulting explosion in capability has forever altered the landscape of analytics and data warehousing systems, lowering the bar to entry and fostering a new generation of products, services and organizational attitudes - a trend explored more broadly in Mike Loukides' "What is Data Science?" report.

The emergence of Linux gave power to the innovative developer with merely a small Linux server at their desk: SMAQ has the same potential to streamline data centers, foster innovation at the edges of an organization, and enable new startups to cheaply create data-driven businesses.


June 17 2010

From Apache to Health and Human Services

Brian Behlendorf, one of the founders of the Apache web server project and the CollabNet cooperative software development company, is contracting now with the Department of Health and Human Services (HHS) on the CONNECT software project. CONNECT helps hospitals and agencies exchange medical data, which gives doctors critical information to improve patient care.

Behlendorf, along with project leader David Riley, will speak at OSCON about the importance of CONNECT and the way they and their colleagues built a robust community of government staff, volunteers, and healthcare IT vendors around it.

Health IT at OSCON 2010Behlendorf discusses the following in this 18-minute podcast:

  • The role of health data in promoting quality care, in improving our knowledge of what works, and in reducing healthcare costs.
  • How HHS is trying to improve the exchange of patient data for hospitals and doctors, agencies monitoring quality of care, and eventually patients themselves.
  • How, with Behlendorf's help, HHS opened up the CONNECT project, attracted both volunteers and vendors to improve it, and created a community with a sense of ownership.

May 26 2010

Four short links: 26 May 2010

  1. PSTSDK -- Apache-licensed code from Microsoft to read Outlook files. Covered by Microsoft's Open Specification Promise not to assert related patents against users of this library.
  2. Cheap Android Tablet -- not multitouch, but only $136. Good for hacking with in the meantime. (via Hacker News)
  3. Real-Time Collaborative Editing with Websockets, node.js, and Redis -- uses Chrome's websockets alternative to Comet and other long-polling web connections.
  4. XMPP Library for Node.js -- I'm intrigued to see how quickly Node.js, the Javascript server environment, has taken off.

March 15 2010

Four short links: 15 March 2010

  1. A German Library for the 21st Century (Der Spiegel) -- But browsing in Europeana is just not very pleasurable. The results are displayed in thumbnail images the size of postage stamps. And if you click through for a closer look, you're taken to the corresponding institute. Soon you're wandering helplessly around a dozen different museum and library Web sites -- and you end up lost somewhere between the "Vlaamse Kunstcollectie" and the "Wielkopolska Biblioteka Cyfrowa." Would it not be preferable to incorporate all the exhibits within the familiar scope of Europeana? "We would have preferred that," says Gradmann. "But then the museums would not have participated." They insist on presenting their own treasures. This is a problem encountered everywhere around the world: users hate silos but institutions hate the thought of letting go of their content. We're going to have to let go to win. (via Penny Carnaby)
  2. StoryGarden -- a web-based tool for gathering and analyzing a large number of stories contributed by the public. The content of the stories, along with some associated survey questions, are processed in an automated semantic computing process for an immediate, interactive display for the lay public, and in a more thorough manual process for expert analysis.
  3. Google Apps Script -- VBA for the 2010s. Currently mainly for spreadsheets, but some hooks into Gmail and Google Calendar.
  4. There's a Rootkit in the Closet -- lovely explanation of finding and isolating a rootkit, reconstructing how it got there and deconstructing the rootkit to figure out what it did. It's a detective story, no less exciting than when Cliff Stohl wrote The Cuckoo's Egg.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...