Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 09 2013

How to analyze 100 million images for $624

Jetpac is building a modern version of Yelp, using big data rather than user reviews. People are taking more than a billion photos every single day, and many of these are shared publicly on social networks. We analyze these pictures to discover what they can tell us about bars, restaurants, hotels, and other venues around the world — spotting hipster favorites by the number of mustaches, for example.

Treating large numbers of photos as data, rather than just content to display to the user, is a pretty new idea. Traditionally it’s been prohibitively expensive to store and process image data, and not many developers are familiar with both modern big data techniques and computer vision. That meant we had to cut a path through some thick underbrush to get a system working, but the good news is that the free-falling price of commodity servers makes running it incredibly cheap.

I use m1.xlarge servers on Amazon EC2, which are beefy enough to process two million Instagram-sized photos a day, and only cost $12.48! I’ve used some open source frameworks to distribute the work in a completely scalable way, so this works out to $624 for a 50-machine cluster that can process 100 million pictures in 24 hours. That’s just 0.000624 cents per photo! (I seriously do not have enough exclamation points for how mind-blowingly exciting this is.)

The Foundations

So, how do I do all that work so quickly and cheaply? The building blocks are the open source Hadoop, HIPI, and OpenCV projects. Unless you were frozen in an Arctic ice-cave for the last decade, you’ll have heard of Hadoop, but the other two are a bit less famous.

HIPI is a Java framework that lets you efficiently process images on a Hadoop cluster. It’s needed because HDFS can’t handle large numbers of files, so it provides a way of bundling images together into much bigger files, and unbundling them on the fly as you process them. It’s been growing in popularity in research and academia, but hasn’t had widespread commercial use yet. I had to fork it and add meta-data support so I could keep track of the files as they went through the system, along with some other fixes. It’s running like a champion now, though, and has enabled everything else we’re doing.

OpenCV is written in C++, but has a recently added Java wrapper. It supports a lot of the fundamental operations you need to implement image-processing algorithms, so I was able to write my advanced (and fairly cunning, if I do say so myself, especially for mustaches!) image analysis and object detection routines using OpenCV for the basic operations.


The first and most time-consuming step is getting your images downloaded. HIPI has a basic distributed image downloader as an example, but you’ll want to make sure the site you’re accessing won’t be overwhelmed. I was focused on large social networks with decent CDNs, so I felt comfortable running several servers in parallel. I did have to alter the HIPI downloader code to add a user agent string so the admins of the sites could contact me if the downloading was causing any problems.

If you have three different external services you’re pulling from, with four servers assigned to each service, you end up taking about 30 days to download 100 million photos. That’s $4,492.80 for the initial pull, which is not chump change, but not wildly outside a startup budget, especially if you plan ahead and use reserved instances to reduce the cost.

Object Recognition

Now you have 100 million images, you need to do something useful with them. Photos contain a massive amount of information, and OpenCV has a lot of the building blocks you’ll need to extract some of the parts you care about. Think about the properties you’re interested in — maybe you want to exclude pictures containing people if you’re trying to create slideshows about places — and then look around at the tools that the library offers; for this example you could search for faces, and identify photos that are faceless as less likely to contain people. Anything you could do on a single image, you can now do in parallel on millions of them.

Before you go ahead and do a distributed run, though, make sure it works. I like to write a standalone command-line version of my image processing step, so I can debug it easily and iterate fast on the algorithm. OpenCV has a lot of convenience functions that make it easy to load images from disk, or even capture them from your webcam, display them, and save them out at the end. Their Java support is quite new, and I hit a few hiccups, but overall, it works very well. This made it possible to write a wrapper that is handed the images from HIPI’s decoding engine, does some processing, and then writes the results out in a text format, one line per image, all within a natively-Java Hadoop job.

Once you’re happy with the way the algorithm is working on your test data, and you’ve wrapped it in a HIPI wrapper, you’re ready to apply it to all the images you’ve downloaded. Just spin up a Hadoop job with your jar file, pointing at the HDFS folders containing the output of the downloader. In our experience, we were able to process around two million 700×700-sized JPEG images on each server per day, so you can use that as a rule of thumb for the size/speed tradeoffs you want to make when you choose how many machines to put in your cluster. Surprisingly, the actual processing we did within our algorithm didn’t affect the speed very much, apparently the object recognition and image-processing work ran fast enough that it was swamped by the time spent on IO.

I hope I’ve left you excited and ready to tackle your own large-scale image processing challenges. Whether you’re a data person who’s interested in image processing or a computer-vision geek who wants to widen your horizons, there’s a lot of new ground to be explored, and it’s surprisingly easy once you put together the pieces.

August 07 2013

Predicting the future: Strata 2014 hot topics

Conferences like Strata are planned a year in advance. The logistics and coordination required for an event of this magnitude takes a lot of planning, but it also takes a decent amount of prediction: Strata needs to skate to where the puck is going.

While Strata New York + Hadoop World 2013 is still a few months away, we’re already guessing at what next year’s Santa Clara event will hold. Recently, the team got together to identify some of the hot topics in big data, ubiquitous computing, and new interfaces. We selected eleven big topics for deeper investigation.

  • Deep learning
  • Time-series data
  • The big data “app stack”
  • Cultural barriers to change
  • Design patterns
  • Laggards and Luddites
  • The convergence of two databases
  • The other stacks
  • Mobile data
  • The analytic life-cycle
  • Data anthropology

Here’s a bit more detail on each of them.

Deep learning

Teaching machines to think has been a dream/nightmare of scientists for a long time. Rather than teaching a machine explicitly, or using Watson-like statistics to figure out the best answer from a mass of data, Deep Learning uses simpler, core ideas and then builds upon them — much as a baby learns sounds, then words, then sentences.

It’s been applied to problems like vision (find an edge, then a shape, then an object) and better voice recognition. But advances in processing and algorithms are making it increasingly attractive for a large number of challenges. A Deep Learning model “copes” better with things its creators can’t foresee, or genuinely new situations. A recent MIT Technology Review article said these approaches improved image recognition by 70%, and improved Android voice recognition 25%. But 80% of the benefits come from additional computing power, not algorithms, so this is stuff that’s only become possible with the advent of cheap, on-demand, highly parallel processing.

The main drivers of this approach are big companies like Google (which acquired DNNResearch), IBM and Microsoft. There are also startups in the machine learning space like Vicarious and Grok (née Numenta).

Deep Learning isn’t without its critics. Something learned in a moment of pain or danger might not be true later on, so the system needs to unlearn — or at least reduce the certainty — of a conclusion. What’s more, certain things might only be true after a sequence of events: once we’ve seen a person put a ball in a box and close the lid, we know there is a ball in the box, but a picture of the box afterward wouldn’t reveal this. Inability to take into account time is one of the criticisms Grok founder Jeff Hawkins levels at Deep Learning.

There’s some good debate, and real progress in AI and machine learning, as a result of the new computing systems that make these models possible. They’ll likely supplant the expert systems (yes/no trees) that are used in many industries, but have fundamental flaws. Ben Goldacre described this problem at Strata in 2012: almost every patient who displays the symptoms of a rare disease instead has two, much more common, diseases with those symptoms.

Also this is why House is a terrible doctor’s show.

In 2014, much of the data science content of Strata will focus on making machines smarter, and much of this will come from abundant back-end processing paired with advances in vision, sensemaking, and context.

Time-series data

Data is often structured according to the way it will be used.

  • To data designers, a graph is a mathematical structure that describes how a pair of objects relate to one another. This is why Facebook’s search tool is called Graph Search. To work with large numbers of relationships, we use a Graph database that organizes everything in it according to how it’s related to everything else. This makes it easy to find things that are linked to one another, like routers in a network or friends at a company, even with millions of connections. As a result, it’s often in the core of a social network’s application stack. Companies like Neo4j and Titan and Vertex make them.
  • On the other hand, a relational database keeps several tables of data (your name; a product purchase) and then links them by a common thread (such as the credit card used to buy the product, or the name of the person to whom it belongs). When most traditional enterprise IT people say “database,” they mean a relational database (RDBMS). The RDBMS has been so successful it’s supplanted most other forms of data storage.

(As a sidenote, at the core of the RDBMS is a “join,” an operation that links two tables. Much of the excitement around NoSQL databases was in fact about doing away with the join, which — though powerful — significantly restricts how quickly and efficiently an RDBMS can process large amounts of data. Ironically, the dominant language for querying many of these NoSQL databases through tools like Impala is now SQL. If the NoSQL movement had instead been called NoJoin, things might have been much more clear.)

Book SpiralBook Spiral

Book Spiral – Seattle Central Library by brewbooks, on Flickr

Data systems are often optimized for a specific use.

  • Think of a coin-sorting machine — it’s really good at organizing many coins of a limited variety (nickels, dimes, pennies, etc.).
  • Now think of a library — it’s really good at a huge diversity of books, often only one or two of each, and not very fast.

Databases are the same: a graph database is built differently from a relational database; an analytical database (used to explore and report on data) is different from an operational one (used in production).

Most of the data in your life — from your Facebook feed to your bank statement — has one common element: time. Time is the primary key of the universe.

Since time is often the common thread in data, optimizing databases and processing systems to be really, really good at handling data over time is a huge benefit for many applications, particularly those that try to find correlations between seemingly different data — does the temperature on your NEST thermostat correlate with an increase in asthma inhaler use? Black Swans aside, time is also useful when trying to predict the future from the past.

Time Series data is at the root of life-logging and the Quantified Self movement, and will be critical for the Internet of Things. It’s a natural way to organize things which, as humans, we fundamentally understand. Time series databases have a long history, and there’s a lot of effort underway to modernize them as well as the analytical tools that crunch the data they contain, so we think time-series data deserves deeper study in 2014.

The Big Data app stack

We think we’re about to see the rise of application suites for big data. Consider the following evolution:

  1. On a mainframe, the hardware, operating system, and applications were often indistinguishable.
  2. Much of the growth of consumer PCs happened because of the separation of these pieces — companies like Intel and Phoenix made the hardware; Microsoft and Red Hat made the OS; and developers like WordPerfect, Lotus, and DBase made the applications.
  3. Eventually, we figured out what the PC was “for” and it acquired a core set of applications without which, it seems, a PC wouldn’t be useful. Those are generally described as “office suites,” and while there was once a rivalry for them, today, they’ve been subsumed by OS makers (Apple, Microsoft, Open Source) while those that didn’t have an OS withered on the vine (Corel).
  4. As we moved onto the web, the same thing started to happen — email, social network, blog, and calendar seemed to be useful online applications now that we were all connected, and the big portal makers like Google, Sina, Yahoo, Naver, and Facebook made “suites” of these things. So, too, did the smartphone platforms, from PalmPilot to Blackberry to Apple and Android.
  5. Today’s private cloud platforms are like yesterday’s operating systems, with OpenStack, CloudPlatform, VMWare, Eucalyptus, and a few others competing based on their compatibility with public clouds, hardware, and applications. Clouds are just going through this transition to apps, and we’re learning that their “app suite” includes things like virtual desktops, disaster recovery, on-demand storage — and of course, big data.

Okay, enough history lesson.

We’re seeing similar patterns emerge in big data. But it’s harder to see what the application suite is before it happens. In 2014, we think we’ll be asking ourselves, what’s the Microsoft Office of Big Data? We can make some guesses:

  • Predicting the future
  • Deciding what people or things are related to other people or things
  • Helping to power augmented reality tools like Google Glass with smart context
  • Making recommendations by guessing what products will appeal to which customers
  • Optimizing bottlenecks in supply chains or processes
  • Identifying health risks or anomalies worthy of investigation

Companies like Wibidata are trying to figure this out — and getting backed by investors with deep pockets. Just as most of the interesting stories about operating systems were the apps that ran on them, and the stories about clouds are things like big data, so the good stories about big data are the “office suites” atop it. Put another way, we don’t know yet what big data is for, but I suspect that in 2014 we’ll start to find out.

Cultural barriers to data-driven change

Every time I talk with companies about data, they love the concept but fail on the execution. There are a number of reasons for this:

  • Incumbency. Yesterday’s leaders were those who could convince others to act in the absence of information. Tomorrow’s leaders are those who can ask the right questions. This means there is a lot of resistance from yesterday’s leaders (think Moneyball).
  • Lack of empowerment. I recently ate a meal in the Pittsburgh airport, and my bill came with a purple pen. I’m now wondering if I tipped differently because of that. What ink colour maximizes per-cover revenues in an airport restaurant? (Admittedly, I’m a bit obsessive.) But there’s no reason someone couldn’t run that experiment, and increase revenues. Are they empowered to do so? How would they capture the data? What would they deem a success? These are cultural and organizational questions that need to be tackled by the company if it is to become data-driven.
  • Risk aversion. Steve Blank says a startup is an organization designed to search for a scalable, repeatable business model. Here’s a corollary: a big company is one designed to perpetuate a scalable, repeatable business model. Change is not in its DNA — predictability is. Since the days of Daniel McCallum, organizational charts and processes fundamentally reinforce the current way of doing things. It often takes a crisis (such as German jet planes in World War Two or Netscape’s attack on Microsoft) to evoke a response (the Lockheed Martin Skunk Works or a free web browser).
  • Improper understanding. Correlation is not causality — there is a correlation between ice cream and drowning, but that doesn’t mean we should ban ice cream. Both are caused by summertime. We should hire more lifeguards (and stock up on ice cream!) in the summer. Yet many people don’t distinguish between correlation and causality. As a species, humans are wired to find patterns everywhere  because a false positive (turning when we hear a rustle in the bushes, only to find there’s nothing there) is less dangerous than a false negative (not turning and getting eaten by a sabre-toothed tiger).
  • Focus on the wrong data. Lean Analytics urges founders to be more data-driven and less self-delusional. But when I recently spoke with executives from DHL’s innovation group, they said that innovation in a big company requires a wilful disregard for data. That’s because the preponderance of data in a big company reinforces the status quo; nascent, disruptive ideas don’t stand a chance. Big organizations have all the evidence they need to keep doing what they have always done.

There are plenty of other reasons why big organizations have a hard time embracing data. Companies like IBM, CGI, and Accenture are minting money trying to help incumbent organizations be the next Netflix and not the next Blockbuster.

What’s more, the advent of clouds, social media, and tools like PayPal or the App Store has destroyed many of the barriers to entry on which big companies rely. As Quentin Hardy pointed out in a recent article, fewer and fewer big firms stick around for the long haul.

Design patterns

As any conference matures, we move into best practices. The way these manifest themselves with architecture is in the form of proven architectures — snippets of recipes people can re-use. Just as a baker knows how to make an icing sauce with fat and sugar — and can adjust it to make myriad variations — so, too, can an architect use a particular architecture to build a known, working component or service.

As Mike Loukides points out, a design pattern is even more abstract than a recipe. It’s like saying, “sweet bread with topping,” which can then be instantiated in any number of different kinds of cake recipes. So, we have a design pattern for “highly available storage” and then rely on proven architectural recipes such as load-balancing, geographic redundancy, and eventual consistency to achieve it.

Such recipes are well understood in computing, and they eventually become standards and appliances. We have a “scale-out” architecture for web computing, where many cheap computers can handle a task, as an Application Delivery Controller (a load balancer) “sprays” traffic across those machines. It’s common wisdom today. But once, it was innovative. Same thing with password recovery mechanisms and hundreds of other building blocks.

We’ll see these building blocks emerge for data systems that meet specific needs. For example, a new technology called homomorphic encryption allows us to analyze data while it is still encrypted, without actually seeing the data. That would, for example, allow us to measure the spread of a disease without violating the privacy of the individual patients. (We had a presenter talk about this at DDBD in Santa Clara.) This will eventually become a vital ingredient in a recipe for “data where privacy is maintained.” There will be other recipes optimized for speed, or resiliency, or cost, all in service of the “highly available storage” pattern.

This is how we move beyond vendors. Just as a scale-out web infrastructure can have an ADC from Radware, Citrix, F5, Riverbed, Cisco, and others (with the same pattern), we’ll see design patterns for big data with components that could come from Cloudera, Hortonworks, IBM, Intel, MapR, Oracle, Microsoft, Google, Amazon, Rackspace, Teradata, and hundreds of others.

Note that many vendors who want to sell “software suites” will hate this. Just as stereo vendors tried to sell all-in-one audio systems, which ultimately weren’t very good, many of today’s commercial providers want to sell turnkey systems that don’t allow the replacement of components. Design patterns and the architectures on which they rely are anathema to these closed systems — and are often where standards tracks emerge. 2014 is when that debate will start out in Big Data.

Laggards and Luddites

Certain industries are inherently risk-averse, or not technological. But that changes fast. A few years ago, I was helping a company called FarmsReach connect restaurants to local farmers and turn the public market into a supply chain hub. We spent a ton of effort building a fax gateway because farmers didn’t have mobile phones, and ultimately, the company pivoted to focus on building networks between farmers.

Today, however, farmers are adopting tech quickly, and they rely on things like GPS-based tractor routing and seed sowing (known as “Satellite Farming”) to get the most from their fields.

As the cost of big data drops and the ease of use increases, we’ll see it applied in many other places. Consider, for example, a city that can’t handle waste disposal. Traditionally, the city would buy more garbage trucks and hire more garbage collectors. But now, it can analyze routing and find places to optimize collection. Unfortunately, this requires increased tracking of workers — something the unions will resist very vocally. We already saw this in education, where efforts to track students were shut down by teachers’ unions.

In 2014, big data will be crossing the chasm, welcoming late adopters and critics to the conversation. It’ll mean broadening the scope of the discussion — and addressing newfound skepticism — at Strata.

Convergence of two databases

If you’re running a data-driven product today, you typically have two parallel systems.

  • One’s in production. If you’re an online retailer, this is where the shopping cart and its contents live, or where the user’s shipping address is stored.
  • The other’s used for analysis. An online retailer might make queries to find out what someone bought in order to handle a customer complaint or generate a report to see which products are selling best.

Analytical technology comes from companies like Teradata, IBM (from the Cognos acquisition), Oracle (from the Hyperion acquisition), SAP, and independent Microstrategy, among many others. They use words like “Data Warehouse” to describe these products, and they’ve been making them for decades. Data analysts work with them, running queries and sending reports to corporate bosses. A standalone analytical data warehouse is commonly accepted wisdom in enterprise IT.

But those data warehouses are getting faster and faster. Rather than running a report and getting it a day later, analysts can explore the data in real time — re-sorting it by some dimension, filtering it in some way, and drilling down. This is often called pivoting, and if you’ve ever used a Pivot Table in Excel you know what it’s like. In data warehouses, however, we’re dealing with millions of rows.

At the same time, operational databases are getting faster and sneakier. Traditionally, a database is the bottleneck in an application because it doesn’t handle concurrency well. If a record is being changed in the database by one person, it’s locked so nobody else can touch it. If I am editing a Word document, it makes sense to lock it so someone else doesn’t edit it — after all, what would we do with the changes we’d both made?

But that model wouldn’t work for Facebook or Twitter. Imagine a world where, when you’re updating your status, all your friends can’t refresh their feeds.

We’ve found ways to fix this. When several people edit a Google Doc at once, for instance, each of their changes is made as a series of small transactions. The document doesn’t really exist — instead, it’s a series of transactional updates, assembled to look like a document. Similarly, when you post something to Facebook, those changes eventually find their way to your friends. The same is true on Twitter or Google+.

These kinds of eventually consistent approaches make concurrent editing possible. They aren’t really new, either: your bank statement is eventually consistent, and when you check it online, the bottom of the statement tells you that the balance is only valid up until a period in the past and new transactions may take a while to post. Here’s what mine says:

Transactions from today are reflected in your balance, but may not be displayed on this page if you recently updated your bankbook, if a paper statement was issued, or if a transaction is backdated. These transactions will appear in your history the following business day.

Clearly, if eventual consistency is good enough for my bank account, it’s good enough for some forms of enterprise data.

So, we have analytical databases getting real-time fast and operational databases increasingly able to do things concurrently without affecting production systems. Which begs the question: why do we have two databases?

This is a massive, controversial issue worth billions of dollars. Take, for example, EMC, which recently merged its Greenplum acquisition into Pivotal. Pivotal’s marketing (“help customers build, deploy, scale, and analyze at an unprecedented velocity”) points at this convergence, which may happen as organizations move their applications into cloud environments (which is partly why Pivotal includes Cloud Foundry, which VMWare acquired).

The change will probably create some huge industry consolidation in the coming years (think Oracle buying Teradata, then selling a unified operational/analytical database). There are plenty of reasons it’s a bad idea, and plenty of reasons why it’s a good one. We think this will be a hot topic in 2014.

Cassandra and the other stacks

Big data has been synonymous with Hadoop. The break-out success of the Hadoop ecosystem has been astonishing, but it does other stacks a disservice. There are plenty of other robust data architectures that have furiously enthusiastic tribes behind them. Cassandra, for example, was created by Facebook, released into the wild, and tamed by Reddit to allow the site to scale to millions of daily visitors atop Amazon with only a handful of employees. MongoDB is another great example, and there are dozens more.

Some of these stacks got wrapped around the axle of the NoSQL debate, which, as I mentioned, might have been better framed as NoJoin. But we’re past that now, and there are strong case studies for many of the stacks. There are also proven affinities between a particular stack (such as Cassandra) and a particular cloud (such as Amazon Web Services) because of their various heritages.

In 2014, we’ll be discussing more abstract topics and regarding every one of these stacks as a tool in a good toolbox.

Mobile data

By next year, there will be more mobile phones in the world than there are humans, over one billion of them “smart.” They are the closest thing we have to a tag for people. Whether measuring mall traffic for shoppers or projecting the source of Malarial outbreaks in Africa, it’s big. One carrier recently released mobile data from the Ivory Coast to researchers.

Just as Time Series data has structure, so does geographic data, much of which lives in Strata’s Connected World track. Mobile data is a precursor to the Internet of Everything, and it’s certainly one of the most prolific structured data sources in the world.

I think concentrating on mobility is critical for another reason, too. The large systems created to handle traffic for the nearly 1,000 carriers in the world are big, fast, and rock solid. An AT&T 5ESS switch or one of the large-scale Operational Support Systems, simply does not fall over.

Other than DNS, the Internet doesn’t really have this kind of industrial-grade system for managing billions of devices, each of which can connect to the others with just a single address. That is astonishing scale, and we tend to ignore it as “plumbing.” In 2014 , the control systems for the Internet of Everything are as likely to come from Big Iron made by Ericsson as they are to come from some Web 2.0 titan.

The analytic life-cycle

The book The Theory That Would Not Die begins with a quote from John Maynard Keynes: “When the facts change, I change my opinion. What do you do, sir?” As this New York Times review of the book observes, “If you are not thinking like a Bayesian, perhaps you should be.”

Bayes’ theorem says that beliefs must be updated based on new evidence — and in an information-saturated world, new evidence arrives constantly, which means the cycle turns quickly. To many readers, this is nothing more than explaining the scientific method. But there are plenty of people who weren’t weaned on experimentation and continuous learning — and even those with a background in science make dumb mistakes, as the Boy Or Girl Paradox handily demonstrates.

Ben Lorica, O’Reilly’s chief scientist (and owner of the enviable Twitter handle @BigData) recently wrote about the lifecycle of data analysis. I wrote another piece on the Lean Analytics cycle with Avinash Kaushik a few months ago. In both cases, it’s an iterative process of hypothesis-forming, experimentation, data collection, and readjustment.

In 2014, we’ll be spending more time looking at the whole cycle of data analysis, including collection, storage, interpretation, and the practice of asking good questions informed by new evidence.

Data anthropology

Data seldom tells the whole story. After flooding in Haiti, mobile phone data suggested people weren’t leaving one affected area for a safe haven. Researchers concluded that they were all sick with cholera and couldn’t move. But by interviewing people on the ground, aid workers found out the real problem was that flooding had destroyed the roads, making it hard to leave.

As this example shows, there’s no substitute for context. In Lean Analytics, we say “Instincts are experiments. Data is proof.” For some reason this resonated hugely and is one of the most favorited/highlighted passages in the book. People want a blend of human and machine, of soft, squishy qualitative data alongside cold, hard quantitative data. We joke that in the early stages of a startup, your only metric should be “how many people have I spoken with?” It’s too early to start counting.

In Ash Maurya’s Running Lean, there’s a lot said about customer development. Learning how to conduct good interviews that don’t lead the witness and measuring the cultural factors that can pollute data is hugely difficult. In The Righteous MindJonathan Haidt says all university research is WEIRD: Western, Educated, Industrialized, Rich, and Democratic. That’s because test subjects are most often undergraduates, who fit this bill. To prove his assertion, Haidt replicated studies done on campus at a McDonald’s a few miles away, with vastly different results.

At the first Strata New York, I actually left the main room one morning to go write a blog post. I was so overcome by the examples of data errors — from bad collection, to bad analysis, to wilfully ignoring the results of good data — that it seemed to me we weren’t paying attention to the right things. If “Data is the new Oil,” then its supply chain is a controversial XL pipeline with woefully few people looking for leaks and faults. Anthropology can fix this, tying quantitative assumptions to verification.

Nobody has championed data anthropology as much as O’Reilly’s own Roger Magoulas, who joined Jon Bruner and Jim Stogdill for a podcast on the subject recently.

So, data anthropology can ensure good data collection, provide essential context to data, and check that the resulting knowledge is producing the intended results. That’s why it’s on our list of hot topics for 2014.

Photo: Book Spiral – Seattle Central Library by brewbooks, on Flickr

November 12 2012

Four short links: 12 November 2012

  1. Teaching Programming to a Highly Motivated Beginner (CACM) — I don’t think there is any better way to internalize knowledge than first spending hours upon hours growing emotionally distraught over such struggles and only then being helped by a mentor. Me, too. Not struggle for struggle’s sake, but because you have built a strong mental map of the problem into which the solution can lock.
  2. Corona (GitHub) — Facebook opensources their improvements to Hadoop’s job tracking, in the name of scalability, latency, cluster utilization, and fairness. (via Chris Aniszczyk)
  3. One Man’s Trash (Bunnie Huang) — Bunnie finds a Chumby relic in a Shenzhen market stall.
  4. Dronestagram — posting pictures of drone strike locations to Instagram. (via The New Aesthetic)

October 25 2012

Four short links: 25 October 2012

  1. Big Data: the Big Picture (Vimeo) — Jim Stogdill’s excellent talk: although Big Data is presented as part of the Gartner Hype Cycle, it’s an epoch of the Information Age which will have significant effects on the structure of corporations and the economy.
  2. Impala (github) — Cloudera’s open source (Apache) implementation of Google’s F1 (PDF), for realtime queries across clusters. Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries. Furthermore, Impala does not leverage MapReduce, allowing Impala to return result in real-time. (via Wired)
  3. druid (github) — open source (GPLv2) a distributed, column-oriented analytical datastore. It was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. See also the announcement of its open-sourcing.
  4. Supersonic (Google Code) — an ultra-fast, column oriented query engine library written in C++. It provides a set of data transformation primitives which make heavy use of cache-aware algorithms, SIMD instructions and vectorised execution, allowing it to exploit the capabilities and resources of modern, hyper pipelined CPUs. It is designed to work in a single process. Apache-licensed.

August 22 2012

Seven Reasons Why I Like Spark

A large portion of this week’s Amp Camp at UC Berkeley, is devoted to an introduction to Spark – an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I’ve come to consider it a key part of my big data toolkit. Here’s why:

Hadoop integration: Spark can work with files stored in HDFS, an important feature given the amount of investment in the Hadoop Ecosystem. Getting Spark to work with MapR is straightforward.

The Spark interactive Shell: Spark is written in Scala, and has it’s own version of the Scala interpreter. I find this extremely convenient for testing short snippets of code.

The Spark Analytic Suite:

(Figure courtesy of Matei Zaharia)

Spark comes with tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming). Rather than having to mix and match a set of tools (e.g., Hive, Hadoop, Mahout, S4/Storm), you only have to learn one programming paradigm. For SQL enthusiasts, the added bonus is that Shark tends to run faster than Hive. If you want to run Spark in the cloud, there are a set of EC2 scripts available.

Resilient Distributed Data sets (RDD’s):
RDD’s are distributed objects that can be cached in-memory, across a cluster of compute nodes. They are the fundamental data objects used in Spark. The crucial thing is that fault-tolerance is built-in: RDD’s are automatically rebuilt if something goes wrong. If you need to test something out, RDD’s can even be used interactively from the Spark interactive shell.

Distributed Operators:
Aside from Map and Reduce, there are many other operators one can use on RDD’s. Once I familiarized myself with how they work, I began converting a few basic machine-learning and data processing algorithms into this framework.

Once you get past the learning curve … iterative programs
It takes some effort to become productive in anything, Spark is no exception. I was a complete Scala newbie so I first had to get comfortable with a new language (apparently, they like underscores – see here, here, and here). Beyond Scala one can use Shark (“SQL” on Spark), and relatively new Java and Python API’s.

You can use the examples that come with Spark to get started, but I found the essential thing is to get comfortable with the built-in distributed operators. Once I learned RDD’s and the operators, I started writing iterative programs to implement a few machine-learning and data processing algorithms. (Since Spark distributes & caches data in-memory, you can write pretty fast machine-learning programs on massive data sets.)

It’s already used in production
Is anyone really using Spark? While the list of companies is still small, judging from the size of the SF Spark Meetup and Amp Camp, I expect many more companies to start deploying Spark. (If you’re in the SF Bay Area, we are starting a new Distributed Data Processing Meetup with Airbnb, and Spark is one one of the topics we’ll cover.)


June 07 2012

Strata Week: Data prospecting with Kaggle

Here are a few of the data stories that caught my attention this week:

Prospecting for data

KaggleThe data science competition site Kaggle is extending its features with a new service called Prospect. Prospect allows companies to submit a data sample to the site without having a pre-ordained plan for a contest. In turn, the data scientists using Kaggle can suggest ways in which machine learning could best uncover new insights and answer less-obvious questions — and what sorts of data competitions could be based on the data.

As GigaOm's Derrick Harris describes it: "It's part of a natural evolution of Kaggle from a plucky startup to an IT company with legs, but it's actually more like a prequel to Kaggle's flagship predictive modeling competitions than it is a sequel." It's certainly a good way for companies to get their feet wet with predictive modeling.

Practice Fusion, a web-based electronic health records system for physicians, has launched the inaugural Kaggle Prospect challenge.

HP's big data plans

Last year, Hewlett Packard made a move away from the personal computing business and toward enterprise software and information management. It's a move that was marked in part by the $10 billion it paid to acquire Autonomy. Now we know a bit more about HP's big data plans for its Information Optimization Portfolio, which has been built around Autonomy's Intelligent Data Operating Layer (IDOL).

ReadWriteWeb's Scott M. Fulton takes a closer look at HP's big data plans.

The latest from Cloudera

Cloudera released a number of new products this week: Cloudera Manager 3.7.6; Hue 2.0.1; and of course CDH 4.0, its Hadoop distribution.

CDH 4.0 includes:

"... high availability for the filesystem, ability to support multiple namespaces, HBase table and column level security, improved performance, HBase replication and greatly improved usability and browser support for the Hue web interface. Cloudera Manager 4 includes multi-cluster and multi-version support, automation for high availability and MapReduce2, multi-namespace support, cluster-wide heatmaps, host monitoring and automated client configurations."

Social data platform DataSift also announced this week that it was powering its Hadoop clusters with CDH to perform the "Big Data heavy lifting to help deliver DataSift's Historics, a cloud-computing platform that enables entrepreneurs and enterprises to extract business insights from historical public Tweets."

Have data news to share?

Feel free to email us.

OSCON 2012 Data Track — Today's system architectures embrace many flavors of data: relational, NoSQL, big data and streaming. Learn more in the Data track at OSCON 2012, being held July 16-20 in Portland, Oregon.

Save 20% on registration with the code RADAR


May 24 2012

Strata Week: Visualizing a better life

Here are a few of the data stories that caught my attention this week:

Visualizing a better life

How do you compare the quality of life in different countries? As The Guardian's Simon Rogers points out, GDP has commonly been the indicator used to show a country's economic strength, but it's insufficient for comparing the quality of life and happiness of people.

To help build a better picture of what quality of life means to people, the Organization for Economic Cooperation and Development OECD built the Your Better Life Index. The index lets people select the things that matter to them: housing, income, jobs, community, education, environment, governance, health, life satisfaction, safety and work-life balance. The OECD launched the tool last year and offered an update this week, adding data on gender and inequality.

Screenshot from OECD's Your Better Life Index.

"It's counted as a major success by the OECD," writes Rogers, "particularly as users consistently rank quality of life indicators such as education, environment, governance, health, life satisfaction, safety and work-life balance above more traditional ones. Designed by Moritz Stefaner and Raureif, it's also rather beautiful."

The countries that come out on top most often based on users' rankings: "Denmark (life satisfaction and work-life balance), Switzerland (health and jobs), Finland (education), Japan (safety), Sweden (environment), and the USA (income)."

Researchers' access to data

The New York Times' John Markoff examines social science research and the growing problem of datasets that are not made available to other scholars. Opening data helps make sure that research results can be verified. But Markoff suggests that in many cases, data is being kept private and proprietary.

Much of the data he's talking about here is:

"... gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers' privacy. But to many scientists, the practice is an invitation to bad science, secrecy and even potential fraud."

"The debate will only intensify as large companies with deep pockets do more research about their users," Markoff predicts.

Updates to Hadoop

Apache has released the alpha version of Hadoop 2.0.0. We should stress "alpha" here, and as Hortonworks' Arun Murthy notes, it's "not ready to run in production." However, he adds the update "is still an important step forward, as it represents the very first release that delivers new and important capabilities," including: HDFS HA (manual failover) and next generation MapReduce.

In other Hadoop news, MapR has unveiled a series of new features and initiatives for its Hadoop distribution, including release of a fully compliant ODBC 3.52 driver, support for the Linux Pluggable Authentication Modules (PAM), and the availability of the source code for several of its components.

Have data news to share?

Feel free to email me.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR


May 18 2012

Top Stories: May 14-18, 2012

Here's a look at the top stories published across O'Reilly sites this week.

A federal judge learned to code
The judge presiding over the Oracle/Google case learned Java, and that skill came in handy when coding specifics arose during the trial. It's proof that coding is a part of cultural competence, even if you never do it professionally.

The chicken and egg of big data solutions
So, here we are with all of this disruptive big data technology, but we seem to have lost the institutional wherewithal to do anything with it in a lot of large companies, at least until package solutions come along.

DIY learning: Schoolers, Edupunks, and Makers challenge education
Schoolers, Edupunks and Makers are showing us what's possible when learners, not institutions, own the education that will define their lives.

John Allspaw on DevOps
John Allspaw discusses DevOps in high-volume web companies and the importance of cooperation between development and operations.

JavaScript and Dart: Can we do better?
O'Reilly editor Simon St. Laurent talked with Google's Seth Ladd about the challenges of improving the web. How can we build on JavaScript's ubiquity while addressing performance, team, and scale issues?

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif. Save 20% on registration with the code RADAR20.

May 17 2012

Strata Week: Google unveils its Knowledge Graph

Here's what caught my attention in the data space this week.

Google's Knowledge Graph

Google Knowledge Graph"Google does the semantic Web," says O'Reilly's Edd Dumbill, "except they call it the Knowledge Graph." That Knowledge Graph is part of an update to search that Google unveiled this week.

"We've always believed that the perfect search engine should understand exactly what you mean and give you back exactly what you want," writes Amit Singhal, Senior VP of Engineering, in the company's official blog post.

That post makes no mention of the semantic web, although as ReadWriteWeb's Jon Mitchell notes, the Knowledge Graph certainly relies on it, following on and developing from Google's acquisition of the semantic database Freebase in 2010.

Mitchell describes the enhanced search features:

"Most of Google users' queries are ambiguous. In the old Google, when you searched for "kings," Google didn't know whether you meant actual monarchs, the hockey team, the basketball team or the TV series, so it did its best to show you web results for all of them.

"In the new Google, with the Knowledge Graph online, a new box will come up. You'll still get the Google results you're used to, including the box scores for the team Google thinks you're looking for, but on the right side, a box called "See results about" will show brief descriptions for the Los Angeles Kings, the Sacramento Kings, and the TV series, Kings. If you need to clarify, click the one you're looking for, and Google will refine your search query for you."

Yahoo's fumbles

The news from Yahoo hasn't been good for a long time now, with the most recent troubles involving the departure of newly appointed CEO Scott Thompson over the weekend and a scathing blog post this week by Gizmodo's Mathew Honan titled "How Yahoo Killed Flickr and Lost the Internet." Ouch.

Over on GigaOm, Derrick Harris wonders if Yahoo "sowed the seeds of its own demise with Hadoop." While Hadoop has long been pointed to as a shining innovation from Yahoo, Harris argues that:

"The big problem for Yahoo is that, increasingly, users and advertisers want to be everywhere on the web but at Yahoo. Maybe that's because everyone else that's benefiting from Hadoop, either directly or indirectly, is able to provide a better experience for consumers and advertisers alike."

De-funding data gathering

The appropriations bill that recently passed the U.S. House of Representatives axes funding for the Economic Census and the American Community Survey. The former gathers data about 25 million businesses and 1,100 industries in the U.S., while the latter collects data from three million American households every year.

Census Bureau director Robert Groves writes that the bill "devastates the nation's statistical information about the status of the economy and the larger society." BusinessWeek chimes in that the end to these surveys "blinds business," noting that businesses rely "heavily on it to do such things as decide where to build new stores, hire new employees, and get valuable insights on consumer spending habits."

Got data news to share?

Feel free to email me.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR


May 16 2012

The chicken and egg of big data solutions

Before I came to O'Reilly I was building the "big data and disruptive analytics practice" at a major systems integrator. It was a blast to spend every week talking to customers in different industries who were waking up to the possibilities that technologies like Hadoop offered their businesses. Many of these businesses are going to fundamentally change as they embrace this stuff (or be replaced by those that do). But there's a catch.

Twenty years or so ago large integrators made big business building applications on the then-new relational paradigm. They put in Oracle databases with custom code, wrote PowerBuilder apps on Sybase, and of course lots of businesses rolled their own with VB and SQL Server. It was an era of custom coding where Oracle, Sybase, SQL Server, Informix and etc. were thought of as platforms to build stuff on.

Then the market matured and shifted to package solution implementation. ERP, CRM, …, etc. The big guys focused on integrating again and told their clients there was no ROI in building custom stuff. ROI would come from integrating best-of-breed solutions. Databases became commodity back ends to the applications that were always the real focus.

Now along comes big data, NoSQL, data science, and all that stuff and it seems like we're starting the cycle over again. But this time clients, having been well trained over the last decade or so, aren't having any of that "build it from scratch" mentality. They know that Hadoop and other new technologies can be transformative to their business, but they want it packaged up and solution'ified like they are used to. I heard a lot of "let us know when you have a solution already built or available to buy that does X" in the last year.

Also, lots of the shops that do this stuff at scale are built and staffed around the package implementation model and have shed many of the skills they used to have for custom work. Everything from staffing models to methodologies are oriented toward package installation.

So, here we are with all of this disruptive technology, but we seem to have lost the institutional wherewithal to do anything with it in a lot of large companies. Of course that fact was hard on my numbers. I had a great pipeline of companies with pain to solve, and great technologies to solve it, but too much of the time it was hard to close it without readymade solutions.

Every week I talked to the companies building these new platforms to share leads and talk about their direction. After a while I started cutting them off when they wanted to talk about the features of their next release. I just got to the point where I didn't really care, it just wasn't all that relevant to my customers. I mean, it's important that they are making the platforms more manageable and building bridges to traditional BI, ETL, RDBMS, and the like. But the focus was too much on platforms and tools.

I wanted to know "What are you doing to encourage solution development? Are you staffing a support system for ISVs? What startups and/or established players are you aware of that are building solutions on this platform?" So when I saw this tweet I let out a little yelp. Awesome! The lack of ready-to-install solutions was getting attention, and from Mike Olsen.

You can watch the rest of what Mike Olson said here and you'll find he tells a similar story about the RDBMS historical parallel.

I talked to Mike a few weeks ago to find out what was behind his comment and explore what else they are doing to support solution development. It boils down to what he said — he will help connect you with money — plus a newly launched partner program designed to provide better support to ISVs among others. Also, the continued attention to APIs and tools like Pig and Hive should make it easier for the solution ecosystem to develop. It can only be good for his business to have lots of other companies directly solving business problems, and simply pulling in his platform.

Hortonworks also started a partner program in the fall and I think we'll see a lot more emphasis on this across the space this year. However, at the moment wherever I look (Hortonworks partners, Cloudera Partners, Accel big data portfolio) the focus today remains firmly on platform and tools or partnering with integrators. Tresata, a startup focused on financial risk management, pops up in in a lot of lists as the obvious odd one out — an actual domain-specific solution.

What about other people that could be building solutions? Is it the maturity level of the technology, the lack of penetration of Hadoop etc. into your customer's data centers, or some combination of other factors that is slowing things down?

Of course, during the RDBMS adoption it took a lot of years before the custom era was over and thoroughly replaced by the era of package implementation. The question I'm pondering is whether customer expectations and the pace of technology will make it happen faster this time? Or is the disruptive value of big data going to continue to accrue only to risk-taking early adopters for the foreseeable future?

If you are building a startup based on a solution or application that leverages big data technology, and you aren't being stealthy, I'd love to hear about it in the comments.


May 10 2012

Strata Week: Big data boom and big data gaps

Here are a few of the data stories that caught my attention this week.

Big data booming

The call for speakers for Strata New York has closed, but as Edd Dumbill notes, the number of proposals are a solid indication of the booming interest in big data. The first Strata conference, held in California in 2011, elicited 255 proposals. The following event in New York elicited 230. The most recent Strata, held in California again, had 415 proposals. And the number received for Strata's fall event in New York? That came in at 635.

Edd writes:

"That's some pretty amazing growth. I can thus expect two things from Strata New York. My job in putting the schedule together is going to be hard. And we're going to have the very best content around."

The increased popularity of the Strata conference is just one data point from the week that highlights a big data boom. Here's another: According to a recent report by IDC, the "worldwide ecosystem for Hadoop-MapReduce software is expected to grow at a compound annual rate of 60.2 percent, from $77 million in revenue in 2011 to $812.8 million in 2016."

"Hadoop and MapReduce are taking the software world by storm," says IDC's Carl Olofson. Or as GigaOm's Derrick Harris puts it: "All aboard the Hadoop money train."

A big data gap?

Another report released this week reins in some of the exuberance about big data. This report comes from the government IT network MeriTalk, and it points to a "big data gap" in the government — that is, a gap between the promise and the capabilities of the federal government to make use of big data. That's interesting, no doubt, in terms of the Obama administration's recent $200 million commitment to a federal agency big data initiative.

Among the MeriTalk report's findings: 60% of government IT professionals say their agency is analyzing the data it collects and less than half (40%) are using data to make strategic decisions. Those responding to the survey said they felt as though it would take, on average, three years before their agencies were ready to fully take advantage of big data.

Prismatic and data-mining the news

The largest-ever healthcare fraud scheme was uncovered this past week. Arrests were made in seven cities — some 107 doctors, nurses and social workers were charged, with fraudulent Medicare claims totaling about $452 million. The discoveries about the fraudulent behavior were made thanks in part to data-mining — looking for anomalies in the Medicare filings made by various health care providers.

Prismatic penned a post in which it makes the case for more open data so that there's "less friction" in accessing the sort of information that led to this sting operation.

"Both the recent sting and the Prime case show that you need real journalists and investigators working with technology and data to achieve good results. The challenge now is to scale this recipe and force transparency on a larger scale.

"We need to get more technically sophisticated and start analysing the data sets up front to discover the right questions to ask, not just the answer the questions we already know to ask based on up-front human investigation. If we have to discover each fraud ring or singleton abuse as a one-off case, we'll never be able to wipe out fraud on a large enough scale to matter."

Indeed, despite this being the largest bust ever, it's really just a fraction of the estimated $20 to $100 billion a year in Medicare fraud.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.


April 26 2012

April 17 2012

Microsoft opens up

Open Sign by dlofink, on FlickrWhile Microsoft's previous stance on open source systems is well known, it turns out there's been a serious shift as Microsoft looks to bring more non-.NET programmers into the fold.

On April 12, Jean Paoli, president of a new subsidiary of Microsoft called Microsoft Open Technologies, Inc., wrote about the new initiative. In his words, the subsidiary was created "to advance the company's investment in openness — including interoperability, open standards and open source." This is a public step toward working with open source communities and integrating technologies into Microsoft's closed initiatives, which may not be quite so closed in the future. With that in mind, below I take a look at what's new with Microsoft and open source.

While these projects provide proof that the pendulum is swinging in the open source direction, the impact for Microsoft can and will be much more resounding. New markets, programmers, and communities are at play here if this new tact goes well.

Attracting the polyglot programmers

This shift in ideology will likely help Microsoft on a number of fronts, including finding new programmers and communities. For example, Microsoft may lure developers to Windows 8 — rumored to be launching in October — by making it as easy as possible to get up and running. HTML5/JavaScript as well as C++ can be used to create Windows 8 Metro applications, and Microsoft hasn't forgotten its own .NET developers, who will use C#. The common theme you will see with the Windows 8 release and others is that Microsoft is trying to become less isolated from the rest of the programming community, many of whom are now polyglot programmers.

Hadoop's halo effect

Azure, Microsoft's cloud platform, is slowly gaining momentum as enterprises make the shift to cloud services. The key word here is "slowly." On the other hand, Hadoop, an open source Apache project that's become a central part of the big data movement, has a huge and active community that's improving the code minute by minute. Supporting Hadoop on Azure lets Microsoft incorporate the popularity and visibility of an open source project into a Microsoft initiative that needs more exposure.

A marketing signal

With a Microsoft Openness website that speaks to the relationship it has with open source technologies, and an accompanying Twitter account (@OpenatMicrosoft) with more than 6,500 followers, the Microsoft marketing team also seems to think open source exposure is important. (Side note: Gianugo Rabellino, Microsoft senior director of open source communities, and one of the people tweeting from the @OpenatMicrosoft account, will be presenting at the OSCON conference this summer.)

As Microsoft continues to see viable open source projects gain momentum, you can be sure that it will work on including ways for those languages, libraries, and frameworks to be incorporated into its current and future platforms. But the more meaningful change is that Microsoft is seeing that opening its own technologies to programmers will only make its products better, more accessible, and central to the future of programming.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Photo: Open Sign by dlofink, on Flickr


March 14 2012

Now available: "Planning for Big Data"

Planning for Big DataEarlier this month, more than 2,500 people came together for the O'Reilly Strata Conference in Santa Clara, Calif. Though representing diverse fields, from insurance to media and high-tech to healthcare, attendees buzzed with a new-found common identity: they are data scientists. Entrepreneurial and resourceful, combining programming skills with math, data scientists have emerged as a new profession leading the march toward data-driven business.

This new profession rides on the wave of big data. Our businesses are creating ever more data, and as consumers we are sources of massive streams of information, thanks to social networks and smartphones. In this raw material lies much of value: insight about businesses and markets, and the scope to create new kinds of hyper-personalized products and services.

Five years ago, only big business could afford to profit from big data: Walmart and Google, specialized financial traders. Today, thanks to an open source project called Hadoop, commodity Linux hardware and cloud computing, this power is in reach for everyone. A data revolution is sweeping business, government and science, with consequences as far reaching and long lasting as the web itself.

Where to start?

Every revolution has to start somewhere, and the question for many is "how can data science and big data help my organization?" After years of data processing choices being straightforward, there's now a diverse landscape to negotiate. What's more, to become data driven, you must grapple with changes that are cultural as well as technological.

Our aim with Strata is to help you understand what big data is, why it matters, and where to get started. In the wake the recent conference, we're delighted to announce the publication of our "Planning for Big Data" book. Available as a free download, the book contains the best insights from O'Reilly Radar authors over the past three months, including myself, Alistair Croll, Julie Steele and Mike Loukides.

"Planning for Big Data" is for anybody looking to get a concise overview of the opportunity and technologies associated with big data. If you're already working with big data, hand this book to your colleagues or executives to help them better appreciate the issues and possibilities.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at

Related data reports and ebooks

March 12 2012

O'Reilly Radar Show 3/12/12: Best data interviews from Strata California 2012

Below you'll find the script and associated links from the March 12, 2012 episode of O'Reilly Radar. An archive of past shows is available through O'Reilly Media's YouTube channel and you can subscribe to episodes of O'Reilly Radar via iTunes.

In this special edition of the Radar Show we're bringing you three of our best interviews from the 2012 Strata Conference in California.

First up is Hadoop creator Doug Cutting discussing the similarities between Linux and the big data world. [Interview begins 16 seconds in.]

In our second interview from Strata California, Max Gadney from After the Flood explains the benefits of video data graphics. [Begins at 7:04.]

In our final Strata CA interview, Kaggle's Jeremy Howard looks at the difference between big data and analytics. [Begins at 13:46.]


Just a reminder that you can always catch episodes of O'Reilly Radar at and subscribe to episodes through iTunes.

All of the links and resources mentioned during this episode are posted at

That's all we have for this episode. Thanks for joining us and we'll see you again soon.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

February 13 2012

Four short links: 13 February 2012

  1. Rise of the Independents (Bryce Roberts) -- companies that don't take VC money and instead choose to grow organically: indies. +1 for having a word for this.
  2. The Performance Golden Rule (Steve Souders) -- 80-90% of the end-user response time is spent on the frontend. Check out his graphs showing where load times come from for various popular sites. The backend responds quickly, but loading all the Javascript and images and CSS and embedded autoplaying videos and all that kerfuffle takes much much longer.
  3. Starry Night Comes to Life -- wow, beautiful, must-see.
  4. MapReduce Patterns, Algorithms, and Use Cases -- In this article I digest a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found in the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting.

February 08 2012

Four short links: 8 February 2012

  1. Mavuno -- an open source, modular, scalable text mining toolkit built upon Hadoop. (Apache-licensed)
  2. Cow Clicker -- Wired profile of Cowclicker creator Ian Bogost. I was impressed by Cow Clickers [...] have turned what was intended to be a vapid experience into a source of camaraderie and creativity. People create communities around social activities, even when they are antisocial. (via BoingBoing)
  3. Unicode Has a Pile of Poo Character (BoingBoing) -- this is perfect.
  4. The Research Works Act and the Breakdown of Mutual Incomprehension (Cameron Neylon) -- an excellent summary of how researchers and publishers view each other and their place in the world.

February 03 2012

Top stories: January 30-February 3, 2012

Here's a look at the top stories published across O'Reilly sites this week.

What is Apache Hadoop?
Apache Hadoop has been the driving force behind the growth of the big data industry. But what does it do, and why do you need all its strangely-named friends? (Related: Hadoop creator Doug Cutting on why Hadoop caught on.)

Embracing the chaos of data
Data scientists, it's time to welcome errors and uncertainty into your data projects. In this interview, Jetpac CTO Pete Warden discusses the advantages of unstructured data.

Moneyball for software engineering, part 2
A look at the "Moneyball"-style metrics and techniques managers can employ to get the most out of their software teams.

With GOV.UK, British government redefines the online government platform
A new beta .gov website in Britain is open source, mobile friendly, platform agnostic, and open for feedback.

When will Apple mainstream mobile payments?
David Sims parses the latest iPhone / near-field-communication rumors and considers the impact of Apple's (theoretical) entrance into the mobile payment space.

Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

February 02 2012

What is Apache Hadoop?

HadoopApache Hadoop has been
the driving force behind the growth of the big data industry. You'll
hear it mentioned often, along with associated technologies such as
Hive and Pig. But what does it do, and why do you need all its
strangely-named friends, such as Oozie, Zookeeper and Flume?

Hadoop brings the ability to cheaply process large amounts of
data, regardless of its structure. By large, we mean from 10-100
gigabytes and above. How is this different from what went before?

Existing enterprise data warehouses and relational databases excel
at processing structured data and can store massive amounts of
data, though at a cost: This requirement for structure restricts the kinds of
data that can be processed, and it imposes an inertia that makes
data warehouses unsuited for agile exploration of massive
heterogenous data. The amount of effort required to warehouse data
often means that valuable data sources in organizations are never
mined. This is where Hadoop can make a big difference.

This article examines the components of the Hadoop ecosystem and
explains the functions of each.

The core of Hadoop: MapReduce

Created at
in response to the problem of creating web search
indexes, the MapReduce framework is the powerhouse behind most of
today's big data processing. In addition to Hadoop, you'll find
MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.

The important innovation of MapReduce is the ability to take a query
over a dataset, divide it, and run it in parallel over multiple
nodes. Distributing the computation solves the issue of data too large to fit
onto a single machine. Combine this technique with commodity Linux
servers and you have a cost-effective alternative to massive
computing arrays.

At its core, Hadoop is an open source MapReduce
implementation. Funded by Yahoo, it emerged in 2006 and, href="">according to its
creator Doug Cutting, reached "web scale" capability in early

As the Hadoop project matured, it acquired further components to enhance
its usability and functionality. The name "Hadoop" has
come to represent this entire ecosystem. There are parallels
with the emergence of Linux: The name refers strictly to the Linux
kernel, but it has gained acceptance as referring to a complete
operating system.

Hadoop's lower levels: HDFS and MapReduce

Above, we discussed the ability of MapReduce to distribute
computation over multiple servers. For that computation to take
place, each server must have access to the data. This is the role of
HDFS, the Hadoop Distributed File System.

HDFS and MapReduce are robust. Servers in a Hadoop cluster can
fail and not abort the computation process. HDFS ensures data is
replicated with redundancy across the cluster. On completion of a
calculation, a node will write its results back into HDFS.

There are no restrictions on the data that HDFS stores. Data may
be unstructured and schemaless. By contrast, relational databases
require that data be structured and schemas be defined before storing
the data. With HDFS, making sense of the data is the responsibility
of the developer's code.

Programming Hadoop at the MapReduce level is a case of working with the
Java APIs, and manually loading data files into HDFS.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at

Improving programmability: Pig and Hive

Working directly with Java APIs can be tedious and error prone.
It also restricts usage of Hadoop to Java programmers. Hadoop offers
two solutions for making Hadoop programming easier.

  • Pig is a programming
    language that simplifies the common tasks of working with Hadoop:
    loading data, expressing transformations on the data, and storing
    the final results. Pig's built-in operations can make sense of
    semi-structured data, such as log files, and the language is
    extensible using Java to add support for custom data types and

  • Hive enables Hadoop
    to operate as a data warehouse. It superimposes structure on data in HDFS
    and then permits queries over the data using a familiar SQL-like
    syntax. As with Pig, Hive's core capabilities are

Choosing between Hive and Pig can be confusing. Hive
is more suitable for data warehousing tasks, with predominantly
static structure and the need for frequent analysis. Hive's closeness
to SQL makes it an ideal point of integration between Hadoop and
other business intelligence tools.

Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming
data flows for incorporation into larger applications. Pig is a
thinner layer over Hadoop than Hive, and its main advantage is to
drastically cut the amount of code needed compared to direct
use of Hadoop's Java APIs. As such, Pig's intended audience remains
primarily the software developer.

Improving data access: HBase, Sqoop and Flume

At its heart, Hadoop is a batch-oriented system. Data are loaded
into HDFS, processed, and then retrieved. This is somewhat of a
computing throwback, and often, interactive and random access to data
is required.

Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google's
the project's goal is to host billions of rows of data for rapid access.
can use HBase as both a source and a destination for its
computations, and Hive and Pig can be used in combination with

In order to grant random access to the data, HBase does impose a
few restrictions: Performance with Hive is 4-5 times slower than plain
HDFS, and the maximum amount of data you can store is approximately
a petabyte, versus HDFS' limit of over 30PB.

HBase is ill-suited to ad-hoc analytics and more appropriate for
integrating big data as part of a larger application. Use cases
include logging, counting and storing time-series data.

The Hadoop Bestiary

Deployment, configuration and monitoring
Collection and import of log and event data
HBase Column-oriented database scaling to billions of rows
HCatalog Schema and data type sharing over Pig, Hive and MapReduce
Distributed redundant file system for Hadoop
Data warehouse with SQL-like access
Library of machine learning and data mining algorithms
Parallel computation on server clusters
High-level programming language for Hadoop computations
Orchestration and workflow management
Imports data from relational databases
Cloud-agnostic deployment of clusters
Configuration management and coordination

Getting data in and out

Improved interoperability with the rest of the data world is
provided by href="">Sqoop and href="">Flume. Sqoop is a tool designed to import data from
relational databases into Hadoop, either directly into HDFS or into
Hive. Flume is designed to import streaming flows of log data
directly into HDFS.

Hive's SQL friendliness means that it can be used as a point of
integration with the vast universe of database tools capable of making
connections via JBDC or ODBC database drivers.

Coordination and workflow: Zookeeper and Oozie

With a growing family of services running as part of a Hadoop
cluster, there's a need for coordination and naming services. As
computing nodes can come and go, members of the cluster need
to synchronize with each other, know where to access services, and
know how they should be configured. This is the purpose of href="">Zookeeper.

Production systems utilizing Hadoop can often contain complex
pipelines of transformations, each with dependencies on each
other. For example, the arrival of a new batch of data will trigger
an import, which must then trigger recalculations in dependent
datasets. The Oozie
component provides features to manage the workflow and dependencies,
removing the need for developers to code custom solutions.

Management and deployment: Ambari and Whirr

One of the commonly added features incorporated into Hadoop by
distributors such as IBM and Microsoft is monitoring and
administration. Though in an early stage, href="">Ambari aims
to add these features to the core Hadoop project. Ambari is intended to help system
administrators deploy and configure Hadoop, upgrade clusters, and
monitor services. Through an API, it may be integrated with other
system management tools.

Though not strictly part of Hadoop, href="">Whirr is a highly complementary
component. It offers a way of running services, including Hadoop, on
cloud platforms. Whirr is cloud neutral and
currently supports the Amazon EC2 and Rackspace services.

Machine learning: Mahout

Every organization's data are diverse and particular
to their needs. However, there is much less diversity in the kinds of
analyses performed on that data. The href="">Mahout project is a library of
Hadoop implementations of common analytical computations. Use cases
include user collaborative filtering, user recommendations,
clustering and classification.

Using Hadoop

Normally, you will use Hadoop href="">in
the form of a distribution. Much as with Linux before it,
vendors integrate and test the components of the Apache Hadoop
ecosystem and add in tools and administrative features of their

Though not per se a distribution, a managed cloud installation
of Hadoop's MapReduce is also available through Amazon's Elastic
MapReduce service

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


February 01 2012

Why Hadoop caught on

Doug Cutting (@cutting) is a founder of the Apache Hadoop project and an architect at Hadoop provider Cloudera. When Cutting expresses surprise at Hadoop's growth — as he does below — that carries a lot of weight.

In the following interview, Cutting explains why he's surprised at Hadoop's ascendance, and he looks at the factors that helped Hadoop catch on. He'll expand on some of these points during his Hadoop session at the upcoming Strata Conference.

Why do you think Hadoop has caught on?

Doug CuttingDoug Cutting: Hadoop is a technology whose time had come. As computer use has spread, institutions are generating vastly more data. While commodity hardware offers affordable raw storage and compute horsepower, before Hadoop, there was no commodity software to harness it. Without tools, useful data was simply discarded.

Open source is a methodology for commoditizing software. Google published its technological solutions, and the Hadoop community at Apache brought these to the rest of the world. Commodity hardware combined with the latent demand for data analysis formed the fuel that Hadoop ignited.

Are you surprised at its growth?

Doug Cutting: Yes. I didn't expect Hadoop to become such a central component of data processing. I recognized that Google's techniques would be useful to other search engines and that open source was the best way to spread these techniques. But I did not realize how many other folks had big data problems nor how many of these Hadoop applied to.

What role do you see Hadoop playing in the near-term future of data science and big data?

Doug Cutting: Hadoop is a central technology of big data and data science. HDFS is where folks store most of their data, and MapReduce is how they execute most of their analysis. There are some storage alternatives — for example, Cassandra and CouchDB, and useful computing alternatives, like S4, Giraph, etc. — but I don't see any of these replacing HDFS or MapReduce soon as the primary tools for big data.

Long term, we'll see. The ecosystem at Apache is a loosely-coupled set of separate projects. New components are regularly added to augment or replace incumbents. Such an ecosystem can survive the obsolescence of even its most central components.

In your Strata session description, you note that "Apache Hadoop forms the kernel of an operating system for big data." What else is in that operating system? How is that OS being put to use?

Doug Cutting: Operating systems permit folks to share resources, managing permissions and allocations. The two primary resources are storage and computation. Hadoop provides scalable storage through HDFS and scalable computation through MapReduce. It supports authorization, authentication, permissions, quotas and other operating system features. So, narrowly speaking, Hadoop alone is an operating system.

But no one uses Hadoop alone. Rather, folks also use HBase, Hive, Pig, Flume, Sqoop and many other ecosystem components. So, just as folks refer to more than the Linux kernel when they say "Linux," folks often refer to the entire Hadoop ecosystem when they say "Hadoop." Apache BigTop combines many of these ecosystem projects together into a distribution, much like RHL and Ubuntu do for Linux.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
Get rid of the ads (sfw)

Don't be the product, buy the product!