Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 01 2012

Why Hadoop caught on

Doug Cutting (@cutting) is a founder of the Apache Hadoop project and an architect at Hadoop provider Cloudera. When Cutting expresses surprise at Hadoop's growth — as he does below — that carries a lot of weight.

In the following interview, Cutting explains why he's surprised at Hadoop's ascendance, and he looks at the factors that helped Hadoop catch on. He'll expand on some of these points during his Hadoop session at the upcoming Strata Conference.

Why do you think Hadoop has caught on?

Doug CuttingDoug Cutting: Hadoop is a technology whose time had come. As computer use has spread, institutions are generating vastly more data. While commodity hardware offers affordable raw storage and compute horsepower, before Hadoop, there was no commodity software to harness it. Without tools, useful data was simply discarded.

Open source is a methodology for commoditizing software. Google published its technological solutions, and the Hadoop community at Apache brought these to the rest of the world. Commodity hardware combined with the latent demand for data analysis formed the fuel that Hadoop ignited.

Are you surprised at its growth?

Doug Cutting: Yes. I didn't expect Hadoop to become such a central component of data processing. I recognized that Google's techniques would be useful to other search engines and that open source was the best way to spread these techniques. But I did not realize how many other folks had big data problems nor how many of these Hadoop applied to.

What role do you see Hadoop playing in the near-term future of data science and big data?

Doug Cutting: Hadoop is a central technology of big data and data science. HDFS is where folks store most of their data, and MapReduce is how they execute most of their analysis. There are some storage alternatives — for example, Cassandra and CouchDB, and useful computing alternatives, like S4, Giraph, etc. — but I don't see any of these replacing HDFS or MapReduce soon as the primary tools for big data.

Long term, we'll see. The ecosystem at Apache is a loosely-coupled set of separate projects. New components are regularly added to augment or replace incumbents. Such an ecosystem can survive the obsolescence of even its most central components.

In your Strata session description, you note that "Apache Hadoop forms the kernel of an operating system for big data." What else is in that operating system? How is that OS being put to use?

Doug Cutting: Operating systems permit folks to share resources, managing permissions and allocations. The two primary resources are storage and computation. Hadoop provides scalable storage through HDFS and scalable computation through MapReduce. It supports authorization, authentication, permissions, quotas and other operating system features. So, narrowly speaking, Hadoop alone is an operating system.

But no one uses Hadoop alone. Rather, folks also use HBase, Hive, Pig, Flume, Sqoop and many other ecosystem components. So, just as folks refer to more than the Linux kernel when they say "Linux," folks often refer to the entire Hadoop ecosystem when they say "Hadoop." Apache BigTop combines many of these ecosystem projects together into a distribution, much like RHL and Ubuntu do for Linux.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

December 13 2011

Tapping into a world of ambient data

More data was transmitted over the Internet in 2010 than in all other years combined. That's one reason why this year's Web 2.0 Summit used the "data frame" to explore the new landscape of digital business — from mobile to social to location to government.

Microsoft is part of this conversation about big data, with respect to the immense resources and technical talent the Redmond-based software giant continues to hold. During Web 2.0 Summit, I interviewed Microsoft Fellow David Campbell about his big data work and thinking. A video of our interview is below, with key excerpts added afterward.

What's Microsoft's role in the present and future of big data?

David Campbell: I've been a data geek for 25-plus years. You go back five to seven years ago, it was kind of hard to get some of the younger kids to think that the data space was interesting to solve problems. Databases are kind of boring stuff, but the data space is amazingly exciting right now.

It's a neat thing to have one part of the company that's processing petabytes of data on tens and hundreds of thousands of servers and then another part that's a commercial business. In the last couple of years, what's been interesting is to see them come together, with things that scale even on the commercial side. That's the cool part about it, and the cool part of being at Microsoft now.

What's happening now seems like it wasn't technically possible a few years ago. Is that the case?

David Campbell: Yes, for a variety of reasons. If you think about the costs just to acquire the data, you can still pay people to type stuff in. It's roughly $1 per kilobyte. But you go back 25 or 30 years and virtually all of the data that we were working with had come off human fingertips. Now it's just out there. Even inherently analog things like phone calls and pictures — they're just born digital. To store it, we've gone from $1,000-per-megabyte 25 years ago to $40-per-terabyte for raw storage. That's an incredible shift.

How is Microsoft working with data startups?

David Campbell: The interesting thing about the data space is that we're talking about a lot of people with machine learning experience. They know a particular domain, but it's really hard for them to go find a set of customers. So, let's say that they've got an algorithm or a model that might be relevant to 5,000 people. It's really hard for them to go find those people.

We built this thing a couple of years ago called the DataMarket. The idea is to change the route to market. So, people can take their model and place it on the DataMarket and then others can go find it.

Here's the example I use inside the company, for those old enough to remember: When people were building Visual Basic controls, it was way harder to write one than it was to consume one. The guys writing the controls didn't have to go find the guy who was building the dentist app. They just published it in this thing from way back when it was actually, on paper, called "Programmer's Paradise," and then the guy who was writing the dentist's app would go there to find what he needed.

It's the same sort of thing here. How do you connect those people, those data scientists, who are going to remain a rare commodity with the set of people who can make use of the models they have?

How are the tools of data science changing?

David Campbell: Tooling is going to be a big challenge and a big opportunity here. We announced a tool recently that we call the Data Explorer, which lets people discover other forms of data — some in the DataMarket, some that they have. They can mash it up, turn it around and then republish it.

One of the things we looked at when we started building the tools is that people tend to do mashups today in what I was calling a "last-mile tool." They might use Access or Excel or some other tool. When they were done, they could share it with anyone else who had the same tool. The idea of the Data Explorer is to back up one step and produce something that is itself a data source that's then consumable by a large number of last-mile tools. You can program against the service itself to produce applications and whatnot.

How should companies collect and use data? What strategic advice would you offer?

David Campbell: From the data side, we've lived in what we'd call a world of scarcity. We thought that data was expensive to store, so we had to get rid of it as soon as possible. You don't want it unless you have a good use for it. Now we think about data from a perspective of abundance.

Part of the challenge, 10 or 15 years years ago, was where do I go get the data? Where do I tap in? But in today's world, everything is so interconnected. It's just a matter of teeing into it. The phrase I've used instead of big data is "ambient data." It's just out there and available.

The recommendation would be to stop and think about the latent value in all that data that's there to be collected and that's fairly easy to store now. That's the challenge and the opportunity for all of us.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


This interview was edited and condensed.

Related:

July 14 2011

Strata Week: There's money in data sifting

Here are a few of the data stories that caught my attention this week.

Big bucks for DataSift and for data from Twitter's firehose

DataSiftThe social media data mining platform DataSift — one of the two companies that has the rights to re-syndicate the data from Twitter's firehose — announced this week that it has raised $6 million in a Series A round. (The other company with those rights is Gnip, whose handling of the firehose we recently covered.) DataSift aggregates data from other social media streams as well as Twitter, including Facebook, WordPress, and Digg. While providing the tools to "sift" this content and layering it with other metadata makes DataSift compelling, it's the company's connection to Twitter that may have piqued the most interest.

DataSift grew out of the company MediaSift, the same business that created Tweetmeme, a tool that fell into disfavor when Twitter launched its own sharing button. That move on the part of Twitter to take over functions that third-party developers once provided has had some negative implications on those in the Twitter ecosystem. At this stage, it seems like Twitter is willing to leave some of the big data processing to other companies.

Investor Mark Suster of GRP Partners, whose firm was one of the leaders in this round of DataSift's investment, made the announcement that he was "doubling down on the Twitter ecosystem." For its part, DataSift "has a product that will turn the stream into a lake," says Suster. In other words, "The Twitter stream like most others is ephemeral. If you don't bottle it as it passes by you it's gone. DataSift has a product that builds a permanent database for you of just the information you want to capture."

But Suster's announcement also reiterates the importance of Twitter, something that seems particularly relevant in light of the new Google Plus. Suster describes Twitter as real-time, open, asymmetric, social, viral, location-aware, a referral network, explicit, and implicit. But as the buzz over Google Plus continues, it's not clear that Twitter really holds the corner on all of these characteristics any longer.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

What's under the Google Plus hood?

Google PlusSpeaking of Google Plus, there's been lots of commentary and speculation about how successful the launch of the new social network has been, gauged in part on how quickly that network is growing. According to Ancestry.com founder Paul Allen, Google Plus was set to break the 10 million user mark on July 12, just two weeks after its launch. Ubermedia's Bill Goss went so far as to predict that Google Plus would become the fastest growing social network in history.

So how does Google do it (in terms of the technology)? According to the project's technical lead — and OSCON speakerJoseph Smarr:

Our stack is pretty standard fare for Google apps these days: we use Java servlets for our server code and JavaScript for the browser-side of the UI, largely built with the (open-source) Closure framework, including Closure's JavaScript compiler and template system. A couple nifty tricks we do: we use the HTML5 History API to maintain pretty-looking URLs even though it's an AJAX app (falling back on hash-fragments for older browsers); and we often render our Closure templates server-side so the page renders before any JavaScript is loaded, then the JavaScript finds the right DOM nodes and hooks up event handlers, etc. to make it responsive (as a result, if you're on a slow connection and you click on stuff really fast, you may notice a lag before it does anything, but luckily most people don't run into this in practice). Our backends are built mostly on top of BigTable and Colossus/GFS, and we use a lot of other common Google technologies such as MapReduce (again, like many other Google apps do).


(Google's Joseph Smarr, a member of the Google+ team, will discuss the future of the social web at OSCON. Save 20% on registration with the code OS11RAD.)

Data products for education

DonorsChooseThe charitable giving site DonorsChoose has been running a contest called Hacking Education, and the contest's finalists have just been announced. DonorsChoose lets people make charitable contributions to public schools, supporting teachers' projects with a Kickstarter-like site for education. DonorsChoose opened up its data to developers — this data encompassed more than 300,000 classroom projects that have inspired some $80 million in charitable giving.

The finalists were chosen from over 50 apps and analyses and included a visualization of the kinds of projects teachers proposed and the kinds donors supported, a .NET Factbook, and an automatic press release system so that local journalists could be notified about projects. The grand prize winner has yet to be chosen, but that project will receive a trophy — and a big thumbs up — from Stephen Colbert.

Got data news?

Feel free to email me.



Related:


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl