Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

What is Apache Hadoop?

HadoopApache Hadoop has been
the driving force behind the growth of the big data industry. You'll
hear it mentioned often, along with associated technologies such as
Hive and Pig. But what does it do, and why do you need all its
strangely-named friends, such as Oozie, Zookeeper and Flume?

Hadoop brings the ability to cheaply process large amounts of
data, regardless of its structure. By large, we mean from 10-100
gigabytes and above. How is this different from what went before?


Existing enterprise data warehouses and relational databases excel
at processing structured data and can store massive amounts of
data, though at a cost: This requirement for structure restricts the kinds of
data that can be processed, and it imposes an inertia that makes
data warehouses unsuited for agile exploration of massive
heterogenous data. The amount of effort required to warehouse data
often means that valuable data sources in organizations are never
mined. This is where Hadoop can make a big difference.

This article examines the components of the Hadoop ecosystem and
explains the functions of each.

The core of Hadoop: MapReduce

Created at
Google
in response to the problem of creating web search
indexes, the MapReduce framework is the powerhouse behind most of
today's big data processing. In addition to Hadoop, you'll find
MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.


The important innovation of MapReduce is the ability to take a query
over a dataset, divide it, and run it in parallel over multiple
nodes. Distributing the computation solves the issue of data too large to fit
onto a single machine. Combine this technique with commodity Linux
servers and you have a cost-effective alternative to massive
computing arrays.

At its core, Hadoop is an open source MapReduce
implementation. Funded by Yahoo, it emerged in 2006 and, href="http://research.yahoo.com/files/cutting.pdf">according to its
creator Doug Cutting, reached "web scale" capability in early
2008.

As the Hadoop project matured, it acquired further components to enhance
its usability and functionality. The name "Hadoop" has
come to represent this entire ecosystem. There are parallels
with the emergence of Linux: The name refers strictly to the Linux
kernel, but it has gained acceptance as referring to a complete
operating system.

Hadoop's lower levels: HDFS and MapReduce

Above, we discussed the ability of MapReduce to distribute
computation over multiple servers. For that computation to take
place, each server must have access to the data. This is the role of
HDFS, the Hadoop Distributed File System.

HDFS and MapReduce are robust. Servers in a Hadoop cluster can
fail and not abort the computation process. HDFS ensures data is
replicated with redundancy across the cluster. On completion of a
calculation, a node will write its results back into HDFS.



There are no restrictions on the data that HDFS stores. Data may
be unstructured and schemaless. By contrast, relational databases
require that data be structured and schemas be defined before storing
the data. With HDFS, making sense of the data is the responsibility
of the developer's code.



Programming Hadoop at the MapReduce level is a case of working with the
Java APIs, and manually loading data files into HDFS.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.


Improving programmability: Pig and Hive

Working directly with Java APIs can be tedious and error prone.
It also restricts usage of Hadoop to Java programmers. Hadoop offers
two solutions for making Hadoop programming easier.


  • Pig is a programming
    language that simplifies the common tasks of working with Hadoop:
    loading data, expressing transformations on the data, and storing
    the final results. Pig's built-in operations can make sense of
    semi-structured data, such as log files, and the language is
    extensible using Java to add support for custom data types and
    transformations.


  • Hive enables Hadoop
    to operate as a data warehouse. It superimposes structure on data in HDFS
    and then permits queries over the data using a familiar SQL-like
    syntax. As with Pig, Hive's core capabilities are
    extensible.



Choosing between Hive and Pig can be confusing. Hive
is more suitable for data warehousing tasks, with predominantly
static structure and the need for frequent analysis. Hive's closeness
to SQL makes it an ideal point of integration between Hadoop and
other business intelligence tools.

Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming
data flows for incorporation into larger applications. Pig is a
thinner layer over Hadoop than Hive, and its main advantage is to
drastically cut the amount of code needed compared to direct
use of Hadoop's Java APIs. As such, Pig's intended audience remains
primarily the software developer.

Improving data access: HBase, Sqoop and Flume

At its heart, Hadoop is a batch-oriented system. Data are loaded
into HDFS, processed, and then retrieved. This is somewhat of a
computing throwback, and often, interactive and random access to data
is required.

Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google's
href="http://research.google.com/archive/bigtable.html">BigTable,
the project's goal is to host billions of rows of data for rapid access.
MapReduce
can use HBase as both a source and a destination for its
computations, and Hive and Pig can be used in combination with
HBase.

In order to grant random access to the data, HBase does impose a
few restrictions: Performance with Hive is 4-5 times slower than plain
HDFS, and the maximum amount of data you can store is approximately
a petabyte, versus HDFS' limit of over 30PB.

HBase is ill-suited to ad-hoc analytics and more appropriate for
integrating big data as part of a larger application. Use cases
include logging, counting and storing time-series data.




The Hadoop Bestiary























































Ambari
Deployment, configuration and monitoring
Flume
Collection and import of log and event data
HBase Column-oriented database scaling to billions of rows
HCatalog Schema and data type sharing over Pig, Hive and MapReduce
HDFS
Distributed redundant file system for Hadoop
Hive
Data warehouse with SQL-like access
Mahout
Library of machine learning and data mining algorithms
MapReduce
Parallel computation on server clusters
Pig
High-level programming language for Hadoop computations
Oozie
Orchestration and workflow management
Sqoop
Imports data from relational databases
Whirr
Cloud-agnostic deployment of clusters
Zookeeper
Configuration management and coordination



Getting data in and out

Improved interoperability with the rest of the data world is
provided by href="https://github.com/cloudera/sqoop/wiki">Sqoop and href="https://cwiki.apache.org/FLUME/">Flume. Sqoop is a tool designed to import data from
relational databases into Hadoop, either directly into HDFS or into
Hive. Flume is designed to import streaming flows of log data
directly into HDFS.

Hive's SQL friendliness means that it can be used as a point of
integration with the vast universe of database tools capable of making
connections via JBDC or ODBC database drivers.

Coordination and workflow: Zookeeper and Oozie

With a growing family of services running as part of a Hadoop
cluster, there's a need for coordination and naming services. As
computing nodes can come and go, members of the cluster need
to synchronize with each other, know where to access services, and
know how they should be configured. This is the purpose of href="http://zookeeper.apache.org/">Zookeeper.

Production systems utilizing Hadoop can often contain complex
pipelines of transformations, each with dependencies on each
other. For example, the arrival of a new batch of data will trigger
an import, which must then trigger recalculations in dependent
datasets. The Oozie
component provides features to manage the workflow and dependencies,
removing the need for developers to code custom solutions.

Management and deployment: Ambari and Whirr

One of the commonly added features incorporated into Hadoop by
distributors such as IBM and Microsoft is monitoring and
administration. Though in an early stage, href="http://incubator.apache.org/ambari/">Ambari aims
to add these features to the core Hadoop project. Ambari is intended to help system
administrators deploy and configure Hadoop, upgrade clusters, and
monitor services. Through an API, it may be integrated with other
system management tools.



Though not strictly part of Hadoop, href="http://whirr.apache.org/">Whirr is a highly complementary
component. It offers a way of running services, including Hadoop, on
cloud platforms. Whirr is cloud neutral and
currently supports the Amazon EC2 and Rackspace services.

Machine learning: Mahout



Every organization's data are diverse and particular
to their needs. However, there is much less diversity in the kinds of
analyses performed on that data. The href="http://mahout.apache.org/">Mahout project is a library of
Hadoop implementations of common analytical computations. Use cases
include user collaborative filtering, user recommendations,
clustering and classification.


Using Hadoop

Normally, you will use Hadoop href="http://radar.oreilly.com/2012/01/big-data-ecosystem.html">in
the form of a distribution. Much as with Linux before it,
vendors integrate and test the components of the Apache Hadoop
ecosystem and add in tools and administrative features of their
own.

Though not per se a distribution, a managed cloud installation
of Hadoop's MapReduce is also available through Amazon's Elastic
MapReduce service
.


Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.


Save 20% on registration with the code RADAR20

Related:

Don't be the product, buy the product!

Schweinderl