Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 27 2012

Top stories: January 23-27, 2012

Here's a look at the top stories published across O'Reilly sites this week.

On pirates and piracy
Mike Loukides: "I'm not willing to have the next Bach, Beethoven, or Shakespeare post their work online, only to have it taken down because they haven't paid off a bunch of executives who think they own creativity."

Microsoft's plan for Hadoop and big data
Strata conference chair Edd Dumbill takes a look at Microsoft's plans for big data. By embracing Hadoop, the company aims to keep Windows and Azure as a standards-friendly option for data developers.

Coming soon to a location near you: The Amazon Store?
Jason Calacanis says an Amazon retail presence isn't out of the question and that AmazonBasics is a preview of what's to come.

Survey results: How businesses are adopting and dealing with data
Feedback from a recent Strata Online Conference suggests there's a large demand for clear information on what big data is and how it will change business.

Why the fuss about iBooks Author?
Apple doesn't have an objective to move the publishing industry forward. With iBooks Author, the company sees an opportunity to reinvent this industry within its own closed ecosystem.


Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

January 26 2012

Strata Week: Genome research kicks up a lot of data

Here are a few of the data stories that caught my attention this week.

Genomics data and the cloud

Bootstrap DNA by Charles Jencks, 2003 by mira66, on FlickrGigaOm's Derrick Harris explores some of the big data obstacles and opportunities surrounding genome research. He notes that:

When the Human Genome Project successfully concluded in 2003, it had taken 13 years to complete its goal of fully sequencing the human genome. Earlier this month, two firms — Life Technologies and Illumina — announced instruments that can do the same thing in a day, one for only $1,000. That's likely going to mean a lot of data.

But as Harris observes, the promise of quick and cheap genomics is leading to other problems, particularly as the data reaches a heady scale. A fully sequenced human genome is about 100GB of raw data. But citing DNAnexus founder Andreas Sundquist, Harris says that:

... volume increases to about 1TB by the time the genome has been analyzed. He [Sundquist] also says we're on pace to have 1 million genomes sequenced within the next two years. If that holds true, there will be approximately 1 million terabytes (or 1,000 petabytes, or 1 exabyte) of genome data floating around by 2014.

That makes the promise of a $1,000 genome sequencing service challenging when it comes to storing and processing petabytes of data. Harris posits that it will be cloud computing to the rescue here, providing the necessary infrastructure to handle all that data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Stanley Fish versus the digital humanities

Literary critic and New York Times opinionator Stanley Fish has been on a bit of a rampage in recent weeks, taking on the growing field of the "digital humanities." Prior to the annual Modern Language Association meeting, Fish cautioned that alongside the traditional panels and papers on Ezra Pound and William Shakespeare and the like, there were going to be a flood of sessions devoted to:

...'the digital humanities,' an umbrella term for new and fast-moving developments across a range of topics: the organization and administration of libraries, the rethinking of peer review, the study of social networks, the expansion of digital archives, the refining of search engines, the production of scholarly editions, the restructuring of undergraduate instruction, the transformation of scholarly publishing, the re-conception of the doctoral dissertation, the teaching of foreign languages, the proliferation of online journals, the redefinition of what it means to be a text, the changing face of tenure — in short, everything.

That "everything" was narrowed down substantially in Fish's editorial this week, in which he blasted the digital humanities for what he sees as its fixation "with matters of statistical frequency and pattern." In other words: data and computational analysis.

According to Fish, the problem with digital humanities is that this new scholarship relies heavily on the machine — and not the literary critic — for interpretation. Fish contends that digital humanities scholars are all teams of statisticians and positivists, busily digitizing texts so they can data-mine them and systematically and programmatically uncover something of interest — something worthy of interpretation.

University of Illinois, Urbana-Champaign English professor Ted Underwood argues that Fish not only mischaracterizes what digital humanities scholars do, but he misrepresents how his own interpretive tradition works:

... by pretending that the act of interpretation is wholly contained in a single encounter with evidence. On his account, we normally begin with a hypothesis (which seems to have sprung, like Sin, fully-formed from our head), and test it against a single sentence.

One of the most interesting responses to Fish's recent rants about the humanities' digital turn comes from University of North Carolina English professor Daniel Anderson, who demonstrates in the following video a far fuller picture of what "digital" "data" — creation and interpretation — looks like:

Hadoop World merges with O'Reilly's Strata New York conference

Two of the big data events announced they'll be merging this week: Hadoop World will now be part of the Strata Conference in New York this fall.

[Disclosure: The Strata events are run by O'Reilly Media.]

Cloudera first started Hadoop World back in 2009, and as Hadoop itself has seen increasing adoption, Hadoop World, too, has become more popular. Strata is a newer event — its first conference was held in Santa Clara, Calif., in February 2011, and it expanded to New York in September 2011.

With the merger, Hadoop World will be a featured program at Strata New York 2012 (Oct. 23-25).

In other Hadoop-related news this week, Strata chair Edd Dumbill took a close look at Microsoft's Hadoop strategy. Although it might be surprising that Microsoft has opted to adopt an open source technology as the core of its big data plans, Dumbill argues that:

Hadoop, by its sheer popularity, has become the de facto standard for distributed data crunching. By embracing Hadoop, Microsoft allows its customers to access the rapidly-growing Hadoop ecosystem and take advantage of a growing talent pool of Hadoop-savvy developers.

Also, Cloudera data scientist Josh Willis takes a closer look at one aspect of that ecosystem: the work of scientists whose research falls outside of statistics and machine learning. His blog post specifically addresses one use case for Hadoop — seismology, for which there is now Seismic Hadoop — but the post also provides a broad look at what constitutes the practice of data science.

Got data news?

Feel free to email me.

Photo: Bootstrap DNA by Charles Jencks, 2003 by mira66, on Flickr

Related:

January 25 2012

Microsoft's plan for Hadoop and big data

Microsoft has placed Apache Hadoop at the core of its big data strategy. It's a move that might seem surprising to the casual observer, being a somewhat enthusiastic adoption of a significant open source product.

The reason for this move is that Hadoop, by its sheer popularity, has
become the de facto standard for distributed data crunching. By
embracing Hadoop, Microsoft allows its customers to access the
rapidly-growing Hadoop ecosystem and take advantage of a growing
talent pool of Hadoop-savvy developers.

Microsoft's goals go beyond integrating Hadoop into Windows. It
intends to contribute the adaptions it makes back to the Apache
Hadoop project, so that anybody can run a purely open source Hadoop
on Windows.

Microsoft's Hadoop distribution

The Microsoft distribution of Hadoop is currently in "Customer Technology Preview"
phase. This means it is undergoing evaluation in the field by groups
of customers. The expected release time is toward the middle of
2012, but will be influenced by the results of the technology
preview program.

Microsoft's Hadoop distribution is usable either on-premise with
Windows Server, or in Microsoft's cloud platform, Windows Azure. The
core of the product is in the MapReduce, HDFS, Pig and Hive
components of Hadoop. These are certain to ship in the 1.0
release.

As Microsoft's aim is for 100% Hadoop compatibility, it is likely
that additional components of the Hadoop ecosystem such as
Zookeeper, HBase, HCatalog and Mahout will also be shipped.

Additional components integrate Hadoop with
Microsoft's ecosystem of business intelligence and analytical products:


  • Connectors for Hadoop, integrating it with SQL Server and SQL
    Sever Parallel Data Warehouse.

  • An ODBC driver for Hive, permitting any Windows application to
    access and run queries against the Hive data warehouse.

  • An Excel Hive Add-in, which enables the movement of data directly
    from Hive into Excel or PowerPivot.



On the back end, Microsoft offers Hadoop performance improvements,
integration with Active Directory to facilitate access control, and
with System Center for administration and management.

How Hadoop integrates with the Microsoft ecosystem
How Hadoop integrates with the Microsoft ecosystem. (Source: microsoft.com.)

Developers, developers, developers

One of the most interesting features of Microsoft's work with
Hadoop is the addition of a JavaScript API. Working with Hadoop at
a programmatic level can be tedious: this is why higher-level languages
such as Pig emerged.

Driven by its focus on the software developer as an important
customer, Microsoft chose to add a JavaScript layer to the Hadoop
ecosystem. Developers can use it to create MapReduce jobs, and even
interact with Pig and Hive from a browser environment.

The real advantage of the JavaScript layer should show itself in
integrating Hadoop into a business environment, making it easy for
developers to create intranet analytical environments accessible by
business users. Combined with Microsoft's focus on bringing
server-side JavaScript to Windows and Azure through Node.js, this gives an interesting glimpse into Microsoft's view
of where developer enthusiasm and talent will lie.

It's also good news for the broader Hadoop community, as
Microsoft intends to contribute its JavaScript API to the Apache
Hadoop open source project itself.



The other half of Microsoft's software development environment is
of course the .NET platform. With Microsoft's Hadoop distribution,
it will be possible to create MapReduce jobs from .NET, though using
the Hadoop APIs directly. It is likely that higher-level interfaces
will emerge in future releases. The same applies to Visual Studio,
which over time will get increasing levels of Hadoop project
support.

Streaming data and NoSQL

Hadoop covers part of the big data problem, but what about
streaming data processing or NoSQL databases? The answer comes in
two parts, covering existing Microsoft products and future
Hadoop-compatible solutions.

Microsoft has some established products: Its streaming
data solution called StreamInsight, and
for NoSQL, Windows Azure has a product called href="http://www.windowsazure.com/en-us/home/tour/storage/">Azure
Tables.

Looking to the future, the commitment of Hadoop compatibility
means that streaming data solutions and NoSQL databases designed to
be part of the Hadoop ecosystem should work with the Microsoft
distribution — HBase itself will ship as a core offering. It seems
likely that solutions such as href="http://incubator.apache.org/s4/">S4 will prove
compatible.



Toward an integrated environment

Now that Microsoft is on the way to integrating the major
components of big data tooling, does it intend to
join it all together to provide an integrated data science platform
for businesses?



That's certainly the vision, according to Madhu Reddy, senior
product planner for Microsoft Big Data: "Hadoop is primarily for
developers. We want to enable people to use the tools they
like."

The strategy to achieve this involves entry points at multiple
levels: for developers, analysts and business users. Instead of
choosing one particular analytical platform of choice, Microsoft
will focus on interoperability with existing tools. Excel is an
obvious priority, but other tools are also
important to the company.

According to Reddy, data scientists represent a spectrum of
preferences. While Excel is a ubiquitous and popular choice, other
customers use Matlab, SAS, or R, for example.

The data marketplace

One thing unique to Microsoft as a big data and cloud platform is
its data market, Windows
Azure Marketplace
. Mixing external data, such as geographical or
social, with your own, can generate revealing insights. But it's
hard to find data, be confident of its quality, and purchase it
conveniently. That's where data marketplaces meet a need.

The availability of the Azure marketplace integrated with Microsoft's
tools gives analysts a ready source of external data with some
guarantees of quality. Marketplaces are in their infancy now, but
will play a growing role in the future of data-driven business.

Summary

The Microsoft approach to big data has ensured the continuing
relevance of its Windows platform for web-era organizations, and
makes its cloud services a competitive choice for data-centered
businesses.

Appropriately enough for a company with a large and diverse
software ecosystem of its own, the Microsoft approach is one of
interoperability. Rather than laying out a golden path
for big data, as suggested by the appliance-oriented approach of
others, Microsoft is focusing heavily on integration.

The guarantee of this approach lies in Microsoft's choice to
embrace and work with the Apache Hadoop community, enabling the
migration of new tools and talented developers to its
platform.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Related:

January 20 2012

Top stories: January 16-20, 2012

Here's a look at the top stories published across O'Reilly sites this week.

SOPA and PIPA are bad industrial policy
Tim O'Reilly: SOPA and PIPA not only harm the Internet, they support existing content companies in their attempts to hold back innovative business models that will actually grow the market and deliver new value to consumers. (See also: Why O'Reilly went dark on 1/18/12 and The President's challenge.)

Big data market survey: Hadoop solutions
Edd Dumbill explores the Hadoop-based big data solutions available on the market, contrasts the approaches of EMC Greenplum, IBM, Microsoft and Oracle and provides an overview of Hadoop distributions.

Mobile interfaces: Mistakes to avoid and trends to watch
"Designing Mobile Interfaces" co-author Steven Hoober discusses common mobile interface mistakes, and he offers his thoughts on the latest mobile device trends — including why the addition of gestures and sensors isn't wholly positive.

From SOPA to speech: Seven tech trends to monitor
Mike Loukides weighs in on the tech trends — good and bad — that will exert considerable influence in 2012.

Early thoughts on iBooks Author and Apple's textbook move
James Turner considers Apple's new authoring platform and its restrictive policies. Will those restrictions limit the program's potential?


Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

January 19 2012

Big data market survey: Hadoop solutions


The big data ecosystem can be confusing. The popularity of "big data" as industry buzzword has created a broad category. As
Hadoop steamrolls through the industry, solutions from the
business intelligence and data warehousing fields are also
attracting the big data label. To confuse matters, Hadoop-based solutions such as Hive
are at the same time evolving toward being a competitive data warehousing solution.


Understanding the nature of your big data problem is a helpful
first step in evaluating potential solutions. Let's
remind ourselves of href="http://radar.oreilly.com/2012/01/what-is-big-data.html">the
definition of big data:


"Big data is data that exceeds the processing capacity of
conventional database systems. The data is too big, moves too
fast, or doesn't fit the strictures of your database
architectures. To gain value from this data, you must choose an
alternative way to process it."


Big data problems vary in how heavily they weigh in on the axes
of volume, velocity and variability. Predominantly structured yet
large data, for example, may be most suited to an analytical
database approach.


This survey makes the assumption that a data warehousing
solution alone is not the answer to your problems, and concentrates on
analyzing the commercial Hadoop ecosystem. We'll focus on the
solutions that incorporate storage and data processing,
excluding those products which only sit above those layers, such
as the visualization or analytical workbench software.

Getting started with Hadoop doesn't require a large
investment as the software is open source, and is also available
instantly through the Amazon Web Services cloud. But for
production environments, support, professional services and
training are often required.

Just Hadoop?




Apache Hadoop is unquestionably the center of the latest
iteration of big data solutions. At its heart, Hadoop is a
system for distributing computation among commodity servers. It
is often used with the Hadoop Hive project, which layers data
warehouse technology on top of Hadoop, enabling ad-hoc
analytical queries.


Big data platforms divide along the lines of their approach to
Hadoop. The big data offerings from familiar enterprise vendors
incorporate a Hadoop distribution, while other platforms
offer Hadoop connectors to their existing analytical database
systems. This latter category tends to comprise massively
parallel processing (MPP) databases that made their name in big
data before Hadoop matured: Vertica and Aster Data. Hadoop's
strength in these cases is in processing unstructured data in
tandem with the analytical capabilities of the existing database
on structured or structured data.



Practical big data implementations don't in general fall neatly
into either structured or unstructured data
categories. You will invariably find Hadoop working as part of a
system with a relational or MPP database.



Much as with Linux before it, no Hadoop solution incorporates
the raw Apache Hadoop code. Instead, it's packaged into
distributions. At a minimum, these distributions have been
through a testing process, and often include additional
components such as management and monitoring tools. The most
well-used distributions now come from Cloudera, Hortonworks and
MapR. Not every distribution will be commercial, however: the
BigTop
project
aims to create a Hadoop distribution under the
Apache umbrella.


Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Integrated Hadoop systems


The leading Hadoop enterprise software vendors have aligned their
Hadoop products with the rest of their database and analytical
offerings. These vendors don't require you to source Hadoop from
another party, and offer it as a core part of their big data
solutions. Their offerings integrate Hadoop into a
broader enterprise setting, augmented by analytical and workflow
tools.

EMC Greenplum


EMC Greenplum

Database


Deployment options

Appliance
(Modular Data Computing Appliance),

Software
(Enterprise Linux)

Hadoop



Bundled distribution
(Greenplum HD);

Hive,

Pig,

Zookeeper,

HBase



NoSQL component

HBase

Links





Acquired by EMC, and rapidly taken to the heart of the
company's strategy, Greenplum is a relative newcomer to the
enterprise, compared
to other companies in this section. They have turned that to
their advantage in creating an analytic platform, positioned as
taking analytics "beyond BI" with agile data science teams.

Greenplum's Unified Analytics Platform (UAP) comprises three
elements: the Greenplum MPP database, for structured data; a
Hadoop distribution, Greenplum HD; and href="http://www.greenplum.com/products/chorus">Chorus, a
productivity and groupware layer for data science teams.

The HD Hadoop layer builds on MapR's Hadoop compatible
distribution, which replaces the file system with a faster
implementation and provides other features for
robustness. Interoperability between HD and Greenplum Database
means that a single query can access both database and Hadoop data.

Chorus is a unique feature, and is indicative of Greenplum's commitment
to the idea of data science and the importance of the agile team
element to effectively exploiting big data. It supports
organizational roles from analysts, data scientists and DBAs
through to executive business stakeholders.



As befits EMC's role in the data center market, Greenplum's UAP is
available in a modular appliance configuration.

IBM


IBM InfoSphere



Database



DB2


Deployment options



Software
(Enterprise Linux)

Hadoop



Bundled distribution
(InfoSphere BigInsights);

Hive,

Oozie,

Pig,

Zookeeper,

Avro,

Flume,

HBase,

Lucene



NoSQL component

HBase

Links




IBM's href="http://www-01.ibm.com/software/data/infosphere/biginsights/">InfoSphere
BigInsights is their Hadoop distribution, and part of a suite
of products offered under the "InfoSphere" information management
brand. Everything big data at IBM is helpfully labeled
Big, appropriately enough for a company affectionately known as "Big
Blue."

BigInsights augments Hadoop with a variety of features,
including
management and administration tools. It also offers textual analysis tools
that aid with entity resolution — identifying people, addresses,
phone numbers and so on.

IBM's Jaql query language provides a point of integration
between Hadoop and other IBM products, such as relational databases
or Netezza data warehouses.

InfoSphere BigInsights is interoperable with IBM's other
database and warehouse products, including DB2, Netezza and its
InfoSphere warehouse and analytics lines. To aid analytical
exploration, BigInsights ships with BigSheets, a spreadsheet
interface onto big data.

IBM addresses streaming big data separately through its href="http://www-01.ibm.com/software/data/infosphere/streams/">InfoSphere
Streams product. BigInsights is not currently offered in an
appliance or cloud form.

Microsoft


Microsoft
Database



Deployment options




Software
(Windows Server),

Cloud
(Windows Azure Cloud)

Hadoop



Bundled distribution
(Big Data Solution);

Hive,

Pig


Links





Microsoft have adopted Hadoop as the center of their big data
offering, and are pursuing an integrated approach aimed at making
big data available through their analytical tool suite, including
to the familiar tools of Excel and PowerPivot.

Microsoft's
href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data-solution.aspx">Big
Data Solution brings Hadoop to the Windows Server platform,
and in elastic form to their cloud platform Windows
Azure. Microsoft have packaged their own distribution of Hadoop,
integrated with Windows Systems Center and Active Directory.
They intend to contribute back changes to Apache Hadoop to
ensure that an open source version of Hadoop will run on Windows.

On the server side, Microsoft offer integrations to their SQL
Server database and their data warehouse product. Using their
warehouse solutions aren't mandated, however. The Hadoop Hive data
warehouse is part of the Big Data Solution, including
connectors from Hive to ODBC and Excel.

Microsoft's focus on the developer is evident in their creation
of a JavaScript API for Hadoop. Using JavaScript, developers can
create Hadoop jobs for MapReduce, Pig or Hive, even from a
browser-based environment. Visual Studio and .NET integration
with Hadoop is also provided.

Deployment is possible either on the server or in the cloud, or
as a hybrid combination. Jobs written against the Apache Hadoop
distribution should migrate with miniminal changes to Microsoft's
environment.


Oracle


Oracle Big Data

Deployment options

Hadoop



Bundled distribution
(Cloudera's Distribution including Apache Hadoop);

Hive,

Oozie,

Pig,

Zookeeper,

Avro,

Flume,

HBase,

Sqoop,

Mahout,

Whirr



NoSQL component


Links



Announcing their entry into the big data market at the end of
2011, Oracle is taking an appliance-based approach. Their
href="http://www.oracle.com/us/products/database/big-data-appliance/overview/index.html">Big
Data Appliance integrates Hadoop, R for analytics, a new
Oracle NoSQL database, and connectors to Oracle's
database and Exadata data warehousing product line.

Oracle's approach caters to the high-end enterprise market, and
particularly leans to the rapid-deployment, high-performance end
of the spectrum. It is the only vendor to include the popular R
analytical language integrated with Hadoop, and to ship a NoSQL
database of their own design as opposed to Hadoop HBase.



Rather than developing their own Hadoop distribution, Oracle
have partnered with Cloudera for Hadoop support, which brings them
a mature and established Hadoop solution. Database connectors
again promote the integration of structured Oracle data with the
unstructured data stored in Hadoop HDFS.

Oracle's href="http://www.oracle.com/us/products/database/nosql/overview/index.html">NoSQL
Database is a scalable key-value database, built on the
Berkeley DB technology. In that, Oracle owes double gratitude to
Cloudera CEO Mike Olson, as he was previously the CEO of
Sleepycat, the creators of Berkeley DB. Oracle are positioning
their NoSQL database as a means of acquiring big data prior to
analysis.

The Oracle R Enterprise product offers direct integration into
the Oracle database, as well as Hadoop, enabling R scripts to run
on data without having to round-trip it out of the data stores.


Availability

While IBM and Greenplum's offerings are available at the time
of writing, the Microsoft and Oracle solutions are expected to be
fully available early in 2012.

Analytical databases with Hadoop connectivity



MPP (massively parallel processing) databases are specialized
for processing structured big data, as distinct from the
unstructured data that is Hadoop's specialty. Along with Greenplum,
Aster Data and Vertica are early pioneers of big data
products before the mainstream emergence of Hadoop.

These MPP solutions are databases specialized for analyical
workloads and data integration, and provide connectors to
Hadoop and data warehouses. A
recent spate of acquisitions have seen these products become the
analytical play by data warehouse and storage vendors: Teradata
acquired Aster Data, EMC acquired Greenplum, and HP acquired
Vertica.

Quick facts




Aster Data



Database

MPP analytical database

Deployment options

Hadoop




Hadoop connector available

Links






ParAccel



Database

MPP analytical database

Deployment options



Software
(Enterprise Linux),

Cloud
(Cloud Edition)

Hadoop




Hadoop integration available

Links





Vertica



Database

MPP analytical database

Deployment options



Appliance
(HP Vertica Appliance),

Software
(Enterprise Linux),

Cloud
(Cloud and Virtualized)

Hadoop




Hadoop and Pig connectors available

Links




Hadoop-centered companies

Directly employing Hadoop is another route to creating a big
data solution, especially where your infrastructure doesn't fall
neatly into the product line of major vendors. Practically every
database now features Hadoop connectivity, and there are multiple
Hadoop distributions to choose from.

Reflecting the developer-driven ethos of the big data world,
Hadoop distributions are frequently offered in a community edition.
Such editions lack enterprise management features, but contain all
the functionality needed for evaluation and development.

The first iterations of Hadoop distributions, from Cloudera and
IBM, focused on usability and adminstration. We are now seeing the
addition of performance-oriented improvements to Hadoop, such as
those from MapR and Platform Computing. While maintaining API
compatibility, these vendors replace slow or fragile parts of the
Apache distribution with better performing or more robust components.

Cloudera

The longest-established provider of Hadoop distributions,
Cloudera provides an
enterprise Hadoop solution, alongside
services, training and support options. Along with
Yahoo, Cloudera have made deep open source contributions to Hadoop, and
through hosting industry conferences have done much to establish
Hadoop in its current position.

Hortonworks

Though a recent entrant to the market, href="http://www.hortonworks.com/">Hortonworks have a long
history with Hadoop. Spun off from Yahoo, where Hadoop
originated, Hortonworks aims to stick close to and promote the
core Apache Hadoop technology. Hortonworks also have a partnership
with Microsoft to assist and accelerate their Hadoop
integration.

Hortonworks href="http://hortonworks.com/technology/hortonworksdataplatform/">Data
Platform is currently in a limited preview phase, with a
public preview expected in early 2012. The company also provides
support and training.


An overview of Hadoop distributions







































Cloudera EMC Greenplum Hortonworks IBM MapR Microsoft Platform Computing Product Name Cloudera's Distribution including Apache Hadoop Greenplum HD Hortonworks Data Platform InfoSphere BigInsights MapR Big Data Solution Platform MapReduce


Free Edition


CDH

Integrated, tested distribution of Apache Hadoop




Community Edition
100% open source certified and supported version of the Apache Hadoop stack








Basic Edition
An integrated Hadoop distribution.




MapR M3 Edition
Free community edition incorporating MapR's performance increases








Platform MapReduce Developer Edition
Evaluation edition, excludes resource management features of regualt edition





Enterprise Edition


Cloudera Enterprise
Adds management software layer over CDH




Enterprise Edition
Integrates MapR's M5 Hadoop-compatible distribution, replaces HDFS with MapR's C++-based file system. Includes MapR management tools








Enterprise Edition
Hadoop distribution, plus BigSheets spreadsheet interface, scheduler, text analytics, indexer, JDBC connector, security support.




MapR M5 Edition
Augments M3 Edition with high availability and data protection features




Big Data Solution
Windows Hadoop distribution, integrated with Microsoft's database and analytical products




Platform MapReduce
Enhanced runtime for Hadoop MapReduce, API-compatible with Apache Hadoop





Hadoop Components



Hive,

Oozie,

Pig,

Zookeeper,

Avro,

Flume,

HBase,

Sqoop,

Mahout,

Whirr





Hive,

Pig,

Zookeeper,

HBase





Hive,

Pig,

Zookeeper,

HBase,

None,

Ambari





Hive,

Oozie,

Pig,

Zookeeper,

Avro,

Flume,

HBase,

Lucene





Hive,

Pig,

Flume,

HBase,

Sqoop,

Mahout,

None,

Oozie





Hive,

Pig









Security



Cloudera Manager

Kerberos, role-based administration and audit trails















Security features

LDAP authentication, role-based authorization, reverse proxy











Active Directory integration










Admin Interface



Cloudera Manager

Centralized management and alerting







Administrative interfaces

MapR Heatmap cluster administrative tools







Apache Ambari

Monitoring, administration and lifecycle management for Hadoop clusters







Administrative interfaces

Administrative features including Hadoop HDFS and MapReduce administration, cluster and server management, view HDFS file content







Administrative interfaces

MapR Heatmap cluster administrative tools







System Center integration






Administrative interfaces

Platform MapReduce Workload Manager







Job Management



Cloudera Manager

Job analytics, monitoring and log search







High-availability job management

JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents







Apache Ambari

Monitoring, administration and lifecycle management for Hadoop clusters







Job management features

Job creation, submission, cancellation, status, logging.







High-availability job management

JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents















Database connectors








Greenplum Database












DB2,

Netezza,

InfoSphere Warehouse













SQL Server,

SQL Server Parallel Data Warehouse











Interop features

























Hive ODBC Driver,

Excel Hive Add-in











HDFS Access



Fuse-DFS

Mount HDFS as a traditional filesystem







NFS

Access HDFS as a conventional network file system







WebHDFS

REST API to HDFS











NFS

Access HDFS as a conventional network file system















Installation



Cloudera Manager

Wizard-based deployment















Quick installation

GUI-driven installation tool



















Additional APIs















Jaql

Jaql is a functional, declarative query language designed to process large data sets.







REST API






JavaScript API

JavaScript Map/Reduce jobs, Pig-Latin, and Hive queries







Includes R, C/C++, C#, Java, Python





Volume Management



















Mirroring, snapshots
















Notes



January 12 2012

Strata Week: A .data TLD?

Here are some of the data stories that caught my attention this week.

Should there be a .data TLD?

radar.dataICANN is ready to open top-level domains (TLD) to the highest bidder, and as such, Wolfram Alpha's Stephen Wolfram posits it's time for a .data TLD. In a blog post on the Wolfram site, he argues that the new top-level domains provide an opportunity for the creation of a .data domain that could create a "parallel construct to the ordinary web, but oriented toward structured data intended for computational use. The notion is that alongside a website like wolfram.com, there'd be wolfram.data."

Wolfram continues:

If a human went to wolfram.data, there'd be a structured summary of what data the organization behind it wanted to expose. And if a computational system went there, it'd find just what it needs to ingest the data, and begin computing with it.

So how would a .data TLD change the way humans and computers interact with data? Or would it change anything? If you've got ideas of how .data could be put to use, please share them in the comments.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Cloudera addresses what Apache Hadoop 1.0 means to its customers

Last week, the Apache Software Foundation (ASF) announced that Hadoop had reached version 1.0. This week, Cloudera took to its blog to explain what that milestone means to its customers.

The post, in part, explains how Hadoop has branched from its trunk, noting that all of this has caused some confusion for Cloudera customers:

More than a year after Apache Hadoop 0.20 branched, significant feature development continued on just that branch and not on trunk. Two major features were added to branches off 0.20.2. One feature was authentication, enabling strong security for core Hadoop. The other major feature was append, enabling users to run Apache HBase without risk of data loss. The security branch was later released as 0.20.203. These branches and their subsequent release have been the largest source of confusion for users because since that time, releases off of the 0.20 branches had features that releases off of trunk did not have and vice versa.

Cloudera explains to its customers that it's offered the equivalent for "approximately a year now" and compares the Apache Hadoop efforts to its own offerings. The post is an interesting insight into not just how the ASF operates, but how companies that offer services around those projects have to iterate and adapt.

Disqus says that pseudonymous commenters are best

Debates over blog comments have resurfaced recently, with a back and forth about whether or not they're good, bad, evil, or irrelevant. Adding some fuel to the fire (or data to the discussion, at least) comes Disqus with its own research based on its commenting service.

According to the Disqus research, commenters using pseudonyms actually are "the most valuable contributors to communities," as their comments are both the highest quantity and quality. Those findings run counter to the idea that those who comment online without using their real names actually lessen rather than enhance quality conversations.

Disqus' data indicates that pseudonymity might engender a more engaged and more engaging community. That notion stands in contrast to arguments that anonymity leads to more trollish and unruly behavior.

Got data news?

Feel free to email me.

Related:

January 05 2012

Strata Week: Unfortunately for some, Uber's dynamic pricing worked

Here are a few of the data stories that caught my attention this week.

Uber's dynamic pricing

Uber logoMany passengers using the luxury car service Uber on New Year's Eve suffered from sticker shock when they saw that a hefty surcharge had been added to their bills — a charge ranging from 3 to more than 6 times the regular cost of an Uber fare. Some patrons took to Twitter to complain about the pricing, and Uber responded with several blog posts and Quora answers, trying to explain the startup's usage of "dynamic pricing."

The idea, writes Uber engineer Dom Anthony Narducci, is that:

... when our utilization is approaching too high of levels to continue to provide low ETA's and good dispatches, we raise prices to reduce demand and increase supply. On New Year's Eve (and just after midnight), this system worked perfectly; demand was too high, so the price bumped up. Over and over and over and over again.

In other words, in order to maintain the service that Uber is known for — reliability — the company adjusted prices based on the supply and demand for transportation. And on New Year's Eve, says Narducci, "As for how the prices got that high, at a super simplistic level, it was because things went right."

TechCrunch contributor Semil Shah points to other examples of dynamic pricing, such as for airfares and hotels, and argues that we might see more of this in the future. "Starting now, consumers should also prepare to experience the underbelly of this phenomenon, a world where prices for goods and services that are in demand, either in quantity or at a certain time, aren't the same price for each of us."

But Reuters' Felix Salmon argues that this sort of algorithmic and dynamic pricing might not work well for most customers. It isn't simply that the prices for Uber car rides are high (they are always higher than a taxi anyway). He contends that the human brain really can't — or perhaps doesn't want to — handle this sort of complicated cost/benefit analysis for a decision like "should I take a cab or call Uber or just walk home." As such, he calls Uber:

... a car service for computers, who always do their sums every time they have to make a calculation. Humans don't work that way. And the way that Uber is currently priced, it's always going to find itself in a cognitive zone of discomfort as far as its passengers are concerned.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Apache Hadoop reaches v1.0

Hadoop logoThe Apache Software Foundation announced that Apache Hadoop has reached v1.0, an indication that the big data tool has achieved a certain level of stability and enterprise-readiness.

V1.0 "reflects six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists, and systems engineers, bringing a highly stable, enterprise-ready release of the fastest-growing big data platform," said the ASF in its announcement.

The designation by the Apache Software Foundation reaffirms the interest in and development of Hadoop, a major trend in 2011 and likely to be such again in 2012.

Proposed bill would repeal open access for federal-funded research

What's the future for open data, open science, and open access in 2012? Hopefully, a bill introduced late last month isn't a harbinger of what's to come.

The Research Works Act (HR 3699) is a proposed piece of legislation that would repeal the open-access policy at the National Institutes of Health (NIH) and prohibit similar policies from being introduced at other federal agencies. HR 3699 has been referred to the Committee on Oversight and Government Reform.

The main section of the bill is quite short:

"No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that

  • causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work; or
  • requires that any actual or prospective author, or the employer of such an actual or prospective author, assent to network dissemination of a private-sector research work."

The bill would prohibit the NIH and other federal agencies from requiring that grant recipients publish in open-access journals.

Got data news?

Feel free to email me.

Related:

December 30 2011

Four short links: 30 December 2011

  1. Hadoop Hits 1.0 -- open source distributed computation engine, heavily used in big data analysis, hits 1.0.
  2. Sparse and Low-Rank Approximation Wiki -- interesting technique: instead of sampling at 2x the rate you need to discriminate then compressing to trade noise for space, use these sampling algorithms to (intelligently) noisily sample at the lower bit rate to begin with. Promises interesting applications particularly in for sensors (e.g., the Rice single pixel camera). (via siah)
  3. Rise of Printer Malware -- firmware attacks embedded in printed documents. Another reminder that not only is it hard to write safe software, your mistakes can be epically bad. (via Cory Doctorow)
  4. Electric Circuits and Transistors Made From Cotton -- To make it conductive, the researchers coated cotton threads in a variety of other materials. To make conductive “wires,” the team coated the threads with gold nanoparticles, and then a conductive polymer. To turn a cotton wire into a semiconductor, it was dipped in another polymer, and then a further glycol coating to make it waterproof. Neat materials hack that might lend a new twist to wearables.

December 26 2011

The year in big data and data science

Big data and data science have both been with us for a while. According to McKinsey & Company's May 2011 report on big data, back in 2009 "nearly all sectors in the U.S. economy had at least an average of 200 terabytes of stored data ... per company with more than 1,000 employees." And on the data-science front, Amazon's John Rauser used his presentation at Strata New York (below) to trace the profession of data scientist all the way back to 18th-century German astronomer Tobias Mayer.

Of course, novelty and growth are separate things, and in 2011, there were a number of new technologies and companies developed to address big data's issues of storage, transfer, and analysis. Important questions were also raised about how the growing ranks of data scientists should be trained and how data science teams should be constructed.

With that as a backdrop, below I take a look at three evolving data trends that played an important role over the last year.

The ubiquity of Hadoop

HadoopIt was a big year for investment for Apache Hadoop-based companies. Hortonworks, which was spun out of Yahoo this summer, raised $20 million upon its launch. And when Cloudera announced it had raised $40 million this fall, GigaOm's Derrick Harris calculated that, all told, Hadoop-based startups had raised $104.5 million between May and November of 2011. (Other startups raising investment for their Hadoop software included PlatforaHadapt and MapR.)

But it wasn't just startups that got in on the Hadoop action this year: IBM announced this fall that it would offer Hadoop in the cloud; Oracle unveiled its own Hadoop distribution running on its new Big Data appliance; EMC signed a licensing agreement with MapR; and Microsoft opted to put its own big data processing system, Dryad, on hold, signing a deal instead with Hortonworks to handle Hadoop on Azure.

The growing number of Hadoop providers and adopters has spurred more solutions for managing and supporting Hadoop. This will become increasingly important in 2012 as Hadoop moves beyond the purview of data scientists to become a tool more businesses and analysts utilize.

More data, more privacy and security concerns

Despite all the promise that better tools for handing and analyzing data holds, there were numerous concerns this year about the privacy and security implications of big data, stemming in part from a series of high-profile data thefts and scandals.

In April, a security breach at Sony led to the theft of the personal data of 77 million users. The intrusion into the Playstation Network prompted Sony to pull it offline, but Sony failed to notify its users about the issue for a full week (later admitting that it stored usernames and passwords unencrypted). Estimates of the cost of the security breach to Sony: between $170 million and $24 billion.

That's a wide range of estimates for the damage done to the company, but the point is clear nonetheless: not only do these sorts of data breaches cost companies millions, but the value of consumers' personal data is also increasing — for both legitimate and illegitimate purposes.

iOS mapSony was hardly the only company with security and privacy concerns on its hands. In April, Alasdair Allan and Pete Warden uncovered a file in Apple iOS software that noted users' latitude-longitude coordinates along with a timestamp. Apple responded, insisting that the company "is not tracking the location of your iPhone. Apple has never done so and has no plans to ever do so." Apple fixed what it said was a "bug."

Late this year, almost all handset makers and carriers were implicated by another mobile concern when Android developer Trevor Eckhart reported that the mobile intelligence company Carrier IQ's rootkit software could record all sorts of user data — texts, web browsing, keystrokes, and even phone calls.

That the data from mobile technology was at the heart of these two controversies reflects in some ways our changing data usage patterns. But whether it's mobile or not, as we do more online — shop, browse, chat, check in, "like" — it's clear that we're leaving behind an immense trail of data about ourselves. This year saw the arrival of several open-source efforts, such as the Locker Project and ThinkUp, that strive to give users better control over their personal social data.

And while better control and safeguards can offer some level of protection, it's clear that technology can always be cracked and the goals of data aggregators can shift. So, if digital data is and always will be a moving target, how does that shape our expectations for privacy? In Privacy and Big Data, published this year, co-authors Terence Craig and Mary Ludloff argued that we might be paying too much attention to concerns about "intrusions of privacy" and that instead we need to be thinking about better transparency with how governments and companies are using our data.

Open data's inflection point

Screenshot from the Open Knowledge Foundation's Open Government Data Map
Screenshot from the Open Knowledge Foundation's Open Government Data Map.

When it comes to better transparency, 2011 has been a good year for open data, with strong growth in the number of open data efforts. Canada, the U.K., France, the U.S., and Kenya were a few of the countries unveiling open data initiatives.

There were still plenty of open data challenges: budgets cuts, for example, threatened the U.S. Data.gov initiative. And in his "state of open data 2011" talk, open data activist David Eaves pointed to the challenges of having different schemas and few standards, making it difficult for some datasets to be used across systems and jurisdictions.

Even with a number of open data "wins" at the government level, a recent survey of the data science community by EMC named the lack of open data as one of the obstacles that data scientists and business intelligence analysts said they faced. Just 22% of the former and 12% of the latter said that they "strongly believed" that the employees at their companies have the access they need to run experiments on data. Arguably, more open data efforts have spawned more interest and better understanding of what this can mean.

The demands for more open data has also spawned a demand for more tools. Importantly, these tools are beginning to be open to more than just data scientists or programmers. They include things like visualization-creator Visual.ly, the scraping tool ScraperWiki, and data-sharing site BuzzData.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

December 22 2011

Developer Year in Review: 2011 Edition

This year brought us triumphs and tragedies, new companies born and old ones burning out. Before DWiR takes a holiday hiatus, we're going to look back on the high points of the year that was.

Mobile gains ground

Smartphones

Lost in all the news about lawsuits, patents and speculation was the overarching theme for mobile this year: it has become the primary software platform for many users. The desktop may not be dead, but it's definitely showing its age, and as smartphones and tablets become ubiquitous, the amount of time the average consumer spends in front of a keyboard is declining rapidly.

The good news for software developers is that the maturing app store model has opened up software distribution to a much larger pool of potential software makers. The bad news is that it has also drastically reset the expectation of how much consumers are willing to spend for apps, although prices are climbing marginally. A $1 app can make you a lot of money if you can get millions of users to buy it, but it won't even get you a nice night on the town if you're writing for a niche market.

With RIM's Blackberry market share doing a good imitation of an Olympic high diver, and the new Windows mobile platform not yet gaining significant traction, 2011 was essentially a two-horse race, with Android passing iOS for the first time in new sales. Apple is crying all the way to the bank, though, as the profit margin on iOS devices is pushing Apple's bottom line to new highs and overall unit sales continue to climb steadily. At least for the moment, the smartphone market is not a zero-sum game.

This year also marked the release of Ice Cream Sandwich (ICS) for Android and iOS 5 for the iPhone/iPad/iPod. ICS is the first version of Android that is making serious efforts to tame the tablet situation, but there have been widespread complaints that carriers are slow to pick it up, even in new models. Objective-C developers are finally getting to say goodbye to old friends like retain, release and autorelease, as Apple rolled out the automatic reference count compiler. Few tears were shed for their passing.


Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.



Save 20% on registration with the code RADAR20

The year of HTML5

In future years, 2011 will be remembered as the year Adobe put up the white flag and joined the HTML5 bandwagon, which started an industry death-watch for Flash. Microsoft also sent out signals that Silverlight was being put out to pasture and that it planned to embrace HTML5 as well.

The stampede to adopt HTML5 was prompted, in part, by the increasing robustness of the standard and the implementations of the standard in browsers. It also didn't hurt that it is the only Rich Internet Application platform that will run on the iPad.

Dru-who and Ha-what?

Two packages with funny names became the hot skills to have on your resume this year. Drupal continued to gain popularity as a content management platform, while Apache Hadoop was the must-have technology for data crunching. By the end of the year, developers with experience in either were in short supply and could basically write their own tickets.

Languages emerge, but few stick

It seems like every year, there's a new batch of languages that promise to be the next Big Thing. In past years, the crown has been worn by Scala, Erlang, Clojure and others. But when it comes time to start a project or hire developers, skills in new languages are rarely high on the list of priorities for companies.

This year, Google joined the fun, promoting both Go and Dart. Like most new languages, they face an uphill battle, even with Google's massive resources behind them. Few have what it takes to fight the institutional inertia of existing development decisions and to join winners such as Ruby in the pantheon of well-adopted emerging languages.

Some general thoughts to end the year

The computer industry, more than most others, can make you feel very old at a relatively young age. I've been hacking, in one form or another, for nearly 35 years, and the technology I used in my youth seems like it belongs in another universe.

The flip side of this is that I'm constantly amazed by what science and technology brings forth on a seemingly daily basis. Whether it's having a conversation with a device I can hold in the palm of my hand or watching the aurora light up the heavens, seen from above by occupants of the ISS, I often seem to be living in the future I read about as a kid.

As a species, we may be prone to pettiness, violence, willful ignorance and hatred, but once in a while, we manage to pull ourselves out of the muck and do something insanely great. Let's attempt to honor the vision of an admittedly imperfect man we lost this year and try to make 2012 insanely greater.

Got news?

Please send tips and leads here.

Related:

December 14 2011

Five big data predictions for 2012

As the "coming out" year for big data and data science draws to a close, what can we expect over the next 12 months?

More powerful and expressive tools for analysis

HadoopThis year has seen consolidation and engineering around improving the basic storage and data processing engines of NoSQL and Hadoop. That will doubtless continue, as we see the unruly menagerie of the Hadoop universe increasingly packaged into distributions, appliances and on-demand cloud services. Hopefully it won't be long before that's dull, yet necessary, infrastructure.

Looking up the stack, there's already an early cohort of tools directed at programmers and data scientists (Karmasphere, Datameer), as well as Hadoop connectors for established analytical tools such as Tableau and R. But there's a way to go in making big data more powerful: that is, to decrease the cost of creating experiments.

Here are two ways in which big data can be made more powerful.

  1. Better programming language support. As we consider data, rather than business logic, as the primary entity in a program, we must create or rediscover idiom that lets us focus on the data, rather than abstractions leaking up from the underlying Hadoop machinery. In other words: write shorter programs that make it clear what we're doing with the data. These abstractions will in turn lend themselves to the creation of better tools for non-programmers.
  2. We require better support for interactivity. If Hadoop has any weakness, it's in the batch-oriented nature of computation it fosters. The agile nature of data science will favor any tool that permits more interactivity.

Streaming data processing

Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.

Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.

For some applications, there just isn't enough storage in the world to store every piece of data your business might receive: at some point you need to make a decision to throw things away. Having streaming computation abilities enables you to analyze data or make decisions about discarding it without having to go through the store-compute loop of map/reduce.

Emerging contenders in the real-time framework category include Storm, from Twitter, and S4, from Yahoo.

Rise of data marketplaces

Your own data can become that much more potent when mixed with other datasets. For instance, add in weather conditions to your customer data, and discover if there are weather related patterns to your customers' purchasing patterns. Acquiring these datasets can be a pain, especially if you want to do it outside of the IT department, and with some exactness. The value of data marketplaces is in providing a directory to this data, as well as streamlined, standardized methods of delivering it. Microsoft's direction of integrating its Azure marketplace right into analytical tools foreshadows the coming convenience of access to data.

Development of data science workflows and tools

As data science teams become a recognized part of companies, we'll see a more regularized expectation of their roles and processes. One of the driving attributes of a successful data science team is its level of integration into a company's business operations, as opposed to being a sidecar analysis team.

EMC ChorusSoftware developers already have a wealth of infrastructure that is both logistical and social, including wikis and source control, along with tools that expose their process and requirements to business owners. Integrated data science teams will need their own versions of these tools to collaborate effectively. One example of this is EMC Greenplum's Chorus, which provides a social software platform for data science. In turn, use of these tools will support the emergence of data science process within organizations.

Data science teams will start to evolve repeatable processes, hopefully agile ones. They could do worse than to look at the ground-breaking work newspaper data teams are doing at news organizations such as The Guardian and New York Times: given short timescales these teams take data from raw form to a finished product, working hand-in-hand with the journalist.

Increased understanding of and demand for visualization

Visualization fulfills two purposes in a data workflow: explanation and exploration. While business people might think of a visualization as the end result, data scientists also use visualization as a way of looking for questions to ask and discovering new features of a dataset.

If becoming a data-driven organization is about fostering a better feel for data among all employees, visualization plays a vital role in delivering data manipulation abilities to those without direct programming or statistical skills.

Throughout a year dominated by business' constant demand for data scientists, I've repeatedly heard from data scientists about what they want most: people who know how to create visualizations.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Related:

December 01 2011

Strata Week: New open-data initiatives in Canada and the UK

Here are a few of the data stories that caught my attention this week.

Open data from StatsCan

Statistics CanadaEmbassy Magazine broke the news this week that all of Statistics Canada's online data will be made available to the public for free, released under the Government of Canada's Open Data License Agreement beginning in February 2012. Statistics Canada is the federal agency commissioned with producing statistics to help understand the Canadian economy, culture, resources, and population. (It runs the Canadian census every five years.)

The decision to make the data freely and openly available "has been in the works for years," according to Statistics Canada spokesperson Peter Frayne. The Canadian government did launch an open-data initiative earlier this year, and the move on the part of StatsCan dovetails philosophically with that. Frayne said that the decision to make the data free was not a response to the controversial decision last summer when the agency dropped its mandatory long-form census.

Open government activist David Eaves responds with a long list of "winners" from the decision, including all of the consumers of StatsCan's data:

Indirectly, this includes all of us, since provincial and local governments are big consumers of StatsCan data and so now — assuming it is structured in such a manner — they will have easier (and cheaper) access to it. This is also true of large companies and non-profits which have used StatsCan data to locate stores, target services and generally allocate resources more efficiently. The opportunity now opens for smaller players to also benefit.

Eaves continues, stressing the importance of these smaller players:

Indeed, this is the real hope. That a whole new category of winners emerges. That the barrier to use for software developers, entrepreneurs, students, academics, smaller companies and non-profits will be lowered in a manner that will enable a larger community to make use of the data and therefore create economic or social goods.

Moving to Big Data: Free Strata Online Conference — In this free online event, being held Dec. 7, 2011, at 9AM Pacific, we'll look at how big data stacks and analytical approaches are gradually finding their way into organizations as well as the roadblocks that can thwart efforts to become more data driven. (This Strata Online Conference is sponsored by Microsoft.)

Register to attend this free Strata Online Conference

Open data from Whitehall

The British government also announced the availability of new open datasets this week. The Guardian reports that personal health records, transportation data, housing prices, and weather data will be included "in what promises to be the most dramatic release of public data since the 2010 election."

The government will also form an Open Data Institute (ODI), led by Sir Tim Berners-Lee. The ODI will involve both businesses and academic institutions, and will focus on helping transform the data for commercial benefit for U.K. companies as well as for the government. The ODI will also work on the development of web standards to support the government's open-data agenda.

The Guardian notes that the health data that's to be released will be the largest of its kind outside of U.S. veterans' medical records. The paper cites the move as something recommended by the Wellcome Trust earlier this year: "Integrated databases ... would make England unique, globally, for such research." Both medical researchers and pharmaceutical companies will be able to access the data for free.

Dell open sources its Hadoop deployment tool

HadoopHadoop adoption and investment has been one of the big data trends of 2011, with stories about Hadoop appearing in almost every edition of Strata Week. GigaOm's Derrick Harris contends that Hadoop's good fortunes will only continue in 2012, listing six reasons why next year may actually go down as "The Year of Hadoop."

This week's Hadoop-related news involves the release of the source code to Crowbar, Dell's Hadoop deployment tool. Silicon Angle's Klint Finley writes that:

Crowbar is an open-source deployment tool developed by Dell originally as part of its Dell OpenStack Cloud service. It started as a tool for installing Open Stack, but can deploy other software through the use of plug-in modules called 'barclamps' ... The goal of the Hadoop barclamp is to reduce Hadoop deployment time from weeks to a single day.

Finley notes that Crowbar isn't competition to Cloudera's line of Hadoop management tools.

What Muncie read

What Middletown Read"People don't read anymore," Steve Jobs once told The New York Times. It's a fairly common complaint, one that certainly predates the computer age — television was to blame, then video games. But our knowledge about reading habits of the past is actually quite slight. That's what makes the database based on ledgers from the Muncie, Ind., public library so marvelous.

The ledgers, which were discovered by historian Frank Felsenstein, chronicle every book checked out of the library, along with the name of the patron who checked it out, between November 1891 and December 1902. That information is now available in the What Middletown Read database.

In a New York Times story on the database, Anne Trubek notes that even at the turn of the 20th century, most library patrons were not reading "the classics":

What do these records tell us Americans were reading? Mostly fluff, it's true. Women read romances, kids read pulp and white-collar workers read mass-market titles. Horatio Alger was by far the most popular author: 5 percent of all books checked out were by him, despite librarians who frowned when boys and girls sought his rags-to-riches novels (some libraries refused to circulate Alger's distressingly individualist books). Louisa May Alcott is the only author who remains both popular and literary today (though her popularity is far less). "Little Women" was widely read, but its sequel "Little Men" even more so, perhaps because it was checked out by boys, too.

Got data news?

Feel free to email me.

Related:

November 03 2011

Strata Week: Cloudera founder has a new data product

Here are a few of the data stories that caught my attention this week:

Odiago: Cloudera founder Christophe Bisciglia's next big data project

Odiago and WibiDataCloudera founder Christophe Bisciglia unveiled his new data startup this week: Odiago. The company's product, WibiData (say it out loud), uses Apache Hadoop and Hbase to analyze consumer web data. Database industry analyst Curt Monash describes WibiData on his DBMS2 blog:

WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop, is a data management and analytic execution layer. That's where the secret sauce resides.

GigaOm's Derrick Harris posits that Odiago points to "the future of Hadoop-based products." Rather than having to "roll your own" Hadoop solutions, future Hadoop users will be able to build their apps to tap into other products that do the "heavy lifting."

Hortonworks launches its data platform

Hadoop company Hortonworks, which spun out of Yahoo earlier this year, officially announced its products and services this week. The Hortonworks Data Platform is an open source distribution powered by Apache Hadoop. It includes the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase and Zookeeper, as well as HCatalog and open APIs for integration. THe Hortonworks Data Platform also includes Ambari, another Apache project, that will serve as the Hadoop installation and management system.

It's possible Hortonworks' efforts will pick up the pace of the Hadoop release cycle and address what ReadWriteWeb's Scott Fulton sees as the "degree of fragmentation and confusion." But as GigaOm's Derrick Harris points out, there is still "so much Hadoop in so many places, with multiple companies offering their own Hadoop solutions.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.



Save 20% on registration with the code RADAR20

Big education content meets big education data

A couple of weeks ago, the adaptive learning startup Knewton announced that it had raised an additional $33 million. This latest round was led by Pearson, the largest education company in the world. As such, the announcement this week that Knewton and Pearson are partnering is hardly surprising.

But this partnership does mark an important development for big data, textbook publishing, and higher education.

Knewton's adaptive learning platform will be integrated with Pearson's digital courseware, giving students individualized content as they move through the materials. To begin with, Knewton will work with just a few of the subjects within Pearson's MyLab and Mastering catalog. There are more than 750 courses in that catalog, and the adaptive learning platform will be integrated with more of them soon. The companies also say they plan to "jointly develop a line of custom, next-generation digital course solutions, and will explore new products in the K12 and international markets."

The data from Pearson's vast student customer base — some 9 million higher ed students use Pearson materials — will certainly help Knewton refine its learning algorithms. In turn, the promise of adaptive learning systems means that students and teachers will be able to glean insights from the learning process — what students understand, what they don't — in real time. It also means that teachers can provide remediation aimed at students' unique strengths and weaknesses.

Got data news?

Feel free to email me.

Related:

October 27 2011

Strata Week: IBM puts Hadoop in the cloud

Here are a few of the data stories that caught my attention this week.

IBM's cloud-based Hadoop offering looks to make data analytics easier

IBM HadoopAt its conference in Las Vegas this week, IBM made a number of major big-data announcements, including making its Hadoop-based product InfoSphere BigInsights available immediately via the company's SmartCloud platform. InfoSphere BigInsights was unveiled earlier this year, and it is hardly the first offering that Big Blue is making to help its customers handle big data. The last few weeks have seen other major players also move toward Hadoop offerings — namely Oracle and Microsoft — but IBM is offering its service in the cloud, something that those other companies aren't yet doing. (For its part, Microsoft does say that a Hadoop service will come to Azure by the end of the year.)

IBM joins Amazon Web Services as the only other company currently offering Hadoop in the cloud, notes GigaOm's Derrick Harris. "Big data — and Hadoop, in particular — has largely been relegated to on-premise deployments because of the sheer amount of data involved," he writes, "but the cloud will be a more natural home for those workloads as companies begin analyzing more data that originates on the web."

Harris also points out that IBM's Hadoop offering is "fairly unique" insofar as it targets businesses rather than programmers. IBM itself contends that "bringing big data analytics to the cloud means clients can capture and analyze any data without the need for Hadoop skills, or having to install, run, or maintain hardware and software."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Cleaning up location data with Factual Resolve

The data platform Factual launched a new API for developers this week that tackles one of the more frustrating problems with location data: incomplete records. Called Factual Resolve, the new offering is, according to a company blog post, an "entity resolution API that can complete partial records, match one entity against another, and aid in de-duping and normalizing datasets."

Developers using Resolve tell it what they know about an entity (say, a venue name) and the API can return the rest of the information that Factual knows based on its database of U.S. places — address, category, latitude and longitude, and so on.

Tyler Bell, Factual's director of product, discussed the intersection of location and big data at this year's Where 2.0 conference. The full interview is contained in the following video:

Google and governments' data requests

As part of its efforts toward better transparency, Google has updated its Government Requests tool this week with information about the number of requests the company has received for user data since the beginning of 2011.

This is the first time that Google is disclosing not just the number of requests, but the number of user accounts specified as well. It's also made the raw data available so that interested developers and researchers can study and visualize the information.

According to Google, requests from U.S. government officials for content removal were up 70% in this reporting period (January-June 2011) versus the previous six months. And the number of user data requests was up by 29% compared to the previous reporting period. Google also says it received requests from local law enforcement agencies to take down various YouTube videos — one on police brutality, one that was allegedly defamatory — but Google says that it did not comply. But of the 5,950 user data requests (impacting some 11,000 user accounts) submitted between January and June 2011, Google says that it has complied with 93%, either fully or partially.

The U.S. was hardly the only government making an increased number of requests to Google. Spain, South Korea, and the U.K., for example, also made more requests. Several countries, including Sri Lanka and the Cook Islands, made their first requests.

Got data news?

Feel free to email me.

Related:

October 06 2011

Strata Week: Oracle's big data play

Here are the data stories that caught my attention this week:

Oracle's big data week

Eyes have been on Oracle this week as it holds its OpenWorld event in San Francisco. The company has made a number of major announcements, including unveiling its strategy for handling big data. This includes its Big Data Appliance, which will use a new Oracle NoSQL database as well as an open-source distribution of Hadoop and R.

Edd Dumbill examined the Oracle news, arguing that "it couldn't be a plainer validation of what's important in big data right now or where the battle for technology dominance lies." He notes that whether one is an Oracle customer or not, the company's announcement "moves the big data world forward," pointing out that there is now a de facto agreement that Hadoop and R are core pieces of infrastructure.

GigaOm's Derrick Harris reached out to some of the startups who also offer these core pieces, including Norman Nie, the CEO of Revolution Analytics, and Mike Olson, CEO of Cloudera. Not surprisingly perhaps, the startups are "keeping brave faces, but the consensus is that Oracle's forays into their respective spaces just validate the work they've been doing, and they welcome the competition."

Oracle's entry as a big data player also brings competition to others in the space, such as IBM and EMC, as all the major enterprise providers wrestle to claim supremacy over whose capabilities are the biggest and fastest. And the claim that "we're faster" was repeated over and over by Oracle CEO Larry Ellison as he made his pitch to the crowd at OpenWorld.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Who wrote Hadoop?

As ReadWriteWeb's Joe Brockmeier notes, ascertaining the contributions to open-source projects is sometimes easier said than done. Who gets credit — companies or individuals — can be both unclear and contentious. Such is the case with a recent back-and-forth between Cloudera's Mike Olson and Hortonworks' Owen O'Malley over who's responsible for the contributions to Hadoop.

O'Malley wrote a blog post titled "The Yahoo! Effect," which, as the name suggests, describes Yahoo's legacy and its continuing contributions to the Hadoop core. O'Malley argues that "from its inception until this past June, Yahoo! contributed more than 84% of the lines of code still in Apache Hadoop trunk." (Editor's note: The link to "trunk" was inserted for clarity.) O'Malley adds that so far this year, the biggest contributors to Hadoop are Yahoo! and Hortonworks.

Lines of code contributed to apache hadoop trunkLines of code contributed to Apache Hadoop Trunk (from Owen O'Malley's post, "The Yahoo! Effect")

That may not be a surprising argument to hear from Hortonworks, the company that was spun out of Yahoo! earlier this year to focus on the commercialization and development of Hadoop.

But Cloudera's Mike Olson challenges that argument — again, not a surprise, as Cloudera has long positioned itself as a major contributor to Hadoop, a leader in the space, and of course now the employer of former Yahoo! engineer Doug Cutting, the originator of the technology. Olson takes issue with O'Malley's calculations and in a blog post of his own, contends that these calculations don't accurately take into account the companies that people now work for:

Five years is an eternity in the tech industry, however, and many of those developers moved on from Yahoo! between 2006 and 2011. If you look at where individual contributors work today — at the organizations that pay them, and at the different places in the industry where they have carried their expertise and their knowledge of Hadoop — the story is much more interesting.

Olson also argues that it isn't simply a matter of who's contributing to the Apache Hadoop core, but rather who is working on:

... the broader ecosystem of projects. That ecosystem has exploded in recent years, and most of the innovation around Hadoop is now happening in new projects. That's not surprising — as Hadoop has matured, the core platform has stabilized, and the community has concentrated on easing adoption and simplifying use.

Got data news?

Feel free to email me.

Related:

October 05 2011

Four short links: 5 October 2011

  1. Ghostery -- a browser plugin to block trackers, web bugs, dodgy scripts, ads, and anything else you care to remove from your browsing experience. It looks like a very well done adblocker, but it's done (a) closed-source and (b) for-profit. Blocking trackers is something every browser *should* do, but because browser makers make (or hope to make) money from ads, they don't. In theory, Mozilla should do it. Even if they were to take up the mantle, though, they're unlikely to make anything for IE or Chrome. So it's in the hands of companies with inarticulate business models. (via Andy Baio)
  2. Perspectives -- Firefox plugin that lets you know when you've encountered an SSL certificate that's different from the ones that other Perspectives users see (e.g., you're being man-in-the-middled by Iran). (via Francois Maurier)
  3. Always Connected -- "I've got a full day of staring at glowing rectangles ahead of me! Better get started ...". I have made mornings and evenings backlight-free zones in an effort to carve out some of the day free of glowing rectangles. (I do still read myself to sleep on the Kindle, though, but it's not backlit)
  4. Is Teaching MapReduce Healthy for Students? -- Google’s narrow MapReduce API conflates logical semantics (define a function over all items in a collection) with an expensive physical implementation (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication. But there’s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization. Detailed and interesting criticism of whether Hadoop is the BASIC of parallel tools. (via Pete Warden)

August 18 2011

August 04 2011

Strata Week: Hadoop adds security to its skill set

Here are a few of the data stories that caught my eye this week.

Where big data and security collide

HadoopCould security be the next killer app for Hadoop? That's what GigaOm's Derrick Harris suggests: "The open-source, data-processing tool is already popular for search engines, social-media analysis, targeted marketing and other applications that can benefit from clusters of machines churning through unstructured data — now it's turning its attention to security data." Noting the universality of security concerns, Harris suggests that "targeted applications" using Hadoop might be a solid starting point for mainstream businesses to adopt the technology.

Juniper Networks' Chris Hoff has also analyzed the connections between big data and security in a couple of recent posts on his Rational Survivability blog. Hoff contends that while we've had the capabilities to analyze security-related data for some time, that's traditionally happened with specialized security tools, meaning that insights are "often disconnected from the transaction and value of the asset from which they emanate."

Hoff continues:

Even when we do start to be able to integrate and correlate event, configuration, vulnerability or logging data, it's very IT-centric. It's very INFRASTRUCTURE-centric. It doesn't really include much value about the actual information in use/transit or the implication of how it's being consumed or related to.

But as both Harris and Hoff argue, Hadoop might help address this as it can handle all an organization's unstructured data and can enable security analysis that isn't "disconnected." And both Harris and Hoff point to Zettaset as an example of a company that is tackling big data and security analysis by using Hadoop.


Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD


What's your most important personal data?

Concerns about data security also occur at the personal level. To that end, The Locker Project, Singly's open source project to help people collect and control their personal data, recently surveyed people about the data they see as most important.

The survey asked people to choose from the following: contacts, messages, events, check-ins, links, photos, music, movies, or browser history. The results are in, and no surprise: photos were listed as the most important, with 37% of respondents (67 out of 179) selecting that option. Forty-six people listed their contacts, and 23 said their messages were most important.

Interestingly, browser history, events, and check-ins were rated the lowest. As Singly's Tom Longson ponders:

Do people not care about where they went? Is this data considered stale to most people, and therefore irrelevant? I personally believe I can create a lot of value from Browser History and Check-ins. For example, what websites are my friends going to that I'm not? Also, what places should I be going that I'm not? These are just a couple of ideas.

But just as revealing as the ranking of data were the reasons that people gave for why certain types were most important, as you can see in the word cloud created from their responses.

Singly word cloud from data surveyClick to enlarge. See Singly's associated analysis of this data.

House panel moves forward on data retention law

The U.S. Congress is in recess now, but among the last-minute things it accomplished before vacation was passage by the House Judiciary Committee of "The Protecting Children from Internet Pornographers Act of 2011." Ostensibly aimed at helping track pedophiles and pornographers online, the bill has raised a number of concerns about Internet data and surveillance. If passed, the law would require, among other things, that Internet companies collect and retain the IP addresses of all users for at least one year.

Representative Zoe Lofgren was one of the opponents of the legislation in committee, trying unsuccessfully to introduce amendments that would curb its data retention requirements. She also tried to have the name of the law changed to the "Keep Every American's Digital Data for Submission to the Federal Government Without a Warrant Act of 2011."

In addition to concerns over government surveillance, TechDirt's Mike Masnick and the Cato Institute's Julian Sanchez have also pointed to the potential security issues that could arise from lengthy data retention requirements. Sanchez writes:

If I started storing big piles of gold bullion and precious gems in my home, my previously highly secure apartment would suddenly become laughably insecure, without my changing my security measures at all. If a company significantly increases the amount of sensitive or valuable information stored in its systems — because, for example, a government mandate requires them to keep more extensive logs — then the returns to a single successful intrusion (as measured by the amount of data that can be exfiltrated before the breach is detected and sealed) increase as well. The costs of data retention need to be measured not just in terms of terabytes, or man hours spent reconfiguring routers. The cost of detecting and repelling a higher volume of more sophisticated attacks has to be counted as well.

New data from a very old map

Gough MapAnd in more pleasant "storing old data" news: the Gough Map, the oldest surviving map of Great Britain, dating back to the 14th century, has now been digitized and made available online.

The project to digitize the map, which now resides in Oxford University's Bodleian Library took 15 months to complete. According to the Bodleian, the project explored the map's "'linguistic geographies,' that is the writing used on the map by the scribes who created it, with the aim of offering a re-interpretation of the Gough Map's origins, provenance, purpose and creation of which so little is known."

Among the insights gleaned includes the revelation that the text on the Gough Map is the work of at least two different scribes — one from the 14th century and a later one, from the 15th century, who revised some pieces. Furthermore, it was also discovered that the map was made closer to 1375 than 1360, the data often given to it.

Got data news?

Feel free to email me.



Related:


July 28 2011

Four short links: 28 July 2011

  1. 23andMe Disproves Its Own Business Model -- a hostile article talking about how there's little predictive power in genetics for diabetes and Parkinson's so what's the point of buying a 23andMe subscription? The wider issue is that, as we've known for a while, mapping out your genome only helps with a few clearcut conditions. For most medical things that we care about, environment is critical too--but that doesn't mean that personalized genomics won't help us better target therapies.
  2. jsftp -- lightweight implementation of FTP client protocol for NodeJS. (via Sergi Mansilla)
  3. Really Bad Workshops -- PDF eBook with rock-solid advice for everyone who runs a workshop.
  4. PigEditor (GitHub) -- Eclipse plugin for those working with Pig and Hadoop. (via Josh Patterson)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl