Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 13 2013

Big data, cool kids

My data's bigger than yours!My data's bigger than yours!

My data’s bigger than yours!

The big data world is a confusing place. We’re no longer in a market dominated mostly by relational databases, and the alternatives have multiplied in a baby boom of diversity.

These child prodigies of the data scene show great promise but spend a lot of time knocking each other around in the schoolyard. Their egos can sometimes be too big to accept that everybody has their place, and eyeball-seeking media certainly doesn’t help.

POPULAR KID: Look at me! Big data is the hotness!
HADOOP: My data’s bigger than yours!
SCIPY: Size isn’t everything, Hadoop! The bigger they come, the harder they fall. And aren’t you named after a toy elephant?
R: Backward sentences mine be, but great power contains large brain.
SQL: Oh, so you all want to be friends again now, eh?!
POPULAR KID: Yeah, what SQL said! Nobody really needs big data; it’s all about small data, dummy.

The fact is that we’re fumbling toward the adolescence of big data tools, and we’re at an early stage of understanding how data can be used to create value and increase the quality of service people receive from government, business and health care. Big data is trumpeted in mainstream media, but many businesses are better advised to take baby steps with small data.

Data skeptics are not without justification. Our use of “small data” hasn’t exactly worked out uniformly well so far, crude numbers often being misused either knowingly or otherwise. For example, over-reliance by bureaucrats on the results of testing in schools is shaping educational institutions toward a tragically homogeneous mediocrity.

The promise and the gamble of big data is this: that we can advance past the primitive quotas of today’s small data into both a sophisticated statistical understanding of an entire system and insight that focuses down to the level of an individual. Data gives us both telescope and microscope, in detail we’ve never had before.

Inside this tantalizing vision lies many of the debates in today’s data world: the need for highly skilled data scientists to effect this change, and the worry that we’ll inadvertently enslave ourselves to Big Brother, even with the best of intentions.

So, as the data revolution moves forward, it’s important to take the long view. The foment of tools and job titles and algorithms is significant, but ultimately it’s background to our larger purposes as people, businesses and government. That’s one reason why, at O’Reilly, we’ve taken the motto “Making Data Work” for Strata. Data, not technology, is the heartbeat of our world because it relates directly to ourselves and the problems we want to solve.

This is also the reason that the Strata and Hadoop World conferences take a broad view of the subject: ranging from the business topics to the tools and data science. If you talk to Hadoop’s most seasoned advocates, they don’t speak only about the tech; they talk about the problems they’re able to solve. The tools alone are never enough; the real enabler is the framework of people and understanding in which they’re used.

Our mission is to help people make sense of the state of the data world and use this knowledge to become both more competitive and more creative. We believe that’s best served by creating context in which we think about our use of data as well as serving the growing specialist communities in data.

Enjoy the noise and the energy from the growing data ecosystem, but keep your eyes on the problems you want to solve.

The Strata and Hadoop World Call for Proposals is open until midnight EDT, Thursday May 16.

January 19 2012

Big data market survey: Hadoop solutions

The big data ecosystem can be confusing. The popularity of "big data" as industry buzzword has created a broad category. As
Hadoop steamrolls through the industry, solutions from the
business intelligence and data warehousing fields are also
attracting the big data label. To confuse matters, Hadoop-based solutions such as Hive
are at the same time evolving toward being a competitive data warehousing solution.

Understanding the nature of your big data problem is a helpful
first step in evaluating potential solutions. Let's
remind ourselves of href="">the
definition of big data:

"Big data is data that exceeds the processing capacity of
conventional database systems. The data is too big, moves too
fast, or doesn't fit the strictures of your database
architectures. To gain value from this data, you must choose an
alternative way to process it."

Big data problems vary in how heavily they weigh in on the axes
of volume, velocity and variability. Predominantly structured yet
large data, for example, may be most suited to an analytical
database approach.

This survey makes the assumption that a data warehousing
solution alone is not the answer to your problems, and concentrates on
analyzing the commercial Hadoop ecosystem. We'll focus on the
solutions that incorporate storage and data processing,
excluding those products which only sit above those layers, such
as the visualization or analytical workbench software.

Getting started with Hadoop doesn't require a large
investment as the software is open source, and is also available
instantly through the Amazon Web Services cloud. But for
production environments, support, professional services and
training are often required.

Just Hadoop?

Apache Hadoop is unquestionably the center of the latest
iteration of big data solutions. At its heart, Hadoop is a
system for distributing computation among commodity servers. It
is often used with the Hadoop Hive project, which layers data
warehouse technology on top of Hadoop, enabling ad-hoc
analytical queries.

Big data platforms divide along the lines of their approach to
Hadoop. The big data offerings from familiar enterprise vendors
incorporate a Hadoop distribution, while other platforms
offer Hadoop connectors to their existing analytical database
systems. This latter category tends to comprise massively
parallel processing (MPP) databases that made their name in big
data before Hadoop matured: Vertica and Aster Data. Hadoop's
strength in these cases is in processing unstructured data in
tandem with the analytical capabilities of the existing database
on structured or structured data.

Practical big data implementations don't in general fall neatly
into either structured or unstructured data
categories. You will invariably find Hadoop working as part of a
system with a relational or MPP database.

Much as with Linux before it, no Hadoop solution incorporates
the raw Apache Hadoop code. Instead, it's packaged into
distributions. At a minimum, these distributions have been
through a testing process, and often include additional
components such as management and monitoring tools. The most
well-used distributions now come from Cloudera, Hortonworks and
MapR. Not every distribution will be commercial, however: the
aims to create a Hadoop distribution under the
Apache umbrella.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at

Integrated Hadoop systems

The leading Hadoop enterprise software vendors have aligned their
Hadoop products with the rest of their database and analytical
offerings. These vendors don't require you to source Hadoop from
another party, and offer it as a core part of their big data
solutions. Their offerings integrate Hadoop into a
broader enterprise setting, augmented by analytical and workflow

EMC Greenplum

EMC Greenplum


Deployment options

(Modular Data Computing Appliance),

(Enterprise Linux)


Bundled distribution
(Greenplum HD);





NoSQL component



Acquired by EMC, and rapidly taken to the heart of the
company's strategy, Greenplum is a relative newcomer to the
enterprise, compared
to other companies in this section. They have turned that to
their advantage in creating an analytic platform, positioned as
taking analytics "beyond BI" with agile data science teams.

Greenplum's Unified Analytics Platform (UAP) comprises three
elements: the Greenplum MPP database, for structured data; a
Hadoop distribution, Greenplum HD; and href="">Chorus, a
productivity and groupware layer for data science teams.

The HD Hadoop layer builds on MapR's Hadoop compatible
distribution, which replaces the file system with a faster
implementation and provides other features for
robustness. Interoperability between HD and Greenplum Database
means that a single query can access both database and Hadoop data.

Chorus is a unique feature, and is indicative of Greenplum's commitment
to the idea of data science and the importance of the agile team
element to effectively exploiting big data. It supports
organizational roles from analysts, data scientists and DBAs
through to executive business stakeholders.

As befits EMC's role in the data center market, Greenplum's UAP is
available in a modular appliance configuration.


IBM InfoSphere



Deployment options

(Enterprise Linux)


Bundled distribution
(InfoSphere BigInsights);









NoSQL component



IBM's href="">InfoSphere
BigInsights is their Hadoop distribution, and part of a suite
of products offered under the "InfoSphere" information management
brand. Everything big data at IBM is helpfully labeled
Big, appropriately enough for a company affectionately known as "Big

BigInsights augments Hadoop with a variety of features,
management and administration tools. It also offers textual analysis tools
that aid with entity resolution — identifying people, addresses,
phone numbers and so on.

IBM's Jaql query language provides a point of integration
between Hadoop and other IBM products, such as relational databases
or Netezza data warehouses.

InfoSphere BigInsights is interoperable with IBM's other
database and warehouse products, including DB2, Netezza and its
InfoSphere warehouse and analytics lines. To aid analytical
exploration, BigInsights ships with BigSheets, a spreadsheet
interface onto big data.

IBM addresses streaming big data separately through its href="">InfoSphere
Streams product. BigInsights is not currently offered in an
appliance or cloud form.



Deployment options

(Windows Server),

(Windows Azure Cloud)


Bundled distribution
(Big Data Solution);




Microsoft have adopted Hadoop as the center of their big data
offering, and are pursuing an integrated approach aimed at making
big data available through their analytical tool suite, including
to the familiar tools of Excel and PowerPivot.

Data Solution brings Hadoop to the Windows Server platform,
and in elastic form to their cloud platform Windows
Azure. Microsoft have packaged their own distribution of Hadoop,
integrated with Windows Systems Center and Active Directory.
They intend to contribute back changes to Apache Hadoop to
ensure that an open source version of Hadoop will run on Windows.

On the server side, Microsoft offer integrations to their SQL
Server database and their data warehouse product. Using their
warehouse solutions aren't mandated, however. The Hadoop Hive data
warehouse is part of the Big Data Solution, including
connectors from Hive to ODBC and Excel.

Microsoft's focus on the developer is evident in their creation
of a JavaScript API for Hadoop. Using JavaScript, developers can
create Hadoop jobs for MapReduce, Pig or Hive, even from a
browser-based environment. Visual Studio and .NET integration
with Hadoop is also provided.

Deployment is possible either on the server or in the cloud, or
as a hybrid combination. Jobs written against the Apache Hadoop
distribution should migrate with miniminal changes to Microsoft's


Oracle Big Data

Deployment options


Bundled distribution
(Cloudera's Distribution including Apache Hadoop);











NoSQL component


Announcing their entry into the big data market at the end of
2011, Oracle is taking an appliance-based approach. Their
Data Appliance integrates Hadoop, R for analytics, a new
Oracle NoSQL database, and connectors to Oracle's
database and Exadata data warehousing product line.

Oracle's approach caters to the high-end enterprise market, and
particularly leans to the rapid-deployment, high-performance end
of the spectrum. It is the only vendor to include the popular R
analytical language integrated with Hadoop, and to ship a NoSQL
database of their own design as opposed to Hadoop HBase.

Rather than developing their own Hadoop distribution, Oracle
have partnered with Cloudera for Hadoop support, which brings them
a mature and established Hadoop solution. Database connectors
again promote the integration of structured Oracle data with the
unstructured data stored in Hadoop HDFS.

Oracle's href="">NoSQL
Database is a scalable key-value database, built on the
Berkeley DB technology. In that, Oracle owes double gratitude to
Cloudera CEO Mike Olson, as he was previously the CEO of
Sleepycat, the creators of Berkeley DB. Oracle are positioning
their NoSQL database as a means of acquiring big data prior to

The Oracle R Enterprise product offers direct integration into
the Oracle database, as well as Hadoop, enabling R scripts to run
on data without having to round-trip it out of the data stores.


While IBM and Greenplum's offerings are available at the time
of writing, the Microsoft and Oracle solutions are expected to be
fully available early in 2012.

Analytical databases with Hadoop connectivity

MPP (massively parallel processing) databases are specialized
for processing structured big data, as distinct from the
unstructured data that is Hadoop's specialty. Along with Greenplum,
Aster Data and Vertica are early pioneers of big data
products before the mainstream emergence of Hadoop.

These MPP solutions are databases specialized for analyical
workloads and data integration, and provide connectors to
Hadoop and data warehouses. A
recent spate of acquisitions have seen these products become the
analytical play by data warehouse and storage vendors: Teradata
acquired Aster Data, EMC acquired Greenplum, and HP acquired

Quick facts

Aster Data


MPP analytical database

Deployment options


Hadoop connector available




MPP analytical database

Deployment options

(Enterprise Linux),

(Cloud Edition)


Hadoop integration available




MPP analytical database

Deployment options

(HP Vertica Appliance),

(Enterprise Linux),

(Cloud and Virtualized)


Hadoop and Pig connectors available


Hadoop-centered companies

Directly employing Hadoop is another route to creating a big
data solution, especially where your infrastructure doesn't fall
neatly into the product line of major vendors. Practically every
database now features Hadoop connectivity, and there are multiple
Hadoop distributions to choose from.

Reflecting the developer-driven ethos of the big data world,
Hadoop distributions are frequently offered in a community edition.
Such editions lack enterprise management features, but contain all
the functionality needed for evaluation and development.

The first iterations of Hadoop distributions, from Cloudera and
IBM, focused on usability and adminstration. We are now seeing the
addition of performance-oriented improvements to Hadoop, such as
those from MapR and Platform Computing. While maintaining API
compatibility, these vendors replace slow or fragile parts of the
Apache distribution with better performing or more robust components.


The longest-established provider of Hadoop distributions,
Cloudera provides an
enterprise Hadoop solution, alongside
services, training and support options. Along with
Yahoo, Cloudera have made deep open source contributions to Hadoop, and
through hosting industry conferences have done much to establish
Hadoop in its current position.


Though a recent entrant to the market, href="">Hortonworks have a long
history with Hadoop. Spun off from Yahoo, where Hadoop
originated, Hortonworks aims to stick close to and promote the
core Apache Hadoop technology. Hortonworks also have a partnership
with Microsoft to assist and accelerate their Hadoop

Hortonworks href="">Data
Platform is currently in a limited preview phase, with a
public preview expected in early 2012. The company also provides
support and training.

An overview of Hadoop distributions

Cloudera EMC Greenplum Hortonworks IBM MapR Microsoft Platform Computing Product Name Cloudera's Distribution including Apache Hadoop Greenplum HD Hortonworks Data Platform InfoSphere BigInsights MapR Big Data Solution Platform MapReduce

Free Edition


Integrated, tested distribution of Apache Hadoop

Community Edition
100% open source certified and supported version of the Apache Hadoop stack

Basic Edition
An integrated Hadoop distribution.

MapR M3 Edition
Free community edition incorporating MapR's performance increases

Platform MapReduce Developer Edition
Evaluation edition, excludes resource management features of regualt edition

Enterprise Edition

Cloudera Enterprise
Adds management software layer over CDH

Enterprise Edition
Integrates MapR's M5 Hadoop-compatible distribution, replaces HDFS with MapR's C++-based file system. Includes MapR management tools

Enterprise Edition
Hadoop distribution, plus BigSheets spreadsheet interface, scheduler, text analytics, indexer, JDBC connector, security support.

MapR M5 Edition
Augments M3 Edition with high availability and data protection features

Big Data Solution
Windows Hadoop distribution, integrated with Microsoft's database and analytical products

Platform MapReduce
Enhanced runtime for Hadoop MapReduce, API-compatible with Apache Hadoop

Hadoop Components








































Cloudera Manager

Kerberos, role-based administration and audit trails

Security features

LDAP authentication, role-based authorization, reverse proxy

Active Directory integration

Admin Interface

Cloudera Manager

Centralized management and alerting

Administrative interfaces

MapR Heatmap cluster administrative tools

Apache Ambari

Monitoring, administration and lifecycle management for Hadoop clusters

Administrative interfaces

Administrative features including Hadoop HDFS and MapReduce administration, cluster and server management, view HDFS file content

Administrative interfaces

MapR Heatmap cluster administrative tools

System Center integration

Administrative interfaces

Platform MapReduce Workload Manager

Job Management

Cloudera Manager

Job analytics, monitoring and log search

High-availability job management

JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents

Apache Ambari

Monitoring, administration and lifecycle management for Hadoop clusters

Job management features

Job creation, submission, cancellation, status, logging.

High-availability job management

JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents

Database connectors

Greenplum Database



InfoSphere Warehouse

SQL Server,

SQL Server Parallel Data Warehouse

Interop features

Hive ODBC Driver,

Excel Hive Add-in

HDFS Access


Mount HDFS as a traditional filesystem


Access HDFS as a conventional network file system




Access HDFS as a conventional network file system


Cloudera Manager

Wizard-based deployment

Quick installation

GUI-driven installation tool

Additional APIs


Jaql is a functional, declarative query language designed to process large data sets.


JavaScript API

JavaScript Map/Reduce jobs, Pig-Latin, and Hive queries

Includes R, C/C++, C#, Java, Python

Volume Management

Mirroring, snapshots


October 07 2011

How data and open government are transforming NYC

"In God We Trust," tweeted New York City Mayor Mike Bloomberg this month. "Everyone else, bring data."

Bloomberg, the billionaire founder of Bloomberg L.P., is now in his third term as mayor of the Big Apple. During his tenure, New York City has embraced a more data-driven approach to governing, even when the results of that data-driven transparency show a slump in city services.

This should be no surprise to anyone familiar with the mission statement of his financial data company:

Bloomberg started out with one core belief: that bringing transparency to capital markets through access to information could increase capital flows, produce economic growth and jobs, and significantly reduce the cost of doing business.

To reshape that mission statement for New York City, one might reasonably suggest that Bloomberg's data-driven approach to government is founded upon that belief that bringing transparency to government through access to information could increase capital flows, produce economic growth and jobs, and significantly reduce the cost of the business of government.

As Gov 2.0 goes local, New York City has become the epicenter for many experiments in governance, from citizensourcing smarter government to participatory budgeting to embracing a broader future as a data platform.

One of the most prominent New Yorkers supporting architecting a city as a platform is the city's first chief digital officer, Rachel Sterne.

Sterne gave a keynote speech at this year's Strata NY conference that explained how data-driven innovation informs New York's aim to be the nation's premier digital city.

"I'm especially excited to be speaking with you because as a city, we need your help," said Sterne to the assembled Strata attendees. "As the data practitioners and data scientists who are at the forefront of this revolution, all of our efforts are for naught if you are not part of them and not helping us to expand them and helping to really take advantage of all of the resources that the city of New York is trying put at your disposal."

Video of Sterne's talk is embedded below.

New York City's digital strategy is focused on access to technology, open government, engagement and industry. "Industry is important because we need to make sure the private sector has all the supports it needs to grow and thrive and help to create these solutions that will help the government to ultimately better serve the public," said Sterne. "Open government is important because if our data and our internal structure and priorities aren't completely open, we're not going to be able to enable increased [open] services, that kind of [open] exchange of information, etc. Engagement is crucial because we need to be constantly gathering feedback from the public, informing and serving. And access is the foundation because everyone needs access to these technologies."

Big data in the Big Apple

What does data-driven innovation look like in New York City? Sterne focused on how data "evolves government," asserting that it leads to a more efficient allocation of resources, a more effective execution, and a better response to the real-time needs of citizens. Although she allowed that, "as everyone knows, data can be manipulated."

Sterne highlighted several data-driven initiatives across the city, including the Metropolitan Transit Authority's Bus Time Initiative. "Initially, it was scoped out to hundreds of millions of dollars. The MTA ended up working with a local open-source development shop, [which] did it for a fraction of that, below a million dollars, and now you can get real-time updates on your phone based on where the buses are located using very low-cost technologies."

New York City is also using data internally, explained Sterne — like applying predictive analytics to building code violations and housing data to try to understand where potential fire risks might exist. If that sounds familiar to Radar readers, it should: Chicago is also looking to use data, developers and citizens to become a smarter city. "This is as much about citizens talking to the infrastructure of the city as infrastructure talking to itself," said Chicago CTO John Tolva in an interview last March. "It's where urban informatics and smarter cities cross over to Gov 2.0."

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

New York City, however, has a vastly greater "digital reach" than Chicago. It's bigger than many corporations and states, in fact, connecting to more than four million people through and social media channels that have expanded to include Twitter, Facebook, Tumblr, YouTube and Foursquare. Sterne envisions the city's 200-plus social media platforms as a kind of "digital switchboard," where citizens ask questions and government workers direct them to the appropriate resources, much in the same way that California connects citizens to e-services with social media.

The web as the 21st century public square

"What we're really seeing that's interesting about all these things is that they're happening in public, so people are informing one another," said Sterne. "They're engaging one another, and it's not so much the city telling you what to do but creating a forum for that conversation to take place." If you visit NYC's custom bitly URL shortener,, you can see what content is popular within that community.

Back in May, when NYC's digital roadmap was released, Anil Dash highlighted something important: the roadmap captured New York City government thinking about the web as a public space. This has profound implications about how it should be regulated, treated or described. "The single biggest lesson I got from the 65-page, 11.8mb PDF is a simple one," Dash, a native New Yorker, blogger and entrepreneur, wrote. "The greatest city in the world can take shared public spaces online as seriously as it takes its public spaces in the physical world."

City as a platform

Sterne's description of a "city as a platform" is one of the purest articulations of Tim O'Reilly's "government as a platform" vision that I've heard any public servant articulate this year.

"The thing that's really exciting to me, better than internal data, of course, is open data," Sterne said during her Strata Conference talk. "This, I think, is where we really start to reach the potential of New York City becoming a platform like some of the bigger commercial platforms and open data platforms. How can New York City, with the enormous amount of data and resources we have, think of itself the same way Facebook has an API ecosystem or Twitter does? This can enable us to produce a more user-centric experience of government. It democratizes the exchange of information and services. If someone wants to do a better job than we are in communicating something, it's all out there. It empowers citizens to collaboratively create solutions. It's not just the consumption but the co-production of government services and democracy."

Sterne highlighted the most important open data initiative that the city has pursued to date, the NYC DataMine. Soon, she said, they will be introducing "NYC Platform," which she described as "the city's API." All of their work opening the data, however, "doesn't matter if we're not evangelizing it and making sure people are using it."

NYC has used an app competition to draw more attention to its open data. As I've written elsewhere, by tying specific citizen needs to development, NYC Bigs Apps 3.0 is part of the next-generation of government apps competitions that incorporate sustainability, community, and civic value.

"We've had about 150 apps developed," said Sterne. "There are apps that would be a significant cost to the city. Instead, they're at basically no cost because the prize money is all donated. We provide 350 datasets. Until now, they were not API-enabled. They were not dynamic, but we're going to be doing that because that's the overwhelming response that we're receiving from everyone."

That feedback is widespread in the open government data community, where studies show that developers prefer to explore and interact with data online, as opposed to downloading datasets. When it comes to developers working with public data, dynamic access can open up entire new horizons for potential applications, as the release of real-time transit data has demonstrated.

Sterne shared some useful examples of apps that have been created using NYC open government data, including Roadify, which allows you to find parking spots or transit information, and Don't Eat At, a Foursquare app that sends users a text message when they check into a NYC restaurant that is at risk of being closed for health code violations.

Sterne's message to data scientists was generally quite well received at Strata. "Pleased to see @RachelSterne's keynote today," tweeted Alistair Coote, a NYC Web developer at RecordSetter. "If done right, open govt will be far more important than anything announced at #f8 today," he observed, referring to Facebook's new look.

Why open government data matters to New Yorkers

The experience in NYC during Hurricane Irene "once again proved the utility and importance of open data and the NYC DataMine, as several organizations used OEM's Hurricane Evacuation Zone geographic data to build maps that served and informed the public," Sterne told me via email. "This data has been public for over a year. Parties developing tools built on city platforms included WNYC, NYTimes, Google, Mobile Commons and Crisis Commons. NYC Digital was also in regular contact with these parties to alert them of information changes."

The key insight coming out of that August weekend, with respect to the city acting as a platform during unprecedented demands for information, was that the open data that NYC provided on evacuation zones was used by other organizations to build maps. When buckled under heavy traffic, the city government turned to the Internet to share important resources. "As long as the right information was getting to citizens, that's all that matters," said Sterne at Strata. "It's OK if it's decentralized, as long as the reach is being expanded."

As I reported here on Radar, the growth of an Internet of things is an important evolution. What we saw during Hurricane Irene is the increasing importance of an Internet of people, where citizens act as sensors during an emergency.

"Social media played a critical role in informing New Yorkers," wrote Sterne. "Prior to that weekend, we established clear guidelines and a streamlined approvals process for social media content, which were disseminated to all social media managers. This ensured that even as we communicated in real time, we had accuracy and consistency in city messaging. @NYCMayorsOffice and were both major communication channels. @NYCMayorsOffice doubled its followers, increasing by nearly 30,000 during the weekend, and was cited by the mayor in press conferences as a resource. The YouTube channel was updated shortly after each press conference and saw nearly 60,000 views over the weekend. Over 32,000 tweets were published (not counting retweets) containing the text 'nycmayorsoffice' from August 25-29. Response was overwhelmingly positive."

Data pitfalls and potential

Legitimate questions have been raised about New York's data-driven policy, where journalists have questioned crime data behind the city's CompStat program. The city has also faced challenges and nearly $300 million in expanded costs for its computer system for personnel data, offering up a sobering reminder of how difficult it is even for immensely successful private sector leaders to upgrade public sector IT systems. That's a reality that former U.S. CIO Vivek Kundra can certainly speak to at the federal level.

That said, New York City and its mayor clearly deserve credit for opening data, being transparent about the administration's performance, and continuing to work toward the incremental improvements that tend to be the way that government moves ahead.

For more insight into the IT behind New York City government, Radar's managing editor, Mac Slocum, talked with Carole Post, commissioner of NYC's Department of Information Technology and Telecommunications, about what being a data-driven city means and some of the most valuable ways that data is being applied. Video from that interview is below:

The challenges you see in opening up data in New York City are two-fold, and ones you see across government, said Post. "We first and foremost are a steward of the data that we hold, and so the concerns around privacy, confidentiality and public safety are definitely ones that need to be balanced against accessibility to the information," she said.

That challenge is one that every big city CIO will face in the years ahead, as technology affords new potential to open government and new risks for exposing sensitive personal data. "While we are enormous proponents of having open data," said Post, preserving integrity of the data and protections is important.

Post acknowledged that city government has "typically not been a very open bastion of sharing all of its information," but pointed to a necessary step in open government's evolution: moving to a standard of open by default, where civic data is considered open unless there is a reason for it not to be.


October 03 2011

Oracle's Big Data Appliance: what it means

Today, Oracle announced their Big Data Appliance. It couldn't be a plainer validation of what's important in big data right now, or where the battle for technology dominance lies.

Oracle's appliance includes some homegrown technology, most specifically a NoSQL database of their own design, and some open source technologies: Hadoop and R. Let's take a look at what these three decisions might mean.

Oracle NoSQL Database: Oracle's core reputation is as a database vendor, and as owners of the Berkeley DB technology, they have a core NoSQL platform to build upon (Berkeley was NoSQL for years before we even had that term). Oracle have no reason to partner with or incorporate other NoSQL tech such as Cassandra or MongoDB, and now pose a significant business threat to those technologies—perhaps Cassandra more than MongoDB, due to its enterprise credentials.

Hadoop: competitive commercial big data solutions such as Greenplum and Aster Data got ahead in the market through incorporating their own MapReduce technologies. Oracle hasn't bothered to do this, and has instead standardized on Hadoop and a system of connectors to its main Oracle product. (Both Greenplum and Aster also have Hadoop connectors.) If it needed any further validation, this confirms Hadoop's arrival as the Linux of big data. It's a standard.

R: big data isn't much use until you can make sense of it, and the inclusion of R in Oracle's big data appliance bears this out. It also sets up R as a new industry standard for analytics: something that will raise serious concern among vendors of established statistical and analytical solutions SAS and SPSS.

Whether you use Oracle or not, today's announcement moves the big data world forward. We have de facto agreement on Hadoop and R as core infrastructure, and we have healthy competition at the database and NoSQL layer.

Talk about this at Strata 2012: As the call for participation for Strata 2012 (Feb 28-Mar 1, Santa Clara, CA) nears its close, Oracle's announcement couldn't be more timely. We are opening up new content tracks focusing on the Hadoop ecosystem and on R. Submit your proposal by the end of this week.

August 23 2011

The nexus of data, art and science is where the interesting stuff happens

Jer Thorp (@blprnt), data artist in residence at The New York Times, was tasked a few years ago with designing an algorithm for the placement of the names on the 9/11 memorial. If an algorithm sounds unnecessarily complex for what seems like a basic bit of organization, consider this: Designer Michael Arad envisioned names being arranged according to "meaningful adjacencies," rather than by age or alphabetical order.

The project, says Thorp, is a reminder that data is connected to people, to real lives, and to the real world. I recently spoke with Thorp about the challenges that come with this type of work and the relationship between data, art and science. Thorp will expand on many of these ideas in his session at next month's Strata Conference in New York City.

Our interview follows.

How do aesthetics change our understanding of data?

Jer ThorpJer Thorp: I'm certainly interested in the aesthetic of data, but I rarely think when I start a project "let's make something beautiful." What we see as beauty in a data visualization is typically pattern and symmetry — something that often emerges when you find the "right" way, or one of the right ways, to represent a particular dataset. I don't really set out for beauty, but if the result is beautiful, I've probably done something right.

My work ranges from practical to conceptual. In the utilitarian projects I try not to add aesthetic elements unless they are necessary for communication. In the more conceptual projects, I'll often push the acceptable limits of complexity and disorder to make the piece more effective. Of course, often these more abstract pieces get mistaken for infographics, and I've had my fair share Internet comment bashing as a result. Which I kind of like, in some sort of masochistic way.

What's it like working as a data artist at the New York Times? What are the biggest challenges you face?

Jer Thorp: I work in the R&D Group at the New York Times, which is tasked to think about what media production and consumption will look like in the next three years or so. So we're kind of a near-futurist department. I've spent the last year working on Project Cascade, which is a really novel system for visualizing large-scale sharing systems in real time. We're using it to analyze how New York Times content gets shared through Twitter, but it could be used to look at any sharing system — meme dispersal, STD spread, etc. The system runs live on a five-screen video wall outside the lab, and it gives us a dynamic, exploratory look at the vast conversation that is occurring at any time around New York Times articles, blog posts, etc.

It's frankly amazing to be able to work in a group where we're encouraged to take the novel path. Too many "R&D" departments, particularly in advertising agencies, are really production departments that happen to do work with augmented reality, or big data, or whatever else is trendy at the moment. There's an "R" in R&D for a reason, and I'm lucky to be in a place where we're given a lot of room to roam. Most of the credit for this goes to Michael Zimbalist, who is a great thinker and has an uncanny sense of the future. Add to that a soundly brilliant design and development team and you get a perfect creative storm.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

I try to straddle the border between design, art and science, and one of my biggest challenges is to not get pulled too far in one direction. I'm always conscious when I'm starting new projects to try to face in a different direction from where I was headed last. This keeps me at that boundary where I think the most interesting things are happening. Right now I'm working on two projects that concern memory and history, which is relatively uncharted territory for me and is getting me into a mix of neurobiology and psychology research alongside a lot of art and design history. So far, it's been tremendously satisfying.

In addition to your position at the Times, you're also a visiting professor at New York University. I'm curious how you see data visualization changing the way art and technology are taught and learned.

Jer Thorp: The class I'm currently teaching is called "Data Representation." Although it does include a fair amount of visualization, we talk a lot about how data can be used in a creative practice in different ways — sculpture, performance, participatory practice, etc. I'm really excited about artists who are representing information in novel media, such as Adrien Segal and Nathalie Miebach, and I try to encourage my students to push into areas that haven't been well explored. It's an exciting time for students because there are a million new niches just waiting to be found.

This interview was edited and condensed.


August 09 2011

There's no such thing as big data

“You know,” said a good friend of mine last week, “there’s really no such thing as big data.”

I sighed a bit inside. In the past few years, cloud computing critics have said similar things: that clouds are nothing new, that they’re just mainframes, that they’re just painting old technologies with a cloud brush to help sales. I’m wary of this sort of techno-Luddism. But this person is sharp, and not usually prone to verbal linkbait, so I dug deeper.

He’s a ridiculously heavy traveler, racking up hundreds of thousands of miles in the air each year. He’s the kind of flier airlines dream of: loyal, well-heeled, and prone to last-minute, business-class trips. He's is exactly the kind of person an airline needs to court aggressively, one who represents a disproportionally large amount of revenues. He’s an outlier of the best kind. He’d been a top-ranked passenger with United Airlines for nearly a decade, using their Mileage Plus program for everything from hotels to car rentals.

And then his company was acquired.

The acquiring firm had a contractual relationship with American Airlines, a competitor of United with a completely separate loyalty program. My friend’s air travel on United and its partner airlines dropped to nearly nothing.

He continued to book hotels in Shanghai, rent cars in Barcelona, and buy meals in Tahiti, and every one of those transactions was tied to his loyalty program with United. So the airline knew he was traveling -- just not with them.

Astonishingly, nobody ever called him to inquire about why he'd stopped flying with them. As a result, he’s far less loyal than he was. But more importantly, United has lost a huge opportunity to try to win over a large company’s business, with a passionate and motivated inside advocate.

And this was his point about big data: that given how much traditional companies put it to work, it might as well not exist. Companies have countless ways they might use the treasure troves of data they have on us. Yet all of this data lies buried, sitting in silos. It seldom sees the light of day.

When a company does put data to use, it’s usually a disruptive startup. Zappos and customer service. Amazon and retailing. Craigslist and classified ads. Zillow and house purchases. LinkedIn and recruiting. eBay and payments. Ryanair and air travel. One by one, industry incumbents are withering under the harsh light of data.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

20% on registration with the code STN11RAD

Big data and the innovator's dilemma

Large companies with entrenched business models tend to cling to their buggy-whips. They have a hard time breaking their own business models, as Clay Christensen so clearly stated in "The Innovator’s Dilemma," but it's too easy to point the finger at simple complacency.

Early-stage companies have a second advantage over more established ones: they can ask for forgiveness instead of permission. Because they have less to lose, they can make risky bets. In the early days of PayPal, the company could skirt regulations more easily than Visa or Mastercard, because it had far less to fear if it was shut down. This helped it gain marketshare while established credit-card companies were busy with paperwork.

The real problem is one of asking the right questions.

At a dig data conference run by The Economist this spring, one of the speakers made a great point: Archimedes had taken baths before.

(Quick historical recap: In an almost certainly apocryphal tale, Hiero of Syracuse had asked Archimedes to devise a way of measuring density, an indicator of purity, in irregularly shaped objects like gold crowns. Archimedes realized that the level of water in a bath changed as he climbed in, making it an indicator of volume. Eureka!)

The speaker’s point was this: it was the question that prompted Archimedes’ realization.

Small, agile startups disrupt entire industries because they look at traditional problems with a new perspective. They’re fearless, because they have less to lose. But big, entrenched incumbents should still be able to compete, because they have massive amounts of data about their customers, their products, their employees, and their competitors. They fail because often they just don’t know how to ask the right questions.

In a recent study, McKinsey found that by 2018, the U.S. will face a shortage of 1.5 million managers who are fluent in data-based decision making. It’s a lesson not lost on leading business schools: several of them are introducing business courses in analytics.

Ultimately, this is what my friend’s airline example underscores. It takes an employee, deciding that the loss of high-value customers is important, to run a query of all their data and find him, and then turn that into a business advantage. Without the right questions, there really is no such thing as big data -- and today, it’s the upstarts that are asking all the good questions.

When it comes to big data, you either use it or lose.

This is what we’re hoping to explore at Strata JumpStart in New York next month. Rather than taking a vertical look at a particular industry, we’re looking at the basics of business administration through a big data lens. We'll be looking at apply big data to HR, strategic planning, risk management, competitive analysis, supply chain management, and so on. In a world flooded by too much data and too many answers, tomorrow's business leaders need to learn how to ask the right questions.

June 14 2011

Big data and the semantic web

On Quora, Gerald McCollum asked if big data and the semantic web were indifferent to each other, as there was little discussion of the semantic web topic at Strata this February.

My answer in brief is: big data's going to give the semantic web the massive amounts of metadata it needs to really get traction.

As the chair of the Strata conference, I see a vital link between big data and semantic web, and have my own roots in the semantic web world. Earlier this year however, the interaction was not yet of sufficient utility to make a strong connection in the conference agenda.

Google and the semantic web

A good example of the development of the relationship between big data and the semantic web is Google. Early on, Google search eschewed explicit use of semantics, preferring to infer a variety of signals in order to generate results. They used big data to create signals such as PageRank.

Now, as the search algorithms mature, Google's mission is to make their results ever more useful to users. To achieve this, their software must start to understand more about the actual world. Who's an author? What's a recipe? What do my friends find useful? So the connections between entities become more important. To achieve this Google are using data from initiatives such as, RDFa and microformats.

Google do not use these semantic web techniques to replace their search, but rather to augment it and make it more useful. To get all fancypants about it: Google are starting to promote the information they gather towards being knowledge. They even renamed their search group as "Knowledge".

Metadata is hard: big data can help

Conventionally, semantic web systems generate metadata and identified entities explicitly, ie. by hand or as the output of database values. But as anybody who's tried to get users to do it will tell you, generating metadata is hard. This is part of why the full semantic web dream isn't yet realized. Analytical approaches take a different approach: surfacing and classifying the metadata from analysis of the actual content and data itself. (Freely exposing metadata is also controversial and risky, as open data advocates will attest.)

Once big data techniques have been successfully applied, you have identified entities and the connections between them. If you want to join that information up to the rest of the web, or to concepts outside of your system, you need a language in which to do that. You need to organize, exchange and reason about those entities. It's this framework that has been steadily built up over the last 15 years with the semantic web project.

To give an already widespread example: many data scientists use Wikipedia to help with entity resolution and disambiguation, using Wikipedia URLs to identify entities. This is a classic use of the most fundamental of semantic web technologies: the URI.

For Strata, as our New York series of conferences approaches, we will be starting to include a little more semantic web, but with a strict emphasis on utility.

Strata itself is not as much beholden to big data, as about being data-driven, and the ongoing consequences that has for technology, business and society.

April 04 2011

The truth about data: Once it's out there, it's hard to control

The amount of data being produced is increasing exponentially, which raises big questions about security and ownership. Do we need to be more concerned about the information many of us readily give out to join popular social networks, sign up for website community memberships, or subscribe to free online email? And what happens to that data once it's out there?

In a recent interview, Jeff Jonas (@JeffJonas), IBM distinguished engineer and a speaker at the O'Reilly Strata Online Conference, said consumers' willingness to give away their data is a concern, but it's perhaps secondary to the sheer number of data copies produced.

Our interview follows.

What is the current state of data security?

JeffJonas.jpg Jeff Jonas: A lot of data has been created, and a boatload more is on its way — we have seen nothing yet. Organizations now wonder how they are going to protect all this data — especially how to protect it from unintended disclosure. Healthcare providers, for example, are just as determined to prevent a "wicked leak" as anyone else. Just imagine the conversation between the CIO and the board trying to explain the risk of the enemy within — the "insider threat" — and the endless and ever-changing attack vectors.

I'm thinking a lot these days about data protection, ranging from reducing the number of copies of data to data anonymization to perpetual insider threat detection.

How are advancements in data gathering, analysis, and application affecting privacy, and should we be concerned?

Jeff Jonas: When organizations only collect what they need in order to conduct business, tell the consumer what they are collecting, why and how they are going to use it, and then use it this way, most would say "fair game." This is all in line with Fair Information Practices (FIPs).

There continues to be some progress in the area of privacy-enhancing technology. For example, tamper-resistant audit logs, which are a way to record how a system was used that even the database administrator cannot alter. On the other hand, the trend that I see involves the willingness of consumers to give up all kinds of personal data in return for some benefit — free email or a fantastic social network site, for example.

While it is hard to not be concerned about what is happening to our privacy, I have to admit that for the most part technology advances are really delivering a lot of benefit to mankind.

The Strata Online Conference, being held April 6, will look at how information — and the ability to put it to work — will shape tomorrow's markets. Scheduled speakers include: Gavin Starks from AMEE, Jeff Jonas from IBM, Chris Thorpe from Artfinder, and Ian White from Urban Mapping.

Registration is open

What are the major issues surrounding data ownership?

Jeff Jonas: If users continue to give their data away because the benefits are irresistible, then there will be fewer battles, I suppose. The truth about data is that once it is out there, it's hard to control.

I did a back of the envelope estimate a few years ago to estimate the number of copies a single piece of data may experience. Turns out the number is roughly the same as the number of licks it takes to get to the center of a Tootsie Pop — a play on an old TV commercial that basically translates to more than you can easily count.

A well-thought-out data backup strategy alone may create more than 100 copies. Then what about the operational data stores, data warehouses, data marts, secondary systems and their backups? Thousands of copies would not be uncommon. Even if a consumer thought they could own their data — which they can't in many settings — how could they ever do anything to affect it?


March 09 2011

One foot in college, one foot in business

screenshot.png In a recent interview, Joe Hellerstein, a professor in the UC Berkeley computer science department, talked about the disconnect between open source innovation and development. The problem, he said, doesn't lie with funding, but with engineering and professional development:

As I was coming up as a student, really interesting open source was coming out of universities. I'm thinking of things like the Ingres and Postgres database projects at Berkeley and the Mach operating system at Carnegie Mellon. These are things that today are parts of commercial products, but they began as blue-sky research. What has changed now is there's more professionally done open source. It's professional, but it's further disconnected from research.

A lot of the open source that's very important is really "me-too" software — so Linux was a clone of Unix, and Hadoop is a clone of Google's MapReduce. There's a bit of a disconnect between the innovation side, which the universities are good at, and the professionalism of open source that we expect today, which the companies are good at. The question is, can we put those back together through some sort of industrial-academic partnership? I'm hopeful that can be done, but we need to change our way of business.

Hellerstein pointed to the MADlib project being conducted between his group at Berkeley and the project sponsor EMC Greenplum as an example of a new partnership model that could close the gap between innovation and development.

Our sponsor would have been happy to donate money to my research funds, but I said, "You know, what I really need is engineering time."

The thing I cannot do on campus is run a professional engineering shop. There are no career incentives for people to be programmers at the university. But a company has processes and expertise, and they can hire really good people who have a career path in the company. Can we find an arrangement where those people are working on open source code in collaboration with the people at the university?

It's a different way of doing research funding. The company's contributions are not financial. The contributions are in engineering sweat. It's an interesting experiment, and it's going well so far.

In the interview Hellerstein also discusses MAD data analysis and where we are in the industrial revolution of data. The full interview is available in the following video:


March 02 2011

Before you interrogate data, you must tame it

The Guardian's coverage of the WikiLeaks cablesIBM, Wolfram|Alpha, Google, Bing, groups at universities, and others are trying to develop algorithms that parse useful information from unstructured data.

This limitation in search is a dull pain for many industries, but it was sharply felt by data journalists with the WikiLeaks releases. In a recent interview, Simon Rogers (@smfrogers), editor of the Guardian's Datablog and Datastore, talked about the considerable differences between the first batch of WikiLeaks releases — which arrived in a structured form — and the text-filled mass of unstructured cables that came later.

There were three WikiLeaks releases. One and two, Afghanistan and Iraq, were very structured. We got a CSV sheet, which was basically the "SIGACTS" — that stands for "significant actions" — database. It's an amazing data set, and in some ways it was really easy to work with. We could do incredibly interesting things, showing where things happened and events over time, and so on.

With the cables, it was a different kettle of fish. It was just a massive text file. We couldn't just look for one thing and think, "oh, that's the end of one entry and the beginning of the next." We had a few guys working on this for two or three months, just trying to get it into a state where we could have it in a database. Once it was in a database, internally we could give it to our reporters to start interrogating and getting stories out of it.

During the same interview, Rogers said that providing readers with the searchable data behind stories is a counter-balance to the public's cynicism toward the media.

When we launched the Datablog, we thought it was just going to be developers [using it]. What it turned out to be, actually, is real people out there in the world who want to know what's going on with a story. And I think part of that is the fact that people don't trust journalists any more, really. They don't trust us to be truthful and honest, so there's a hunger to see the stories behind the stories.

For more about how Rogers' group dealt with the WikiLeaks data and how data journalism works, check out the full interview in the following video:


February 23 2011

Data is a currency

If I talk about data marketplaces, you probably think of large resellers like Bloomberg or Thomson Reuters. Or startups like InfoChimps. What you probably don't think of is that we as consumers trade in data.

Since the advent of computers in enterprises, our interaction with business has caused us to leave a data imprint. In return for this data, we might get lower prices or some other service. The web has only accelerated this, primarily through advertising, and big data technologies are adding further fuel to this change.

When I use Facebook I'm trading my data for their service. I've entered into this commerce perhaps unwittingly, but using the same mechanism humankind has known throughout our history: trading something of mine for something of theirs.

So let's guard our privacy by all means, but recognize this is a bargain and a marketplace we enter into. Consumers will grow more sophisticated about the nature of this trade, and adopt tools to manage the data they give up.

Is this all one-way traffic? Business is certainly ahead of the consumer in the data management game, but there's a race for control on both sides. To continue the currency analogy, browsers have had "wallets" for a while, so we can keep our data in one place.

The maturity of the data currency will be signalled by personal data bank accounts, that give us the consumer control and traceability. The Locker project is a first step towards this goal, giving users a way to get their data back from disparate sites, but is one of many future models.

Who runs data banks themselves will be another point of control in the struggle for data ownership.


February 10 2011

Big Data: An opportunity in search of a metaphor

Strata job board The crowd at the Strata Conference could be divided into two broad contingents:

  1. Those attending to learn more about data, having recently discovered its potential.
  2. Long-time data enthusiasts watching with mixed emotions as their interest is legitimized, experiencing a feeling not unlike when a band that you've been following for years suddenly becomes popular.

A data-oriented event like this, outside a specific vertical, could not have drawn a large crowd with this level of interest, even two years ago. Until recently, data was mainly an artifact of business processes. It now takes center stage; organizationally, data has left the IT department and become the responsibility of the product team.

Of course "data," in its abstract sense, has not changed. But our ability to obtain, manipulate, and comprehend data certainly has. Today, data merits top billing due to a number of confluent factors, not least its increased accessibility via on-demand platforms and tools.   Server logs are the new cash-for-gold: act now to realize the neglected riches within your upper drive bay.

But the idea of "big data" as a discipline, as a conference subject, or as a business, remains in its formative years and has yet to be satisfactorily defined.  This immaturity is perhaps best illustrated by the array of language employed to define big data's merits and its associated challenges. Commentators are employing very distinct wording to make the ill-defined idea of "big data" more familiar; their metaphors fall cleanly into three categories:

  • Natural resources ("the new oil," "goldrush" and of course "data mining"): Highlights the singular value inherent in data, tempered by the effort required to realize its potential.
  • Natural disasters ("data tornado," "data deluge," data tidal wave"): Frames data as a problem of near-biblical scale, with subtle undertones of assured disaster if proper and timely preparations are not considered.
  • Industrial devices ("data exhaust," "firehose," "Industrial Revolution"): A convenient grab-bag of terminologies that usually portrays data as a mechanism created and controlled by us, but one that will prove harmful if used incorrectly.

If Strata's Birds-of-a-Feather conference sessions are anything to go by, the idea of "big data" requires the definition and scope these metaphors attempt to provide. Over lunch you could have met with like-minded delegates to discuss big data analysis, cloud computing, Wikipedia, peer-to-peer collaboration, real-time location sharing, visualization, data philanthropy, Hadoop (natch'), data mining competitions, dev ops, data tools (but "not trivial visualizations"), Cassandra, NLP, GPU computing, or health care data. There are two takeaways here: the first is that we are still figuring out what big data is and how to think about it; the second is that any alternative is probably an improvement on "big data."

Strata is about "making data work" — the tenor of the conference was less of a "how-to" guide, and more about defining the problem and shaping the discussion. Big data is a massive opportunity; we are searching for its identity and the language to define it.

February 08 2011

Will data be too cheap to meter?

CrunchBaseLast week at Strata I got into an argument with a journalist over the future of CrunchBase. His position was that we were just in a "pre-commercial" world, that creating the database required a reporter's time, and so after the current aberration had passed we'd return to the old status quo where this kind of information was only available through paid services. I wasn't so sure.

When I explain to people why the Big Data movement is important — why it's a real change instead of a fad — I point to price as the fundamental difference between the old and new worlds. Until a few years ago, the state of the art for doing meaningful analysis of multi-gigabyte data sets was the data warehouse. These custom systems were very capable, but could easily cost millions of dollars. Today I can hire a hundred machine Hadoop cluster from Amazon for just $10 an hour, and process thousands of gigabytes a day.

This represents a massive discontinuity in price, and it's why Big Data is so disruptive. Suddenly we can imagine a group of kids in their garage building Google-scale systems practically on pocket money. While the drop in the cost of data storage and transmission has been less dramatic, it has followed a steady downward trend over the decades. Now that processing has become cheap too, a whole universe of poverty-stricken hackers, academics, makers, reporters, and startups can do interesting things with massive data sets.

Why does this have to do with CrunchBase? The reporter had some implicit assumptions about the cost of the data collection process. He argued that it required extra effort from the journalists to create the additional value captured in the database. To paraphrase him: "It's time they'd rather spend at home playing with their kids, and so we'll end up compensating them for their work if we want them to continue producing it." What I felt was missing from this is that CrunchBase might actually be just a side-effect of work they'd be doing even if it wasn't released for public consumption.

Many news organizations are taking advantage of the dropping cost of data handling by heavily automating their news-gathering and publishing workflows. This can be as simple as Google Alerts or large collections of RSS feeds to scan, using scraping tools to gather public web data, and there's a myriad of other information-processing techniques out there. Internally there's a need to keep track of the results of manual or automated research, and so the most advanced organizations are using some kind of structured database to capture the information for future use.

That means that that the only extra effort required to release something like CrunchBase is publishing it to the web. Assuming that there's some benefits to doing so (that TechCrunch's reputation as the site-of-record for technology company news is enhanced, for example) and that there's multiple companies with the data available, then the low cost of the release will mean it makes sense to give it away.

I actually don't know if all these assumptions are true, CrunchBase's approach may not be sustainable, but I hope it illustrates how a truly radical change in price can upset the traditional rules. Even on a competitive, commercial, free-market playing field it sometimes makes sense to behave in ways that appear hopelessly altruistic. We've seen this play out with open-source software. I expect to see pricing forces do something similar to open up more and more sources of data.

I'm usually the contrarian guy in the room arguing that information wants to be paid, so I don't actually believe (as Lewis Strauss famously said about electricity) all data will be too cheap to meter. Instead I'm hoping we'll head toward a world where producers of information are paid for adding real value. Too many "premium" data sets are collated and merged from other computerized sources, and that process should be increasingly automatic, and so increasingly cheap. Give me a raw CrunchBase culled from press releases and filings for free, then charge me for your informed opinion on how likely the companies are to pay their bills if I extend them credit. Just as free, open-source software has served as the foundation for some very lucrative businesses, the new world of free public data will trigger a flood of innovations that will end up generating value in ways we can't foresee, and that we'll be happy to pay for.

February 03 2011

A new challenge looks for a smarter algorithm to improve healthcare

Starting on April 4, the Heritage Health Prize (@HPNHealthPrize) competition, funded by the Heritage Provider Network (HPN), will ask the world's scientists to submit an algorithm that will help them to identify patients at risk of hospitalization before they need to go to the emergency room.

"This competition is to literally predict the probability that someone will go to the hospital in the next year," said Anthony Goldbloom at the Strata Conference. Goldbloom is the founder and CEO of Kaggle, the Australian data mining company that has partnered with HPN on the competition. "The idea is to rank how at risk people are, go through the list and figure out which of the people on the list can be helped," he said.

If successful, HPN estimates that the algorithm produced by this competition could save them billions in healthcare costs. In the process, the development and deployment of the algorithm could provide the rest of the healthcare industry with a successful model for reducing costs.

"Finally, we've got a data competition that has real world benefits," said Pete Warden, author of the "Data Source Handbook" and founder of OpenHeatMap. "This is like the Netflix Prize, but for something far more important."

The importance of reducing healthcare costs can't be underestimated. Nationally, some $2.8 trillion dollars are spent annually on healthcare in the United States, with that number expected to grow in the years ahead. "There are two problems with the healthcare reform law," said Jonathan Gluck, a senior executive at HPN. "We pay for quantity — the more services you consume, the more we're going to bill you — and it never addressed personal responsibility."

If patients who would benefit from receiving lower cost preventative care can receive relevant treatments and therapies earlier, the cost issue might be addressed.

Why a prize?

HPN is just the latest organization to turn to a prize to generate a solution to a big problem. The White House has been actively pursuing prizes and competitions as a means of catalyzing collaborative innovation in open government around solving grand national prizes. From the X-Prize to the Netflix Prize to a growing number of challenges at, 2011 might just be the year where this method for generating better answers hits the adoption tipping point.

Goldbloom noted that in the eight months that Kaggle has hosted competitions, they've never had one where the benchmark hasn't been outperformed. From tourism forecasting to chess ratings, each time the best method was quickly improved within a few weeks, said Goldbloom.

As David Zax highlighted in his Fast Company article on the competition, adding an algorithm to find patients at risk might suggest that doctors' diagnoses or clinical skills are being subtracted from the equation. The idea here is not necessarily to take away a doctor's skills. Rather, it's to provide them with predictive analytics that augment those capabilities. As Zax writes, that has to be taken in context with the current state of healthcare:

A shortage of primary care physicians in the U.S. means that doctors don't always have time to pick up on the subtle connections that might lead to a Gregory House-style epiphany of what's ailing a patient. More importantly, though, the algorithms may point to connections that a human mind simply would never make in the first place.

Balancing privacy with potential

One significant challenge with this competition, so to speak, is that the data set isn't just about what movies people are watching. It's about healthcare, and that introduces a host of complexities around privacy and compliance with regulations. The data has to be de-identified, which naturally impairs what can be done. Gluck emphasized that the competition is HIPAA-compliant. Avoiding a data breach has been prioritized ahead of a successful outcome in the competition. Not doing so, given the sanctions that exist for such a breach, might well have made the competition a non-starter.

Gluck said that Khaled El Eman, a professor at the University of Ontario and a noted healthcare privacy expert, has been making attempts to de-anonymize the test data sets. Gluck said El Eman has been using public databases and other techniques to try and triangulate identity with records. To date he has not been successful.

Hotspotting the big picture

The potential of the Heritage Health Challenge will be familiar to readers of the New Yorker, where Dr. Atul Gawande published a feature on "healthcare hotspotting." In the article, Gawande examines the efforts of physicians like Dr. Jeffrey Brenner, of Camden, New Jersey, to use data to discover the neediest patients and deliver them better care.

The Camden Coalition has been able to measure its long-term effect on its first thirty-six super-utilizers. They averaged sixty-two hospital and E.R. visits per month before joining the program and thirty-seven visits after—a forty-per-cent reduction. Their hospital bills averaged $1.2 million per month before and just over half a million after—a fifty-six-per-cent reduction.

These results don’t take into account Brenner’s personnel costs, or the costs of the medications the patients are now taking as prescribed, or the fact that some of the patients might have improved on their own (or died, reducing their costs permanently). The net savings are undoubtedly lower, but they remain, almost certainly, revolutionary. Brenner and his team are out there on the boulevards of Camden demonstrating the possibilities of a strange new approach to health care: to look for the most expensive patients in the system and then direct resources and brainpower toward helping them.

The results of the approach taken in Camden is controversial, as Gawande's response to criticism of his article acknowledges. The promise of applying data science to identifying patients at higher risk, however, comes at a time when the ability of that discipline to deliver meaningful results has never been greater. If a smarter predictive algorithm emerges from this contest, $3 million dollars of prize money may turn out to have been a bargain.


ePayments Week: How big a bite will Apple take?

Here's what caught my attention in the payment space this week.

How big a bite will Apple take out of the mobile commerce market?

ePayments WeekWith the consensus growing that Apple will probably introduce contactless payment (using NFC technology) soon — most likely with iPhone 5's release this summer — we're all wondering how big a deal that's going to be. Gartner analyst Avivah Litan, quoted in Computerworld, suspects that it will be hugely disruptive since it could empower Apple to let its 160 million iTunes customers bypass the credit card companies. Currently, iTunes customers link credit cards to their accounts, but Apple could eventually encourage customers to link directly to their bank accounts (as PayPal has been doing), thus cutting the credit card companies out of the fun. Keep in mind, iTunes has about twice as many members as PayPal's 81 million.

On the other hand, Apple's move into mobile commerce could create opportunities for banks and credit card companies, according to Andrew Johnson in American Banker. He quotes Richard Doherty, research director of Envisioneering Group: "Apple's probably someone you want to work with than try to fend off ... They do touch tens of millions of Americans, hundreds of millions of people around the world." The piece points out the dangers of fragmentation, with certain retailers being "handcuffed" to certain phone operating systems, depending on the hardware they buy to handle transactions at the register. (Some of us remember similar warnings to Apple in early iPod/iTunes days about the dangers of going it alone on hardware and digital music standards, suggesting Apple should join in with recording industry standards. We all know how that turned out.)

Apple's ability to shape a winning platform all on its own is part of the point in Drew Sievers' blog on MFoundry about Apple's potential to make a bigger impact than simply enabling phone purchases. Sievers reminds us how refreshing and unique purchases are at Apple's retail outlets and wonders why they don't offer their point-of-sale (POS) solution to other retailers. Indeed, other retailers already use the same hardware that the Apple Store does, but the software and, more importantly, the sales training that accompany it are Apple's. Apple's stores are famously profitable per square foot. Transferring its whole solution (hardware, software, training, and t-shirted hipsters) to other retailers could be revolutionary, getting clerks out from behind the register and in front of the wavering customer.

The walled gardens dictate coin of the realm

Facebook recently notified its developer community that as of July 1, all social games on Facebook need to include the ability to process payments with Facebook Credits. Facebook promotes Credits as a currency to its game developers on the basis that using them is economical (the system to deploy them is already in place) and it increases usage (perhaps because credits don't feel like real money to people using them?). But Facebook takes a 30% cut of transactions completed with Facebook Credits, which feels like a serious vig.

Facebook isn't the only walled garden that's tightening the rules on what type of coin visitors to the realm can use. The Wall Street Journal reported that in the same week that Apple rejected a digital book application from Sony it tightened the rules on how publishers can collect payment. Journal blogger Russell Adams reported that digital magazine publisher Zinio had received notice that newspaper and magazine apps would need to handle transactions through iTunes rather than an alternate system. Apparently, Zinio already does this, so their system won't change. But other publishers, like the Journal, may need to change the way they bill for content, with Apple taking a cut.

At the Strata Conference: "Why are consumers comfortable telling us where they are?"

Last week, ahead of Data Privacy Day, Microsoft released results of a survey finding a wide gap between the percentage of consumers who think location services are valuable (94%) and those who are, nonetheless, wary about saying where they are (52%). Even so, that means nearly half of those surveyed weren't concerned about sharing that data. That seems a big shift from only a few years ago when people were nervous about giving their real name on the Internet. So I asked some of the people at this week's O'Reilly Strata conference whether they thought there had been a shift in the way consumers view the trade-off of services for privacy.

Alistair Croll, an analyst and partner at Bitcurrent and one of Strata's co-chairs, suggested that consumers' decisions to be more comfortable sharing data was a rational economic choice based on the sense that you get more for it than you used to. "Ten years ago, there wasn't a lot of upside to giving out your personal data," he said. But sharing that data today returns information that helps you lead a more effective life, whether that's driving directions or restaurant locations.

Hilary Mason, a computer science professor working with, added that there are network effects to this value as more people share their locations: "I get better recommendations, I get ambient awareness of people I haven't seen in a while. The network effects of being open begin to outweigh the benefits of not being open."

Some wondered if the missing link here was simply more transparency about where your data was going and who else was using it for what types of purpose. But Tim O'Reilly challenged that notion. "If you had access to that information, what would you do differently? I don't think it would actually change our behavior," he said. "I think the thing to do is to figure out what's bad — what's creepy, if you will — and then punish it when that happens." In other words, whether lured or spurred to it, consumers are heading down a path to reveal more and more of their personal data — at least until some "data apocalypse," as Kroll put it, comes along and causes us to rethink our extroversion.

Got news?

News tips and suggestions are always welcome, so please send them along.

If you're interested in learning more about the payment development space, check out PayPal X DevZone, a collaboration between O'Reilly and PayPal.


January 27 2011

Will data warehousing survive the advent of big data?

For more than 25 years, data warehousing has been the accepted architecture for providing information to support decision makers. Despite numerous implementation approaches, it is founded on sound information management principles, most particularly that of integrating information according to a business-directed and predefined model before allowing use by decision makers. Big data, however one defines it, challenges some of the underlying principles behind data warehousing, causing some analysts to question if the data warehouse will survive.

In this article, I address this question directly and propose that data warehousing, and indeed information management as a whole, must evolve in a radically new direction if we are to manage big data properly and solve the key issue of finding implicit meaning in data.

Back in the 1980s I worked for IBM in Ireland, defining the first published data warehouse architecture (Devlin & Murphy, 1988). At that time, the primary driver for data warehousing was to reconcile data from multiple operational systems and to provide a single, easily-understood source of consistent information to decision makers. The architecture defined the "Business Data Warehouse (BDW) ... [as] the single logical storehouse of all the information used to report on the business ... In relational terms, the end user is presented with a view / number of views that contain the accessed data ..." Note the phrase "single logical storehouse" — I'll return to it later.

Big data (or what was big data then — a few hundred MB in many cases!) and the poor performance of early relational databases proved a challenge to the physical implementation of this model. Within a couple of years, the layered model emerged. Shown in Figure 1 (below), this has a central enterprise data warehouse (EDW) as a point of consolidation and reconciliation, and multiple user-access data marts fed from it. This implementation model has stood the test of time. But it does say that all data must (or should) flow through the EDW, the implications of which I'll discuss later.

Operational systems
Figure 1: The Traditional Data Warehouse Architecture.

The current hype around "big data" has caused some analysts and vendors to declare the death of data warehousing, and in some cases, the demise even of the relational database.

A prerequisite to discussing these claims is to understand and clearly define the term "big data." However, it's a fairly nebulous concept. Wikipedia's definition, as of December 2010, is vague and pliable:

Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.

So, it's as big as you want and getting ever larger.

A taxonomy for data — mind over matter

To get a better understanding, we need to look at the different types of data involved and, rather than focus on the actual data volumes, look to the scale and variety of processing required to extract implicit meaning from the raw data.

Figure 2 (below) introduces a novel and unique view of data, its categories and its relationship to meaning, which I call somewhat cheekily "Mind over Matter."

Operational systems
Figure 2: Mind over Matter and the Heart of Meaning.

Broadly speaking, the bottom pyramid represents data gleaned primarily from the physical world, the world of matter. At the lowest level, we have measurement data, sourced from a variety of sensors connected to computers and the Internet. Such physical event data includes location, velocity, flow rate, event count, G-force, chemical signal, and many more. Such measurements are widely used in science and engineering applications, and have grown to enormous volumes in areas such as particle physics, genomics and performance monitoring of complex equipment. This type of big data has been recognized by the scientific and engineering community for many years and is the basis for much modern research and development. When such basic data is combined in meaningful ways, it becomes interesting in the commercial world.

Atomic data is thus comprised of physical events, meaningfully combined in the context of some human interaction. For example, a combined set of location, velocity and G-force measurements in a specific pattern and time from an automobile monitoring box may indicate an accident. A magnetic card reading of account details, followed by a count of bills issued at an ATM, is clearly a cash withdrawal transaction. More sophisticated combinations include call detail records (CDRs) in telecom systems, web log records, e-commerce transactions and so on. There's nothing new in this type of big data. Telcos, financial institutions and web retailers have statistically analyzed it extensively since the early days of data warehousing for insight into customer behavior and as a basis for advertising campaigns or offers aimed at influencing it.

Derived data, created through mathematical manipulation of atomic data, is generally used to create a more meaningful view of business information to humans. For example, banking transactions can be accumulated and combined to create account status and balance information. Transaction data can be summarized into averages or sampled. Some of these processes result in a loss of detailed data. This data type and the two below it in the lower pyramid comprise hard information, that is largely numerical and keyword data, well structured for use by computers and amenable to standard statistical processing.

As we move to the top pyramid, we enter the realm of the mind — information originating from the way we as humans perceive the world and interact socially within it. We also call this soft information — less well structured and requiring more specialized statistical and analytical processing. The top layer is multiplex data, image, video and audio information, often in smaller numbers of very large files and very much part of the big data scene. Very specialized processing is required to extract context and meaning from such data and extensive research is ongoing to create the necessary tools. The layer below — textual data — is more suited to statistical analysis and text analytics tools are widely used against big data of this type.

The final layer in our double pyramid is compound data, a combination of hard and soft information, typically containing the structural, syntactic and model information that adds context and meaning to hard information and bridges the gap between the two categories. Metadata is a very significant subset of compound data. It is part of the data/information continuum; not something to push out to one side of the information architecture as a separate box — as often seen in data warehousing architectures.

Compound data is the final category of data, and probably the category of most current interest in big data. It contains much social media information — a combination of hard web log data and soft textual and multimedia data from sources such as Twitter, Facebook and so on.

The width of each layer in the pyramids corresponds loosely to data volumes and numbers of records in each category. The outer color bands in Figure 2 place data warehousing and big data in context. The two concepts overlap significantly in the world of matter. The major difference is that big data includes and even focuses on the world of mind at the detailed, high volume level.

More importantly, the underlying reason we do data warehousing (more correctly, business intelligence, for which data warehousing is the architectural foundation) and analyze big data is essentially the same: we are searching for meaning in the data universe. And meaning resides at the conjoined apexes of the two pyramids.

Both data warehousing and big data begin with highly detailed data, and approach its meaning by moving toward very specific insights that are represented by small data sets that the human mind can grasp. The old nugget, now demoted to urban legend, of "men who buy diapers on Friday evenings are also likely to buy beer" is a case in point. Business intelligence works more from prior hypotheses, whereas big data uses statistics to extract hypotheses.

Now that we understand the different types of data and how big data and data warehousing relate, we can address the key question: does big data spell the end of data warehousing?

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR111RAD

Reports of my death are greatly exaggerated

Data warehousing, as we currently do it — and that's a key phrase — is usually rather difficult to implement and maintain. The ultimate reason is that data warehousing seeks to ensure that enterprise-wide decision making is consistent and trusted. This was and is a valid and worthy objective, but it's also challenging. Furthermore, it has driven two architectural aims:

  1. To define, create and maintain a reconciled, integrated set of enterprise data for decision making.
  2. That this set should be the single source for all decision-making needs, be they immediate or long-term, one-off or ongoing, throw-away or permanent.

The first of these aims makes sense: there are many decisions which should be based on reconciled and integrated information for commercial, legal or regulatory reasons. The second aim was always questionable — as shown, for example, by the pervasive use of spreadsheets — and becomes much more so as data volumes and types grow. Big data offers new, easier and powerful ways to interactively explore even larger data sets, most of which have never seen the inside of a data warehouse and likely never will.

Current data warehousing practices also encourage and, in many ways, drive the creation of multiple copies of data. Data is duplicated across the three layers of the architecture in Figure 1, and further duplicated in the functional silos of the data marts. What is more, the practice of building independent data marts fed directly from the operational environment and bypassing the EDW entirely is lamentably far too common. The advent of big data, with its large and growing data volumes, argues strongly against duplication of data. I've explored these issues and more in a series of articles on B-eye-Network (Devlin, 2010), concluding that a new inclusive architecture — Business Integrated Insight (BI2) — is required to extend existing data warehousing approaches.

Big data will give (re)birth to the data warehouse

As promised, it is time to return to the "single logical storehouse" of information required by the business. Back in the 1980s, that information was very limited in comparison to what business needs today, and its uses were similarly circumscribed. Today's business needs both a far broader information environment and a much more integrated processing approach. A single logical storehouse is required with both a well-defined, consistent and integrated physical core, and a loose federation of data whose diversity, timeliness and even inconsistency is valued. In order to discuss this sensibly, we need some new terminology that minimizes confusion and contention between the advocates of the various different technologies and approaches.

The first term is "Business Information Resource" (BIR), introduced in a Teradata-sponsored white paper (Devlin, 2009), and defined as a single logical view of the entire information foundation of the business that aims to differentiate between different data uses and to reduce the tendency to duplicate data multiple times. Within a unified information space, the BIR has a conceptual structure allowing reasonable boundaries of business interest and implementation viability to be drawn (Devlin, 2010a). With such a broad scope, the BIR is clearly instantiated in a number of technologies, of which relational and XML databases, and distributed file and content stores such as Hadoop are key. Thus, the relational database technology of the data warehouse is focused on the creation and maintenance of a set of information that can support common and consistent decision making. Hadoop, MapReduce and similar technologies are directed to their areas of strength such as temporary, throw away data, fast turnaround reports where speed trumps accuracy, text analysis, graphs, large-scale quantitative analytical sand boxes, and web farm reporting. Furthermore, these stores are linked through virtual access technology that presents the separate physical stores to the business user as a single entity as and when required.

The second term, "Core Business Information" (CBI), from an Attivio-sponsored white paper (Devlin, 2010b), is the set of information that ensures the long-term quality and consistency of the BIR. This information needs to be modeled and defined at an early stage of the design and its content and structure subject to rigorous change management. While other information may undergo changes in definition or relationships over time, the CBI must remain very stable.

While space doesn't permit a more detailed description here of these two concepts, the above-mentioned papers make clear that the CBI contains the information at the heart of a traditional enterprise data warehouse (and, indeed, of modern Master Data Management). The Business Information Resource, on the other hand, is a return to the conceptual basis of the data warehouse — a logical single storehouse of all the information required by the business, which, by definition, encompasses big data in all its glory.


While announcing the death of data warehousing and relational databases makes for attention-grabbing headlines, reality is more complex. Big data is actually a superset of the information and processes that have characterized data warehousing since its inception, with big data focusing on large-scale and often short-term analysis. With the advent of big data, data warehousing itself can return to its roots — the creation of consistency and trust in enterprise information. In truth, there exists a substantial overlap between the two areas; the precepts and methods of both are highly complementary and the two will be mandatory for all forward-looking businesses.


Devlin, B. A. and Murphy, P. T., "An architecture for a business and information system," IBM Systems Journal, Volume 27, Number 1, Page 60 (1988)

Devlin, B., "Business Integrated Insight (BI2) — Reinventing enterprise information management," White Paper, (2009)

Devlin, B., "From Business Intelligence to Enterprise IT Architecture," B-eye-Network, (2010)

Devlin, B., "Beyond Business Intelligence," Business Intelligence Journal, Volume 15, Number 2, Page 7, (2010a)

Devlin, B., "Beyond the Data Warehouse: A Unified Information Store for Data and Content," White Paper, (2010b)


The "dying craft" of data on discs

To prepare for next week's Strata Conference, we're continuing our series of conversations with innovators working with big data and analytics. Today, we hear from Ian White, the CEO of Urban Mapping.

Mapfluence, one of Urban Mapping's products, is a spacial database platform that aggregates data from multiple sources to deliver geographic insights to clients. GIS services online are not a completely new idea, but White said the leading players haven't "risen to the occasion." That's left open some new opportunities, particularly at the lower end of the market. Whereas traditional GIS services still often deliver data by mailing out a CD-ROM or through proprietary client-server systems, Urban Mapping is one of several companies that have updated the model to work through the browser. Their key selling point, White said, is a wider range of licensing levels that allow it to support smaller clients as well as the larger ones.

Geographic data is increasingly free, but the value proposition for companies like Urban Mapping lies in the intelligence behind the data, and the organization that makes it accessible. "We're in a phase now where we're aggregating a lot of high-value data," White said. "The next phase is to offer tools to editorially say what you want."

Urban Mapping aims to provide the domain expertise on the demographic datasets it works with, freeing clients up to focus on the intelligence revealed by the data. "A developer might spend a lot of time looking through a data catalog to find a column name. If, for example, the developer is making an application for commercial real estate and they want demographic information, they might wonder which one of 1,500 different indicators they want." Delivering the right one is obviously of a higher value than delivering a list of all 1,500. "That saves an enormous amount of time."

To achieve those time savings, Urban Mapping considers the end users and their needs when they source data. As they design the architecture around it, they think about three layers: the design layer, the application layer, and the user interface layer atop that. "We look to understand the user's ultimate purpose and then work back from there," White said, as they organize tables, add metadata, and make sure data is accessible to technical and non-technical users efficiently.

"The notion of receiving a CD in the mail, opening it, reading the manual, it's kind of a dying craft," White said. "It's unfortunate that a lot of companies have built processes around having people on staff to do this kind of work. We can effectively allow those people to work in a higher-value area of the business."

You'll find the full interview in the following video:

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD


January 21 2011

Visualization deconstructed: Mapping Facebook's friendships

In the first post in Radar's new "visualization deconstructed" series, I talked about how data visualization originated from cartography (which some now just call "mapping"). Cartography initially focused on mapping physical spaces, but at the end of the 20th century we created and discovered new spaces that were made possible by the Internet. By abstracting away the constraints of the physical space, social networks such as Facebook emerged and opened up new territories, where topology is primarily defined by the social fabric rather than physical space. But is this fabric completely de-correlated from the physical space?

Mapping Facebook's friendships

Last December, Paul Butler, an intern on Facebook's data infrastructure engineering team, posted a visualization that examined a subset of the relations between Facebook users. Users were positioned in their respective cities and arcs denoted friendships.

Paul extracted the data and started playing with it. As he put it:

Visualizing data is like photography. Instead of starting with a blank canvas, you manipulate the lens used to present the data from a certain angle.

There is definitely discovery involved in the process of creating a visualization, where by giving visual attributes to otherwise invisible data, you create a form for data to embody.

The most striking discovery that Paul made while creating his visualization was the unraveling of a very detailed map of the world, including the shapes of the continents (remember that only lines representing relationships are drawn).

If you compare the Facebook visualization with NASA's world at night pictures, you can see how close the two maps are, except for Russia and parts of China. It seems that Facebook has a big growth opportunity in these regions!

So let's have a look at Paul's visualization:

  • A complex network of arcs and lines does a great job communicating the notions of human activity and organic social fabric.
  • The choice of color palette works very well, as it immediately make us think about night shots of earth, where the light of the city makes human activity visible. The color contrast is well balanced, so that we don't see too much blurring or bleeding of colors.
  • Choosing to draw only lines and arcs makes the visualization very interesting, as at first sight, we would think that the outlines of continents and the cities have been pre-drawn. Instead, they emerge from the drawing of arcs representing friendships between people in different cities, and we can make the interesting discovery of a possible correlation between physical location and social friendships on the Internet.

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD

Overall, this is a great visualization that had a lot of success last December, being mentioned in numerous blogs and liked by more than 2,000 people on Facebook. However, I can see a couple ways to improve it and open up new possibilities:

  • Play with the color scale -- By using a less linear gradient as a color scale, or by using more than two colors, some other patterns may emerge. For instance, by using a clearer cut-off in the gradient, we could better see relations with a weight above a specific threshold. Also, using more than one color in the gradient might reveal the predominance of one color over another in specific regions. Again, it's something to try, and we'll probably lose some of the graphic appeal in favor of (perhaps) more insights into the data.
  • Play with the drawing of the lines -- Because the lines are spread all over the map, it's a little difficult to identify "streams" of lines that all flow in the same direction. It would be interesting to draw the lines in three parts, where the middle part would be shared by many lines, creating "pipelines" of relationships from one region to another. Of course, this would require a lot of experimentation and it might not even be possible with the tools used to draw the visualization.
  • Use a different reference to position cities -- Cities in the visualization are positioned using their geographical position, but there are other ways they could be placed. For instance, we could position them on a grid, ordered by their population, or GDP. What kind of patterns and trends would emerge by changing this perspective ?

Static requires storytelling

In last week's post, I looked at an interactive visualization, where users can explore the data and its different representations. With the Facebook data, we have a static visualization where we can only look, not touch — it's like gazing at the stars.

Although a static visualization has the potential to evolve into an interactive visualization, I think creating a static image involves a little bit more care. Interactive visualizations can be used as exploration tools, but static visualizations need to present insight the data explorer had when creating the visualization. It has to tell a story to be interesting.


January 19 2011

Data startups: we want you

The startup showcase at Web 2.0 Expo NY 2010The O'Reilly Strata Conference on making data work is almost upon us. There's one final opportunity to be a part of this epoch-defining event: the Startup Showcase.

We're seeking startups that want to pitch the attendees — a broad selection of leaders from the business and investment community, as well as elite developers and savvy data-heads. Successful applicants will receive a couple of Strata registrations gratis, and the chance to be one of three winners who get to give their company pitches on the main stage.

Our judges include Roger Ehrenberg of IA Ventures, one of the leading investors in data-driven companies, and Tim Guleri of Sierra Ventures, whose successes include big data star Greenplum, which was recently acquired by EMC.

You've got until the end of this week to tell us why your startup should be a part of Strata. Submissions close on Friday night (Jan. 21.), so apply now.

January 13 2011

Strata Week: Data centers

Here's what caught my attention in the data world this week.


As a former Bostonian, I well remember the Big Dig: a project to sink the city's Central Artery underground and add tunnels and a bridge to relieve traffic congestion. The effort consumed (not unpredictably) several years and billions of dollars more than originally projected. And by the time the dust cleared, traffic had increased so much that congestion was just as bad, if not worse, than when the project began.

As with Massachusetts and vehicles, so with government and data. Fittingly, the Boston Phoenix this week published a look at the challenge of government data.

Digital storage is not a natural resource. The amount of information that government agencies may be required to keep — from tweets and emails to tax histories — is growing faster than the capacity for storage.

While the Obama administration has made strides to address this situation by forming the Office of Government Information Services (OGIS) in 2009 and appointing Vivek Kundra as Chief Information Officer, the need is so large as to remain overwhelming at the moment. From the same Phoenix article:

The United States Census Bureau alone maintains about 2560 terabytes of information — more data than is contained in all the academic libraries in America, and the equivalent of about 50 million four-door filing cabinets of text documents. In addition to the federal deluge, tens of thousands of municipal and state facilities maintain data ranging from driver's-license pics to administrative e-mails — or at least they're required to.

More and more, huge storage requirements meet staffing cuts and tight budgets in a complicated showdown. Many municipal governments have IT staffs of one or two people, if they have IT staffs at all.

This, of course, leads to a lot of outsourcing and privatization, which come with personnel and expertise benefits, but also security drawbacks. Will government digitization and data transport us into the future, or become another big dig against a rising tide?

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions — along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD

This data was raised in a barn

Of course, privatization may save the day after all, if it can significantly lower the cost of data centers by saving power. That's one goal of Microsoft's new data center in Quincy, Wash.: it will use outside air for cooling (known as "air-side economizing"), and house the racks in a barn-like building that protects the servers from wind and rain but is otherwise "virtually transparent to ambient outdoor conditions."

These new server farms will also make use of Microsoft's IT Pre-Assembled Components (ITPACs), which allow for flexibility and scaling, and will help keep costs down even further.

Microsoft, like Intel in New Mexico and Red Rocks Data Center and Sun Microsystems in Colorado, began experimenting with air-side economizing in late 2008 and 2009.

Air-side economiser of the kind Red Rocks Data Center was using in Colorado.

Intel went on to publish a white paper on air economization as well as a Proof of Concept video in which they report lowering power costs by nearly 74 percent. Red Rocks Data has since closed, but first reported that during the coolest months of the year, "our savings are averaging about $1,600 a month on a $5,000 total bill."

With or without a tractor shed, many more of these air-cooled data centers with a modular approach are likely to be built in the coming years. Microsoft expects to open others in Virginia and Iowa in 2011, and they likely will not be alone.

Maybe the White House should build itself a barn.

Resource room

If you're headed to Strata in a couple of weeks and find yourself in need of some anticipatory reading for the flight, download the recent Big Data reports from PricewaterhouseCoopers and NESTA (the National Endowment for Science, Technology and the Arts in the UK).

The NESTA report lays out some of the key concepts and threads from its November 2010 event, "The Power and Possibilities of Big Data." You can also watch video from the event, which brought together folks like Hans Peter Brøndmo from Nokia, Haakon Overli from Dawn Capital, Max Jolly from dunnhumby, and Megan Smith from Google.

The PwC issue is aimed at CIOs and covers "the techniques behind low-cost distributed computing that have led companies to explore more of their data in new ways." Several of these articles will be great background before heading off to Strata — hope to see you there!

Keep up with new developments in the data world with the Strata Week RSS feed.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!