Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 07 2013

Predicting the future: Strata 2014 hot topics

Conferences like Strata are planned a year in advance. The logistics and coordination required for an event of this magnitude takes a lot of planning, but it also takes a decent amount of prediction: Strata needs to skate to where the puck is going.

While Strata New York + Hadoop World 2013 is still a few months away, we’re already guessing at what next year’s Santa Clara event will hold. Recently, the team got together to identify some of the hot topics in big data, ubiquitous computing, and new interfaces. We selected eleven big topics for deeper investigation.

  • Deep learning
  • Time-series data
  • The big data “app stack”
  • Cultural barriers to change
  • Design patterns
  • Laggards and Luddites
  • The convergence of two databases
  • The other stacks
  • Mobile data
  • The analytic life-cycle
  • Data anthropology

Here’s a bit more detail on each of them.

Deep learning

Teaching machines to think has been a dream/nightmare of scientists for a long time. Rather than teaching a machine explicitly, or using Watson-like statistics to figure out the best answer from a mass of data, Deep Learning uses simpler, core ideas and then builds upon them — much as a baby learns sounds, then words, then sentences.

It’s been applied to problems like vision (find an edge, then a shape, then an object) and better voice recognition. But advances in processing and algorithms are making it increasingly attractive for a large number of challenges. A Deep Learning model “copes” better with things its creators can’t foresee, or genuinely new situations. A recent MIT Technology Review article said these approaches improved image recognition by 70%, and improved Android voice recognition 25%. But 80% of the benefits come from additional computing power, not algorithms, so this is stuff that’s only become possible with the advent of cheap, on-demand, highly parallel processing.

The main drivers of this approach are big companies like Google (which acquired DNNResearch), IBM and Microsoft. There are also startups in the machine learning space like Vicarious and Grok (née Numenta).

Deep Learning isn’t without its critics. Something learned in a moment of pain or danger might not be true later on, so the system needs to unlearn — or at least reduce the certainty — of a conclusion. What’s more, certain things might only be true after a sequence of events: once we’ve seen a person put a ball in a box and close the lid, we know there is a ball in the box, but a picture of the box afterward wouldn’t reveal this. Inability to take into account time is one of the criticisms Grok founder Jeff Hawkins levels at Deep Learning.

There’s some good debate, and real progress in AI and machine learning, as a result of the new computing systems that make these models possible. They’ll likely supplant the expert systems (yes/no trees) that are used in many industries, but have fundamental flaws. Ben Goldacre described this problem at Strata in 2012: almost every patient who displays the symptoms of a rare disease instead has two, much more common, diseases with those symptoms.

Also this is why House is a terrible doctor’s show.

In 2014, much of the data science content of Strata will focus on making machines smarter, and much of this will come from abundant back-end processing paired with advances in vision, sensemaking, and context.

Time-series data

Data is often structured according to the way it will be used.

  • To data designers, a graph is a mathematical structure that describes how a pair of objects relate to one another. This is why Facebook’s search tool is called Graph Search. To work with large numbers of relationships, we use a Graph database that organizes everything in it according to how it’s related to everything else. This makes it easy to find things that are linked to one another, like routers in a network or friends at a company, even with millions of connections. As a result, it’s often in the core of a social network’s application stack. Companies like Neo4j and Titan and Vertex make them.
  • On the other hand, a relational database keeps several tables of data (your name; a product purchase) and then links them by a common thread (such as the credit card used to buy the product, or the name of the person to whom it belongs). When most traditional enterprise IT people say “database,” they mean a relational database (RDBMS). The RDBMS has been so successful it’s supplanted most other forms of data storage.

(As a sidenote, at the core of the RDBMS is a “join,” an operation that links two tables. Much of the excitement around NoSQL databases was in fact about doing away with the join, which — though powerful — significantly restricts how quickly and efficiently an RDBMS can process large amounts of data. Ironically, the dominant language for querying many of these NoSQL databases through tools like Impala is now SQL. If the NoSQL movement had instead been called NoJoin, things might have been much more clear.)

Book SpiralBook Spiral

Book Spiral – Seattle Central Library by brewbooks, on Flickr

Data systems are often optimized for a specific use.

  • Think of a coin-sorting machine — it’s really good at organizing many coins of a limited variety (nickels, dimes, pennies, etc.).
  • Now think of a library — it’s really good at a huge diversity of books, often only one or two of each, and not very fast.

Databases are the same: a graph database is built differently from a relational database; an analytical database (used to explore and report on data) is different from an operational one (used in production).

Most of the data in your life — from your Facebook feed to your bank statement — has one common element: time. Time is the primary key of the universe.

Since time is often the common thread in data, optimizing databases and processing systems to be really, really good at handling data over time is a huge benefit for many applications, particularly those that try to find correlations between seemingly different data — does the temperature on your NEST thermostat correlate with an increase in asthma inhaler use? Black Swans aside, time is also useful when trying to predict the future from the past.

Time Series data is at the root of life-logging and the Quantified Self movement, and will be critical for the Internet of Things. It’s a natural way to organize things which, as humans, we fundamentally understand. Time series databases have a long history, and there’s a lot of effort underway to modernize them as well as the analytical tools that crunch the data they contain, so we think time-series data deserves deeper study in 2014.

The Big Data app stack

We think we’re about to see the rise of application suites for big data. Consider the following evolution:

  1. On a mainframe, the hardware, operating system, and applications were often indistinguishable.
  2. Much of the growth of consumer PCs happened because of the separation of these pieces — companies like Intel and Phoenix made the hardware; Microsoft and Red Hat made the OS; and developers like WordPerfect, Lotus, and DBase made the applications.
  3. Eventually, we figured out what the PC was “for” and it acquired a core set of applications without which, it seems, a PC wouldn’t be useful. Those are generally described as “office suites,” and while there was once a rivalry for them, today, they’ve been subsumed by OS makers (Apple, Microsoft, Open Source) while those that didn’t have an OS withered on the vine (Corel).
  4. As we moved onto the web, the same thing started to happen — email, social network, blog, and calendar seemed to be useful online applications now that we were all connected, and the big portal makers like Google, Sina, Yahoo, Naver, and Facebook made “suites” of these things. So, too, did the smartphone platforms, from PalmPilot to Blackberry to Apple and Android.
  5. Today’s private cloud platforms are like yesterday’s operating systems, with OpenStack, CloudPlatform, VMWare, Eucalyptus, and a few others competing based on their compatibility with public clouds, hardware, and applications. Clouds are just going through this transition to apps, and we’re learning that their “app suite” includes things like virtual desktops, disaster recovery, on-demand storage — and of course, big data.

Okay, enough history lesson.

We’re seeing similar patterns emerge in big data. But it’s harder to see what the application suite is before it happens. In 2014, we think we’ll be asking ourselves, what’s the Microsoft Office of Big Data? We can make some guesses:

  • Predicting the future
  • Deciding what people or things are related to other people or things
  • Helping to power augmented reality tools like Google Glass with smart context
  • Making recommendations by guessing what products will appeal to which customers
  • Optimizing bottlenecks in supply chains or processes
  • Identifying health risks or anomalies worthy of investigation

Companies like Wibidata are trying to figure this out — and getting backed by investors with deep pockets. Just as most of the interesting stories about operating systems were the apps that ran on them, and the stories about clouds are things like big data, so the good stories about big data are the “office suites” atop it. Put another way, we don’t know yet what big data is for, but I suspect that in 2014 we’ll start to find out.

Cultural barriers to data-driven change

Every time I talk with companies about data, they love the concept but fail on the execution. There are a number of reasons for this:

  • Incumbency. Yesterday’s leaders were those who could convince others to act in the absence of information. Tomorrow’s leaders are those who can ask the right questions. This means there is a lot of resistance from yesterday’s leaders (think Moneyball).
  • Lack of empowerment. I recently ate a meal in the Pittsburgh airport, and my bill came with a purple pen. I’m now wondering if I tipped differently because of that. What ink colour maximizes per-cover revenues in an airport restaurant? (Admittedly, I’m a bit obsessive.) But there’s no reason someone couldn’t run that experiment, and increase revenues. Are they empowered to do so? How would they capture the data? What would they deem a success? These are cultural and organizational questions that need to be tackled by the company if it is to become data-driven.
  • Risk aversion. Steve Blank says a startup is an organization designed to search for a scalable, repeatable business model. Here’s a corollary: a big company is one designed to perpetuate a scalable, repeatable business model. Change is not in its DNA — predictability is. Since the days of Daniel McCallum, organizational charts and processes fundamentally reinforce the current way of doing things. It often takes a crisis (such as German jet planes in World War Two or Netscape’s attack on Microsoft) to evoke a response (the Lockheed Martin Skunk Works or a free web browser).
  • Improper understanding. Correlation is not causality — there is a correlation between ice cream and drowning, but that doesn’t mean we should ban ice cream. Both are caused by summertime. We should hire more lifeguards (and stock up on ice cream!) in the summer. Yet many people don’t distinguish between correlation and causality. As a species, humans are wired to find patterns everywhere  because a false positive (turning when we hear a rustle in the bushes, only to find there’s nothing there) is less dangerous than a false negative (not turning and getting eaten by a sabre-toothed tiger).
  • Focus on the wrong data. Lean Analytics urges founders to be more data-driven and less self-delusional. But when I recently spoke with executives from DHL’s innovation group, they said that innovation in a big company requires a wilful disregard for data. That’s because the preponderance of data in a big company reinforces the status quo; nascent, disruptive ideas don’t stand a chance. Big organizations have all the evidence they need to keep doing what they have always done.

There are plenty of other reasons why big organizations have a hard time embracing data. Companies like IBM, CGI, and Accenture are minting money trying to help incumbent organizations be the next Netflix and not the next Blockbuster.

What’s more, the advent of clouds, social media, and tools like PayPal or the App Store has destroyed many of the barriers to entry on which big companies rely. As Quentin Hardy pointed out in a recent article, fewer and fewer big firms stick around for the long haul.

Design patterns

As any conference matures, we move into best practices. The way these manifest themselves with architecture is in the form of proven architectures — snippets of recipes people can re-use. Just as a baker knows how to make an icing sauce with fat and sugar — and can adjust it to make myriad variations — so, too, can an architect use a particular architecture to build a known, working component or service.

As Mike Loukides points out, a design pattern is even more abstract than a recipe. It’s like saying, “sweet bread with topping,” which can then be instantiated in any number of different kinds of cake recipes. So, we have a design pattern for “highly available storage” and then rely on proven architectural recipes such as load-balancing, geographic redundancy, and eventual consistency to achieve it.

Such recipes are well understood in computing, and they eventually become standards and appliances. We have a “scale-out” architecture for web computing, where many cheap computers can handle a task, as an Application Delivery Controller (a load balancer) “sprays” traffic across those machines. It’s common wisdom today. But once, it was innovative. Same thing with password recovery mechanisms and hundreds of other building blocks.

We’ll see these building blocks emerge for data systems that meet specific needs. For example, a new technology called homomorphic encryption allows us to analyze data while it is still encrypted, without actually seeing the data. That would, for example, allow us to measure the spread of a disease without violating the privacy of the individual patients. (We had a presenter talk about this at DDBD in Santa Clara.) This will eventually become a vital ingredient in a recipe for “data where privacy is maintained.” There will be other recipes optimized for speed, or resiliency, or cost, all in service of the “highly available storage” pattern.

This is how we move beyond vendors. Just as a scale-out web infrastructure can have an ADC from Radware, Citrix, F5, Riverbed, Cisco, and others (with the same pattern), we’ll see design patterns for big data with components that could come from Cloudera, Hortonworks, IBM, Intel, MapR, Oracle, Microsoft, Google, Amazon, Rackspace, Teradata, and hundreds of others.

Note that many vendors who want to sell “software suites” will hate this. Just as stereo vendors tried to sell all-in-one audio systems, which ultimately weren’t very good, many of today’s commercial providers want to sell turnkey systems that don’t allow the replacement of components. Design patterns and the architectures on which they rely are anathema to these closed systems — and are often where standards tracks emerge. 2014 is when that debate will start out in Big Data.

Laggards and Luddites

Certain industries are inherently risk-averse, or not technological. But that changes fast. A few years ago, I was helping a company called FarmsReach connect restaurants to local farmers and turn the public market into a supply chain hub. We spent a ton of effort building a fax gateway because farmers didn’t have mobile phones, and ultimately, the company pivoted to focus on building networks between farmers.

Today, however, farmers are adopting tech quickly, and they rely on things like GPS-based tractor routing and seed sowing (known as “Satellite Farming”) to get the most from their fields.

As the cost of big data drops and the ease of use increases, we’ll see it applied in many other places. Consider, for example, a city that can’t handle waste disposal. Traditionally, the city would buy more garbage trucks and hire more garbage collectors. But now, it can analyze routing and find places to optimize collection. Unfortunately, this requires increased tracking of workers — something the unions will resist very vocally. We already saw this in education, where efforts to track students were shut down by teachers’ unions.

In 2014, big data will be crossing the chasm, welcoming late adopters and critics to the conversation. It’ll mean broadening the scope of the discussion — and addressing newfound skepticism — at Strata.

Convergence of two databases

If you’re running a data-driven product today, you typically have two parallel systems.

  • One’s in production. If you’re an online retailer, this is where the shopping cart and its contents live, or where the user’s shipping address is stored.
  • The other’s used for analysis. An online retailer might make queries to find out what someone bought in order to handle a customer complaint or generate a report to see which products are selling best.

Analytical technology comes from companies like Teradata, IBM (from the Cognos acquisition), Oracle (from the Hyperion acquisition), SAP, and independent Microstrategy, among many others. They use words like “Data Warehouse” to describe these products, and they’ve been making them for decades. Data analysts work with them, running queries and sending reports to corporate bosses. A standalone analytical data warehouse is commonly accepted wisdom in enterprise IT.

But those data warehouses are getting faster and faster. Rather than running a report and getting it a day later, analysts can explore the data in real time — re-sorting it by some dimension, filtering it in some way, and drilling down. This is often called pivoting, and if you’ve ever used a Pivot Table in Excel you know what it’s like. In data warehouses, however, we’re dealing with millions of rows.

At the same time, operational databases are getting faster and sneakier. Traditionally, a database is the bottleneck in an application because it doesn’t handle concurrency well. If a record is being changed in the database by one person, it’s locked so nobody else can touch it. If I am editing a Word document, it makes sense to lock it so someone else doesn’t edit it — after all, what would we do with the changes we’d both made?

But that model wouldn’t work for Facebook or Twitter. Imagine a world where, when you’re updating your status, all your friends can’t refresh their feeds.

We’ve found ways to fix this. When several people edit a Google Doc at once, for instance, each of their changes is made as a series of small transactions. The document doesn’t really exist — instead, it’s a series of transactional updates, assembled to look like a document. Similarly, when you post something to Facebook, those changes eventually find their way to your friends. The same is true on Twitter or Google+.

These kinds of eventually consistent approaches make concurrent editing possible. They aren’t really new, either: your bank statement is eventually consistent, and when you check it online, the bottom of the statement tells you that the balance is only valid up until a period in the past and new transactions may take a while to post. Here’s what mine says:

Transactions from today are reflected in your balance, but may not be displayed on this page if you recently updated your bankbook, if a paper statement was issued, or if a transaction is backdated. These transactions will appear in your history the following business day.

Clearly, if eventual consistency is good enough for my bank account, it’s good enough for some forms of enterprise data.

So, we have analytical databases getting real-time fast and operational databases increasingly able to do things concurrently without affecting production systems. Which begs the question: why do we have two databases?

This is a massive, controversial issue worth billions of dollars. Take, for example, EMC, which recently merged its Greenplum acquisition into Pivotal. Pivotal’s marketing (“help customers build, deploy, scale, and analyze at an unprecedented velocity”) points at this convergence, which may happen as organizations move their applications into cloud environments (which is partly why Pivotal includes Cloud Foundry, which VMWare acquired).

The change will probably create some huge industry consolidation in the coming years (think Oracle buying Teradata, then selling a unified operational/analytical database). There are plenty of reasons it’s a bad idea, and plenty of reasons why it’s a good one. We think this will be a hot topic in 2014.

Cassandra and the other stacks

Big data has been synonymous with Hadoop. The break-out success of the Hadoop ecosystem has been astonishing, but it does other stacks a disservice. There are plenty of other robust data architectures that have furiously enthusiastic tribes behind them. Cassandra, for example, was created by Facebook, released into the wild, and tamed by Reddit to allow the site to scale to millions of daily visitors atop Amazon with only a handful of employees. MongoDB is another great example, and there are dozens more.

Some of these stacks got wrapped around the axle of the NoSQL debate, which, as I mentioned, might have been better framed as NoJoin. But we’re past that now, and there are strong case studies for many of the stacks. There are also proven affinities between a particular stack (such as Cassandra) and a particular cloud (such as Amazon Web Services) because of their various heritages.

In 2014, we’ll be discussing more abstract topics and regarding every one of these stacks as a tool in a good toolbox.

Mobile data

By next year, there will be more mobile phones in the world than there are humans, over one billion of them “smart.” They are the closest thing we have to a tag for people. Whether measuring mall traffic for shoppers or projecting the source of Malarial outbreaks in Africa, it’s big. One carrier recently released mobile data from the Ivory Coast to researchers.

Just as Time Series data has structure, so does geographic data, much of which lives in Strata’s Connected World track. Mobile data is a precursor to the Internet of Everything, and it’s certainly one of the most prolific structured data sources in the world.

I think concentrating on mobility is critical for another reason, too. The large systems created to handle traffic for the nearly 1,000 carriers in the world are big, fast, and rock solid. An AT&T 5ESS switch or one of the large-scale Operational Support Systems, simply does not fall over.

Other than DNS, the Internet doesn’t really have this kind of industrial-grade system for managing billions of devices, each of which can connect to the others with just a single address. That is astonishing scale, and we tend to ignore it as “plumbing.” In 2014 , the control systems for the Internet of Everything are as likely to come from Big Iron made by Ericsson as they are to come from some Web 2.0 titan.

The analytic life-cycle

The book The Theory That Would Not Die begins with a quote from John Maynard Keynes: “When the facts change, I change my opinion. What do you do, sir?” As this New York Times review of the book observes, “If you are not thinking like a Bayesian, perhaps you should be.”

Bayes’ theorem says that beliefs must be updated based on new evidence — and in an information-saturated world, new evidence arrives constantly, which means the cycle turns quickly. To many readers, this is nothing more than explaining the scientific method. But there are plenty of people who weren’t weaned on experimentation and continuous learning — and even those with a background in science make dumb mistakes, as the Boy Or Girl Paradox handily demonstrates.

Ben Lorica, O’Reilly’s chief scientist (and owner of the enviable Twitter handle @BigData) recently wrote about the lifecycle of data analysis. I wrote another piece on the Lean Analytics cycle with Avinash Kaushik a few months ago. In both cases, it’s an iterative process of hypothesis-forming, experimentation, data collection, and readjustment.

In 2014, we’ll be spending more time looking at the whole cycle of data analysis, including collection, storage, interpretation, and the practice of asking good questions informed by new evidence.

Data anthropology

Data seldom tells the whole story. After flooding in Haiti, mobile phone data suggested people weren’t leaving one affected area for a safe haven. Researchers concluded that they were all sick with cholera and couldn’t move. But by interviewing people on the ground, aid workers found out the real problem was that flooding had destroyed the roads, making it hard to leave.

As this example shows, there’s no substitute for context. In Lean Analytics, we say “Instincts are experiments. Data is proof.” For some reason this resonated hugely and is one of the most favorited/highlighted passages in the book. People want a blend of human and machine, of soft, squishy qualitative data alongside cold, hard quantitative data. We joke that in the early stages of a startup, your only metric should be “how many people have I spoken with?” It’s too early to start counting.

In Ash Maurya’s Running Lean, there’s a lot said about customer development. Learning how to conduct good interviews that don’t lead the witness and measuring the cultural factors that can pollute data is hugely difficult. In The Righteous MindJonathan Haidt says all university research is WEIRD: Western, Educated, Industrialized, Rich, and Democratic. That’s because test subjects are most often undergraduates, who fit this bill. To prove his assertion, Haidt replicated studies done on campus at a McDonald’s a few miles away, with vastly different results.

At the first Strata New York, I actually left the main room one morning to go write a blog post. I was so overcome by the examples of data errors — from bad collection, to bad analysis, to wilfully ignoring the results of good data — that it seemed to me we weren’t paying attention to the right things. If “Data is the new Oil,” then its supply chain is a controversial XL pipeline with woefully few people looking for leaks and faults. Anthropology can fix this, tying quantitative assumptions to verification.

Nobody has championed data anthropology as much as O’Reilly’s own Roger Magoulas, who joined Jon Bruner and Jim Stogdill for a podcast on the subject recently.

So, data anthropology can ensure good data collection, provide essential context to data, and check that the resulting knowledge is producing the intended results. That’s why it’s on our list of hot topics for 2014.

Photo: Book Spiral – Seattle Central Library by brewbooks, on Flickr

June 09 2011

Four short links: 9 June 2011

  1. Optimizing MongoDB -- shorter field names, barely hundreds of ops/s when not in RAM, updates hold a lock while they fetch the original from disk ... it's a pretty grim story. (via Artur Bergman)
  2. Is There a New Geek Anti-Intellectualism? -- focus is absolutely necessary if we are to gain knowledge. We will be ignoramuses indeed, if we merely flow along with the digital current and do not take the time to read extended, difficult texts. (via Sacha Judd)
  3. Trend Data for Teens (Pew Internet and American Life Project) -- one in six American teens have used the Internet to look for information online about a health topic that’s hard to talk about, like drug use, sexual health, or depression.
  4. The Guts of Android (Linux Weekly News) -- technical but high-level explanation of the components of an Android system and how they compare to those of a typical Linux system.

May 05 2011

Four short links: 5 May 2011

  1. Why We Chose MongoDB for Guardian.co.uk (SlideShare) -- they're using MongoDB's flexible schema, as schema upgrades were pain in their previous system (Oracle). I think of these as the database equivalent of dynamic typing in languages like Perl and Ruby. (via Paul Rowe)
  2. Solving Problems with Visual Analytics -- This book is the result of a community effort of the partners of the VisMaster Coordinated Action funded by the European Union. The overarching aim of VisMaster was to create a research roadmap that outlines the current state of visual analytics across many disciplines, and to describe the next steps that have to be taken to foster a strong visual analytics community, thus enabling the development of advanced visual analytic applications. (via Mark Madsen)
  3. iOS-Couchbase (GitHub) -- a build of distributed key-value store CouchDB, which will keep your mobile data in sync with a remote store. No mean feat given CouchDB itself has Erlang as a dependency. (via Mike Olson)
  4. SimString -- A fast and simple algorithm for approximate string retrieval in C++ with Python and Ruby bindings, opensourced with modified BSD license. (via Matt Biddulph)

April 13 2011

What VMware's Cloud Foundry announcement is about

I chatted today about VMware's Cloud Foundry with Roger Bodamer, the EVP of products and technology at 10Gen. 10Gen's MongoDB is one of three back-ends (along with MySQL and Redis) supported from the start by Cloud Foundry.

If I understand Cloud Foundry and VMware's declared "Open PaaS" strategy, it should fill a gap in services. Suppose you are a developer who wants to loosen the bonds between your programs and the hardware they run on, for the sake of flexibility, fast ramp-up, or cost savings. Your choices are:

  • An IaaS (Infrastructure as a Service) product, which hands you an emulation of bare metal where you run an appliance (which you may need to build up yourself) combining an operating system, application, and related services such as DNS, firewall, and a database.

  • You can implement IaaS on your own hardware using a virtualization solution such as VMware's products, Azure, Eucalyptus, or RPM. Alternatively, you can rent space on a service such as Amazon's EC2 or Rackspace.

  • A PaaS (Platform as a Service) product, which operates at a much higher level. A vendor such as

By now, the popular APIs for IaaS have been satisfactorily emulated so that you can move your application fairly easily from one vendor to another. Some APIs, notably OpenStack, were designed explicitly to eliminate the friction of moving an app and increase the competition in the IaaS space.

Until now, the PaaS situation was much more closed. VMware claims to do for PaaS what Eucalyptus and OpenStack want to do for IaaS. Vmware has a conventional cloud service called Cloud Foundry, but will offer the code under an open source license. Right Scale has already announced that you can use it to run a Cloud Foundry application on EC2. And a large site could run Cloud Foundry on its own hardware, just as it runs VMware.

Cloud Foundry is aggressively open middleware, offering a flexible way to administer applications with a variety of options on the top and bottom. As mentioned already, you can interact with MongoDB, MySQL, or Redis as your storage. (However, you have to use the particular API offered by each back-end; there is no common Cloud Foundry interface that can be translated to the chosen back end.) You can use Spring, Rails, or Node.js as your programming environment.

So open source Cloud Foundry may prove to be a step toward more openness in the cloud arena, as many people call for and I analyzed in a series of articles last year. VMware will, if the gamble pays off, gain more customers by hedging against lock-in and will sell its tools to those who host PaaS on their own servers. The success of the effort will depend on the robustness of the solution, ease of management, and the rate of adoption by programmers and sites.

December 01 2010

Strata Gems: Try MongoDB without installing anything

Welcome to our series of Strata Gems. We'll be publishing a new one each day all the way through to December 24.

Strata 2011Document store databases such as MongoDB and CouchDB offer a scalable way to store semi-structured data. If you're looking for an easy way to get started, without the pain of installing either a single instance or a cluster of database servers, MongoDB is a good choice.

MongoDB lets you store and query data expressed as JSON documents. It offers conventional indexing, replication and lets you perform map-reduce jobs on database contents. You can get started from the comfort of your browser by heading over to the MongoDB web site and using the online "Try It Out" interface.

The MongoDB shell is an interactive JavaScript interpreter.

> foo = 32;
> print(foo);
32
>

Interactive MongoDB shell
The MongoDB "Try It Out" browser-based shell


Here are a couple of sample JSON documents we'll store in Mongo.

> var fred = {name: "Fred Flintstone", age: 32};
{ 
 "name" : "Fred Flintsone",
 "age" : 32
}

> var barney = {name: "Barney Rubble", age: 31};
{
"name" : "Barney Rubble",
"age" : 31
}

Now we'll save them into a database called "characters" - created automatically when referenced - and query for all the documents in the database.

> db.characters.save(fred);
"ok"
> db.characters.save(barney);
"ok"
> db.characters.find();
[ 
  {   "_id" : {   "$oid" : "4cf69bc3cc9374271b0137c0"   },   "name" : "Fred Flintstone",   "age" : 32   },
  {   "_id" : {   "$oid" : "4cf69bc7cc9374271b0137c1"   },   "name" : "Barney Rubble",   "age" : 31   }
]

Let's run a few more complex queries, finding the characters who have age equal to 31, and who have age greater than 30.

> db.characters.find({age: 31});

[
{ "_id" : { "$oid" : "4cf69bc7cc9374271b0137c1" }, "name" : "Barney Rubble", "age" : 31 }
]
> db.characters.find({age: {'$gt': 30}});

[
{ "_id" : { "$oid" : "4cf69bc3cc9374271b0137c0" }, "name" : "Fred Flintstone", "age" : 32 },
{ "_id" : { "$oid" : "4cf69bc7cc9374271b0137c1" }, "name" : "Barney Rubble", "age" : 31 }
]

This is just a small taste of what you can do with MongoDB. To dive further, enter tutorial into the "Try It Out" interface and work through the steps, then download MongoDB and try it out on your own machine.

February 24 2010

NoSQL conference coming to Boston

On March 11 Boston will join several other cities who have host
conferences on the movement broadly known as NoSQL. href="http://incubator.apache.org/cassandra/">Cassandra, href="http://couchdb.apache.org/">CouchDB, HBase, HypergraphDB,
Hypertable, Memcached, MongoDB,
Neo4j, Riak, href="http://aws.amazon.com/simpledb/">SimpleDB, Voldemort, and
probably other projects as well will be represented at the href="http://nosqlboston.eventbrite.com/">one-day affair.

It's generally understood that characterizing a movement by what it's
not is awkward, and it's hard to find an elevator speech to
encompass all the topics of NoSQL Boston. Are these tools for "big
data" problems? Usually, but sometimes even small web sites can find
them useful. Are the tools meant for processing streams such as log
files? Sometimes, but they can be useful for other text and data
processing as well. And do they reject relational principles? Well, so
you'd think--but different ones reject different principles, so even
there it's hard to find commonality. (I compared them to relational
databases in a href="http://broadcast.oreilly.com/2009/07/relational-databases-as-realit.html">blog
last year.

The interviews I had with various projects leaders for this article
turned up a recurring usage pattern for NoSQL. I was seeking
particular domains or types of data where the tools would be useful,
but couldn't see much commonality. What connects the users is that
they carry out web-related data crunching, searching, and other Web
2.0 related work. I think these companies use NoSQL tools because
they're the companies who understand leading-edge technologies and are
willing to take risks in those areas. As the field gets better known,
usage will spread.

I had a talk last week with conference organizer Eliot Horowitz, who
is the founder and CTO of 10gen, the company that makes MongoDB. He
let me know that the conference plans to bypass the head-scratching
and launch into practical applications. The day will contain a coding
session and a schema design session along with keynotes.

The resilience of open source

One question that intrigues me is why all the offerings in the NoSQL
area are open source. Some have commercial add-ons, but the core
technology is provided as free software. The few proprietary products
and services in the market (such as href="http://citrusleaf.net/index.html">Citrusleaf) get far less
attention. Reasons seem to include:

  • The market is currently too small. Just as most computing innovations
    start off in research settings, this one is being explored by people
    looking for solutions to their own problems, more than ways to extract
    a profit. Numerous in-house projects exist in this space that are not
    free software (Google's Map/Reduce and BigTable, for instance, and
    Amazon's SimpleDB and Dynamo) but they aren't commercialized either.


  • Experimentation is moving too fast. Most of the projects are just a
    couple years old, and are rapidly adding features.


  • The ROI is hard to calculate. Horowitz says, "People won't pay for
    anything they don't really understand yet." (Nevertheless, 10gen and
    other companies are commercializing the open source offerings.)


  • Whatever problem an organization is trying to solve, each NoSQL
    offering tends to be piece of the solution. It has to be tuned for and
    integrated into the organization's architecture, and combined with
    pieces from other places.

The projects in this conference therefore demonstrate the innovative
power of free software. CouchDB and Cassandra are particularly
interesting in this regard because they are community efforts more
than corporate efforts. Both are Apache top-level projects. (Cassandra
was just moved from the incubator to a top-level project on February
17.) CouchDB committer J. Chris Anderson tells me that the Apache
community process ensures a wide range of voices are heard, leading to
(of course) occasional public wrangling but a superior outcome.

The BBC and (according to Anderson) SXSW are among the href="http://wiki.apache.org/couchdb/CouchDB_in_the_wild">users of
CouchDB, CouchDB has been integrated into Ubuntu, Mozilla Messaging is
basing Raindrop (their next-generation messaging platform) on CouchDB,
and even mobile handset manufacturers are looking at it. (O'Reilly
Media also uses CouchDB.)

I also talked to Alan Hoffman of href="http://cloudant.com/">Cloudant, which offers a CouchDB cloud
service that fills in some of the gaps left by bare CouchDB
(consistent hashing, partitioning, quorum, etc.). Although a couple
companies offer commercial support, no single company takes
responsibility for CouchDB. Its community is highly
distributed. Anderson listed 10 Apache committers working for 8
different companies, and nearly 40 other people who contribute
patches. Support takes place on mailing lists (roughly one thousand
messages a month) and IRC channels.

Jonathan Ellis, project chair of Cassandra, calls it an "open source
success story" because it went from a state of near petrification to
vibrant regrowth through open sourcing. Facebook invented it and
brought it to a state where it satisfied their needs. They made it
open in and moved it into the Apache Incubator in 2008 but declared
that they would not be doing further development. It could easily have
receded into obscurity.

Ellis says that he was hired at href="http://www.rackspace.com/">Rackspace and asked to find a
distributed data store that was fast and scaled easily; he decided on
Cassandra. Soon after he became a public and enthusiastic advocate,
Digg and Twitter joined Rackspace as users and developers. Having
multiple QA teams test each release--particularly in very different
environments--helps quality immensely. Ellis find that Eric Raymond's
"many eyes" characterization of open source bug fixing applies.

Although Cassandra is found mostly as a backing store for web sites
with a lot of users, Ellis thinks it would meet the needs of many
academic and commercial sites, and looks forward to someone offering a
cloud service based on it.

Justin Sheehy, CTO of Basho, maker of
the Riak data store, told me they can confirm the typical advantages
cited for open source. Developers at potential customer sites can try
out the software without going through a bureaucratic procurement
process, and then become internal advocates who function much more
effectively than outside salespeople.

He also says that companies such as Basho offer the best of both
worlds to tentative customers. The backing of a corporation means that
professional services and added tools are available to go along with
the product those customers buy. But because the source is open and
has a community around it, those customers can feel secure that
development and support will continue regardless of the fate of the
originating company. 10gen, of course, plays a similar role for
MongoDB and Anderson's company Couchio
offers support for CouchDB. For projects that are not closely
associated with the backing of one company, the Apache Foundation's
sponsorship helps to ensure continuity.

What are the fault lines in the NoSQL landscape?

Naturally, the projects I've mentioned in this blog borrow ideas from
each other and show tiny variations on common solutions regarding such
things as B-tree storage, replication, solutions to locality of
reference, etc. Experience will eventually lead to a shake-out and a
convergence among surviving projects. In the meanwhile, how can you
get your head around them?

We'll pause here for a word from our sponsors, letting you know that
O'Reilly has published books on href="http://oreilly.com/catalog/9780596155896/">CouchDB and href="http://oreilly.com/catalog/9780596521974/">Hadoop and is
developing one about MongoDB.

Horowitz offers an initial subdivision of projects based on data model
(document, key-value, or tabular), a theme he explored in href="http://howsoftwareisbuilt.com/2010/02/13/interview-with-eliot-horowitz-cto-of-10gen-mongodb/">another
interview.

Roger Magoulas, a research director with O'Reilly, further subdivides
projects into those that crunch large data sets in a batch
manner--such as Hadoop--and those that retrieve views of data to
fulfill visitor search requests on web pages or similar tasks. He goes
on to say that you can compare them on the basis of particular
features, such as automatic replication, auto-sharding or
partitioning, and in-memory caches.

The most comprehensive attempts I've seen to make sense of this gangly
crew of projects from a feature standpoint come in href="http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/">a
blog by Ellis and one by href="http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape.html">blog
by Vineet Gupta. (Gupta's blog is labeled "Part 1" and I'd love to
see more parts.) But Sheehy says the various features of the offerings
interact too strongly and have too many subtle variations to fit into
an easy taxonomy. "Many people try to classify the projects, everyone
does it differently, and nobody gets it quite right."

Community features

So who uses these things? To take Horowitz's MongoDB again as an
example, many web sites gravitate toward it because the document
structure makes some things--adding fields to rows, mapping objects to
fields--easier than a relational database does. A few scientific sites
also use MongoDB.

Riak also has a large following among web sites and startups, but
their customers also include media companies, ad networks, SMS
gateways, analytics firms, and many other types of organizations.

Magoulas finds that an organization's bent is determined by the
background and expertise of its developers. Programmers with lots of
traditional relational database experience tend to be wary of the
recent upstarts, a position reinforced by legacy investments in tools
that depend on their relational database and are sometimes very
expensive.

On the other hand, web programmers look for tools that conform more
closely to the data structures and programming techniques they're used
to, and can actually be "flummoxed" by relational database logic or
abstraction layers on top of the databases. These programmers may
think it intuitive to do the kinds of filtering and sorting that seem
like reinventing the wheel to a traditional RDMBS programmer.
Anderson likes to quote Jacob Kaplan-Moss, the creator of Django, as
saying, "Django may be built for the Web, but CouchDB is built of the
Web. I've never seen software that so completely embraces the
philosophies behind HTTP."

10gen's consultation with MongoDB users includes asking for votes on
new features. They also see a great deal of code contributions in the
driver layer and adapters (sessions, logging, etc.) but not much in
the core. Sheehy said the same is true of Riak: although contributions
to the core are rare, half the client libraries are developed by
outsiders, and many of the tools.

Rapid change is part of life for NoSQL developers. Anderson says of
CouchDB, "The ancillary APIs have been evolving rapidly in preparation
for our 1.0 release, which should come out in the next few months and
won't differ much from today's trunk. The new APIs include
authentication, authorization, details of Map/Reduce, and functions
for transforming and serving JSON documents as other datatypes such as
HTML or CSV." Horowitz stressed that MongoDB will roll out a lot of
new features over the upcoming year.

One hundred people have signed up for NoSQL Boston so far, and more
than 150 are expected. I'll be there to take it in and try to reduce
it to some high-level insights for this blog.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl