Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 19 2013

Four short links: 20 September 2013

  1. Researchers Can Slip an Undetectable Trojan into Intel’s Ivy Bridge CPUs (Ars Technica) — The exploit works by severely reducing the amount of entropy the RNG normally uses, from 128 bits to 32 bits. The hack is similar to stacking a deck of cards during a game of Bridge. Keys generated with an altered chip would be so predictable an adversary could guess them with little time or effort required. The severely weakened RNG isn’t detected by any of the “Built-In Self-Tests” required for the P800-90 and FIPS 140-2 compliance certifications mandated by the National Institute of Standards and Technology.
  2. rethinkdbopen-source distributed JSON document database with a pleasant and powerful query language.
  3. Teach Kids Programming — a collection of resources. I start on Scratch much sooner, and 12+ definitely need the Arduino, but generally I agree with the things I recognise, and have a few to research …
  4. Raspberry Pi as Ad-Blocking Access Point (AdaFruit) — functionality sadly lacking from my off-the-shelf AP.

September 18 2013

Four short links: 18 September 2013

  1. No ManagersIf we could find a way to replace the function of the managers and focus everyone on actually producing for our Students (customers) then it would actually be possible to be a #NoManager company. In my future posts I’ll explain how we’re doing this at Treehouse.
  2. The 20 Smartest Things Jeff Bezos Has Ever Said (Motley Fool) — I feel like the 219th smartest thing Jeff Bezos has ever said is still smarter than the smartest thing most business commentators will ever say. (He says, self-referentially) “Invention requires a long-term willingness to be misunderstood.”
  3. Putting Time in Perspective — nifty representations of relative timescales and history. (via BoingBoing)
  4. Sophia — BSD-licensed small C library implementing an embeddable key-value database “for a high-load environment”.

September 08 2013

March 01 2013

Four short links: 1 March 2013

  1. Drone Journalismtwo universities in the US have already incorporated drone use in their journalism programs. The Drone Journalism Lab at the University of Nebraska and the Missouri Drone Journalism Program at the University of Missouri both teach journalism students how to make the most of what drones have to offer when reporting a story. They also teach students how to fly drones, the Federal Aviation Administration (FAA) regulations and ethics.
  2. passivednsA network sniffer that logs all DNS server replies for use in a passive DNS setup.
  3. IFLA E-Lending Background Paper (PDF) — The global dominance of English language eBook title availability reinforced by eReader availability is starkly evident in the statistics on titles available by country: in the USA: 1,000,000; UK: 400,000; Germany/France: 80,000 each; Japan: 50,000; Australia: 35,000; Italy: 20,000; Spain: 15,000; Brazil: 6,000. Many more stats in this paper prepared as context for the International Federation of Library Associations.
  4. The god Architecturea scalable, performant, persistent, in-memory data structure server. It allows massively distributed applications to update and fetch common data in a structured and sorted format. Its main inspirations are Redis and Chord/DHash. Like Redis it focuses on performance, ease of use and a small, simple yet powerful feature set, while from the Chord/DHash projects it inherits scalability, redundancy, and transparent failover behaviour.

June 15 2012

Four short links: 15 June 2012

  1. In Flawed, Epic Anonymous Book, the Abyss Gazes Back (Wired) -- Quinn Norton's review of a book about Anonymous is an excellent introduction to Anonymous. Anonymous made us, its mediafags, masters of hedging language. The bombastic claims and hyperbolic declarations must be reported from their mouths, not from our publications. And yet still we make mistakes and publish lies and assumptions that slip through. There is some of this in all of journalism, but in a world where nothing is true and everything is permitted, it’s a constant existential slog. It’s why there’s not many of us on this beat.
  2. Titan (GitHub) -- Apache2-licensed distributed graph database optimized for storing and processing large-scale graphs within a multi-machine cluster. Cassandra and HBase backends, implements the Blueprints graph API. (via Hacker News)
  3. Extra Second This June -- we're getting a leap second this year: there'll be 2012 June 30, 23h 59m 60s. Calendars are fun.
  4. On Creativity (Beta Knowledge) -- I wanted to create a game where even the developers couldn’t see what was coming. Of course I wasn’t thinking about debugging at this point. The people who did the debugging asked me what was a bug. I could not answer that. — Keita Takahashi, game designer (Katamari Damacy, Noby Noby Boy). Awesome quote.

April 14 2012

MySQL in 2012: Report from Percona Live

The big annual MySQL conference, started by MySQL AB in 2003 and run
by my company O'Reilly for several years, lives on under the able
management of Percona. This
fast-growing company started out doing consulting on MySQL,
particularly in the area of performance, and branched out into
development and many other activities. The principals of this company
wrote the most recent two editions of the popular O'Reilly book href="">High
Performance MySQL

Percona started offering conferences a couple years ago and decided to
step in when O'Reilly decided not to run the annual MySQL conference
any more. Oracle did not participate in Percona Live, but has
announced href="">its own MySQL
conference for next September.

Percona Live struck me as a success, with about one thousand attendees
and the participation of leading experts from all over the MySQL
world, save for Oracle itself. The big players in the MySQL user
community came out in force: Facebook, HP, Google, Pinterest (the
current darling of the financial crowd), and so on.

The conference followed the pattern laid down by old ones in just
about every way, with the same venue (the Santa Clara Convention
Center, which is near a light-rail but nothing else of interest), the
same food (scrumptious), the standard format of one day of tutorials
and two days of sessions (although with an extra developer day tacked
on, which I will describe later), an expo hall (smaller than before,
but with key participants in the ecosystem), and even community awards
(O'Reilly Media won an award as Corporate Contributor of the Year).
Monty Widenius was back as always with a MariaDB entourage, so it
seemed like old times. The keynotes seemed less well attended than the
ones from previous conferences, but the crowd was persistent and
showed up in impressive numbers for the final events--and I don't
believe it was because everybody thought they might win one of the
door prizes.

Jeremy Zawodny ready to hand out awards
Jeremy Zawodny ready to hand out awards.

Two contrasting database deployments

I checked out two well-attended talks by system architects from two

high-traffic sites: Pinterest and craigslist. The radically divergent
paths they took illustrate the range of options open to data centers
nowadays--and the importance of studying these options so a data
center can choose the path appropriate to its mission and

Jeremy Zawodny (co-author of the first edition of High Performance
) href="
the design of craigslist's site, which illustrates the model of
software accretion over time and an eager embrace of heterogeneity.
Among their components are:

  • Memcache, lying between the web servers and the MySQL database in
    classic fashion.

  • MySQL to serve live postings, handle abuse, data for monitoring
    system, and other immediate needs.

  • MongoDB to store almost 3 billion items related to archived (no longer
    live) postings.

  • HAproxy to direct requests to the proper MySQL server in a cluster.

  • Sphinx for text searches, with

    indexes over all live postings, archived postings, and forums.

  • Redis for temporary items such as counters and blobs.

  • An XFS filesystem for images.

  • Other helper functions that Zawodny lumped together as "async

Care and feeding of this menagerie becomes a job all in itself.
Although craigslist hires enough developers to assign them to
different areas of expertise, they have also built an object layer
that understands MySQL, cache, Sphinx, MongoDB. The original purpose
of this layer was to aid in migrating old data from MySQL to MongoDB
(a procedure Zawodny admitted was painful and time-consuming) but it
was extended into a useful framework that most developers can use
every day.

Zawodny praised MySQL's durability and its method of replication. But
he admitted that they used MySQL also because it was present when they
started and they were familiar with it. So adopting the newer entrants
into the data store arena was by no means done haphazardly or to try
out cool new tools. Each one precisely meets particular needs of the

For instance, besides being fast and offering built-in sharding,
MongoDB was appealing because they don't have to run ALTER TABLE every
time they add a new field to the database. Old entries coexist happily
with newer ones that have different fields. Zawodny also likes using a
Perl client to interact with a database, and the Perl client provided
by MongoDB is unusually robust because it was developed by 10gen
directly, in contrast to many other datastores where Perl was added by
some random volunteer.

The architecture at craigslist was shrewdly chosen to match their
needs. For instance, because most visitors click on the limited set
of current listings, the Memcache layer handles the vast majority of
hits and the MySQL database has a relatively light load.

However, the MySQL deployment is also carefully designed. Clusters are
vertically partitioned in two nested ways. First, different types of
items are stored on separate partitions. Then, within each type, the
nodes are further divided by the type of query:

  • A single master to handle all writes.

  • A group for very fast reads (such as lookups on a primary key)

  • A group for "long reads" taking a few seconds

  • A special node called a "thrash handler" for rare, very complex

It's up to the application to indicate what kind of query it is
issuing, and HAproxy interprets this information to direct the query
to the proper set of nodes.

Naturally, redundancy is built in at every stage (three HAproxy
instances used in round robin, two Memcache instances holding the same
data, two data centers for the MongoDB archive, etc.).

It's also interesting what recent developments have been eschewed by
craigslist. The self-host everything and use no virtualization.
Zawodny admits this leads to an inefficient use of hardware, but
avoids the overhead associated with virtualization. For efficiency,
they have switched to SSDs, allowing them to scale down from 20
servers to only 3. They don't use a CDN, finding that with aggressive
caching and good capacity planning they can handle the load
themselves. They send backups and logs to a SAN.

Let's turn now from the teeming environment of craigslist to the
decidedly lean operation of Pinterest, a much younger and smaller
organization. As href="">presented
by Marty Weiner and Yashh Nelapati, when they started web-scale
growth in the Autumn of 2011, they reacted somewhat like craigslist,
but with much less thinking ahead, throwing in all sorts of software
such as Cassandra and MongoDB, and diversifying a bit recklessly.
Finally they came to their senses and went on a design diet. Their
resolution was to focus on MySQL--but the way they made it work is
unique to their data and application.

They decided against using a cluster, afraid that bad application code
could crash everything. Sharding is much simpler and doesn't require
much maintenance. Their advice for implementing MySQL sharding

  • Make sure you have a stable schema, and don't add features for a
    couple months.

  • Remove all joins and complex queries for a while.

  • Do simple shards first, such as moving a huge table into its own

They use Pyres, a
Python clone of Resque, to move data into shards.

However, sharding imposes severe constraints that led them to
hand-crafted work-arounds.

Many sites want to leave open the possibility for moving data between
shards. This is useful, for instance, if they shard along some
dimension such as age or country, and they suddenly experience a rush
of new people in their 60s or from China. The implementation of such a
plan requires a good deal of coding, described in the O'Reilly book href="">MySQL High
, including the creation of a service that just
accepts IDs and determines what shard currently contains the ID.

The Pinterest staff decided the ID service would introduce a single
point of failure, and decided just to hard-code a shard ID in every ID
assigned to a row. This means they never move data between shards,
although shards can be moved bodily to new nodes. I think this works
for Pinterest because they shard on arbitrary IDs and don't have a
need to rebalance shards.

Even more interesting is how they avoid joins. Suppose they want to
retrieve all pins associated with a certain board associated with a
certain user. In classical, normalized relational database practice,
they'd have to do a join on the comment, pin, and user tables. But
Pinterest maintains extra mapping tables. One table maps users to
boards, while another maps boards to pins. They query the
user-to-board table to get the right board, query the board-to-pin
table to get the right pin, and then do simple queries without joins
on the tables with the real data. In a way, they implement a custom
NoSQL model on top of a relational database.

Pinterest does use Memcache and Redis in addition to MySQL. As with
craigslist, they find that most queries can be handled by Memcache.
And the actual images are stored in S3, an interesting choice for a
site that is already enormous.

It seems to me that the data and application design behind Pinterest
would have made it a good candidate for a non-ACID datastore. They
chose to stick with MySQL, but like organizations that use NoSQL
solutions, they relinquished key aspects of the relational way of
doing things. They made calculated trade-offs that worked for their
particular needs.

My take-away from these two fascinating and well-attended talks was
that how you must understand your application, its scaling and
performance needs, and its data structure, to know what you can
sacrifice and what solution gives you your sweet spot. craigslist
solved its problem through the very precise application of different
tools, each with particular jobs that fulfilled craigslist's

requirements. Pinterest made its own calculations and found an
entirely different solution depending on some clever hand-coding
instead of off-the-shelf tools.

Current and future MySQL

The conference keynotes surveyed the state of MySQL and some
predictions about where it will go.

Conference co-chair Sarah Novotny at keynote
Conference co-chair Sarah Novotny at keynotes.

The world of MySQL is much more complicated than it was a couple years
ago, before Percona got heavily into the work of releasing patches to
InnoDB, before they created entirely new pieces of software, and
before Monty started MariaDB with the express goal of making a better
MySQL than MySQL. You can now choose among Oracle's official MySQL
releases, Percona's supported version, and MariaDB's supported
version. Because these are all open source, a major user such as

Facebook can even apply patches to get the newest features.

Nor are these different versions true forks, because Percona and
MariaDB create their enhancements as patches that they pass back to
Oracle, and Oracle is happy to include many of them in a later
release. I haven't even touched on the commercial ecosystem around
MySQL, which I'll look at later in this article.

In his href="
keynote, Percona founder Peter Zaitsev praised the latest MySQL
release by Oracle. With graceful balance he expressed pleasure that
the features most users need are in the open (community) edition, but
allowed that the proprietary extensions are useful too. In short, he
declared that MySQL is less buggy and has more features than ever.

The href="">former
CEO of MySQL AB, Mårten Mickos, also found that MySQL is
doing well under Oracle's wing. He just chastised Oracle for failing
to work as well as it should with potential partners (by which I
assume he meant Percona and MariaDB). He lauded their community
managers but said the rest of the company should support them more.

Keynote by Mårten Mickos
Keynote by Mårten Mickos.

Aker presented an OpenStack MySQL service developed by his current
employer, Hewlett-Packard. His keynote retold the story that had led
over the years to his developing href="">Drizzle (a true fork of MySQL
that tries to return it to its lightweight, Web-friendly roots) and
eventually working on cloud computing for HP. He described modularity,
effective use of multiple cores, and cloud deployment as the future of

A href="
on the second day of the conference brought together high-level
managers from many of the companies that have entered the MySQL space
from a variety of directions in a high-level discussion of the
database engine's future. Like most panels, the conversation ranged
over a variety of topics--NoSQL, modular architecture, cloud
computing--but hit some depth only on the topic of security, which was
not represented very strongly at the conference and was discussed here
at the insistence of Slavik Markovich from McAfee.

Keynote by Brian Aker
Keynote by Brian Aker.

Many of the conference sessions disappointed me, being either very
high level (although presumably useful to people who are really new to
various topics, such as Hadoop or flash memory) or unvarnished
marketing pitches. I may have judged the latter too harshly though,
because a decent number of attendees came, and stayed to the end, and
crowded around the speakers for information.

Two talks, though, were so fast-paced and loaded with detail that I
couldn't possibly keep my typing up with the speaker.

One such talk was the href="
by Mark Callaghan of Facebook. (Like the other keynotes, it should be
posted online soon.) A smattering of points from it:

  • Percona and MariaDB are adding critical features that make replication
    and InnoDB work better.

  • When a logical backup runs, it is responsible for 50% of IOPS.

  • Defragmenting InnoDB improves compression.

  • Resharding is not worthwhile for a large, busy site (an insight also
    discovered by Pinterest, as I reported earlier)

The other fact-filled talk was href="
Yoshinori Matsunobu of Facebook, and concerned how to achieve
NoSQL-like speeds while sticking with MySQL and InnoDB. Much of the
talk discussed an InnoDB memcached plugin, which unfortunately is
still in the "lab" or "pre-alpha" stage. But he also suggested some
other ways to better performance, some involving Memcache and others
more round-about:

  • Coding directly with the storage engine API, which is storage-engine

  • Using HandlerSocket, which queues write requests and performs them
    through a single thread, avoiding costly fsync() calls. This can
    achieve 30,000 writes per second, robustly.

Matsunobu claimed that many optimizations are available within MySQL
because a lot of data can fit in main memory. For instance, if you
have 10 million users and store 400 bytes per user, the entire user
table can fit in 20 GB. Matsunobu tests have shown that most CPU time
in MySQL is spent in functions that are not essential for processing
data, such as opening and closing a table. Each statement opens a
separate connection, which in turn requires opening and closing the
table again. Furthermore, a lot of data is sent over the wire besides
the specific fields requested by the client. The solutions in the talk
evade all this overhead.

The commercial ecosystem

Both as vendors and as sponsors, a number of companies have always
lent another dimension to the MySQL conference. Some of these really
have nothing to do with MySQL, but offer drop-in replacements for it.
Others really find a niche for MySQL users. Here are a few that I
happened to talk to:

  • Clustrix provides a very
    different architecture for relational data. They handle sharding
    automatically, permitting such success stories as the massive scaling
    up of the social media site Massive Media NV without extra
    administrative work. Clustrix also claims to be more efficient by
    breaking queries into fragments (such as the WHERE clauses of joins)
    and executing them on different nodes, passing around only the data

    produced by each clause.

  • Akiban also offers faster
    execution through a radically different organization of data. They
    flatten the normalized tables of a normalized database into a single
    data structure: for instance, a customer and his orders may be located
    sequentially in memory. This seems to me an import of the document
    store model into the relational model. Creating, in effect, an object
    that maps pretty closely to the objects used in the application
    program, Akiban allows common queries to be executed very quickly, and
    could be deployed as an adjunct to a MySQL database.

  • Tokutek produced a drop-in
    replacement for InnoDB. The founders developed a new data structure
    called a fractal tree as a faster alternative to the B-tree structures
    normally used for indexes. The existence of Tokutek vindicates both

    the open source distribution of MySQL and its unique modular design,
    because these allowed Tokutek's founders to do what they do
    best--create a new storage engine--without needing to create a whole
    database engine with the related tools and interfaces it would

  • Nimbus Data Systems creates a
    flash-based hardware appliance that can serve as a NAS or SAN to
    support MySQL. They support a large number of standard data transfer
    protocols, such as InfiniBand, and provide such optimizations as
    caching writes in DRAM and making sure they write complete 64KB blocks
    to flash, thus speeding up transfers as well as preserving the life of
    the flash.

Post-conference events

A low-key developer's day followed Percona Live on Friday. I talked to
people in the Drizzle and
Sphinx tracks.

As a relatively young project, the Drizzle talks were aimed mostly at
developers interested in contributing. I heard talks about their
kewpie test framework and about build and release conventions. But in
keeping with it's goal to make database use easy and light-weight, the
project has added some cool features.

Thanks to a
and a built-in web server, Drizzle now presents you with a
Web interface for entering SQL commands. The Web interface translates
Drizzle's output to simple HTML tables for display, but you can also
capture the JSON directly, making programmatic access to Drizzle
easier. A developer explained to me that you can also store JSON
directly in Drizzle; it is simply stored as a single text column and
the JSON fields can be queried directly. This reminded me of an XQuery
interface added to some database years ago. There too, the XML was
simply stored as a text field and a new interface was added to run the
XQuery selects.

Sphinx, in contrast to Drizzle, is a mature product with commercial
support and (as mentioned earlier in the article) production
deployments at places such as craigslist, as well as an href="">O'Reilly
book. I understood better, after attending today's sessions, what
makes Sphinx appealing. Its quality is unusually high, due to the use

of sophisticated ranking algorithms from the research literature. The
team is looking at recent research to incorporate even better
algorithms. It is also fast and scales well. Finally, integration with
MySQL is very clean, so it's easy to issue queries to Sphinx and pick
up results.

Recent enhancements include an href="">add-on called fSphinx
to make faceted searches faster (through caching) and easier, and
access to Bayesian Sets to find "items similar to this one." In Sphinx
itself, the team is working to add high availability, include a new
morphology (stemming, etc.) engine that handles German, improve
compression, and make other enhancements.

The day ended with a reception and copious glasses of Monty Widenius's
notorious licorice-flavored vodka, an ending that distinguishes the
MySQL conference from others for all time.

April 03 2012

Data's next steps

Steve O'Grady (@sogrady) , a developer-focused analyst from RedMonk, views large-scale data collection and aggregation as a problem that has largely been solved. The tools and techniques required for the Googles and Facebooks of the world to handle what he calls "datasets of extraordinary sizes" have matured. In O'Grady's analysis, what hasn't matured are methods for teasing meaning of this data that are accessible to "ordinary users."

Among the other highlights from our interview:

  • O'Grady on the challenge of big data: "Kevin Weil (@kevinweil) from Twitter put it pretty well, saying that it's hard to ask the right question. One of the implications of that statement is that even if we had perfect access to perfect data, it's very difficult to determine what you would want to ask, how you would want to ask it. More importantly, once you get that answer, what are the questions that derive from that?"
  • O'Grady on the scarcity of data scientists: "The difficulty for basically every business on the planet is that there just aren't many of these people. This is, at present anyhow, a relatively rare skill set and therefore one that the market tends to place a pretty hefty premium on."
  • O'Grady on the reasons for using NoSQL: "If you are going down the NoSQL route for the sake of going down the NoSQL route, that's the wrong way to do things. You're likely to end up with a solution that may not even improve things. It may actively harm your production process moving forward because you didn't implement it for the right reasons in the first place."

The full interview is embedded below and available Here. For the entire interview transcript, click here.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20


February 10 2012

Top stories: February 6-10, 2012

Here's a look at the top stories published across O'Reilly sites this week.

The NoSQL movement
A relational database is no longer the default choice. Mike Loukides charts the rise of the NoSQL movement and explains how to choose the right database for your application.

Jury to Eolas: Nobody owns the interactive web
A Texas jury has struck down a company's claim to ownership of the interactive web. Eolas, which has been suing technology companies for more than a decade, now faces the prospect of losing the patents.

It's time for a unified ebook format and the end of DRM
The music industry has shown that you need to offer consumers a universal format and content without rights restrictions. So when will publishers pay attention?

Business-government ties complicate cyber security
Is an attack on a U.S. business' network an attack on the U.S. itself? "Inside Cyber Warfare" author Jeffrey Carr discusses the intermingling of corporate and government interests in this interview.

Unstructured data is worth the effort when you've got the right tools
Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.

Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

Photo used with "Unstructured data" story: mess with graphviz.

February 08 2012

The NoSQL movement

In a conversation last year, Justin Sheehy, CTO of Basho, described NoSQL as a movement, rather than a technology. This description immediately felt right; I've never been comfortable talking about NoSQL, which when taken literally, extends from the minimalist Berkeley DB (commercialized as Sleepycat, now owned by Oracle) to the big iron HBase, with detours into software as fundamentally different as Neo4J (a graph database) and FluidDB (which defies description).

But what does it mean to say that NoSQL is a movement rather than a technology? We certainly don't see picketers outside Oracle's headquarters. Justin said succinctly that NoSQL is a movement for choice in database architecture. There is no single overarching technical theme; a single technology would belie the principles of the movement.

Think of the last 15 years of software development. We've gotten very good at building large, database-backed applications. Many of them are web applications, but even more of them aren't. "Software architect" is a valid job description; it's a position to which many aspire. But what do software architects do? They specify the high-level design of applications: the front end, the APIs, the middleware, the business logic — the back end? Well, maybe not.

Since the '80s, the dominant back end of business systems has been a relational database, whether Oracle, SQL Server or DB2. That's not much of an architectural choice. Those are all great products, but they're essentially similar, as are all the other relational databases. And it's remarkable that we've explored many architectural variations in the design of clients, front ends, and middleware, on a multitude of platforms and frameworks, but haven't until recently questioned the architecture of the back end. Relational databases have been a given.

Many things have changed since the advent of relational databases:

  • We're dealing with much more data. Although advances in storage capacity and CPU speed have allowed the databases to keep pace, we're in a new era where size itself is an important part of the problem, and any significant database needs to be distributed.
  • We require sub-second responses to queries. In the '80s, most
    database queries could run overnight as batch jobs. That's no
    longer acceptable. While some analytic functions can still run as
    overnight batch jobs, we've seen the web evolve from static files
    to complex database-backed sites, and that requires sub-second
    response times for most queries.
  • We want applications to be up 24/7. Setting up redundant
    servers for static HTML files is easy, but a database replication
    in a complex database-backed application is another.
  • We're seeing many applications in which the database has to
    soak up data as fast (or even much faster) than it processes
    queries: in a logging application, or a distributed sensor
    application, writes can be much more frequent than reads.
    Batch-oriented ETL (extract, transform, and load) hasn't
    disappeared, and won't, but capturing high-speed data flows is
    increasingly important.
  • We're frequently dealing with changing data or with
    unstructured data. The data we collect, and how we use it, grows
    over time in unpredictable ways. Unstructured data isn't a
    particularly new feature of the data landscape, since unstructured
    data has always existed, but we're increasingly unwilling to force
    a structure on data a priori.
  • We're willing to sacrifice our sacred cows. We know that
    consistency and isolation and other properties are very valuable,
    of course. But so are some other things, like latency and
    availability and not losing data even if our primary server goes
    down. The challenges of modern applications make us realize that
    sometimes we might need to weaken one of these constraints in order
    to achieve another.

These changing requirements lead us to different tradeoffs and compromises when designing software. They require us to rethink what we require of a database, and to come up with answers aside from the relational databases that have served us well over the years. So let's look at these requirements in somewhat more detail.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at

Size, response, availability

It's a given that any modern application is going to be distributed. The size of modern datasets is only one reason for distribution, and not the most important. Modern applications (particularly web applications) have many concurrent users who demand reasonably snappy response. In their 2009 Velocity Conference talk, Performance Related Changes and their User Impact, Eric Schurman and Jake Brutlag showed results from independent research projects at Google and Microsoft. Both projects demonstrated imperceptibly small increases in response time cause users to move to another site; if response time is over a second, you're losing a very measurable percentage of your traffic.

If you're not building a web application — say you're doing business analytics, with complex, time-consuming queries — the world has changed, and users now expect business analytics to run in something like real time. Maybe not the sub-second latency required for web users, but queries that run overnight are no longer acceptable. Queries that run while you go out for coffee are marginal. It's not just a matter of convenience; the ability to run dozens or hundreds of queries per day changes the nature of the work you do. You can be more experimental: you can follow through on hunches and hints based on earlier queries. That kind of spontaneity was impossible when research went through the DBA at the data warehouse.

Whether you're building a customer-facing application or doing internal analytics, scalability is a big issue. Vertical scalability (buy a bigger, faster machine) always runs into limits. Now that the laws of physics have stalled Intel-architecture clock speeds in the 3.5GHz range, those limits are more apparent than ever. Horizontal scalability (build a distributed system with more nodes) is the only way to scale indefinitely. You're scaling horizontally even if you're only buying single boxes: it's been a long time since I've seen a server (or even a high-end desktop) that doesn't sport at least four cores. Horizontal scalability is tougher when you're scaling across racks of servers at a colocation facility, but don't be deceived: that's how scalability works in the 21st century, even on your laptop. Even in your cell phone. We need database technologies that aren't just fast on single servers: they must also scale across multiple servers.

Modern applications also need to be highly available. That goes without saying, but think about how the meaning of "availability" has changed over the years. Not much more than a decade ago, a web application would have a single HTTP server that handed out static files. These applications might be data-driven; but "data driven" meant that a batch job rebuilt the web site overnight, and user transactions were queued into a batch processing system, again for processing overnight. Keeping such a system running isn't terribly difficult. High availability doesn't impact the database. If the database is only engaged in batched rebuilds or transaction processing, the database can crash without damage. That's the world for which relational databases were designed. In the '80s, if your mainframe ran out of steam, you got a bigger one. If it crashed, you were down. But when databases became a living, breathing part of the application, availability became an issue. There is no way to make a single system highly available; as soon as any component fails, you're toast. Highly available systems are, by nature, distributed systems.

If a distributed database is a given, the next question is how much work a distributed system will require. There are fundamentally two options: databases that have to be distributed manually, via sharding; and databases that are inherently distributed. Relational databases are split between multiple hosts by manual sharding, or determining how to partition the datasets based on some properties of the data itself: for example, first names starting with A-K on one server, L-Z on another. A lot of thought goes into designing a sharding and replication strategy that doesn't impair performance, while keeping the data relatively balanced between servers. There's a third option that is essentially a hybrid: databases that are not inherently distributed, but that are designed so they can be partitioned easily. MongoDB is an example of a database that can be sharded easily (or even automatically); HBase, Riak, and Cassandra are all inherently distributed, with options to control how replication and distribution work.

What database choices are viable when you need good interactive response? There are two separate issues: read latency and write latency. For reasonably simple queries on a database with well-designed indexes, almost any modern database can give decent read latency, even at reasonably large scale. Similarly, just about all modern databases claim to be able to keep up with writes at high-speed. Most of these databases, including HBase, Cassandra, Riak, and CouchDB, write data immediately to an append-only file, which is an extremely efficient operation. As a result, writes are often significantly faster than reads.

Whether any particular database can deliver the performance you need depends on the nature of the application, and whether you've designed the application in a way that uses the database efficiently: in particular, the structure of queries, more than the structure of the data itself. Redis is an in-memory database with extremely fast response, for both read and write operations; but there are a number of tradeoffs. By default, data isn't saved to disk, and is lost if the system crashes. You can configure Redis for durability, but at the cost of some performance. Redis is also limited in scalability; there's some replication capability, but support for clusters is still coming. But if you want raw speed, and have a dataset that can fit into memory, Redis is a great choice.

It would be nice if there were some benchmarks to cover database performance in a meaningful sense, but as the saying goes, "there are lies, damned lies, and benchmarks." In particular, no small benchmark can properly duplicate a real test-case for an application that might reasonably involve dozens (or hundreds) of servers.

Changing data and cheap lunches

NoSQL databases are frequently called "schemaless" because they don't have the formal schema associated with relational databases. The lack of a formal schema, which typically has to be designed before any code is written, means that schemaless databases are a better fit for current software development practices, such as agile development. Starting from the simplest thing that could possibly work and iterating quickly in response to customer input doesn't fit well with designing an all-encompassing data schema at the start of the project. It's impossible to predict how data will be used, or what additional data you'll need as the project unfolds. For example, many applications are now annotating their data with geographic information: latitudes and longitudes, addresses. That almost certainly wasn't part of the initial data design.

How will the data we collect change in the future? Will we be collecting biometric information along with tweets and Foursquare checkins? Will music sites such as Last.FM and Spotify incorporate factors like blood pressure into their music selection algorithms? If you think these scenarios are futuristic, think about Twitter. When it started out, it just collected bare-bones information with each tweet: the tweet itself, the Twitter handle, a timestamp, and a few other bits. Over its five-year history, though, lots of metadata has been added. A tweet may be 140 characters at most, but a couple KB is actually sent to the server, and all of this is saved in the database. Up-front schema design is a poor fit in a world where data requirements are fluid.

In addition, modern applications frequently deal with unstructured data: blog posts, web pages, voice transcripts, and other data objects that are essentially text. O'Reilly maintains a substantial database of job listings for some internal research projects. The job descriptions are chunks of text in natural languages. They're not unstructured because they don't fit into a schema. You can easily create a JOBDESCRIPTION column in a table, and stuff text strings into it. It's that knowing the data type and where it fits in the overall structure doesn't help. What are the questions you're likely to ask? Do you want to know about skills, certifications, the employer's address, the employer's industry? Those are all valid columns for a table, but you don't know what you care about in advance; you won't find equivalent information in each job description; and the only way to get from the text to the data is through various forms of pattern matching and classification. Doing the classification up front, so you could break a job listing down into skills, certifications, etc., is a huge effort that would largely be wasted. The guys who work with this data recently had fits disambiguating "Apple Computer" from "apple orchard." Would you even know this was a problem outside of a concrete research project based on a concrete question? If you're just pre-populating an INDUSTRY column from raw data, would you notice that lots of computer industry jobs were leaking into fruit farming? A JOBDESCRIPTION column doesn't hurt, but doesn't help much either, and going further, by trying to design a schema around the data that you'll find in the unstructured text, that definitely hurts. The kinds of questions you're likely to ask have everything to do with the data itself, and little to do with that data's relations to other data.

However, it's really a mistake to say that NoSQL databases have no schema. In a document database, such as CouchDB or MongoDB, documents are key-value pairs. While you can add documents with differing sets of keys (missing keys or extra keys), or even add keys to documents over time, applications still must know that certain keys are present to query the database; indexes have to be set up to make searches efficient. The same thing applies to column-oriented databases, such as HBase and Cassandra. While any row may have as many columns as needed, some up-front thought has to go into what columns are needed to organize the data. In most applications, a NoSQL database will require less up-front planning, and offer more flexibility as the application evolves. As we'll see, data design revolves more around the queries you want to ask than the domain objects that the data represents. It's not a free lunch; possibly a cheap lunch, but not free.

What kinds of storage models do the more common NoSQL databases support? Redis is a relatively simple key-value store, but with a twist: values can be data structures (lists and sets), not just strings. It supplies operations for working directly with sets and lists (for example, union and intersection).

CouchDB and MongoDB both store documents in JSON format, where JSON is a format originally designed for representing JavaScript objects, but now available in many languages. So on one hand, you can think of CouchDB and MongoDB as object databases; but you could also think of a JSON document as a list of key-value pairs. Any document can contain any set of keys, and any key can be associated with an arbitrarily complex value that is itself a JSON document. CouchDB queries are views, which are themselves documents in the database that specify searches. Views can be very complex, and can use a built-in MapReduce facility to process and summarize results. Similarly, MongoDB queries are JSON documents, specifying fields and values to match, and query results can be processed by a built in MapReduce. To use either database effectively, you start by designing your views: what do you want to query, and how. Once you do that, it will become clear what keys are needed in your documents.

Riak can also be viewed as a document database, though with more flexibility about document types. It natively handles JSON, XML, and plain text, and a plug-in architecture allows you to add support for other document types. Searches "know about" the structure of JSON and XML documents. Like CouchDB, Riak incorporates MapReduce to perform complex queries efficiently.

Cassandra and HBase are usually called column-oriented databases, though a better term is a "sparse row store." In these databases, the equivalent to a relational "table" is a set of rows, identified by a key. Each row consists of an unlimited number of columns; columns are essentially keys that let you look up values in the row. Columns can be added at any time, and columns that are unused in a given row don't occupy any storage. NULLs don't exist. And since columns are stored contiguously, and tend to have similar data, compression can be very efficient, and searches along a column are likewise efficient. HBase describes itself as a database that can store billions of rows with millions of columns.

How do you design a schema for a database like this? As with the document databases, your starting point should be the queries you'll want to make. There are some radically different possibilities. Consider storing logs from a web server. You may want to look up the IP addresses that accessed each URL you serve. The URLs can be the primary key; each IP address can be a column. This approach will quickly generate thousands of unique columns, but that's not a problem — and a single query, with no joins, gets you all the IP addresses that accessed a single URL. If some URLs are visited by many addresses, and some are only visited by a few, that's no problem: remember that NULLs don't exist. This design isn't even conceivable in a relational database. You can't have a table that doesn't have a fixed number of columns.

Now, let's make it more complex: you're writing an ecommerce application, and you'd like to access all the purchases that a given customer has made. The solution is similar. The column family is organized by customer ID (primary key), you have columns for first name, last name, address, and all the normal customer information, plus as many rows as are needed for each purchase. In a relational database, this would probably involve several tables and joins. In the NoSQL databases, it's a single lookup. Schema design doesn't go away, but it changes: you think about the queries you'd like to execute, and how you can perform those efficiently.

This isn't to say that there's no value to normalization, just that data design starts from a different place. With a relational database, you start with the domain objects, and represent them in a way that guarantees that virtually any query can be expressed. But when you need to optimize performance, you look at the queries you actually perform, then merge tables to create longer rows, and do away with joins wherever possible. With the schemaless databases, whether we're talking about data structure servers, document databases, or column stores, you go in the other direction: you start with the query, and use that to define your data objects.

The sacred cows

The ACID properties (atomicity, consistency, isolation, durability) have been drilled into our heads. But even these come into play as we start thinking seriously about database architecture. When a database is distributed, for instance, it becomes much more difficult to achieve the same kind of consistency or isolation that you can on a single machine. And the problem isn't just that it's "difficult" but rather that achieving them ends up in direct conflict with some of the reasons to go distributed. It's not that properties like these aren't very important — they certainly are — but today's software architects are discovering that they require the freedom to choose when it might be worth a compromise.

What about transactions, two-phase commit, and other mechanisms inherited from big iron legacy databases? If you've read almost any discussion of concurrent or distributed systems, you've heard that banking systems care a lot about consistency. What if you and your spouse withdraw money from the same account at the same time? Could you overdraw the account? That's what ACID is supposed to prevent. But a few months ago, I was talking to someone who builds banking software, and he said "If you really waited for each transaction to be properly committed on a world-wide network of ATMs, transactions would take so long to complete that customers would walk away in frustration. What happens if you and your spouse withdraw money at the same time and overdraw the account? You both get the money; we fix it up later."

This isn't to say that bankers have discarded transactions, two-phase commit and other database techniques; they're just smarter about it. In particular, they're distinguishing between local consistency and absolutely global consistency. Gregor Hohpe's classic article Starbucks Does Not Use Two-Phase Commit makes a similar point: in an asynchronous world, we have many strategies for dealing with transactional errors, including write-offs. None of these strategies are anything like two-phase commit. They don't force the world into inflexible, serialized patterns.

The CAP theorem is more than a sacred cow; it's a law of the database universe that can be expressed as "Consistency, Availability, Partition Tolerance: choose two." But let's rethink relational databases in light of this theorem. Databases have stressed consistency. The CAP theorem is really about distributed systems, and as we've seen, relational databases were developed when distributed systems were rare and exotic at best. If you needed more power, you bought a bigger mainframe. Availability isn't an issue on a single server: if it's up, it's up, if it's down, it's down. And partition tolerance is meaningless when there's nothing to partition. As we saw at the beginning of this article, distributed systems are a given for modern applications; you won't be able to scale to the size and performance you need on a single box. So the CAP theorem is historically irrelevant to relational databases: they're good at providing consistency, and they have been adapted to provide high availability with some success, but they are hard to partition without extreme effort or extreme cost.

Since partition tolerance is a fundamental requirement for distributed applications, it becomes a question of what to sacrifice: consistency or availability. There have been two approaches: Riak and Cassandra stress availability, while HBase has stressed consistency. With Cassandra and Riak, the tradeoff between consistency and availability is tunable. CouchDB and MongoDB are essentially single-headed databases, and from that standpoint, availability is a function of how long you can keep the hardware running. However, both have add-ons that can be used to build clusters. In a cluster, CouchDB and MongoDB are eventually consistent (like Riak and Cassandra); availability depends on what you do with the tools they provide. You need to set up sharding and replication, and use what's essentially a proxy server to present a single interface to cluster's clients. BigCouch is an interesting effort to integrate clustering into CouchDB, making it more like Riak. Now that Cloudant has announced that it is merging BigCouch and CouchDB, we can expect to see clustering become part of the CouchDB core.

We've seen that absolute consistency isn't a hard requirement for banks, nor is it the way we behave in our real-world interactions. Should we expect it of our software? Or do we care more about availability?

It depends. The consistency requirements of many social applications are very soft. You don't need to get the correct number of Twitter or Facebook followers every time you log in. If you search, you probably don't care if the results don't contain the comments that were posted a few seconds ago. And if you're willing to accept less-than-perfect consistency, you can make huge improvements in performance. In the world of big-data-backed web applications, with databases spread across hundreds (or potentially thousands) of nodes, the performance penalty of locking down a database while you add or modify a row is huge; if your application has frequent writes, you're effectively serializing all the writes and losing the advantage of the distributed database. In practice, in an "eventually consistent" database, changes typically propagate to the nodes in tenths of a second; we're not talking minutes or hours before the database arrives in a consistent state.

Given that we have all been battered with talk about "five nines" reliability, and given that it is a big problem for any significant site to be down, it seems clear that we should prioritize availability over consistency, right? The architectural decision isn't so easy, though. There are many applications in which inconsistency must eventually be dealt with. If consistency isn't guaranteed by the database, it becomes a problem that the application has to manage. When you choose availability over consistency, you're potentially making your application more complex. With proper replication and failover strategies, a database designed for consistency (such as HBase) can probably deliver the availability you require; but this is another design tradeoff. Regardless of the database you're using, more stringent reliability requirements will drive you toward exotic engineering. Only you can decide the right balance for your application. The point isn't that any given decision is right or wrong, but that you can (and have to) choose, and that's a good thing.

Other features

I've completed a survey of the major tradeoffs you need to think about in selecting a database for a modern big data application. But the major tradeoffs aren't the only story. There are many database projects with interesting features. Here are a some of the ideas and projects I find most interesting:

  • Scripting: Relational databases all come with some variation of the SQL language, which can be seen as a scripting language for data. In the non-relational world, a number of scripting languages are available. CouchDB and Riak support JavaScript, as does MongoDB. The Hadoop project has spawned a several data scripting languages that are usable with HBase, including Pig and Hive. The Redis project is experimenting with integrating the Lua scripting language.
  • RESTful interfaces: CouchDB and Riak are unique in offering
    RESTful interfaces. These are interfaces based on HTTP and the architectural
    style elaborated in Roy Fielding's doctoral
    and Restful Web
    . CouchDB goes so far as to serve as a web application
    framework. Riak also offers a more traditional protocol buffer
    interface, which is a better fit if you expect a high volume of
    small requests.
  • Graphs: Neo4J is a special
    purpose database designed for maintaining large graphs: data where
    the data items are nodes, with edges representing the connections
    between the nodes. Because graphs are extremely flexible data
    structures, a graph database can emulate any other kind of
  • SQL: I've been discussing the NoSQL movement, but SQL is a
    familiar language, and is always just around the corner. A couple
    of startups are working on adding SQL to Hadoop-based datastores:
    DrawnToScale (which
    focuses on low-latency, high-volume web applications) and "">Hadapt (which focuses on analytics and
    bringing data warehousing into the 20-teens). In a few years, will
    we be looking at hybrid databases that take advantage of both
    relational and non-relational models? Quite possibly.
  • Scientific data: Yet another direction comes from SciDB, a database project aimed at the
    largest scientific applications (particularly the Large Synoptic Survey Telescope).
    The storage model is based on multi-dimensional arrays. It is
    designed to scale to hundreds of petabytes of storage, collecting
    tens of terabytes per night. It's still in the relatively early
  • Hybrid architectures: NoSQL is really about architectural
    choice. And perhaps the biggest expression of architectural choice
    is a hybrid architecture: rather than using a single database
    technology, mixing and matching technologies to play to their
    strengths. I've seen a number of applications that use traditional
    relational databases for the portion of the data for which the
    relational model works well, and a non-relational database for the
    rest. For example, customer data could go into a relational
    database, linked to a non-relational database for unstructured data
    such as product reviews and recommendations. It's all about
    flexibility. A hybrid architecture may be the best way to integrate
    "social" features into more traditional ecommerce sites.

These are only a few of the interesting ideas and projects that are floating around out there. Roughly a year ago, I counted a couple dozen non-relational database projects; I'm sure there are several times that number today. Don't hesitate to add notes about your own projects in the comments.

In the end

In a conversation with Eben Hewitt, author of Cassandra: The Definitive Guide, Eben summarized what you need to think about when architecting the back end of a data-driven system. They're the same issues software architects have been dealing with for years: you need to think about the whole ecosystems in which the application works; you need to consider your goals (Do you require high availability? Fault tolerance?); you need to consider support options; you need to isolate what will change over the life of the application, and separate that from what remains the same. The big difference is that now there are options; you don't have to choose the relational model. There are other options for building large databases that scale horizontally, are highly available, and can deliver great performance to users. And these options, the databases that make up the NoSQL movement, can often achieve these goals with greater flexibility and lower cost.

It used to be said that nobody got fired for buying IBM. Then nobody got fired for buying Microsoft. Now, I suppose, nobody gets fired for buying Oracle. But just as the landscape changed for IBM and Microsoft, it's shifting again, and even Oracle has a NoSQL solution. Rather than relational databases being the default, we're moving into a world where developers are considering their architectural options, and deciding which products fit their application: how the databases fit into their programming model, whether they can scale in ways that make sense for the application, whether they have strong or relatively weak consistency requirements.

For years, the relational default has kept developers from understanding their real back-end requirements. The NoSQL movement has given us the opportunity to explore what we really require from our databases, and to find out what we already knew: there is no one-size-fits-all solution.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


October 21 2011

Developer Week in Review: Talking to your phone

I've spent the last week or so getting up to speed on the ins and outs of Vex Robotics tournaments since I foolishly volunteered to be competition coordinator for an event this Saturday. I've also been helping out my son's team, offering design advice where I could. Vex is similar to Dean Kamen's FIRST Robotics program, but the robots are much less expensive to build. That means many more people can field robots from a given school and more people can be hands-on in the build. If you happen to be in southern New Hampshire this Saturday, drop by Pinkerton Academy and watch two dozen robots duke it out.

In non-robotic news ...

Why Siri matters

SiriIt's easy to dismiss Siri, Apple's new voice-driven "assistant" for the iPhone 4S, as just another refinement of the chatbot model that's been entertaining people since the days of ELIZA. No one would claim that Siri could pass the Turing test, for example. But, at least in my opinion, Siri is important for several reasons.

On a pragmatic level, Siri makes a lot of common smartphone tasks much easier. For example, I rarely used reminders on the iPhone and preferred to use a real keyboard when I had to create appointments. But Siri makes adding a reminder or appointment so easy that I have made it pretty much my exclusive method of entering them. It also is going to be a big win for drivers trying to use smartphones in their cars, especially in states that require hands-free operations.

I suspect Siri will also end up being a classic example of crowdsourcing. If I were Apple, I would be capturing every "miss" that Siri couldn't handle and looking for common threads. Since Siri is essentially doing natural language processing and applying rules to your requests, Apple can improve Siri progressively by adding the low-hanging fruit. For example, at the moment, Siri balks at a question like, "How are the Patriots doing?" I'd be shocked if it fails to answer that question in a year since sports scores and standings will be at the heart of commonly asked questions.

For developers, the benefits of Siri are obvious. While it's a closed box right now, if Apple follows its standard model, we should expect to see API and SDK support for it in future releases of iOS. At the moment, apps that want voice control (and they are few and far between) have to implement it themselves. Once apps can register with Siri, any app will be able to use voice.

Velocity Europe, being held Nov. 8-9 in Berlin, will bring together the web operations and performance communities for two days of critical training, best practices, and case studies.

Save 20% on registration with the code RADAR20

Can Open Office survive? logoLong-time WIR readers will know that I'm no fan of how Oracle has treated its acquisitions from Sun. A prime example is OpenOffice. In June, OpenOffice was spun off from Oracle, and therefore lost its allowance. Now the OpenOffice team is passing around the hat, looking for funds to keep the project going.

We need to support Open Office because it's the only project that really keeps Microsoft honest as far as providing open standards access to Microsoft Office products. It's also the only way that Linux users can deal with the near-ubiquitous use of Office document formats in the real world (short of running Office in a VM or with Wine.)

The revenge of SQL

The NoSQL crowd has always had Google App Engine as an ally since the only database available to App Engine apps has been the App Engine Datastore, which (among other things) doesn't support joins. But much as Apple initially rejected multitasking on the iPhone (until it decided to embrace it), Google appears to have thrown in the towel as far as SQL goes.

It's always dangerous to hold an absolutist position (with obvious exceptions, such as despising Jar Jar Binks). SQL may have been overused in the past, but it's foolish to reject SQL altogether. It can be far too useful at times. SQL can be especially handy, as an example, when developing pure REST-like web services. It's nice to see that Google has taken a step back from the edge. Or, to put it more pragmatically, that it listens to its customer base on occasion.

Got news?

Please send tips and leads here.


October 07 2011

Top Stories: October 3-7, 2011

Here's a look at the top stories published across O'Reilly sites this week.

Oracle's Big Data Appliance: what it means
Oracle's new Big Data Appliance couldn't be a plainer validation of what's important in big data right now, or where the battle for technology dominance lies.

PhoneGap basics: What it is and what it can do for mobile developers
Joe Bowser, the developer of the Android version of PhoneGap, on the pros and cons of developing with the PhoneGap cross-platform application framework.

How data and open government are transforming NYC
New York City has become the epicenter for experiments in data-driven governance. Here, NYC officials Rachel Sterne and Carole Post discuss the city's data initiatives.

The making of a "minimum awesome product"
In this podcast, Evan Doll, the co-founder of Flipboard sat down with Joe Wikert to discuss Flipboard's focus on design and social integration.

iPad vs. Kindle Fire: Early impressions and a few predictions
Few have actually held the Kindle Fire, let alone put it through its paces, so Pete Meyers chose a novel analytical approach: Examine his own iPad habits and look for spots where the Fire can find a foothold.

Android Open, being held October 9-11 in San Francisco, is a big-tent meeting ground for app and game developers, carriers, chip manufacturers, content creators, OEMs, researchers, entrepreneurs, VCs, and business leaders. Save 20% on registration with the code AN11RAD.

October 06 2011

Oracle's NoSQL

Oracle's turn-about announcement of a NoSQL product wasn't really surprising. When Oracle spends time and effort putting down a technology, you can bet that its secretly impressed, and trying to re-implement it in its back room. So Oracle's paper "Debunking the NoSQL Hype" should really have been read as a backhanded product announcement. (By the way, don't click that link; the paper appears to have been taken down. Surprise.)

I have to agree with DataStax and other developers in the NoSQL movement: Oracle's announcement is a validation, more than anything else. It's certainly a validation of NoSQL, and it's worth thinking about exactly what that means. It's long been clear that NoSQL isn't about any particular architecture. When databases as fundamentally different as MongoDB, Cassandra, and Neo4J can all be legitimately characterized as "NoSQL," it's clear that NoSQL isn't a "thing." We've become accustomed to talking about the NoSQL "movement," but what does that mean?

As Justin Sheehy, CTO of Basho Technologies, said, the NoSQL movement isn't about any particular architecture, but about architectural choice. For as long as I can remember, application developers have debated software architecture choices with gusto. There were many choices for the front end; many choices for middleware; and careers rose and fell based on those choices. Somewhere along the way, "Software Architect" even became a job title. But for the backend, for the past 20 years there has really been only one choice: a relational database that looks a lot like Oracle (or MySQL, if you'd prefer). And choosing between Oracle, MySQL, PostgreSQL, or some other relational database just isn't that big a choice.

Did we really believe that one size fits all for database problems? If we ever did, the last three years have made it clear that the model was broken. I've got nothing against SQL (well, actually, I do, but that's purely personal), and I'm willing to admit that relational databases solve many, maybe even most, of the database problems out there. But just as it's clear that the universe is a more complicated place than physicists thought it was in 1990, it's also clear that there are data problems that don't fit 20-year-old models. NoSQL doesn't use any particular model for storing data; it represents the ability to think about and choose your data architecture. It's important to see Oracle recognize this. The company's announcement isn't just a validation of key-value stores, but of the entire discussion of database architecture.

Of course, there's more to the announcement than NoSQL. Oracle is selling a big data appliance: an integrated package including Hadoop and R. The software is available standalone, though Oracle clearly hopes that the package will be running on its Exadata Database hardware (or equivalent), which is an impressive monster of a database machine (though I agree with Mike Driscoll, that machines like these are on the wrong side of history). There are other bits and pieces to solve ETL and other integration problems. And it's fair to say that Oracle's announcement validates more than just NoSQL; it validates the "startup stack" or "data stack" that we've seen in many of most exciting new businesses that we watch. Hadoop plus a non-relational database (often MongoDB, HBase, or Cassandra), with R as an analytics platform, is a powerful combination. If nothing else, Oracle has given more conservative (and well-funded) enterprises permission to make the architectural decisions that the startups have been making all along, and to work with data that goes beyond what traditional data warehouses and BI technologies allow. That's a good move, and it grows the pie for everyone.

I don't think many young companies will be tempted to invest millions in Oracle products. Some larger enterprises should, and will, question whether investing in Oracle products is wise when there are much less expensive solutions. And I am sure that Oracle will take its share of the well-funded enterprise business. It's a win all around.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR


Strata Week: Oracle's big data play

Here are the data stories that caught my attention this week:

Oracle's big data week

Eyes have been on Oracle this week as it holds its OpenWorld event in San Francisco. The company has made a number of major announcements, including unveiling its strategy for handling big data. This includes its Big Data Appliance, which will use a new Oracle NoSQL database as well as an open-source distribution of Hadoop and R.

Edd Dumbill examined the Oracle news, arguing that "it couldn't be a plainer validation of what's important in big data right now or where the battle for technology dominance lies." He notes that whether one is an Oracle customer or not, the company's announcement "moves the big data world forward," pointing out that there is now a de facto agreement that Hadoop and R are core pieces of infrastructure.

GigaOm's Derrick Harris reached out to some of the startups who also offer these core pieces, including Norman Nie, the CEO of Revolution Analytics, and Mike Olson, CEO of Cloudera. Not surprisingly perhaps, the startups are "keeping brave faces, but the consensus is that Oracle's forays into their respective spaces just validate the work they've been doing, and they welcome the competition."

Oracle's entry as a big data player also brings competition to others in the space, such as IBM and EMC, as all the major enterprise providers wrestle to claim supremacy over whose capabilities are the biggest and fastest. And the claim that "we're faster" was repeated over and over by Oracle CEO Larry Ellison as he made his pitch to the crowd at OpenWorld.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Who wrote Hadoop?

As ReadWriteWeb's Joe Brockmeier notes, ascertaining the contributions to open-source projects is sometimes easier said than done. Who gets credit — companies or individuals — can be both unclear and contentious. Such is the case with a recent back-and-forth between Cloudera's Mike Olson and Hortonworks' Owen O'Malley over who's responsible for the contributions to Hadoop.

O'Malley wrote a blog post titled "The Yahoo! Effect," which, as the name suggests, describes Yahoo's legacy and its continuing contributions to the Hadoop core. O'Malley argues that "from its inception until this past June, Yahoo! contributed more than 84% of the lines of code still in Apache Hadoop trunk." (Editor's note: The link to "trunk" was inserted for clarity.) O'Malley adds that so far this year, the biggest contributors to Hadoop are Yahoo! and Hortonworks.

Lines of code contributed to apache hadoop trunkLines of code contributed to Apache Hadoop Trunk (from Owen O'Malley's post, "The Yahoo! Effect")

That may not be a surprising argument to hear from Hortonworks, the company that was spun out of Yahoo! earlier this year to focus on the commercialization and development of Hadoop.

But Cloudera's Mike Olson challenges that argument — again, not a surprise, as Cloudera has long positioned itself as a major contributor to Hadoop, a leader in the space, and of course now the employer of former Yahoo! engineer Doug Cutting, the originator of the technology. Olson takes issue with O'Malley's calculations and in a blog post of his own, contends that these calculations don't accurately take into account the companies that people now work for:

Five years is an eternity in the tech industry, however, and many of those developers moved on from Yahoo! between 2006 and 2011. If you look at where individual contributors work today — at the organizations that pay them, and at the different places in the industry where they have carried their expertise and their knowledge of Hadoop — the story is much more interesting.

Olson also argues that it isn't simply a matter of who's contributing to the Apache Hadoop core, but rather who is working on:

... the broader ecosystem of projects. That ecosystem has exploded in recent years, and most of the innovation around Hadoop is now happening in new projects. That's not surprising — as Hadoop has matured, the core platform has stabilized, and the community has concentrated on easing adoption and simplifying use.

Got data news?

Feel free to email me.


October 04 2011

Four short links: 4 October 2011

  1. -- Singaporean version of TechStars, with 100-day program ("the bootcamp") Jan-Apr 2012. Startups from anywhere in the world can apply, and will want to because Singapore is the gateway to Asia. They'll also have mentors from around the world.
  2. Oracle NoSQLdb -- Oracle want to sell you a distributed key-value store. It's called "Oracle NoSQL" (as opposed to PostgreSQL, which is SQL No-Oracle). (via Edd Dumbill)
  3. Facebook Browser -- interesting thoughts about why the browser might be a good play for Facebook. I'm not so sure: browsers don't lend themselves to small teams, and search advertising doesn't feel like a good fit with Facebook's existing work. Still, making me grumpy again to see browsers become weapons again.
  4. Bitbucket -- a competitor to Github, from the folks behind the widely-respected Jira and Confluence tools. I'm a little puzzled, to be honest: Github doesn't seem to have weak spots (the way, for example, that Sourceforge did).

October 03 2011

Oracle's Big Data Appliance: what it means

Today, Oracle announced their Big Data Appliance. It couldn't be a plainer validation of what's important in big data right now, or where the battle for technology dominance lies.

Oracle's appliance includes some homegrown technology, most specifically a NoSQL database of their own design, and some open source technologies: Hadoop and R. Let's take a look at what these three decisions might mean.

Oracle NoSQL Database: Oracle's core reputation is as a database vendor, and as owners of the Berkeley DB technology, they have a core NoSQL platform to build upon (Berkeley was NoSQL for years before we even had that term). Oracle have no reason to partner with or incorporate other NoSQL tech such as Cassandra or MongoDB, and now pose a significant business threat to those technologies—perhaps Cassandra more than MongoDB, due to its enterprise credentials.

Hadoop: competitive commercial big data solutions such as Greenplum and Aster Data got ahead in the market through incorporating their own MapReduce technologies. Oracle hasn't bothered to do this, and has instead standardized on Hadoop and a system of connectors to its main Oracle product. (Both Greenplum and Aster also have Hadoop connectors.) If it needed any further validation, this confirms Hadoop's arrival as the Linux of big data. It's a standard.

R: big data isn't much use until you can make sense of it, and the inclusion of R in Oracle's big data appliance bears this out. It also sets up R as a new industry standard for analytics: something that will raise serious concern among vendors of established statistical and analytical solutions SAS and SPSS.

Whether you use Oracle or not, today's announcement moves the big data world forward. We have de facto agreement on Hadoop and R as core infrastructure, and we have healthy competition at the database and NoSQL layer.

Talk about this at Strata 2012: As the call for participation for Strata 2012 (Feb 28-Mar 1, Santa Clara, CA) nears its close, Oracle's announcement couldn't be more timely. We are opening up new content tracks focusing on the Hadoop ecosystem and on R. Submit your proposal by the end of this week.

August 01 2011

Four short links: 1 August 2011

  1. The Flashed Face Effect Video -- your brain is not perfect, and it reduces faces to key details. When they flash by in the periphery of your vision, you perceive them as gross and freakish. I like to start the week by reminding myself how fallible I am. Good preparation for the rest of the week... (via BERG London)
  2. The Newsonomics of Netflix and the Digital Shift -- Netflix changed prices, tilting people toward digital and away from physical. This post argues that the same will happen in newspapers. Imagine 2020, and the always-out-there-question: Will we still have print newspapers? Well, maybe, but imagine how much they’ll cost — $3 for a local daily? — and consumers will compare that to the “cheap” tablet pricing, and decide, just as they doing now are with Netflix, which product to take and which to let go. The print world ends not with a bang, but with price increase after price increase. (via Tim O'Reilly)
  3. Phonegap -- just shipped 1.0 of an HTML5 app platform that allows you to author native applications with web technologies and get access to APIs and app stores.
  4. UnQL -- query language for document store databases, from the creators of CouchDB and SQLite. (via Francisco Reyes)

July 13 2011

Who are the OSCON data geeks?

This podcast highlights some of the sessions in OSCON Data and who might be interested in them.

Edd Dumbill, Bradford Stephens and I took the liberty of making irreverent monikers for several of the types of attendees we expect at OSCON Data. These include:

DBA Dude
  • Data Scientist
  • NOSQL Nerd
  • Scaling Geek
  • Real-time Traveler
  • (Podcast production by Rich Goyette Audio.)

    OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data.

    Save 20% on registration with the code OS11RAD

    June 29 2011

    What CouchDB can do for HTML5, web apps and mobile

    CouchDBCouchApps are JavaScript and HTML5 applications served directly from the document-oriented database CouchDB. In the following interview, Found Line co-founder and OSCON speaker Bradley Holt (@BradleyHolt) talks about the utility of CouchApps, what CouchDB offers web developers, and how the database works with HTML5.

    How do CouchApps work?

    Bradley Holt: CouchApps are web applications built using CouchDB, JavaScript, and HTML5. They skip the middle tier and allow a web application to talk directly to the database — the CouchDB database could even be running on the end-user's machine or Android / iOS device.

    What are the benefits of building CouchApps?

    Bradley Holt: Streamlining of your codebase (no middle tier), replication, the ability to deploy/replicate an application along with its data, and the side benefits that come with going "with the grain" of how the web works are some of the benefits of building CouchApps.

    To be perfectly honest though, I don't think CouchApps are quite ready for widespread developer adoption yet. The biggest impediment is tooling. The current set of development tools need refinement, and the process of building a CouchApp can be a bit difficult at times. The term "CouchApp" can also have many different meanings. That said, the benefits of CouchApps are compelling and the tools will catch up soon.

    OSCON JavaScript and HTML5 Track — Discover the new power offered by HTML5, and understand JavaScript's imminent colonization of server-side technology.

    Save 20% on registration with the code OS11RAD

    HTML5 addresses a lot of storage issues. Where does CouchDB fit in?

    Bradley HoltBradley Holt: The HTML5 Web Storage specification describes an API for persistent storage of key/value pairs locally within a user's web browser. Unlike previous attempts at browser local storage specifications, the HTML5 storage specification has achieved significant cross-browser support.

    One thing that the HTML5 Web Storage API lacks, however, is a means of querying for values by anything other than a specific key. You can't query across a set of keys or values. IndexedDB addresses this and allows for indexed database queries, but IndexedDB is not currently part of the HTML5 specification and is only implemented in a limited number of browsers.

    If you need more than just key/value storage, then you have to look outside of the HTML5 specification. Like HTML5 Web Storage, CouchDB stores key/value pairs. In CouchDB, the key part of the key/value pair is a document ID and the value is a JSON object representing a single document. Unlike HTML5 Web Storage, CouchDB provides a means of indexing and querying data using MapReduce "views." Since CouchDB is accessed using a RESTful HTTP API and stores documents as JSON objects, it is easy to work with CouchDB directly from an HTML5/JavaScript web application.

    How does CouchDB's replication feature work with HTML5?

    Bradley Holt: Again, CouchDB is not directly related to the HTML5 specification, but CouchDB's replication feature creates unique opportunities for CouchApps built using JavaScript and HTML5 (or any application built using CouchDB, for that matter).

    I've heard J. Chris Anderson use the term "ground computing" as a counterpoint to "cloud computing." The idea is to store a user's data as close to that user as possible — and you can't get any closer than a user's own computer or mobile device! CouchDB's replication feature makes this possible. Data that is relevant to a particular user can be copied to and from that user's own computer or mobile device using CouchDB's incremental replication. This allows for faster access for the user (since his or her application is hitting a local database), offline access, data portability, and potentially more control over his or her own data.

    Now that CouchDB runs on mobile devices, how do you see it shaping mobile app development?

    Bradley Holt: While Android is a great platform, the biggest channel for mobile applications is Apple's iOS. CouchDB has been available on the Android for a while now, but it is relatively new to iOS. Now that CouchDB can be used to build iPhone/iPad applications, we will most certainly see many more mobile applications built using CouchDB in order to take advantage of CouchDB's unique features — especially replication.

    The big question is, will these applications be built as native applications or will they be built as CouchApps? I don't know the answer, but I'd like to see more of these applications built on the CouchApps side. With CouchApps, developers can more easily port their applications across platforms, and they can use existing HTML5, JavaScript, and CSS skill sets.

    This interview was edited and condensed.


    June 09 2011

    Four short links: 9 June 2011

    1. Optimizing MongoDB -- shorter field names, barely hundreds of ops/s when not in RAM, updates hold a lock while they fetch the original from disk ... it's a pretty grim story. (via Artur Bergman)
    2. Is There a New Geek Anti-Intellectualism? -- focus is absolutely necessary if we are to gain knowledge. We will be ignoramuses indeed, if we merely flow along with the digital current and do not take the time to read extended, difficult texts. (via Sacha Judd)
    3. Trend Data for Teens (Pew Internet and American Life Project) -- one in six American teens have used the Internet to look for information online about a health topic that’s hard to talk about, like drug use, sexual health, or depression.
    4. The Guts of Android (Linux Weekly News) -- technical but high-level explanation of the components of an Android system and how they compare to those of a typical Linux system.

    May 27 2011

    Four short links: 27 May 2011

    1. flockdb (Github) -- Twitter's open source scalable fault-tolerant distributed key-value database. (via Twitter's open source projects page)
    2. How to Kill Innovation in Five Easy Steps (Tech Republic) -- point four is interesting, Rely too heavily on data and dashboards. It's good to be reminded of the contra side to the big-data-can-be-mined-for-all-truths attitudes flying around.
    3. Architecture of Open Source Applications -- CC-licensed book available through Lulu or for free download. Lots of interesting stories and design decisions to draw from. I know when I learned how Perl worked on the inside, I learned a hell of a lot that I could apply later in life and respected its creators all the more.
    4. Bullying in 140 Letters -- it's about an Australian storm in a teacup, but it made me consider the short-form medium. Short-form negativity can have the added colour/resonance of being snarky and funny. Hard to add colour to short-form positive comments, though. Much harder to be funny and positive than to be funny and negative. Have we inadvertently created a medium where, thanks to the quirks of our language and the way we communicate, it favours negativity over positivity?

    Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
    Could not load more posts
    Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
    Just a second, loading more posts...
    You've reached the end.

    Don't be the product, buy the product!