Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 29 2011

What CouchDB can do for HTML5, web apps and mobile

CouchDBCouchApps are JavaScript and HTML5 applications served directly from the document-oriented database CouchDB. In the following interview, Found Line co-founder and OSCON speaker Bradley Holt (@BradleyHolt) talks about the utility of CouchApps, what CouchDB offers web developers, and how the database works with HTML5.

How do CouchApps work?

Bradley Holt: CouchApps are web applications built using CouchDB, JavaScript, and HTML5. They skip the middle tier and allow a web application to talk directly to the database — the CouchDB database could even be running on the end-user's machine or Android / iOS device.

What are the benefits of building CouchApps?

Bradley Holt: Streamlining of your codebase (no middle tier), replication, the ability to deploy/replicate an application along with its data, and the side benefits that come with going "with the grain" of how the web works are some of the benefits of building CouchApps.

To be perfectly honest though, I don't think CouchApps are quite ready for widespread developer adoption yet. The biggest impediment is tooling. The current set of development tools need refinement, and the process of building a CouchApp can be a bit difficult at times. The term "CouchApp" can also have many different meanings. That said, the benefits of CouchApps are compelling and the tools will catch up soon.

OSCON JavaScript and HTML5 Track — Discover the new power offered by HTML5, and understand JavaScript's imminent colonization of server-side technology.

Save 20% on registration with the code OS11RAD

HTML5 addresses a lot of storage issues. Where does CouchDB fit in?

Bradley HoltBradley Holt: The HTML5 Web Storage specification describes an API for persistent storage of key/value pairs locally within a user's web browser. Unlike previous attempts at browser local storage specifications, the HTML5 storage specification has achieved significant cross-browser support.

One thing that the HTML5 Web Storage API lacks, however, is a means of querying for values by anything other than a specific key. You can't query across a set of keys or values. IndexedDB addresses this and allows for indexed database queries, but IndexedDB is not currently part of the HTML5 specification and is only implemented in a limited number of browsers.

If you need more than just key/value storage, then you have to look outside of the HTML5 specification. Like HTML5 Web Storage, CouchDB stores key/value pairs. In CouchDB, the key part of the key/value pair is a document ID and the value is a JSON object representing a single document. Unlike HTML5 Web Storage, CouchDB provides a means of indexing and querying data using MapReduce "views." Since CouchDB is accessed using a RESTful HTTP API and stores documents as JSON objects, it is easy to work with CouchDB directly from an HTML5/JavaScript web application.



How does CouchDB's replication feature work with HTML5?

Bradley Holt: Again, CouchDB is not directly related to the HTML5 specification, but CouchDB's replication feature creates unique opportunities for CouchApps built using JavaScript and HTML5 (or any application built using CouchDB, for that matter).

I've heard J. Chris Anderson use the term "ground computing" as a counterpoint to "cloud computing." The idea is to store a user's data as close to that user as possible — and you can't get any closer than a user's own computer or mobile device! CouchDB's replication feature makes this possible. Data that is relevant to a particular user can be copied to and from that user's own computer or mobile device using CouchDB's incremental replication. This allows for faster access for the user (since his or her application is hitting a local database), offline access, data portability, and potentially more control over his or her own data.

Now that CouchDB runs on mobile devices, how do you see it shaping mobile app development?

Bradley Holt: While Android is a great platform, the biggest channel for mobile applications is Apple's iOS. CouchDB has been available on the Android for a while now, but it is relatively new to iOS. Now that CouchDB can be used to build iPhone/iPad applications, we will most certainly see many more mobile applications built using CouchDB in order to take advantage of CouchDB's unique features — especially replication.

The big question is, will these applications be built as native applications or will they be built as CouchApps? I don't know the answer, but I'd like to see more of these applications built on the CouchApps side. With CouchApps, developers can more easily port their applications across platforms, and they can use existing HTML5, JavaScript, and CSS skill sets.

This interview was edited and condensed.

Related:

May 05 2011

Four short links: 5 May 2011

  1. Why We Chose MongoDB for Guardian.co.uk (SlideShare) -- they're using MongoDB's flexible schema, as schema upgrades were pain in their previous system (Oracle). I think of these as the database equivalent of dynamic typing in languages like Perl and Ruby. (via Paul Rowe)
  2. Solving Problems with Visual Analytics -- This book is the result of a community effort of the partners of the VisMaster Coordinated Action funded by the European Union. The overarching aim of VisMaster was to create a research roadmap that outlines the current state of visual analytics across many disciplines, and to describe the next steps that have to be taken to foster a strong visual analytics community, thus enabling the development of advanced visual analytic applications. (via Mark Madsen)
  3. iOS-Couchbase (GitHub) -- a build of distributed key-value store CouchDB, which will keep your mobile data in sync with a remote store. No mean feat given CouchDB itself has Erlang as a dependency. (via Mike Olson)
  4. SimString -- A fast and simple algorithm for approximate string retrieval in C++ with Python and Ruby bindings, opensourced with modified BSD license. (via Matt Biddulph)

March 31 2011

Improving healthcare in Zambia with CouchDB

A new healthcare project in Zambia is trying to integrate supervisors, clinics, and community healthcare workers (CHW) into a system that can improve patient service and provide more data about the effectiveness of care. Because of the technical challenges in an extreme rural setting, unique solutions are required. According to Cory Zue, chief technology officer of Dimagi, CouchDB went a long way toward keeping a consistent set of records under extreme circumstances. The full story will be laid out in Zue's talk at the upcoming MySQL conference, but here's a sneak peak.



You're involved with a rural healthcare project in Africa. Can you talk a bit about it, and how CouchDB is being used?


Coary ZueCory Zue: We chose to use CouchDB for very specific reasons for our project, which had to do with it being very good at replicating itself. The project is an effort to deploy health record and data collection systems to extremely remote, rural clinics in Zambia. Working in that environment, we're facing a lot of really challenging technical limitations. If you've only worked in America or in Europe, you don't necessarily run across these types of issues.

We've got computers at clinics that are maintaining patient records. That data needs to sync to cell phones and to a central server, but around 35% of them don't have power, so we're installing solar panels. So one limitation is that that the system has to work on low resources.

None of these clinics have Internet out of the box, so most of the time our only Internet connection is through a GSM modem that connects over the local cell network. It's very hard to move data in that environment, and you can't do anything that relies on an always-on Internet connection with a web app that is always accessing data remotely.

CouchDB was a really good option for us because we could install a Couch database at each clinic site, and then that way all the clinic operations would be local. There would be no Internet use in terms of going out and getting the patient records, or entering data at the clinic site. Couch has a replication engine that lets you synchronize databases — both pull replication and push replication — so we have a star network of databases with one central server in the middle and all of these satellite clinic servers that are connecting through that cell network whenever they're able to get on, and sending the data back and forth. That way we're able to get data in and out of these really remote, rural areas without having to write our own synchronization protocols and network stack.

O'Reilly MySQL Conference & Expo, being held April 11-14, will spread the latest and best knowledge on MySQL and related technologies to the global open source community.

Save 25% on registration with the code MYS11RAD



It could be argued that the same money could have been spent on things like medicine
and staff. What makes this project cost effective?


Cory Zue: That's something that we are constantly thinking about in my organization. In this case, the goal of the entire project is to prove the benefit. This is actually part of a five-year research study that's looking at improving primary care through better oversight and through community health worker integration with the health system.

What we're looking at are two arms: One is that when the data gets to a central place, people like district supervisors can have performance indicator reports that give them a sense for how well the clinic is doing, how well they're following protocols in terms of things such as taking vitals at every visit. They get a report and then they can come and talk to the clinic about what they're doing right and what they're doing wrong.

The other arm that is more interesting, and I think the arm that has more of an impact, is integrating the community health workers with the primary care system. In many of these places, the community health worker is the first line of defense between a person getting sick and what they do about it. The system creates cases or follow ups, from visits to the clinic, that tell the community health worker: "This person had a problem. You should go follow up with them in five days and make sure they're okay."

By sending data through the cell network and then back down to applications that are running on the phones at the community health centers, community health workers are able to find out about complications and people they need to check on. The entire workflow of patient tracing, from clinic to community and then back to the clinic, can be much more complete.

How has the project been going so far?

Cory Zue: This project kicked off in September and it will be going on through the course of the next five-plus years. Hopefully we'll find that the system was a smashing success and it will continue to live on or evolve past that time.

Anecdotally, the thing that we've heard so far is about the visibility of the data in terms of, for the people who oversee these clients, just having any idea of what's happening on the ground. We haven't gotten a lot of feedback from the community health workers yet, and so as we continue to grow out we're really looking to focus on getting testimonials from the actual CHWs and the patients that are affected.


This interview was edited and condensed.

December 24 2010

Strata Gems: CouchDB in the browser

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Whirr makes Hadoop and Cassandra a snap.

Strata 2011 An earlier Strata Gem explained how to try MongoDB without installing any software. You can get a similar experience for CouchDB thanks to JSCouch, which also provides a great introduction to writing MapReduce functions.

CouchDB was the first document-store database to become popular under the "NoSQL" label, and offers some unique and interesting features.

  • RESTful JSON API — rather than requiring programming language-specific bindings, CouchDB can be accessed over HTTP, with JSON used as the data format
  • MapReduce — CouchDB provides query and view functionality via MapReduce operations, which are specified in JavaScript
  • Replication — with incremental replication and bi-directional conflict resolution, CouchDB is suitable for connecting replicated databases with intermittent connectivity. The Ubuntu One cloud service uses this feature

To demonstrate the basics of CouchDB, Mu Dynamics and Jan Lehnardt created a partial implementation of CouchDB that runs in the browser, JSCouch.

The sample database in the demonstration contains information about photos and their metadata, a few such documents are shown below.

{"name":"fish.jpg","created_at":"Fri, 24 Dec 2010 06:35:03 GMT",
 "user":"bob","type":"jpeg","camera":"nikon",
  "info":{"width":100,"height":200,"size":12345},"tags":["tuna","shark"]}

{"name":"trees.jpg","created_at":"Fri, 24 Dec 2010 06:33:45 GMT",
"user":"john","type":"jpeg","camera":"canon",
"info":{"width":30,"height":250,"size":32091},"tags":["oak"]}

If you click on the "map/reduce" tab, you will find a drop-down for selecting pre-written JavaScript functions for performing certain operations. Let's take a look at a query to compute the sum of all the image sizes. Here's the map function:

function(doc) {
	emit("size", doc.info.size);
}

And the reduce function:

function(keys, values, rereduce) {
	return sum(values);
}

With a variety of NoSQL databases to choose from, and developers the unwitting kingmakers, easy demonstration interfaces such as JSCouch can help smooth the path to technology adoption.

November 18 2010

Strata Week: Keeping it clean

This edition of Strata Week is all about making things easy and tidy. If you're eager to learn more tips and tricks for doing so, come to Santa Clara in February: check out the list of Strata conference speakers and register today.

Languages made easy: R and Clojure

Love "fruitful and fun" data mining with Orange? Wish you had an interface like that for R? Wish no more. Anup Parikh and Kyle Covington have created Red-R to extend the Orange interface.

The goal of this project is to provide access to the massive library of packages in R (and even non-R packages) without any programming expertise. The Red-R framework uses concepts of data-flow programming to make data the center of attention while hiding all the programming complexity.

Similar to Orange, Red-R uses a series of widgets to modify and display data. The beauty of Red-R is that it allows programming novices to leverage R's power and to interact with their data in an analytical way. Such tools are no substitute for actual statistical modeling, of course, but they are a great first step in piquing interest and providing a visual conversation-starter.

red-r_architechture.png

Red-R is still in its infancy, but as with all such projects, testing and bug reports are welcome. Check out the forums to get involved.

If R is not your thing, perhaps you've jumped on the Clojure bandwagon (I wouldn't blame you: Clojure is one exciting new language). If that's the case, check out Webmine, a library for mining HTML written by Bradford Cross, Matt Revelle, and Aria Haghighi.

Facts are stubborn things

A team at the Indiana University Center for Complex Networks and Systems Research has built the Truthy system to examine and classify memes on Twitter in an attempt to identify instances of astroturfing, smear campaigns, and other "social pollution."

Truthy looks at streaming Twitter data via the public Twitter API, filters it to extract politically-minded tweets, and then pulls out "memes" like #hashtags, @ replies, phrases, and URLs. Memes that constitute a high volume of tweets, as well as memes that have experienced a significant fluctuation in volume, are flagged and entered into a database for further investigation.

The Truthy system then visualizes a timeline, map, and diffusion network for each meme, and applies sentiment analysis in order to better study and understand "social epidemics." It also relies on crowdsourcing to train its algorithms. Users can visit the project's website and are asked to click the "Truthy" button on a meme's detail page when they suspect a meme contains misinformation masquerading as fact.

Check out the gallery for some fascinating network visuals and the stories behind them.

truthygallery.png

A clean bill of health

Kudos to Dimagi and CIDRZ for a creative solution to a serious problem. In order to provide standard interventions to reduce maternal and infant mortality rates in rural Zambia for the BHOMA (Better Health Outcomes through Mentoring and Assessments) project, they needed a distributed system for capturing and relaying health data.

As in many other places in Africa, reliable internet is not easy to find in rural Zambian communities. But cell phones are nearly ubiquitous, and the best communication devices for relaying patient information from clinics and field workers and back again.

Enter Apache's CouchDB, which saved the day with its continuous replication. A lightweight server in each clinic now replicates filtered data to a national CouchDB database via a modem connection, and two-way replication allows data collected on phones to propagate back to each clinic.

Read more details of the case study here.

Refinement rather than fashion

You may recall that among a spate of Google acquisitions over the summer was Metaweb, the company responsible for Freebase. Now, a nifty open source tool formerly called Freebase Gridworks has been renamed Google Refine, and version 2.0 was released just last week.

Refine is a powerful tool for cleaning up data. It allows you to easily sort and transform inconsistent cells to correct typos and merge variants; filter, then remove or change certain rows; apply custom text transformations; examine numerical columns via histograms; and perform many more complex operations to make data more consistent and useful.

Refine really shines when it is used to combine or transform data from multiple sources, so it's no surprise that it has been popular for open government and data journalism tasks.

Also notable is the fact that Refine is a downloadable desktop app, not a web service. This means you don't have to upload your data anywhere in order to use it. Best of all, Google Refine keeps a running changelog that lets you review and revert changes to your data -- so go ahead: play around. A great set of video tutorials on Google's blog can help you do just that.

October 15 2010

Strata Week: Army anomalies

Here's a look at the latest data news and developments that caught my eye.

Algorithms to sniff out disloyal troops

The US Army is vexed by the problem of troops who become disaffected and, by extension, a risk to the operations they're involved in. Through a DARPA-sponsored research project, the Army hopes to use big data analytics to identify individuals likely to pose a threat.

In an announcement for an investigative "industry day" for the ADAMS (Anomaly Detection at Multiple Scales ) project, DARPA framed the problem:

Each time we see an incident like a soldier in good mental health becoming homicidal or suicidal or an innocent insider becoming malicious we wonder why we didn't see it coming. When we look through the evidence after the fact, we often find a trail -- sometimes even an "obvious" one.

... The focus is on malevolent insiders that started out as "good guys." The specific goal of ADAMS is to detect anomalous behaviors before or shortly after they turn. Operators in the counter- intelligence community are the target end-users for ADAMS insider threat detection technology.

Reporting on ADAMS, Wired notes that there's still much to consider about here:

All this suggests the blind are still leading the blind when it comes to stopping internal military subversion. It's far from clear what kind of data -- troops' e-mail? web trails? book orders? -- DARPA would use to ferret out troops who pose a risk to themselves or others.

CouchDB in the movies

A NoSQL column-store database, CouchDB is noted for its support of replication and synchronization. Notably used in Ubuntu's personal cloud technology for synchronization, CouchDB provides a great substrate for replicating, periodically disconnected, data services.

These features for synchronization and replication made CouchDB an attractive solution for the needs of Novacut, a new open source video editing solution inspired by the decentralized version control systems available to programmers. By using cloud storage and sharing metadata documenting changes made to a video, many people can work on a single video in a distributed fashion.

In the words of the Novacut home page:

Such an editor will help artists reduce costs, work faster, and collaborate with the right people. Such an editor will help independent TV and film succeed. We want artists to win!

Riak adds full-text searching

A NoSQL database modeled on Amazon's Dynamo architecture, Riak is maturing rapidly. Riak provides a decentralized key-value store, a MapReduce engine, and an HTTP/JSON query interface. Aimed at deployment in web applications, Riak's architecture is scalable and fault tolerant.

Basho, the creators of Riak, announced the release of Riak 0.13, including a full-text search engine, Riak Search. Like the database itself, Riak Search is scalable and fault tolerant, operating in real time.

Search and indexing are typically missing from key-value stores, and they're often solved by the addition of a search engine such as SOLR. By adding search, Riak is now a good general solution for scalable web applications, functioning as a self-contained SMAQ system. This web-app-friendly positioning is reinforced by Riak's integration with JavaScript, PHP, Python and Ruby.

Riak's 0.13 release also improves the performance of its MapReduce functionality, and is generally less resource-hungry than previous versions. Both open source and enterprise editions are available.

Hadoop World announcements

This week saw Cloudera's Hadoop World conference in New York, and a burst of product announcements from companies in the Hadoop ecosystem.

  • SD Times reports that Revolution Analytics, the makers of R, have hired the author of the RHIPE package, which integrates R with Hadoop. In hiring Saptarshi Guha, Revolution Analytics joins the trend of analytics and data vendors getting behind Hadoop as a common substrate for their platforms.
  • Karmasphere announced the release of the Professional Edition of their
    Karmasphere Studio product, a graphical environment
    that eases the development, debugging, deployment and monitoring of Hadoop jobs. The Professional Edition adds graphical instrumentation
    and rule-based diagnostic tools for monitoring performance.
  • Datameer announced the first formal release of their Datameer Analytics Solution (DAS), a spreadsheet-driven interface for data analysis with Hadoop. DAS facilitates end-to-end big data processing, from import through to reporting.
  • Quest Software announced OraOop, a connector that allows rapid and scalable data transfer between Oracle databases and Hadoop. OraOop takes the form of a plugin to Cloudera's Hadoop database connector, Sqoop, offering improved performance when used with Oracle.
  • Also connecting their database to Hadoop is Membase (formerly Northscale). Membase is a fast key-value data store. Through integration with
    Cloudera's Hadoop Distribution
    , data can be accumulated in Membase and streamed
    through to Hadoop for processing. A Sqoop-derived connector also allows the rapid
    loading of data between the two systems



Send us news

Email us news, tips and interesting tidbits at strataweek@oreilly.com.


February 24 2010

NoSQL conference coming to Boston

On March 11 Boston will join several other cities who have host
conferences on the movement broadly known as NoSQL. href="http://incubator.apache.org/cassandra/">Cassandra, href="http://couchdb.apache.org/">CouchDB, HBase, HypergraphDB,
Hypertable, Memcached, MongoDB,
Neo4j, Riak, href="http://aws.amazon.com/simpledb/">SimpleDB, Voldemort, and
probably other projects as well will be represented at the href="http://nosqlboston.eventbrite.com/">one-day affair.

It's generally understood that characterizing a movement by what it's
not is awkward, and it's hard to find an elevator speech to
encompass all the topics of NoSQL Boston. Are these tools for "big
data" problems? Usually, but sometimes even small web sites can find
them useful. Are the tools meant for processing streams such as log
files? Sometimes, but they can be useful for other text and data
processing as well. And do they reject relational principles? Well, so
you'd think--but different ones reject different principles, so even
there it's hard to find commonality. (I compared them to relational
databases in a href="http://broadcast.oreilly.com/2009/07/relational-databases-as-realit.html">blog
last year.

The interviews I had with various projects leaders for this article
turned up a recurring usage pattern for NoSQL. I was seeking
particular domains or types of data where the tools would be useful,
but couldn't see much commonality. What connects the users is that
they carry out web-related data crunching, searching, and other Web
2.0 related work. I think these companies use NoSQL tools because
they're the companies who understand leading-edge technologies and are
willing to take risks in those areas. As the field gets better known,
usage will spread.

I had a talk last week with conference organizer Eliot Horowitz, who
is the founder and CTO of 10gen, the company that makes MongoDB. He
let me know that the conference plans to bypass the head-scratching
and launch into practical applications. The day will contain a coding
session and a schema design session along with keynotes.

The resilience of open source

One question that intrigues me is why all the offerings in the NoSQL
area are open source. Some have commercial add-ons, but the core
technology is provided as free software. The few proprietary products
and services in the market (such as href="http://citrusleaf.net/index.html">Citrusleaf) get far less
attention. Reasons seem to include:

  • The market is currently too small. Just as most computing innovations
    start off in research settings, this one is being explored by people
    looking for solutions to their own problems, more than ways to extract
    a profit. Numerous in-house projects exist in this space that are not
    free software (Google's Map/Reduce and BigTable, for instance, and
    Amazon's SimpleDB and Dynamo) but they aren't commercialized either.


  • Experimentation is moving too fast. Most of the projects are just a
    couple years old, and are rapidly adding features.


  • The ROI is hard to calculate. Horowitz says, "People won't pay for
    anything they don't really understand yet." (Nevertheless, 10gen and
    other companies are commercializing the open source offerings.)


  • Whatever problem an organization is trying to solve, each NoSQL
    offering tends to be piece of the solution. It has to be tuned for and
    integrated into the organization's architecture, and combined with
    pieces from other places.

The projects in this conference therefore demonstrate the innovative
power of free software. CouchDB and Cassandra are particularly
interesting in this regard because they are community efforts more
than corporate efforts. Both are Apache top-level projects. (Cassandra
was just moved from the incubator to a top-level project on February
17.) CouchDB committer J. Chris Anderson tells me that the Apache
community process ensures a wide range of voices are heard, leading to
(of course) occasional public wrangling but a superior outcome.

The BBC and (according to Anderson) SXSW are among the href="http://wiki.apache.org/couchdb/CouchDB_in_the_wild">users of
CouchDB, CouchDB has been integrated into Ubuntu, Mozilla Messaging is
basing Raindrop (their next-generation messaging platform) on CouchDB,
and even mobile handset manufacturers are looking at it. (O'Reilly
Media also uses CouchDB.)

I also talked to Alan Hoffman of href="http://cloudant.com/">Cloudant, which offers a CouchDB cloud
service that fills in some of the gaps left by bare CouchDB
(consistent hashing, partitioning, quorum, etc.). Although a couple
companies offer commercial support, no single company takes
responsibility for CouchDB. Its community is highly
distributed. Anderson listed 10 Apache committers working for 8
different companies, and nearly 40 other people who contribute
patches. Support takes place on mailing lists (roughly one thousand
messages a month) and IRC channels.

Jonathan Ellis, project chair of Cassandra, calls it an "open source
success story" because it went from a state of near petrification to
vibrant regrowth through open sourcing. Facebook invented it and
brought it to a state where it satisfied their needs. They made it
open in and moved it into the Apache Incubator in 2008 but declared
that they would not be doing further development. It could easily have
receded into obscurity.

Ellis says that he was hired at href="http://www.rackspace.com/">Rackspace and asked to find a
distributed data store that was fast and scaled easily; he decided on
Cassandra. Soon after he became a public and enthusiastic advocate,
Digg and Twitter joined Rackspace as users and developers. Having
multiple QA teams test each release--particularly in very different
environments--helps quality immensely. Ellis find that Eric Raymond's
"many eyes" characterization of open source bug fixing applies.

Although Cassandra is found mostly as a backing store for web sites
with a lot of users, Ellis thinks it would meet the needs of many
academic and commercial sites, and looks forward to someone offering a
cloud service based on it.

Justin Sheehy, CTO of Basho, maker of
the Riak data store, told me they can confirm the typical advantages
cited for open source. Developers at potential customer sites can try
out the software without going through a bureaucratic procurement
process, and then become internal advocates who function much more
effectively than outside salespeople.

He also says that companies such as Basho offer the best of both
worlds to tentative customers. The backing of a corporation means that
professional services and added tools are available to go along with
the product those customers buy. But because the source is open and
has a community around it, those customers can feel secure that
development and support will continue regardless of the fate of the
originating company. 10gen, of course, plays a similar role for
MongoDB and Anderson's company Couchio
offers support for CouchDB. For projects that are not closely
associated with the backing of one company, the Apache Foundation's
sponsorship helps to ensure continuity.

What are the fault lines in the NoSQL landscape?

Naturally, the projects I've mentioned in this blog borrow ideas from
each other and show tiny variations on common solutions regarding such
things as B-tree storage, replication, solutions to locality of
reference, etc. Experience will eventually lead to a shake-out and a
convergence among surviving projects. In the meanwhile, how can you
get your head around them?

We'll pause here for a word from our sponsors, letting you know that
O'Reilly has published books on href="http://oreilly.com/catalog/9780596155896/">CouchDB and href="http://oreilly.com/catalog/9780596521974/">Hadoop and is
developing one about MongoDB.

Horowitz offers an initial subdivision of projects based on data model
(document, key-value, or tabular), a theme he explored in href="http://howsoftwareisbuilt.com/2010/02/13/interview-with-eliot-horowitz-cto-of-10gen-mongodb/">another
interview.

Roger Magoulas, a research director with O'Reilly, further subdivides
projects into those that crunch large data sets in a batch
manner--such as Hadoop--and those that retrieve views of data to
fulfill visitor search requests on web pages or similar tasks. He goes
on to say that you can compare them on the basis of particular
features, such as automatic replication, auto-sharding or
partitioning, and in-memory caches.

The most comprehensive attempts I've seen to make sense of this gangly
crew of projects from a feature standpoint come in href="http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/">a
blog by Ellis and one by href="http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape.html">blog
by Vineet Gupta. (Gupta's blog is labeled "Part 1" and I'd love to
see more parts.) But Sheehy says the various features of the offerings
interact too strongly and have too many subtle variations to fit into
an easy taxonomy. "Many people try to classify the projects, everyone
does it differently, and nobody gets it quite right."

Community features

So who uses these things? To take Horowitz's MongoDB again as an
example, many web sites gravitate toward it because the document
structure makes some things--adding fields to rows, mapping objects to
fields--easier than a relational database does. A few scientific sites
also use MongoDB.

Riak also has a large following among web sites and startups, but
their customers also include media companies, ad networks, SMS
gateways, analytics firms, and many other types of organizations.

Magoulas finds that an organization's bent is determined by the
background and expertise of its developers. Programmers with lots of
traditional relational database experience tend to be wary of the
recent upstarts, a position reinforced by legacy investments in tools
that depend on their relational database and are sometimes very
expensive.

On the other hand, web programmers look for tools that conform more
closely to the data structures and programming techniques they're used
to, and can actually be "flummoxed" by relational database logic or
abstraction layers on top of the databases. These programmers may
think it intuitive to do the kinds of filtering and sorting that seem
like reinventing the wheel to a traditional RDMBS programmer.
Anderson likes to quote Jacob Kaplan-Moss, the creator of Django, as
saying, "Django may be built for the Web, but CouchDB is built of the
Web. I've never seen software that so completely embraces the
philosophies behind HTTP."

10gen's consultation with MongoDB users includes asking for votes on
new features. They also see a great deal of code contributions in the
driver layer and adapters (sessions, logging, etc.) but not much in
the core. Sheehy said the same is true of Riak: although contributions
to the core are rare, half the client libraries are developed by
outsiders, and many of the tools.

Rapid change is part of life for NoSQL developers. Anderson says of
CouchDB, "The ancillary APIs have been evolving rapidly in preparation
for our 1.0 release, which should come out in the next few months and
won't differ much from today's trunk. The new APIs include
authentication, authorization, details of Map/Reduce, and functions
for transforming and serving JSON documents as other datatypes such as
HTML or CSV." Horowitz stressed that MongoDB will roll out a lot of
new features over the upcoming year.

One hundred people have signed up for NoSQL Boston so far, and more
than 150 are expected. I'll be there to take it in and try to reduce
it to some high-level insights for this blog.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl