Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

October 20 2011

Strata Week: A step toward personal data control

Here are a few of the data stories that caught my attention this week.

Your data in your locker

SinglyEarlier this month, John Battelle wrote a post on his blog where he wished for a service to counter the ways in which all our personal data is scattered across so many applications and devices. He was looking for a tool that would pull together the data from these various places into something that "queries all my various social actions and curates them into one publicly addressable instance independent of any larger platform like AOL, Facebook, Apple, or Google ... I'm pretty sure this is what Singly and the Locker Project will make theoretically possible."

Battelle and Singly's Jason Cavnar discussed the Locker Project in more detail in another post on Battelle's blog this week.

As Cavnar argued:

Data doesn't do us justice. This is about LIFE. Our lives. Or as our colleague Lindsay (@lschutte) says — 'your story.' Not data. Data is just a manifestation of the actual life we are leading. Our data (story) should be ours to own, remember, re-use, discover with and share.

If that sounds appealing then there's good news ahead. Singly 1.0 begins its roll-out to developers this week, as ReadWriteWeb's Marshall Kirkpatrick reports. Developers will be able to build apps that "search, sort and visualize contacts, links and photos that have been published by their own accounts on various social networks but also by all the accounts they are subscribed to there." The apps will live on Github and will deploy on Github for now. There are also several restrictions as far as using other people's apps — for example, you can only do so to visualize your own data.

Even with limitations, Singly is a first step in what will be a much-anticipated and a hugely important move for personal data control.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Bad graphics and good data journalism

A sample word cloudLast week, New York Times senior software architect Jacob Harris issued a challenge to the growing number of data journalists. Want to visualize your work? Avoid word clouds.

Word clouds are, he argued, much like tag clouds before them: "the mullets of the Internet." That is, taking a particular dataset and merely visualizing the frequency of words therein via tools like Wordle and the like is simply "filler visualization." (And Harris said it's also personally painful to the NYT data science team.)

Harris pointed to the numerous problems with utilizing word clouds as the sole form of textual analysis. At the very least, they only take advantage of word frequency, which doesn't necessarily tell you that much:

For starters, word clouds support only the crudest sorts of textual analysis, much like figuring out a protein by getting a count only of its amino acids. This can be wildly misleading; I created a word cloud of Tea Party feelings about Obama, and the two largest words were implausibly "like" and "policy," mainly because the importuned word "don't" was automatically excluded. (Fair enough: Such stopwords would otherwise dominate the word clouds.) A phrase or thematic analysis would reach more accurate conclusions. When looking at the word cloud of the War Logs, does the equal sizing of the words "car" and "blast" indicate a large number of reports about car bombs or just many reports about cars or explosions? How do I compare the relative frequency of lesser-used words? Also, doesn't focusing on the occurrence of specific words instead of concepts or themes miss the fact that different reports about truck bombs might be use the words "truck," "vehicle," or even "bongo" (since the Kia Bongo is very popular in Iraq)?

The Guardian's Simon Rogers responded to Harris. Rogers acknowledged there are plenty of poor visualizations out there, but he added an important point:

Calling for better graphics is also like calling for more sunshine and free chocolate — who's going to disagree with that? What they do is ignore why people produce their own graphics. We often use free tools because they are quick and tell the story simply. But, when we have the time, nothing beats having a good designer create something beautiful — and the Guardian graphics team produces lovely visualisation for the Datablog all the time — such as this one. What is the alternative online for those who don't have access to a team of trained designers?

That last question is crucial, particularly as not everyone has access to designers or software to be able to do much more with their data than create simple visualizations (i.e. word clouds). Rogers said that it's probably fine to have a lot of less-than-useful graphics, because, if nothing else, it "shows that data analysis is part of all our lives now, not just the preserve of a few trained experts handing out pearls of wisdom."

Mary Meeker examines the global growth of mobile data

Among the most-anticipated speakers at Web 2.0 Summit this week was Mary Meeker. The former Morgan Stanley analyst and now partner at Kleiner Perkins gave her annual "Internet Trends" presentation, which is always chock full of data.

Meeker's full Web 2.0 Summit presentation is available in the following video:

Meeker noted that 81% of users of the top 10 global Internet properties come from outside the U.S. Furthermore, in the last three years alone, China has added more Internet users than there are in all of the United States (246 million new Chinese users online versus 244 million total U.S. users online). Although companies like Apple, Amazon, and Google continue to dominate, Meeker pointed out that some of the largest and fasted growing Internet companies are also based outside the U.S. — Chinese companies like Baidu and Tencent, for example, and Russian companies like Mail.ru. And beyond just market value, she pointed to global innovations, such as Sweden's Spotify and Israel's Waze.

The growth in Internet usage continues to be in mobile. Meeker highlighted the global scale and spread of mobile growth, noting that it's in countries like Turkey, India, Brazil and China where we are seeing the largest year-over-year expansion in mobile subscribers.

Suggesting that it may be time to reevaluate Maslow's hierarchy of needs, Meeker posited that Internet access is rapidly becoming a crucial need that sits at the top of a new hierarchy.

Apache Cassandra reaches 1.0

Apache CassandraThe Apache Software Foundation announced this week the release of Cassandra v1.0.

Cassandra, originally developed by Facebook to power its Inbox Search, was open sourced in 2008. Although it's been a top-level Apache project for more than a year now, the 1.0 release marks Cassandra's maturity and readiness for more widespread implementation. The technology has been adopted beyond Facebook by companies like Cisco, Cloudkick, Digg, Reddit, Twitter and Walmart Labs.

Of course, Cassandra is just one of many non-relational databases on the market, with the most recent addition coming from Oracle. But Jonathan Ellis, the vice president of the Apache Cassandra project, explained to PCWorld why Cassandra remains competitive:

[Its] architecture is suited for multi-data center environments, because it does not rely on a leader node to coordinate activities of the database. Data can be written to a local node, thereby eliminating the additional network communications needed to coordinate with a sometimes geographically distant master node. Also, because Cassandra is a column-based storage engine, it can store richer data sets than the typical key-value storage engine.

Got data news?

Feel free to email me.

Related:

February 24 2010

NoSQL conference coming to Boston

On March 11 Boston will join several other cities who have host
conferences on the movement broadly known as NoSQL. href="http://incubator.apache.org/cassandra/">Cassandra, href="http://couchdb.apache.org/">CouchDB, HBase, HypergraphDB,
Hypertable, Memcached, MongoDB,
Neo4j, Riak, href="http://aws.amazon.com/simpledb/">SimpleDB, Voldemort, and
probably other projects as well will be represented at the href="http://nosqlboston.eventbrite.com/">one-day affair.

It's generally understood that characterizing a movement by what it's
not is awkward, and it's hard to find an elevator speech to
encompass all the topics of NoSQL Boston. Are these tools for "big
data" problems? Usually, but sometimes even small web sites can find
them useful. Are the tools meant for processing streams such as log
files? Sometimes, but they can be useful for other text and data
processing as well. And do they reject relational principles? Well, so
you'd think--but different ones reject different principles, so even
there it's hard to find commonality. (I compared them to relational
databases in a href="http://broadcast.oreilly.com/2009/07/relational-databases-as-realit.html">blog
last year.

The interviews I had with various projects leaders for this article
turned up a recurring usage pattern for NoSQL. I was seeking
particular domains or types of data where the tools would be useful,
but couldn't see much commonality. What connects the users is that
they carry out web-related data crunching, searching, and other Web
2.0 related work. I think these companies use NoSQL tools because
they're the companies who understand leading-edge technologies and are
willing to take risks in those areas. As the field gets better known,
usage will spread.

I had a talk last week with conference organizer Eliot Horowitz, who
is the founder and CTO of 10gen, the company that makes MongoDB. He
let me know that the conference plans to bypass the head-scratching
and launch into practical applications. The day will contain a coding
session and a schema design session along with keynotes.

The resilience of open source

One question that intrigues me is why all the offerings in the NoSQL
area are open source. Some have commercial add-ons, but the core
technology is provided as free software. The few proprietary products
and services in the market (such as href="http://citrusleaf.net/index.html">Citrusleaf) get far less
attention. Reasons seem to include:

  • The market is currently too small. Just as most computing innovations
    start off in research settings, this one is being explored by people
    looking for solutions to their own problems, more than ways to extract
    a profit. Numerous in-house projects exist in this space that are not
    free software (Google's Map/Reduce and BigTable, for instance, and
    Amazon's SimpleDB and Dynamo) but they aren't commercialized either.


  • Experimentation is moving too fast. Most of the projects are just a
    couple years old, and are rapidly adding features.


  • The ROI is hard to calculate. Horowitz says, "People won't pay for
    anything they don't really understand yet." (Nevertheless, 10gen and
    other companies are commercializing the open source offerings.)


  • Whatever problem an organization is trying to solve, each NoSQL
    offering tends to be piece of the solution. It has to be tuned for and
    integrated into the organization's architecture, and combined with
    pieces from other places.

The projects in this conference therefore demonstrate the innovative
power of free software. CouchDB and Cassandra are particularly
interesting in this regard because they are community efforts more
than corporate efforts. Both are Apache top-level projects. (Cassandra
was just moved from the incubator to a top-level project on February
17.) CouchDB committer J. Chris Anderson tells me that the Apache
community process ensures a wide range of voices are heard, leading to
(of course) occasional public wrangling but a superior outcome.

The BBC and (according to Anderson) SXSW are among the href="http://wiki.apache.org/couchdb/CouchDB_in_the_wild">users of
CouchDB, CouchDB has been integrated into Ubuntu, Mozilla Messaging is
basing Raindrop (their next-generation messaging platform) on CouchDB,
and even mobile handset manufacturers are looking at it. (O'Reilly
Media also uses CouchDB.)

I also talked to Alan Hoffman of href="http://cloudant.com/">Cloudant, which offers a CouchDB cloud
service that fills in some of the gaps left by bare CouchDB
(consistent hashing, partitioning, quorum, etc.). Although a couple
companies offer commercial support, no single company takes
responsibility for CouchDB. Its community is highly
distributed. Anderson listed 10 Apache committers working for 8
different companies, and nearly 40 other people who contribute
patches. Support takes place on mailing lists (roughly one thousand
messages a month) and IRC channels.

Jonathan Ellis, project chair of Cassandra, calls it an "open source
success story" because it went from a state of near petrification to
vibrant regrowth through open sourcing. Facebook invented it and
brought it to a state where it satisfied their needs. They made it
open in and moved it into the Apache Incubator in 2008 but declared
that they would not be doing further development. It could easily have
receded into obscurity.

Ellis says that he was hired at href="http://www.rackspace.com/">Rackspace and asked to find a
distributed data store that was fast and scaled easily; he decided on
Cassandra. Soon after he became a public and enthusiastic advocate,
Digg and Twitter joined Rackspace as users and developers. Having
multiple QA teams test each release--particularly in very different
environments--helps quality immensely. Ellis find that Eric Raymond's
"many eyes" characterization of open source bug fixing applies.

Although Cassandra is found mostly as a backing store for web sites
with a lot of users, Ellis thinks it would meet the needs of many
academic and commercial sites, and looks forward to someone offering a
cloud service based on it.

Justin Sheehy, CTO of Basho, maker of
the Riak data store, told me they can confirm the typical advantages
cited for open source. Developers at potential customer sites can try
out the software without going through a bureaucratic procurement
process, and then become internal advocates who function much more
effectively than outside salespeople.

He also says that companies such as Basho offer the best of both
worlds to tentative customers. The backing of a corporation means that
professional services and added tools are available to go along with
the product those customers buy. But because the source is open and
has a community around it, those customers can feel secure that
development and support will continue regardless of the fate of the
originating company. 10gen, of course, plays a similar role for
MongoDB and Anderson's company Couchio
offers support for CouchDB. For projects that are not closely
associated with the backing of one company, the Apache Foundation's
sponsorship helps to ensure continuity.

What are the fault lines in the NoSQL landscape?

Naturally, the projects I've mentioned in this blog borrow ideas from
each other and show tiny variations on common solutions regarding such
things as B-tree storage, replication, solutions to locality of
reference, etc. Experience will eventually lead to a shake-out and a
convergence among surviving projects. In the meanwhile, how can you
get your head around them?

We'll pause here for a word from our sponsors, letting you know that
O'Reilly has published books on href="http://oreilly.com/catalog/9780596155896/">CouchDB and href="http://oreilly.com/catalog/9780596521974/">Hadoop and is
developing one about MongoDB.

Horowitz offers an initial subdivision of projects based on data model
(document, key-value, or tabular), a theme he explored in href="http://howsoftwareisbuilt.com/2010/02/13/interview-with-eliot-horowitz-cto-of-10gen-mongodb/">another
interview.

Roger Magoulas, a research director with O'Reilly, further subdivides
projects into those that crunch large data sets in a batch
manner--such as Hadoop--and those that retrieve views of data to
fulfill visitor search requests on web pages or similar tasks. He goes
on to say that you can compare them on the basis of particular
features, such as automatic replication, auto-sharding or
partitioning, and in-memory caches.

The most comprehensive attempts I've seen to make sense of this gangly
crew of projects from a feature standpoint come in href="http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/">a
blog by Ellis and one by href="http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape.html">blog
by Vineet Gupta. (Gupta's blog is labeled "Part 1" and I'd love to
see more parts.) But Sheehy says the various features of the offerings
interact too strongly and have too many subtle variations to fit into
an easy taxonomy. "Many people try to classify the projects, everyone
does it differently, and nobody gets it quite right."

Community features

So who uses these things? To take Horowitz's MongoDB again as an
example, many web sites gravitate toward it because the document
structure makes some things--adding fields to rows, mapping objects to
fields--easier than a relational database does. A few scientific sites
also use MongoDB.

Riak also has a large following among web sites and startups, but
their customers also include media companies, ad networks, SMS
gateways, analytics firms, and many other types of organizations.

Magoulas finds that an organization's bent is determined by the
background and expertise of its developers. Programmers with lots of
traditional relational database experience tend to be wary of the
recent upstarts, a position reinforced by legacy investments in tools
that depend on their relational database and are sometimes very
expensive.

On the other hand, web programmers look for tools that conform more
closely to the data structures and programming techniques they're used
to, and can actually be "flummoxed" by relational database logic or
abstraction layers on top of the databases. These programmers may
think it intuitive to do the kinds of filtering and sorting that seem
like reinventing the wheel to a traditional RDMBS programmer.
Anderson likes to quote Jacob Kaplan-Moss, the creator of Django, as
saying, "Django may be built for the Web, but CouchDB is built of the
Web. I've never seen software that so completely embraces the
philosophies behind HTTP."

10gen's consultation with MongoDB users includes asking for votes on
new features. They also see a great deal of code contributions in the
driver layer and adapters (sessions, logging, etc.) but not much in
the core. Sheehy said the same is true of Riak: although contributions
to the core are rare, half the client libraries are developed by
outsiders, and many of the tools.

Rapid change is part of life for NoSQL developers. Anderson says of
CouchDB, "The ancillary APIs have been evolving rapidly in preparation
for our 1.0 release, which should come out in the next few months and
won't differ much from today's trunk. The new APIs include
authentication, authorization, details of Map/Reduce, and functions
for transforming and serving JSON documents as other datatypes such as
HTML or CSV." Horowitz stressed that MongoDB will roll out a lot of
new features over the upcoming year.

One hundred people have signed up for NoSQL Boston so far, and more
than 150 are expected. I'll be there to take it in and try to reduce
it to some high-level insights for this blog.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl