Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 28 2013

La prochaine vague de startups sera fondée sur la sécurité - Technology Review

La prochaine vague de startups sera fondée sur la sécurité - Technology Review

Les menaces d’intrusion se démultiplient alors que la cybersécurité n’est jamais résolue, estime l’investisseur David Cowan. Pour lui l’avenir est à la sécurité et notamment à la sécurité du l’informatique en nuage qui est pour l’instant l’une des zones les moins matures du cloud computing... Et qui est prometteur, parce que ces alternatives seront faciles à déployer et à gérer, plus facilement mises à jour... Mieux, les information de connexion elles-mêmes seront demain les signaux les plus efficaces (...)

#cloudcomputing #securite #innovation

May 02 2012

Recombinant Research: Breaking open rewards and incentives

In the previous articles in this series I've looked at problems in current medical research, and at the legal and technical solutions proposed by Sage Bionetworks. Pilot projects have shown encouraging results but to move from a hothouse environment of experimentation to the mainstream of one of the world's most lucrative and tradition-bound industries, Sage Bionetworks must aim for its nucleus: rewards and incentives.

Previous article in the series: Sage Congress plans for patient engagement.

Think about the publication system, that wretchedly inadequate medium for transferring information about experiments. Getting the data on which a study was based is incredibly hard; getting the actual samples or access to patients is usually impossible. Just as boiling vegetables drains most of their nutrients into the water, publishing results of an experiment throws away what is most valuable.

But the publication system has been built into the foundation of employment and funding over the centuries. A massive industry provides distribution of published results to libraries and research institutions around the world, and maintains iron control over access to that network through peer review and editorial discretion. Even more important, funding grants require publication (but the data behind the study only very recently). And of course, advancement in one's field requires publication.

Lawrence Lessig, in his keynote, castigated for-profit journals for restricting access to knowledge in order to puff up profits. A chart in his talk showed skyrocketing prices for for-profit journals in comparison to non-profit journals. Lessig is not out on the radical fringe in this regard; Harvard Library is calling the current pricing situation "untenable" in a move toward open access echoed by many in academia.

Lawrence Lessig keynote at Sage Congress
Lawrence Lessig keynote at Sage Congress.

How do we open up this system that seemed to serve science so well for so long, but is now becoming a drag on it? One approach is to expand the notion of publication. This is what Sage Bionetworks is doing with Science Translational Medicine in publishing validated biological models, as mentioned in an earlier article. An even more extensive reset of the publication model is found in Open Network Biology (ONB), an online journal. The publishers require that an article be accompanied by the biological model, the data and code used to produce the model, a description of the algorithm, and a platform to aid in reproducing results.

But neither of these worthy projects changes the external conditions that prop up the current publication system.

When one tries to design a reward system that gives deserved credit to other things besides the final results of an experiment, as some participants did at Sage Congress, great unknowns loom up. Is normalizing and cleaning data an activity worth praise and recognition? How about combining data sets from many different projects, as a Synapse researcher did for the TCGA? How much credit do you assign researchers at each step of the necessary procedure for a successful experiment?

Let's turn to the case of free software to look at an example of success in open sharing. It's clear that free software has swept the computer world. Most web sites use free software ranging from the server on which they run to the language compilers that deliver their code. Everybody knows that the most popular mobile platform, Android, is based on Linux, although fewer realize that the next most popular mobile platforms, Apple's iPhones and iPads, run on a modified version of the open BSD operating system. We could go on and on citing ways in which free and open source software have changed the field.

The mechanism by which free and open source software staked out its dominance in so many areas has not been authoritatively established, but I think many programmers agree on a few key points:

  • Computer professionals encountered free software early in their careers, particularly as students or tinkerers, and brought their predilection for it into jobs they took at stodgier institutions such as banks and government agencies. Their managers deferred to them on choices for programming tools, and the rest is history.

  • Of course, computer professionals would not have chosen the free tools had they not been fit for the job (and often best for the job). Why is free software so good? Probably because the people creating it have complete jurisdiction over what to produce and how much time to spend producing it, unlike in commercial ventures with requirements established through marketing surveys and deadlines set unreasonably by management.

  • Different pieces of free software are easy to hook up, because one can alter their interfaces as necessary. Free software developers tend to look for other tools and platforms that could work with their own, and provide hooks into them (Apache, free database engines such as MySQL, and other such platforms are often accommodated.) Customers of proprietary software, in contrast, experience constant frustration when they try to introduce a new component or change components, because the software vendors are hostile to outside code (except when they are eager to fill a niche left by a competitor with market dominance). Formal standards cannot overcome vendor recalcitrance--a painful truth particularly obvious in health care with quasi-standards such as HL7.

  • Free software scales. Programmers work on it tirelessly until it's as efficient as it needs to be, and when one solution just can't scale any more, programmers can create new components such as Cassandra, CouchDB, or Redis that meet new needs.

Are there lessons we can take from this success story? Biological research doesn't fit the circumstances that made open source software a success. For instance, researchers start out low on the totem pole in very proprietary-minded institutions, and don't get to choose new ways of working. But the cleverer ones are beginning to break out and try more collaboration. Software and Internet connections help.

Researchers tend to choose formats and procedures on an ad hoc, project by project basis. They haven't paid enough attention to making their procedures and data sets work with those produced by other teams. This has got to change, and Sage Bionetworks is working hard on it.

Research is labor-intensive. It needs desperately to scale, as I have pointed out throughout this article, but to do so it needs entire new paradigms for thinking about biological models, workflow, and teamwork. This too is part of Sage Bionetworks' mission.

Certain problems are particularly resistant in research:

  • Conditions that affect small populations have trouble raising funds for research. The Sage Congress initiatives can lower research costs by pooling data from the affected population and helping researchers work more closely with patients.

  • Computation and statistical methods are very difficult fields, and biological research is competing with every other industry for the rare individuals who know these well. All we can do is bolster educational programs for both computer scientists and biologists to get more of these people.

  • There's a long lag time before one knows the effects of treatments. As Heywood's keynote suggested, this is partly solved by collecting longitudinal data on many patients and letting them talk among themselves.

Another process change has revolutionized the computer field: agile programming. That paradigm stresses close collaboration with the end-users whom the software is supposed to benefit, and a willingness to throw out old models and experiment. BRIDGE and other patient initiatives hold out the hope of a similar shift in medical research.

All these things are needed to rescue the study of genetics. It's a lot to do all at once. Progress on some fronts were more apparent than others at this year's Sage Congress. But as more people get drawn in, and sometimes fumbling experiments produce maps for changing direction, we may start to see real outcomes from the efforts in upcoming years.

All articles in this series, and others I've written about Sage Congress, are available through a bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

Sponsored post

May 01 2012

Recombinant Research: Sage Congress plans for patient engagement

Clinical trials are the pathway for approving drug use, but they aren't good enough. That has become clear as a number of drugs (Vioxx being the most famous) have been blessed by the FDA, but disqualified after years of widespread use reveal either lack of efficacy or dangerous side effects. And the measures taken by the FDA recently to solve this embarrassing problem continue the heavy-weight bureaucratic methods it has always employed: more trials, raising the costs of every drug and slowing down approval. Although I don't agree with the opinion of Avik S. A. Roy (reprinted in Forbes) that Phase III trials tend to be arbitrary, I do believe it is time to look for other ways to test drugs for safety and efficacy.

First article in the series: Recombinant Research: Sage Congress Promotes Data Sharing in Genetics.

But the Vioxx problem is just one instance of the wider malaise afflicting the drug industry. They just aren't producing enough new medications, either to solve pressing public needs or to keep up their own earnings. Vicki Seyfert-Margolis of the FDA built on her noteworthy speech at last year's Sage Congress (reported in one of my articles about the conference) with the statistic that drug companies have submitted 20% fewer medications to the FDA between 2001 and 2007. Their blockbuster drugs produce far fewer profits than before as patents expire and fewer new drugs emerge (a predicament called the "patent cliff"). Seyfert-Margolis intimated that this crisis in the cause of layoffs in the industry, although I heard elsewhere that the companies are outsourcing more research, so perhaps the downsizing is just a reallocation of the same money.

Benefits of patient involvement

The field has failed to rise to the challenges posed by new complexity. Speakers at Sage Congress seemed to feel that genetic research has gone off the tracks. As the previous article in this series explained, Sage Bionetworks wants researchers to break the logjam by sharing data and code in GitHub fashion. And surprisingly, pharma is hurting enough to consider going along with an open research system. They're bleeding from a situation where as much as 80% of each clinical analysis is spent retrieving, formatting, and curating the data. Meanwhile, Kathy Giusti of the Multiple Myeloma Research Foundation says that in their work, open clinical trials are 60% faster.

Attendees at a breakout session where I sat in, including numerous managers from major pharma companies, expressed confidence that they could expand public or "pre-competitive" research in the direction Sage Congress proposed. The sector left to engage is the one that's central to all this work--the public.

If we could collect wide-ranging data from, say, 50,000 individuals (a May 2013 goal cited by John Wilbanks of Sage Bionetworks, a Kauffman Foundation Fellow), we could uncover a lot of trends that clinical trials are too narrow to turn up. Wilbanks ultimately wants millions of such data samples, and another attendee claimed that "technology will be ready by 2020 for a billion people to maintain their own molecular and longitudinal health data." And Jamie Heywood of PatientsLikeMe, in his keynote, claimed to have demonstrated through shared patient notes that some drugs were ineffective long before the FDA or manufacturers made the discoveries. He decried the current system of validating drugs for use and then failing to follow up with more studies, snorting that, "Validated means that I have ceased the process of learning."

But patients have good reasons to keep a close hold on their health data, fearing that an insurance company, an identity thief, a drug marketer, or even their own employer will find and misuse it. They already have little enough control over it, because the annoying consent forms we always have shoved in our faces when we come to a clinic give away a lot of rights. Current laws allow all kinds of funny business, as shown in the famous case of the Vermont law against data mining, which gave the Supreme Court a chance to say that marketers can do anything they damn please with your data, under the excuse that it's de-identified.

In a noteworthy poll by Sage Bionetworks, 80% of academics claimed they were comfortable sharing their personal health data with family members, but only 31% of citizen advocates would do so. If that 31% is more representative of patients and the general public, how many would open their data to strangers, even when supposedly de-identified?

The Sage Bionetworks approach to patient consent

It's basic research that loses. So Wilbanks and a team have been working for the past year on a "portable consent" procedure. This is meant to overcome the hurdle by which a patient has to be contacted and give consent anew each time a new researcher wants data related to his or her genetics, conditions, or treatment. The ideal behind portable consent is to treat the entire research community as a trusted user.

The current plan for portable consent provides three tiers:

Tier 1

No restrictions on data, so long as researchers follow the terms of service. Hopefully, millions of people will choose this tier.

Tier 2

A middle ground. Someone with asthma may state that his data can be used only by asthma researchers, for example.

Tier 3

Carefully controlled. Meant for data coming from sensitive populations, along with anything that includes genetic information.

Synapse provides a trusted identification service. If researchers find a person with useful characteristics in the last two tiers, and are not authorized automatically to use that person's data, they can contact Synapse with the random number assigned to the person. Synapse keeps the original email address of the person on file and will contact him or her to request consent.

Portable consent also involves a lot of patient education. People will sign up through a software wizard that explains the risks. After choosing portable consent, the person decides how much to put in: 23andMe data, prescriptions, or whatever they choose to release.

Sharon Terry of the Genetic Alliance said that patient advocates currently try to control patient data in order to force researchers to share the work they base on that data. Portable consent loosens this control, but the field may be ready for its more flexible conditions for sharing.

Pharma companies and genetics researchers have lots to gain from access to enormous repositories of patient data. But what do the patients get from it? Leaders in health care already recognize that patients are more than experimental subjects and passive recipients of treatment. The recent ONC proposal for Stage 2 of Meaningful Use includes several requirements to share treatment data with the people being treated (which seems kind of a no-brainer when stated this baldly) and the ONC has a Consumer/Patient Engagement Power Team.

Sage Congress is fully engaged in the patient engagement movement too. One result is the BRIDGE initiative, a joint project of Sage Bionetworks and Ashoka with funding from the Robert Wood Johnson Foundation, to solicit questions and suggestions for research from patients. Researchers can go for years researching a condition without even touching on some symptom that patients care about. Listening to patients in the long run produces more cooperation and more funding.

Portable consent requires a leap of faith, because as Wilbanks admits, releasing aggregates of patient data mean that over time, a patient is almost certain to be re-identified. Statistical techniques are just getting too sophisticated and compute power growing too fast for anyone to hide behind current tricks such as using only the first three digits of a five-digit postal code. Portable consent requires the data repository to grant access only to bona fide researchers and to set terms of use, including a ban on re-identifying patients. Still, researchers will have rights to do research, redistribute data, and derive products from it. Audits will be built in.

But as mentioned by Kelly Edwards of the University of Washington, tools and legal contracts can contribute to trust, but trust is ultimately based on shared values. Portable consent, properly done, engages with frameworks like Synapse to create a culture of respect for data.

In fact, I think the combination of the contractual framework in portable consent and a platform like Synapse, with its terms of use, might make a big difference in protecting patient privacy. Seyfert-Margolis cited predictions that 500 million smartphone users will be using medical apps by 2015. But mobile apps are notoriously greedy for personal data and cavalier toward user rights. Suppose all those smartphone users stored their data in a repository with clear terms of use and employed portable consent to grant access to the apps? We might all be safer.

The final article in this series will evaluate the prospects for open research in genetics, with a look at the grip of journal publishing on the field, and some comparisons to the success of free and open source software.

Next: Breaking Open Rewards and Incentives. All articles in this series, and others I've written about Sage Congress, are available through a bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

April 23 2012

April 14 2012

MySQL in 2012: Report from Percona Live

The big annual MySQL conference, started by MySQL AB in 2003 and run
by my company O'Reilly for several years, lives on under the able
management of Percona. This
fast-growing company started out doing consulting on MySQL,
particularly in the area of performance, and branched out into
development and many other activities. The principals of this company
wrote the most recent two editions of the popular O'Reilly book href="">High
Performance MySQL

Percona started offering conferences a couple years ago and decided to
step in when O'Reilly decided not to run the annual MySQL conference
any more. Oracle did not participate in Percona Live, but has
announced href="">its own MySQL
conference for next September.

Percona Live struck me as a success, with about one thousand attendees
and the participation of leading experts from all over the MySQL
world, save for Oracle itself. The big players in the MySQL user
community came out in force: Facebook, HP, Google, Pinterest (the
current darling of the financial crowd), and so on.

The conference followed the pattern laid down by old ones in just
about every way, with the same venue (the Santa Clara Convention
Center, which is near a light-rail but nothing else of interest), the
same food (scrumptious), the standard format of one day of tutorials
and two days of sessions (although with an extra developer day tacked
on, which I will describe later), an expo hall (smaller than before,
but with key participants in the ecosystem), and even community awards
(O'Reilly Media won an award as Corporate Contributor of the Year).
Monty Widenius was back as always with a MariaDB entourage, so it
seemed like old times. The keynotes seemed less well attended than the
ones from previous conferences, but the crowd was persistent and
showed up in impressive numbers for the final events--and I don't
believe it was because everybody thought they might win one of the
door prizes.

Jeremy Zawodny ready to hand out awards
Jeremy Zawodny ready to hand out awards.

Two contrasting database deployments

I checked out two well-attended talks by system architects from two

high-traffic sites: Pinterest and craigslist. The radically divergent
paths they took illustrate the range of options open to data centers
nowadays--and the importance of studying these options so a data
center can choose the path appropriate to its mission and

Jeremy Zawodny (co-author of the first edition of High Performance
) href="
the design of craigslist's site, which illustrates the model of
software accretion over time and an eager embrace of heterogeneity.
Among their components are:

  • Memcache, lying between the web servers and the MySQL database in
    classic fashion.

  • MySQL to serve live postings, handle abuse, data for monitoring
    system, and other immediate needs.

  • MongoDB to store almost 3 billion items related to archived (no longer
    live) postings.

  • HAproxy to direct requests to the proper MySQL server in a cluster.

  • Sphinx for text searches, with

    indexes over all live postings, archived postings, and forums.

  • Redis for temporary items such as counters and blobs.

  • An XFS filesystem for images.

  • Other helper functions that Zawodny lumped together as "async

Care and feeding of this menagerie becomes a job all in itself.
Although craigslist hires enough developers to assign them to
different areas of expertise, they have also built an object layer
that understands MySQL, cache, Sphinx, MongoDB. The original purpose
of this layer was to aid in migrating old data from MySQL to MongoDB
(a procedure Zawodny admitted was painful and time-consuming) but it
was extended into a useful framework that most developers can use
every day.

Zawodny praised MySQL's durability and its method of replication. But
he admitted that they used MySQL also because it was present when they
started and they were familiar with it. So adopting the newer entrants
into the data store arena was by no means done haphazardly or to try
out cool new tools. Each one precisely meets particular needs of the

For instance, besides being fast and offering built-in sharding,
MongoDB was appealing because they don't have to run ALTER TABLE every
time they add a new field to the database. Old entries coexist happily
with newer ones that have different fields. Zawodny also likes using a
Perl client to interact with a database, and the Perl client provided
by MongoDB is unusually robust because it was developed by 10gen
directly, in contrast to many other datastores where Perl was added by
some random volunteer.

The architecture at craigslist was shrewdly chosen to match their
needs. For instance, because most visitors click on the limited set
of current listings, the Memcache layer handles the vast majority of
hits and the MySQL database has a relatively light load.

However, the MySQL deployment is also carefully designed. Clusters are
vertically partitioned in two nested ways. First, different types of
items are stored on separate partitions. Then, within each type, the
nodes are further divided by the type of query:

  • A single master to handle all writes.

  • A group for very fast reads (such as lookups on a primary key)

  • A group for "long reads" taking a few seconds

  • A special node called a "thrash handler" for rare, very complex

It's up to the application to indicate what kind of query it is
issuing, and HAproxy interprets this information to direct the query
to the proper set of nodes.

Naturally, redundancy is built in at every stage (three HAproxy
instances used in round robin, two Memcache instances holding the same
data, two data centers for the MongoDB archive, etc.).

It's also interesting what recent developments have been eschewed by
craigslist. The self-host everything and use no virtualization.
Zawodny admits this leads to an inefficient use of hardware, but
avoids the overhead associated with virtualization. For efficiency,
they have switched to SSDs, allowing them to scale down from 20
servers to only 3. They don't use a CDN, finding that with aggressive
caching and good capacity planning they can handle the load
themselves. They send backups and logs to a SAN.

Let's turn now from the teeming environment of craigslist to the
decidedly lean operation of Pinterest, a much younger and smaller
organization. As href="">presented
by Marty Weiner and Yashh Nelapati, when they started web-scale
growth in the Autumn of 2011, they reacted somewhat like craigslist,
but with much less thinking ahead, throwing in all sorts of software
such as Cassandra and MongoDB, and diversifying a bit recklessly.
Finally they came to their senses and went on a design diet. Their
resolution was to focus on MySQL--but the way they made it work is
unique to their data and application.

They decided against using a cluster, afraid that bad application code
could crash everything. Sharding is much simpler and doesn't require
much maintenance. Their advice for implementing MySQL sharding

  • Make sure you have a stable schema, and don't add features for a
    couple months.

  • Remove all joins and complex queries for a while.

  • Do simple shards first, such as moving a huge table into its own

They use Pyres, a
Python clone of Resque, to move data into shards.

However, sharding imposes severe constraints that led them to
hand-crafted work-arounds.

Many sites want to leave open the possibility for moving data between
shards. This is useful, for instance, if they shard along some
dimension such as age or country, and they suddenly experience a rush
of new people in their 60s or from China. The implementation of such a
plan requires a good deal of coding, described in the O'Reilly book href="">MySQL High
, including the creation of a service that just
accepts IDs and determines what shard currently contains the ID.

The Pinterest staff decided the ID service would introduce a single
point of failure, and decided just to hard-code a shard ID in every ID
assigned to a row. This means they never move data between shards,
although shards can be moved bodily to new nodes. I think this works
for Pinterest because they shard on arbitrary IDs and don't have a
need to rebalance shards.

Even more interesting is how they avoid joins. Suppose they want to
retrieve all pins associated with a certain board associated with a
certain user. In classical, normalized relational database practice,
they'd have to do a join on the comment, pin, and user tables. But
Pinterest maintains extra mapping tables. One table maps users to
boards, while another maps boards to pins. They query the
user-to-board table to get the right board, query the board-to-pin
table to get the right pin, and then do simple queries without joins
on the tables with the real data. In a way, they implement a custom
NoSQL model on top of a relational database.

Pinterest does use Memcache and Redis in addition to MySQL. As with
craigslist, they find that most queries can be handled by Memcache.
And the actual images are stored in S3, an interesting choice for a
site that is already enormous.

It seems to me that the data and application design behind Pinterest
would have made it a good candidate for a non-ACID datastore. They
chose to stick with MySQL, but like organizations that use NoSQL
solutions, they relinquished key aspects of the relational way of
doing things. They made calculated trade-offs that worked for their
particular needs.

My take-away from these two fascinating and well-attended talks was
that how you must understand your application, its scaling and
performance needs, and its data structure, to know what you can
sacrifice and what solution gives you your sweet spot. craigslist
solved its problem through the very precise application of different
tools, each with particular jobs that fulfilled craigslist's

requirements. Pinterest made its own calculations and found an
entirely different solution depending on some clever hand-coding
instead of off-the-shelf tools.

Current and future MySQL

The conference keynotes surveyed the state of MySQL and some
predictions about where it will go.

Conference co-chair Sarah Novotny at keynote
Conference co-chair Sarah Novotny at keynotes.

The world of MySQL is much more complicated than it was a couple years
ago, before Percona got heavily into the work of releasing patches to
InnoDB, before they created entirely new pieces of software, and
before Monty started MariaDB with the express goal of making a better
MySQL than MySQL. You can now choose among Oracle's official MySQL
releases, Percona's supported version, and MariaDB's supported
version. Because these are all open source, a major user such as

Facebook can even apply patches to get the newest features.

Nor are these different versions true forks, because Percona and
MariaDB create their enhancements as patches that they pass back to
Oracle, and Oracle is happy to include many of them in a later
release. I haven't even touched on the commercial ecosystem around
MySQL, which I'll look at later in this article.

In his href="
keynote, Percona founder Peter Zaitsev praised the latest MySQL
release by Oracle. With graceful balance he expressed pleasure that
the features most users need are in the open (community) edition, but
allowed that the proprietary extensions are useful too. In short, he
declared that MySQL is less buggy and has more features than ever.

The href="">former
CEO of MySQL AB, Mårten Mickos, also found that MySQL is
doing well under Oracle's wing. He just chastised Oracle for failing
to work as well as it should with potential partners (by which I
assume he meant Percona and MariaDB). He lauded their community
managers but said the rest of the company should support them more.

Keynote by Mårten Mickos
Keynote by Mårten Mickos.

Aker presented an OpenStack MySQL service developed by his current
employer, Hewlett-Packard. His keynote retold the story that had led
over the years to his developing href="">Drizzle (a true fork of MySQL
that tries to return it to its lightweight, Web-friendly roots) and
eventually working on cloud computing for HP. He described modularity,
effective use of multiple cores, and cloud deployment as the future of

A href="
on the second day of the conference brought together high-level
managers from many of the companies that have entered the MySQL space
from a variety of directions in a high-level discussion of the
database engine's future. Like most panels, the conversation ranged
over a variety of topics--NoSQL, modular architecture, cloud
computing--but hit some depth only on the topic of security, which was
not represented very strongly at the conference and was discussed here
at the insistence of Slavik Markovich from McAfee.

Keynote by Brian Aker
Keynote by Brian Aker.

Many of the conference sessions disappointed me, being either very
high level (although presumably useful to people who are really new to
various topics, such as Hadoop or flash memory) or unvarnished
marketing pitches. I may have judged the latter too harshly though,
because a decent number of attendees came, and stayed to the end, and
crowded around the speakers for information.

Two talks, though, were so fast-paced and loaded with detail that I
couldn't possibly keep my typing up with the speaker.

One such talk was the href="
by Mark Callaghan of Facebook. (Like the other keynotes, it should be
posted online soon.) A smattering of points from it:

  • Percona and MariaDB are adding critical features that make replication
    and InnoDB work better.

  • When a logical backup runs, it is responsible for 50% of IOPS.

  • Defragmenting InnoDB improves compression.

  • Resharding is not worthwhile for a large, busy site (an insight also
    discovered by Pinterest, as I reported earlier)

The other fact-filled talk was href="
Yoshinori Matsunobu of Facebook, and concerned how to achieve
NoSQL-like speeds while sticking with MySQL and InnoDB. Much of the
talk discussed an InnoDB memcached plugin, which unfortunately is
still in the "lab" or "pre-alpha" stage. But he also suggested some
other ways to better performance, some involving Memcache and others
more round-about:

  • Coding directly with the storage engine API, which is storage-engine

  • Using HandlerSocket, which queues write requests and performs them
    through a single thread, avoiding costly fsync() calls. This can
    achieve 30,000 writes per second, robustly.

Matsunobu claimed that many optimizations are available within MySQL
because a lot of data can fit in main memory. For instance, if you
have 10 million users and store 400 bytes per user, the entire user
table can fit in 20 GB. Matsunobu tests have shown that most CPU time
in MySQL is spent in functions that are not essential for processing
data, such as opening and closing a table. Each statement opens a
separate connection, which in turn requires opening and closing the
table again. Furthermore, a lot of data is sent over the wire besides
the specific fields requested by the client. The solutions in the talk
evade all this overhead.

The commercial ecosystem

Both as vendors and as sponsors, a number of companies have always
lent another dimension to the MySQL conference. Some of these really
have nothing to do with MySQL, but offer drop-in replacements for it.
Others really find a niche for MySQL users. Here are a few that I
happened to talk to:

  • Clustrix provides a very
    different architecture for relational data. They handle sharding
    automatically, permitting such success stories as the massive scaling
    up of the social media site Massive Media NV without extra
    administrative work. Clustrix also claims to be more efficient by
    breaking queries into fragments (such as the WHERE clauses of joins)
    and executing them on different nodes, passing around only the data

    produced by each clause.

  • Akiban also offers faster
    execution through a radically different organization of data. They
    flatten the normalized tables of a normalized database into a single
    data structure: for instance, a customer and his orders may be located
    sequentially in memory. This seems to me an import of the document
    store model into the relational model. Creating, in effect, an object
    that maps pretty closely to the objects used in the application
    program, Akiban allows common queries to be executed very quickly, and
    could be deployed as an adjunct to a MySQL database.

  • Tokutek produced a drop-in
    replacement for InnoDB. The founders developed a new data structure
    called a fractal tree as a faster alternative to the B-tree structures
    normally used for indexes. The existence of Tokutek vindicates both

    the open source distribution of MySQL and its unique modular design,
    because these allowed Tokutek's founders to do what they do
    best--create a new storage engine--without needing to create a whole
    database engine with the related tools and interfaces it would

  • Nimbus Data Systems creates a
    flash-based hardware appliance that can serve as a NAS or SAN to
    support MySQL. They support a large number of standard data transfer
    protocols, such as InfiniBand, and provide such optimizations as
    caching writes in DRAM and making sure they write complete 64KB blocks
    to flash, thus speeding up transfers as well as preserving the life of
    the flash.

Post-conference events

A low-key developer's day followed Percona Live on Friday. I talked to
people in the Drizzle and
Sphinx tracks.

As a relatively young project, the Drizzle talks were aimed mostly at
developers interested in contributing. I heard talks about their
kewpie test framework and about build and release conventions. But in
keeping with it's goal to make database use easy and light-weight, the
project has added some cool features.

Thanks to a
and a built-in web server, Drizzle now presents you with a
Web interface for entering SQL commands. The Web interface translates
Drizzle's output to simple HTML tables for display, but you can also
capture the JSON directly, making programmatic access to Drizzle
easier. A developer explained to me that you can also store JSON
directly in Drizzle; it is simply stored as a single text column and
the JSON fields can be queried directly. This reminded me of an XQuery
interface added to some database years ago. There too, the XML was
simply stored as a text field and a new interface was added to run the
XQuery selects.

Sphinx, in contrast to Drizzle, is a mature product with commercial
support and (as mentioned earlier in the article) production
deployments at places such as craigslist, as well as an href="">O'Reilly
book. I understood better, after attending today's sessions, what
makes Sphinx appealing. Its quality is unusually high, due to the use

of sophisticated ranking algorithms from the research literature. The
team is looking at recent research to incorporate even better
algorithms. It is also fast and scales well. Finally, integration with
MySQL is very clean, so it's easy to issue queries to Sphinx and pick
up results.

Recent enhancements include an href="">add-on called fSphinx
to make faceted searches faster (through caching) and easier, and
access to Bayesian Sets to find "items similar to this one." In Sphinx
itself, the team is working to add high availability, include a new
morphology (stemming, etc.) engine that handles German, improve
compression, and make other enhancements.

The day ended with a reception and copious glasses of Monty Widenius's
notorious licorice-flavored vodka, an ending that distinguishes the
MySQL conference from others for all time.

January 16 2012

Medical imaging in the cloud: a conversation about eMix

Over the past few weeks I've been talking to staff at DR Systems about medical imaging in the cloud. DR Systems boasts of offering the first cloud solution for sharing medical images, called eMix. According to my contact Michael Trambert, Lead Radiologist for PACS Reengineering for the Cottage Health System and Sansum Clinic in Santa Barbara, California, eMix started off by offering storage for both images and the reports generated by radiologists, cardiologists, and other imaging specialists. It then expanded to include other medical records in HL7, CDA, CCR, PDF, RTF, and plain text formats. It is vendor neutral, thanks to DICOM (a standard that covers both images and reports) and HL7.

First a bit of background (some of which I offered in an earlier posting). In the U.S., currently, an estimated 30 billion dollars are wasted each year through re-imagining that could be avoided. In addition to cost, there are many reasons to cut down on images: many systems expose patients to small amounts of radiation that pose a cumulative risk over time, and in an emergency situation it's better to reuse a recent image than to waste time taking another.

The situation was brought home by a conversation I had with CIO Chuck Christian of Vincennes, Indiana's Good Samaritan Hospital, a customer of eMix. Patients are often tasked with carrying their own images around (originally as print-outs, and more recently as CD-ROMs). These things often get misplaced, or the CDs turn out to be corrupt or incompatible with the receiving IT system. It's a situation crying out for networked transfer, but HIPAA requires careful attention to security and privacy.

eMix is currently used by about 300 sites, most in the US, and a few in Europe. Uses include remote consulting and sending an eMix image and report "package" to an emergency treatment center ahead of the patient. The eMix package has a built-in viewing capability, so the recipient needs nothing beyond a web browser. Data is protected by encryption on the eMix site and through SSL during transmission.

Sharing is so easy that according to eMix General Manager Florent Saint-Clair, the chief privacy risk in eMix is user error. A sender may type in the wrong email address or accede to a request for an image without ensuring that the recipient is properly authorized to receive it.

This will be an issue with the Direct project, too, when that enters common use. The Direct project will allow the exchange of data over email, but because most doctors' email accounts are not currently secure, eMix just uses email to notify a recipient that an image is ready. Everything else takes place over the Web. The company stresses a number of measures they take to ensure security: for instance, data is always deleted after 30 days, physical security is robust, and storage is on redundant servers.

January 11 2012

Can Maryland's other "CIO" cultivate innovation in government?

Bryan Sivak at OSCON 2010When Maryland hired Bryan Sivak last April as the state's chief innovation officer, the role had yet to be defined in government. After all, like most other states in the union, Maryland had never had a chief innovation officer before.

Sivak told TechPresident on his second day at work that he wanted to define what it means to build a system for innovation in government:

If you can systemize what it means to be innovative, what it means to challenge the status quo without a budget, without a lot of resources, then you've created something that can be replicated anywhere.

Months later, Sivak (@BryanSivak) has been learning — and sharing — as he goes. That doesn't mean he walked into the role without ideas about how government could be more innovative. Anything but. Sivak's years in the software industry and his tenure as the District of Columbia's chief technology officer equipped him with plenty of ideas, along with some recognition as a Gov 2.0 Hero from Govfresh.

Sivak was a genuine change agent during his tenure in DC. As DCist reported, Sivak oversaw the development of several projects while he was in office, like the District's online service request center and "the incredibly useful TrackDC."

Citizensourcing better ideas

One of the best ideas that Sivak brought to his new gig was culled directly from the open government movement: using collective intelligence to solve problems.

"My job is to fight against the entrenched status quo," said Sivak in an interview this winter. "I'm not a subject expert in 99% of issues. The people who do those jobs, live and breathe them, do know what's happening. There are thousands and thousands of people asking 'why can't we do this this way? My job is to find them, help them, get them discovered, and connect them."

That includes both internal and external efforts, like a pilot partnership with citizens to report downed trees last year.

An experiment with SeeClickFix during Hurricane Irene in August 2011 had a number of positive effects, explained Sivak. "It made emergency management people realize that they needed to look at this stuff," he said. "Our intention was to get people thinking. The new question is now, 'How do we figure out how to use it?' They're thinking about how to integrate it into their process."

Gathering ideas for making government work better from the public presents some challenges. For instance, widespread public frustration with the public sector can also make citizensourcing efforts a headache to architect and govern. Sivak suggested trying to get upset citizens involved in addressing the problems they highlight in public comments.

"Raise the issue, then channel the negative reactions into fixing the issues," he said. "Why not get involved? There are millions of civil servants trying to do the right thing every day."

In general, Sivak said, "the vast majority of people are there to do a good job. The problem is rules and regulations built up over centuries that prevent us from doing that the best way."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Doing more with less

If innovation is driven by resource constraints, by "doing more with less," Sivak will be in the right place at the right time. Maryland Governor Martin O'Malley's 2012 budget included billions in proposed cuts, including hundreds of millions pared from state agencies. More difficult decisions will be in the 2013 budget as well.

The challenge now is how to bring ideas to fruition in the context of state government, where entrenched bureaucracy, culture, aging legacy IT systems and more prosaic challenges around regulations stand in the way.

One clear direction is to find cost savings in modern IT, where possible. Moving government IT systems into the cloud won't be appropriate in all circumstances. Enterprise email, however, appears to be a ripe area for migration. Maryland is moving to the cloud for email, specifically to Google Apps for Enterprise.

This will merge 57 systems into one, said Sivak. "Everyone is really jazzed." The General Services Agency saved an estimated $11 million for 13,000 employees, according to Sivak. "We hope to save more. People don't factor in upgrade costs."

He's found, however, that legacy IT systems aren't the most significant hindrance to innovation in government. "When I started public service [in D.C. government], procurement and HR were the things I was least interested in," said Sivak. "Now, they are the things I'm most interested in. Fix them and you fix many of the problems."

The problem with reform of these areas, however, is that it's neither a particularly sexy issue for politicians to run on in election years nor focus upon in office. Sivak told TechPresident last year "that to be successful with technology initiatives, you need to attack the underlying culture and process first."

Given the necessary focus on the economy and job creation, the Maryland governor's office is also thinking about how to attract and sustain entrepreneurs and small businesses. "We're also working on making research and development benefit the state more," Sivak said. "Maryland is near the top of R&D spending but very low on commercializing the outcomes."

In an extended interview conducted via email (portions posted below), Sivak expanded further about his view of innovation, what's on his task list and some of the projects that he's been working on to date.

How do you define innovation? What pockets of innovation in government inspire you?

Bryan Sivak: Innovation is an overused term in nearly every vertical — both the public and private sectors — which is why a definition is important. My current working definition is something like this:

Innovation challenges existing processes and systems, resulting in the injection, rapid execution and validation of new ideas into the ecosystem. In short, innovation asks "why?" a lot.

Note that this is my current working definition and might change without notice.

What measures has Maryland taken to attract and retain startups?

Bryan Sivak: There are a number of entities across the state that are focused on Maryland's startup ecosystem. Many are on the local level and the private academic side (incubators, accelerators, etc.), but as a state we have organizations that are — at least partially — focused on this as well. TEDCO, for example, is a quasi-public entity focused on encouraging technology development across the state. And the Department of Business and Economic Development has a number of people who are focused on building the state's startup infrastructure.

One of the things I've been focusing on is the "commercialization gap," specifically the fact that Maryland ranks No. 1 per capita in PhD scientists and engineers, No. 1 in federal research and development dollars per capita, and No. 1 in the best public schools in the country, but it is ranked No. 37 in terms of entrepreneurial activity. We are working on coming up with a package to address this gap and to help commercialize technologies that are a result of R&D investment into our academic and research institutions.

What about the cybersecurity industry?

Cybersecurity is a big deal in Maryland, and in 2010, the Department of Business and Economic Development released its Cyber Maryland plan, which contains 10 priorities that the state is working on to make Maryland the cybersecurity hub of the U.S. Given the preponderance of talent and specific institutions in the state, it makes a ton of sense and builds on assets we already have in place.

What have you learned from Maryland's crowdsourcing efforts to date?

Bryan Sivak: We've really just started to dip our toes in the crowdsourcing waters, but it's been very interesting so far. The desire is there — people definitely want to contribute — but what's become very clear is that we need a process in place on the back end to handle incoming items. On the public safety front, for example, most of the issues that get reported by citizens will be dealt with by the locals, as opposed to the state. We need a mechanism for issues to be reported and tracked in a single interface but acted upon by the appropriate entity.

This is much easier on the local side since all groups are theoretically on the same page. We are also building ad-hoc processes on the fly to handle responses to other crowdsourced inputs. For example, we recently asked citizens and businesses for ideas for regulatory reform in the state. In order to make sure these inputs were handled correctly, we created a manual, human-based process to filter the ideas and make sure the right people at the right agencies saw them. This worked well for this initiative, but it is obviously not scalable for implementation on a broad scale.

The conclusion is that the desire and ability for people external to the government to contribute is not going to decrease, so if we are proactive on this issue and try to stay ahead of or with the curve, everyone — government, residents, and businesses — will benefit.

What roles do data and analytics play in Maryland's governance processes and policy making?

Bryan Sivak: They play huge roles. The governor [Martin O'Malley] is well known for his belief in data-driven decision making, which was the impetus behind the creation of CitiStat in Baltimore and StateStat in Maryland. We use dashboards to track nearly every initiative, and this data features prominently in almost every policy discussion. As an example, check out the Governor's Delivery Unit website, where we publish a good amount of analysis we use to track achievement of goals. We are now working on building a robust data warehouse that will not only enable us to provide a deeper level of analysis on the fly, and on both a preset and an ad-hoc basis, but also give us the added benefit of easily publishing raw data to the community at large.

What have you learned through StateStat? How can you realize more value from it through automation?

Bryan Sivak: The StateStat program is incredibly effective in terms of focusing agencies on a set of desired outcomes and rigorously tracking their progress. One of the big challenges, however, is data collection and analysis. Currently, most of the data is collected by hand and entered into Excel spreadsheets for analysis and distribution. This was a great mechanism to get the program up and running, but by building the data warehouse, we will be able to automate a great deal of the data collection and processing that is currently being done manually. We also hope that by connecting data sources directly to the warehouse, we'll be able to get a much more real-time view of the leading indicators and have dashboards that reflect the current moment, as opposed to historical data.

This interview was edited and condensed. Photo by James Duncan Davidson.


October 07 2011

OpenStack Foundation requires further definition

For outsiders, the major news of interest from this week's OpenStack conference in Boston was the announcement of an OpenStack Foundation. I attended the conference yesterday where the official announcement was made, and tried to find out more about the move. But this will be a short posting because there's not much to say. The thinness of detail about the Foundation is probably a good sign, because it means that Rackspace and its partners are seeking input from the community about important parameters.

OpenStack is going to be the universal cloud platform of the future. This is assured by the huge backing and scads of funding from major companies, both potential users and vendors. (Dell and HP had big presences at the show.) Even if the leadership flubs a few things, the backers will pick them up, dust them off, and propel them on their way forward.

But the leadership has made some flubs--just the garden-variety types made by other leaders of other open source projects that are not so fortunate (or unfortunate) to be under such a strong spotlight. Most of the attendees expressed the view that the project, barely a year old, just needs to mature a bit and get through its awkward adolescent phase.

The whole premise of OpenStack is freedom from vendor lock-in. So Rackspace knew its stewardship had to come to an end. One keynoter today suggested that OpenStack invite seasoned leaders from other famous foundations taking the helm of free software projects--Apache, Mozilla, Linux, GNOME--to join its board and give it sage advice. But OpenStack is in a unique position. These other projects had a few years to achieve code stability and gather a robust community before becoming the intense objects of desire among major corporations who, although they undoubtedly benefited the projects, brought competing agendas. OpenStack got the corporate attention first.

It's also making a pilgrimage into a land dominated by giants such as, VMware, and Microsoft. Interestingly, the people at this conference expressed less concern about the competition presented by those companies than the ambiguous love expressed by companies with complicated relationships to OpenStack, notably Red Hat.

Will the OpenStack Foundation control the code or just manage the business side of the project? How will it attract developers and build community? What role do governments play, given that cloud computing raises substantial regulatory issues? I heard lots of questions like these, all apparently to be decided in the months to come. As one attendee said at the governance forum, "Let's not talk here about details, but about how we're going to talk about details."

And a colleague said to me afterward, "It's exciting to be in at the start of something big." I agree, but other than saying it's big, we don't know much about it.

July 28 2011

July 27 2011

Nebula looks to democratize cloud computing with open source hardware

A new company launched at the Open Source Convention (OSCON) in Portland, Ore. today is making a bid to disrupt the enterprise information technology market. Nebula will combine open source technology developed at NASA with open source hardware developed at Facebook into an appliance that Nebula CEO Chris Kemp is calling a "cloud controller." If Facebook's Open Compute Project looked like a big step forward for infrastructure, operations and the web, Nebula looks like it might be a giant leap. If Nebula succeeds, it could enable every company to implement cloud computing..

"As people face this industrial revolution of big data, they can't use Oracle anymore," said Kemp in an interview at OSCON. "It doesn't scale. We want to be the platform that enables that. We really believe that, if all of this stuff will achieve its potential, in being open, it will reshape the core of computing. We really think there's this new paradigm of computing where people are building on top of infrastructure services instead of infrastructure."

Nebula was founded by Kemp, the former CTO for IT at NASA. The company has recruited tech talent from Google, Amazon, Microsoft and NASA. It is funded by venture capital firm Kleiner Perkins Caufield and Byers and Highland Capital Partners, along with the first three people who invested in Google, Andy Bechtolsheim, David Cheriton and Ram Shriram.

The question that Kemp and his team asked themselves was how they could take OpenStack in its current open source state and make it accessible to everyone. "OpenStack is a great platform," he said. "It's where Linux was 25 years ago. It's like Sun in the early '80s." Now, Kemp says, he hopes to see open source hardware grow in the same way. "I'm eager to see Facebook open source hardware turn into a community," he said. "I want to bring hardware into the conversation. I really want people to start to innovate, breaking out of these monolithic, ivory towers of computing that want to lock you in to Infiniband or Fibre channel or certain blade server sizes. We want open standards, like 10 gigabit ethernet."

nebula_device.jpgNebula will supply the appliance. "If it fails, FedEx it back to us, and we'll send you another one," Kemp said. "Our little box has a 10 gigabit ethernet switch built into it. You can plug cheap commodity servers into the rack. You don't have to turn them on. It will do that. The interface is like Amazon Services." These servers act as monitors by this appliance, including log files and flow data. "What we do is create interface points to all of the common CMDB tools, managing tools, security tools, like ArcSight or Splunk," said Kemp. "We will create integration points for those particular products."

The big bet with Nebula is that this next generation of computing will be based upon open source hardware, and that the community that has made open source an elemental component of the Internet will continue to innovate on top of this platform. The open source model for Nebula isn't novel, as Deborah Gage pointed out in the Wall Street Journal: Cloudera has raised more than $30 million to commercialize Hadoop. Red Hat went public on the strength of its value-added services for Linux.

The paradigm of commodity hardware that's networked together with software isn't new either. That's precisely how Google built its cloud computing infrastructure, as Steven Levy documented in his recent book, "Into the Plex." A "data center in box" isn't a new idea, either. Dell, HP, Cisco and Google have been innovating in that footprint for years.

With the Nebula appliance, "you fill the triple rack full of the cheapest servers money can buy and end up with an Amazon-compatible compute cloud behind your firewall," Kemp said in the interview. That's where Kemp sees an opportunity, in terms of a value proposition for Nebula. Nebula would deliver OpenStack to the enterprise on Open Compute project servers with economics very close to what Google sees with their infrastructure. "We're democratizing web-scale cloud computing and making it turn-key so that you don't have to hire a professional Web services team," he said.

Making that case to enterprise CIOs and business owners that have invested in a given set of systems will require a powerful value proposition. "You don't have huge cost structure, like a Microsoft or VMware, when you're powered by open source technologies," said Kemp. "You can pay yesterday's tech companies and implement yesterday's systems, where you will pay an order of less money for an order of magnitude less capability."

Where Nebula seems to offer something new is in combining open source software with commodity hardware and turning it into a massive private compute cloud that, in theory, businesses with minimal IT experience can deploy. That's a big vision, and one that the world won't be able to fully evaluate until later in 2011. Nebula will be rolled out to six pilot customers this fall in finance, biotech and other industry verticals, said Kemp.

It's worth noting that the technology that drives Nebula comes from one of NASA's flagship open government initiatives, NASA Nebula. Open source has been a key component of NASA's open government work. With the launch of Nebula, an open government initiative looks set to create significant value — and jobs — in the private sector, along with driving open innovation in information technology.

"The next generation of computing will be open, not closed," said Kemp. "We want to see the next 25 years of computing filled with open standards."

April 20 2011

Why the cloud may finally end the reign of the work computer

Work Place by cell105, on FlickrIt's been a debate within organizations as long as I can remember: whether it's possible to support a workforce that has the choice to use their own computers to perform their work. Recently the discussion has reached new levels of excitement as some big name organizations have initiated pilot programs. For IT leaders it's a prospect that's both compelling and daunting.

Technology developments over the years have made software more hardware agnostic, such as the introduction of the web browser and Java. Personal computers have largely become commodity items and their reliability has significantly improved. Yet, despite these events, bringing your own computer (BYOC) to work has remained an elusive goal.

Why bring your own computer to work?

From an IT leader's perspective, the reasons for supporting BYOC are pretty clear. In an environment where CEOs want more of the organization's dollars assigned to value-creating investments and innovation, the ongoing cost of asset management continues to be an unfortunate overhead. From procurement and assignment to repairs and disposal, managing large numbers of personal computers represents a significant dollar amount on a CIO's budget.

The second driver is the desire of employees to use the equipment they are most comfortable with to do their jobs. We know that for most, a personal computer is not simply a black box. From wallpaper to icon positions, a computer often represents an extension of the individual. If anyone needs more convincing, just try and pry an Apple computer away from its user and replace it with a Windows machine (and vice versa). People have preferences. Enterprise-provided computers are a reluctantly accepted reality.

Why can't we bring our own computers to work?

With these compelling reasons and more supporting BYOC, why has it not happened? The first reason that comes to mind for most IT leaders is the nightmare of trying to support hardware from a myriad of vendors. It flies in the face of standardization, which largely helps to keep costs and complexity down. In addition, organizations have continued to build solutions that rely on specific software and hardware requirements and configurations. Finally, there is both a real and perceived loss of control that makes most security and risk professionals shudder.

With all that said, there are now some substantive reasons to believe BYOC may soon become a reality for many organizations.

Times they are a changing

[Many of you can skip this brief history recap] When the web browser emerged in the 1990s, there was some optimism that it would herald the beginning of a world where software would largely become hardware agnostic. Many believed it would make the operating system (OS) largely irrelevant. Of course we know this didn't happen, and software vendors continued to build OS-dependent solutions and organizations recommitted to large-scale, in-house ERP implementations that created vendor lock-ins. At the time, browser technology was inadequate, hosted enterprise applications were weak and often absent for many business functions, and broadband was expensive, inconsistent, and often unreliable across the U.S.

Skip forward and the situation is markedly different. Today we have robust browsers and supporting languages, reliable broadband, and enterprise-class applications that are delivered from hosted providers. It's also not uncommon anymore for staff to use non-business provided, cloud-based consumer applications to perform their work.

Oh to be a start-up! If we could all redo our businesses today, we'd likely avoid building our own data centers and most of our applications. This is one of the promises of cloud computing. And while there will be considerable switching costs for existing organizations, the trend suggests a future where major business functions that are provided by technology will largely be non-competitive, on-demand utilities. In this future state it's entirely possible that hardware independence will become a viable reality. With the application, data, business logic, and security all provisioned in the cloud, the computer really does simply become a portal to information and utility.

Smartphones are already a "bring your own computer" to work device

The smartphone demonstrates all the characteristics of the cloud-provisioned services I've discussed. In many organizations bringing your own smartphone to work is standard practice. Often the employee purchases the device, gets vendor support, and pays for the service themselves (a large number of organizations reimburse the service cost). It's a model that may be emulated with personal computers. (That is, if smartphones don't evolve to become the personal computer. That's another possible outcome.)

I believe fully-embraced cloud computing makes BYOC entirely possible. There will continue to be resistance and indeed, there will be industries where security and control is so inflexible, that BYOC will be difficult to attain. There will also be cultural issues. We'll need to overcome the notion that providing a computer is an organizational responsibility. There was a time when most organizations provided sales-people with cars (some still do). Today we expect employees to provide and maintain their own cars, but we do provide mileage reimbursement when it's used for business purposes. Could there be a similar model for employees who use their own computers? Today, for BYOC, some enterprises simply provide a stipend. What works and what doesn't will need to be figured out.

So what now?

So what are the takeaways from all of this? First, BYOC is a real likelihood for many organizations and it's time for IT leadership to grapple with the implications. Second, the emergence of cloud computing will have unanticipated downstream impacts in organizations and strategies to address those issues will need to be created. Lastly, we've already entered into a slow and painful convergence between smartphones, personal computers, consumer applications and devices, and cloud computing. This needs to be reconciled appropriate to each industry and organization. And it has to happen sooner than later.

When the dust settles, the provision of computing services in the enterprise will be entirely different. IT leadership had better be prepared.

Photo: Work Place by cell105, on Flickr


April 15 2011

Wrap-up of 2011 MySQL Conference

Two themes: mix your relational database with less formal solutions and move to the cloud. Those were the messages coming from O'Reilly's MySQL conference this week. Naturally it included many other talks of a more immediate practical nature: data warehousing and business intelligence, performance (both in MySQL configuration and in the environment, which includes the changes caused by replacing disks with Flash), how to scale up, and new features in both MySQL and its children. But everyone seemed to agree that MySQL does not stand alone.

The world of databases have changed both in scale and in use. As Baron Schwartz said in his broad-vision keynote, databases are starting to need to handle petabytes. And he criticized open source database options as having poorer performance than proprietary ones. As for use, the databases struggle to meet two types of requirements: requests from business users for expeditious reports on new relationships, and data mining that traverses relatively unstructured data such as friend relationships, comments on web pages, and network traffic.

Some speakers introduced NoSQL with a bit of sarcasm, as if they had to provide an interface to Hbase or MongoDB as a check-off item. At the other extreme, in his keynote, Brian Aker summed up his philosophy about Drizzle by saying, "We are not an island unto ourselves, we are infrastructure."

Judging from the informal audience polls, most of the audience had not explored NoSQL or the cloud yet. Most of the speakers about these technologies offered a mix of basic introductory material and useful practical information to meet the needs of their audience, who came, listened, and asked questions. I heard more give and take during the talks about traditional MySQL topics, because the audience seemed well versed in them.

The analysts and experts are telling us we can save money and improve scalability using EC2-style cloud solutions, and adopt new techniques to achieve the familiar goals of reliability and fast response time. I think a more subtle challenge of the cloud was barely mentioned: it encourages a kind of replication that fuzzes the notion of consistency and runs against the ideal of a unique instance for each item of data. Of course, everyone uses replication for production relational databases anyway, both to avert disaster and for load balancing, so the ideal has been blurred for a long time. As we explore the potential of cloud systems as content delivery networks, they blur the single-instance ideal. Sarah Novotny, while warning about the risks of replication, gave a talk about some practical considerations in making it work, such as tolerating inconsistency for data about sessions.

What about NoSQL solutions, which have co-existed with relational databases for decades? Everybody knows about key-value stores, and Memcache has always partnered with MySQL to serve data quickly. I had a talk with Roger Magoulas, an O'Reilly researcher, about some of the things you sacrifice if you use a NoSQL solution instead of a relational database and why that might be OK.

Redundancy and consistency

Instead of storing an attribute such as "supplier" or "model number" in a separate table, most NoSQL solutions make it a part of the record for each individual member of the database. The increased disk space or RAM required becomes irrelevant in an age when those resources are so cheap and abundant. What's more significant is that a programmer can store any supplier or model number she wants, instead of having to select from a fixed set enforced by foreign key constraints. This can introduce inconsistencies and errors, but practical database experts have known for a long time that perfect accuracy is a chimera (see the work of Jeff Jonas) and modern data analytics work around the noise. When you're looking for statistical trends, whether in ocean samples or customer preferences, you don't care whether 800 of your 8 million records have corrupt data in the field you're aggregating.

Decentralizing decisions and control

A relational database potentially gives the DBA most of the power: it is the DBA that creates the schema, defines stored procedures and triggers, and detects and repairs inconsistencies. Modern databases such as MySQL have already blurred the formerly rigid boundaries between DBA and application programmer. In the early days, the programmer had to do a lot of the work that was reserved for DBAs in traditional databases. NoSQL clearly undermines the control freaks even more. As I've already said, enforcing consistency is not as important nowadays as it once seemed, and modern programming languages offer other ways (such as enums and sets) to prevent errors.


I think this is still the big advantage of relational databases. Their complicated schemas and join semantics allow data mining and extended uses that evolve over the years. Many NoSQL databases are designed around the particular needs of an organization at a particular time, and require records to be ordered in the way you want to access them. And I think this is why, as discussed in some of the sessions at this conference, many people start with their raw data in some NoSQL data store and leave the results of their processing in a relational database.

The mood this week was business-like. I've attended conferences held by emerging communities that crackle with the excitement of building something new that no one can quite anticipate; the MySQL conference wasn't like that. The attendees have a job to do and an ROI to prove; they wanted to learn whatever would help them do that.

But the relational model still reflects most of the data handling needs we have, so MySQL will stick around. This may actually be the best environment it has ever enjoyed. Oracle still devotes a crack programming team (which includes several O'Reilly authors) to meeting corporate needs through performance improvements and simple tools. Monty Program has forked off MariaDB and Percona has popularized XtraDB, all the while contributing new features under the GPL that any implementation can use. Drizzle strips MySQL down to a core while making old goals such as multi-master replication feasible. A host of companies in layered applications such as business intelligence cluster around MySQL.

MySQL spawns its own alternative access modes, such as the HandlerSocket plugin that returns data quickly to simple queries while leaving the full relational power of the database in place for other uses. And vendors continue to find intriguing alternatives such as the Xeround cloud service that automates fail-over, scaling, and sharding while preserving MySQL semantics. I don't think any DBA's skills will become obsolete.

April 13 2011

What VMware's Cloud Foundry announcement is about

I chatted today about VMware's Cloud Foundry with Roger Bodamer, the EVP of products and technology at 10Gen. 10Gen's MongoDB is one of three back-ends (along with MySQL and Redis) supported from the start by Cloud Foundry.

If I understand Cloud Foundry and VMware's declared "Open PaaS" strategy, it should fill a gap in services. Suppose you are a developer who wants to loosen the bonds between your programs and the hardware they run on, for the sake of flexibility, fast ramp-up, or cost savings. Your choices are:

  • An IaaS (Infrastructure as a Service) product, which hands you an emulation of bare metal where you run an appliance (which you may need to build up yourself) combining an operating system, application, and related services such as DNS, firewall, and a database.

  • You can implement IaaS on your own hardware using a virtualization solution such as VMware's products, Azure, Eucalyptus, or RPM. Alternatively, you can rent space on a service such as Amazon's EC2 or Rackspace.

  • A PaaS (Platform as a Service) product, which operates at a much higher level. A vendor such as

By now, the popular APIs for IaaS have been satisfactorily emulated so that you can move your application fairly easily from one vendor to another. Some APIs, notably OpenStack, were designed explicitly to eliminate the friction of moving an app and increase the competition in the IaaS space.

Until now, the PaaS situation was much more closed. VMware claims to do for PaaS what Eucalyptus and OpenStack want to do for IaaS. Vmware has a conventional cloud service called Cloud Foundry, but will offer the code under an open source license. Right Scale has already announced that you can use it to run a Cloud Foundry application on EC2. And a large site could run Cloud Foundry on its own hardware, just as it runs VMware.

Cloud Foundry is aggressively open middleware, offering a flexible way to administer applications with a variety of options on the top and bottom. As mentioned already, you can interact with MongoDB, MySQL, or Redis as your storage. (However, you have to use the particular API offered by each back-end; there is no common Cloud Foundry interface that can be translated to the chosen back end.) You can use Spring, Rails, or Node.js as your programming environment.

So open source Cloud Foundry may prove to be a step toward more openness in the cloud arena, as many people call for and I analyzed in a series of articles last year. VMware will, if the gamble pays off, gain more customers by hedging against lock-in and will sell its tools to those who host PaaS on their own servers. The success of the effort will depend on the robustness of the solution, ease of management, and the rate of adoption by programmers and sites.

December 22 2010

Reaching the pinnacle: truly open web services and clouds

Previous section:

Why web services should be released as free software

Free software in the cloud isn't just a nice-sounding ideal or even an efficient way to push innovation forward. Opening the cloud also opens the path to a bountiful environment of computing for all. Here are the steps to a better computing future.

Provide choice

The first layer of benefits when companies release their source code
is incremental: incorporating bug fixes, promoting value-added
resellers, finding new staff among volunteer programmers. But a free
software cloud should go far beyond this.

Remember that web services can be run virtually now. When you log in
to a site to handle mail, CRM, or some other service, you may be
firing up a virtual service within a hardware cloud.

So web and cloud providers can set up a gallery of alternative
services, trading off various features or offering alternative
look-and-feel interfaces. Instead of just logging into a site such as and accepting whatever the administrators have put up
that day, users could choose from a menu, and perhaps even upload
their own preferred version of the service. The SaaS site would then
launch the chosen application in the cloud. Published APIs would allow
users on different software versions to work together.

If a developer outside the company creates a new version with
substantial enhancements, the company can offer it as an option. If
new features slow down performance, the company can allow clients to
decide whether the delays are worth it. To keep things simple for
casual clients, there will probably always be a default service, but
those who want alternatives can have them.

Vendors can provide "alpha" or test sites where people can try out new
versions created by the vendor or by outsiders. Like stand-alone
software, cloud software can move through different stages of testing
and verification.

And providing such sandboxes can also be helpful to developers in
general. A developer would no longer have to take the trouble to
download, install, and configure software on a local computer to do
development and testing. Just log into the sandbox and play.
Google offers
The Go Playground
to encourage students of their Go language. CiviCRM,
which is a free software server (not a cloud or web service) offers a
sandbox for testing new
features. A web service company in electronic health records,
Practice Fusion,
which issued an API challenge in September, is now creating a sandbox
for third-party developers to test the API functionality on its
platform. I would encourage web and cloud services to go even
farther: open their own source code and provide sandboxes for people
to rewrite and try out new versions.

Let's take a moment for another possible benefit of running a
service as a virtual instance. Infected computer systems present a
serious danger to users (who can suffer from identity theft if their
personal data is scooped up) and other systems, which can be
victimized by denial-of-service attacks or infections of their own.
An awkward tower of authorizations reaching right down into the
firmware or hardware. In trusted computing, the computer itself checks
to make sure that a recognized and uncompromised operating system is
running at boot time. The operating system then validates each
application before launching it.

Trusted computing is Byzantine and overly controlling. The hardware
manufacturer gets to decide which operating system you use, and
through that which applications you use. Wouldn't users prefer
to run cloud instances that are born anew each time they log in? That
would wipe out any infection and ensure a trusted environment at the
start of each session without cumbersome gatekeeping.

Loosen the bonds on data

As we've seen, one of the biggest fears keeping potential clients away
from web services and cloud computing is the risk entailed in leaving
their data in the hands of another company. Here it can get lost,
stolen, or misused for nefarious purposes.

But data doesn't have to be stored on the computer where the
processing is done, or even at the same vendor. A user could fire up a
web or cloud service, submit a data source and data store, and keep
results in the data store. IaaS-style cloud computing involves
encrypted instances of operating systems, and if web services did the
same, users would automatically be protected from malicious
prying. There is still a potential privacy issue whenever a user runs
software on someone else's server, because it could skim off private
data and give to a marketing firm or law enforcement.

Alert web service vendors such as Google know they have to assuage
user fears of locked-in data. In Google's case, they created a
protocol called the Data Liberation Front (see an article by two
Google employees,

The Case Against Data Lock-in
). This will allow users to extract
their data in a format that makes it feasible to reconstitute it in
its original format on another system, but it doesn't actually sever
the data from the service as I'm suggesting.

A careful client would store data in several places (to guard against
loss in case one has a disk failure or other catastrophe). The client
would then submit one location to the web service for processing, and
store the data back in all locations or store it in the original
source and then copy it later, after making sure it has not been

A liability issue remains when calculation and data are separated. If
the client experiences loss or corruption, was the web service or the
data storage service responsible? A ping-pong scenario could easily
develop, with the web services provider saying the data storage
service corrupted a disk sector, the data storage service saying the
web service produced incorrect output, and the confused client left
furious with no recourse.

This could perhaps be solved by a hash or digest, a very stable and
widely-used practice used to ensure that any change to the data, even
the flip of a single bit, produces a different output value. A digest
is a small number that represents a larger batch of data. Algorithms
that create digests are fast but generate output that's reasonably
unguessable. Each time the same input is submitted to the algorithm,
it is guaranteed to generate the same digest, but any change to the
input (through purposeful fiddling or an inadvertent error) will
produce a different digest.

The web service could log each completed activity along with the
digest of the data it produces. The data service writes the data,
reads it back, and computes a new digest. Any discrepancy signals a
problem on the data service side, which it can fix by repeating the
write. In the future, if data is corrupted but has the original
digest, the client can blame the web service, because the web service
must have written corrupt data in the first place.

Sascha Meinrath, a wireless networking expert, would like to see
programs run both on local devices and in the cloud. Each
program could exploit the speed and security of the local device but
reach seamlessly back to remote resources when necessary, rather like
a microprocessor uses the local caches as much as possible and faults
back to main memory when needed. Such a dual arrangement would offer
flexibility, making it possible to continue work offline, keep
particularly sensitive data off the network, and let the user trade
off compute power for network usage on a case-by-case basis. (Wireless
use on a mobile device can also run down the battery real fast.)

Before concluding, I should touch on another trend that some
developers hope will free users from proprietary cloud services:
peer-to-peer systems. The concept behind peer-to-peer is appealing and
have been

gaining more attention recently
individuals run servers on their systems at home or work and serve up
the data they want. But there are hard to implement, for reasons I
laid out in two articles,

From P2P to Web Services: Addressing and Coordination

From P2P to Web Services: Trust
. Running your own
software is somewhat moot anyway, because you're well advised to store
your data somewhere else in addition to your own system. So long as
you're employing a back-up service to keep your data safe in case of
catastrophe, you might as well take advantage of other cloud services
as well.

I also don't believe that individual site maintained by
individuals will remain the sources for important data, as the
peer-to-peer model postulates. Someone is going to mine that data and
aggregate it--just look at the proliferation of Twitter search
services. So even if users try to live the ideal of keeping control
over their data, and use distributed technologies like the
Diaspora project,
they will end up surrendering at least some control and data to a

A sunny future for clouds and free software together

The architecture I'm suggesting for computing makes free software even
more accessible than the current practice of putting software on the
Internet where individuals have to download and install it. The cloud
can make free software as convenient as Gmail. In fact, for free
software that consumes a lot of resources, the cloud can open it up to
people who can't afford powerful computers to run the software.

Web service offerings would migrate to my vision of a free software
cloud by splitting into several parts, any or all of them free
software. A host would simply provide the hardware and
scheduling for the rest of the parts. A guest or
appliance would contain the creative software implementing
the service. A sandbox with tools for compilation, debugging,
and source control would make it easy for developers to create new
versions of the guest. And data would represent the results
of the service's calculations in a clearly documented
format. Customers would run the default guest, or select another guest
on the vendor's site or from another developer. The guest would output
data in the standardized format, to be stored in a location of the
customer's choice and resubmitted for the next run.

With cloud computing, the platform you're on no longer becomes
important. The application is everything and the computer is (almost)
nothing. The application itself may also devolve into a variety of
mashed-up components created by different development teams and
communicating over well-defined APIs, a trend I suggested almost a
decade ago in an article titled

Applications, User Interfaces, and Servers in the Soup

The merger of free software with cloud and web services is a win-win.
The convenience of IaaS and PaaS opens up opportunities for
developers, whereas SaaS simplifies the use of software and extends its
reach. Opening the source code, in turn, makes the cloud more
appealing and more powerful. The transition will take a buy-in from
cloud and SaaS providers, a change in the software development
process, a stronger link between computational and data clouds, and
new conventions to be learned by clients of the services. Let's get
the word out.

(I'd like to thank Don Marti for suggesting additional ideas for this
article, including the fear of creating a two-tier user society, the
chance to shatter the tyranny of IT departments, the poor quality of
source code created for web services, and the value of logging
information on user interaction. I would also like to thank Sascha
Meinrath for the idea of seamless computing for local devices and the
cloud, Anne Gentle for her idea about running test and production
systems in the same cloud, and Karl Fogel for several suggestions,
especially the value of usage statistics for programmers of web

December 21 2010

Four short links: 21 December 2010

  1. Cash Cow Disease -- quite harsh on Google and Microsoft for "ingesting not investing" in promising startups, then disconnecting them from market signals. Like pixie dust, potential future advertising revenues can be sprinkled on any revenue-negative scheme to make it look brilliant. (via Dan Martell)
  2. Your Apps Are Watching You (Wall Street Journal) -- the iPhone apps transmitted more data than the apps on phones using Google Inc.'s Android operating system [...] Both the Android and iPhone versions of Pandora, a popular music app, sent age, gender, location and phone identifiers to various ad networks. iPhone and Android versions of a game called Paper Toss—players try to throw paper wads into a trash can—each sent the phone's ID number to at least five ad companies. Grindr, an iPhone app for meeting gay men, sent gender, location and phone ID to three ad companies. [...] Among all apps tested, the most widely shared detail was the unique ID number assigned to every phone. It is effectively a "supercookie," [...] on iPhones, this number is the "UDID," or Unique Device Identifier. Android IDs go by other names. These IDs are set by phone makers, carriers or makers of the operating system, and typically can't be blocked or deleted. "The great thing about mobile is you can't clear a UDID like you can a cookie," says Meghan O'Holleran of Traffic Marketplace, an Internet ad network that is expanding into mobile apps. "That's how we track everything."
  3. On Undo's Undue Importance (Paul Kedrosky) -- The mainstream has money and risks, and so it cares immensely. It wants products and services where big failures aren't catastrophic, and where small failures, the sorts of thing that "undo" fixes, can be rolled back. Undo matters, in other words, because its appearance almost always signals that a market has gone from fringe to mainstream, with profits set to follow. (via Tim O'Reilly on Twitter)
  4. libimobiledevice -- open source library that talks the protocols to support iPhone®, iPod Touch®, iPad® and Apple TV® devices without jailbreaking or proprietary libraries.

December 20 2010

Why web services should be released as free software

Previous section:

Why clouds and web services will continue to take over computing

Let's put together a pitch for cloud and web service providers. We have two hurdles to leap: one persuading them how they'll benefit by releasing the source code to their software, and one addressing their fear of releasing the source code. I'll handle both tasks in this section, which will then give us the foundation to look at a world of free clouds and web services.

Cloud and web service providers already love free software

Reasons for developing software as open source have been told and
retold many times; popular treatments include Eric S. Raymond's
essays in the collection

The Cathedral and the Bazaar
(which O'Reilly puts out
in print),
and Yochai Benkler's Wealth of Networks
(available online as a
as well as the basis for a
and published by Yale University Press). But cloud and web service
companies don't have to be sold on free software--they use it all the

The cornucopia of tools and libraries produced by projects such as the
open source Ruby on Rails make it the first stop on many
services' search for software. Lots of them still code pages in
other open source tools and languages such as PHP and jQuery. Cloud
providers universally base their offerings on Linux, and many use open
source tools to create their customers' virtual systems.
( currently bases its cloud offerings on href="">Xen, and
KVM, heavily backed
by Red Hat, is also a contender.) The best monitoring tools are also
free software. In general, free software is sweeping through the
cloud. (See also

Open Source Projects for Cloud on Rise, According to Black Duck Software Analysis

So cloud and web service providers live the benefits of free software
every day. They know the value of communities who collaborate to
improve and add new layers to software. They groove on the convenience
of loading as much software they want on any systems without
struggling with a license server. They take advantage of frequent
releases with sparkling new features. And they know that there are
thousands of programmers out in the field familiar with the software,
so hiring is easier.

And they give back to open source communities too: they contribute
money, developer time, and valuable information about performance
issues and other real-life data about the operation of the software.

But what if we ask them to open their own code? We can suggest that
they can have better software by letting their own clients--the
best experts in how their software is used--try it out and look
over the source for problems. Web service developers also realize that
mash-ups and extensions are crucial in bringing them more traffic, so
one can argue that opening their source code will make it easier for
third-party developers to understand it and write to it.

Web and cloud services are always trying to hire top-notch programmers
too, and it's a well-established phenomenon that releasing the
source code to a popular product produces a cadre of experts out in
the field. Many volunteers submit bug fixes and enhancements in order
to prove their fitness for employment--and the vendors can pick
up the best coders.

These arguments might not suffice to assail the ramparts of vendors'
resistance. We really need to present a vision of open cloud computing
and persuade vendors that their clients will be happier with services
based on free software. But first we can dismantle some of the fear
around making source code open.

No reason to fear opening the source code

Some cloud and web providers, even though they make heavy use of free
software internally, may never have considered releasing their own
code because they saw no advantages to it (there are certainly
administrative and maintenance tasks associated with opening source
code). Others are embarrassed about the poor structure and coding
style of their fast-changing source code.

Popular methodologies for creating Web software can also raise a
barrier to change. Companies have chosen over the past decade to
feature small, tight-knit teams who communicate with each other and
stakeholders informally and issue frequent software releases to try
out in the field and then refine. Companies find this process more
"agile" than the distributed, open-source practice of putting
everything in writing online, drawing in as broad a range of
contributors as possible, and encouraging experiments on the side. The
agile process can produce impressive results quickly, but it places an
enormous burden on a small group of people to understand what clients
want and massage it into a working product.

We can't move cloud and SaaS sites to free software, in any case, till
we address the fundamental fear some of these sites share with
traditional proprietary software developers: that someone will take
their code, improve it, and launch a competing service. Let's turn to
that concern.

If a service releases its code under the GNU Affero General Public
License, as mentioned in the
previous section,
anyone who improves it and runs a web site with the improved code is
legally required to release their improvements. So we can chip away at
the resistance with several arguments.

First, web services win over visitors through traits that are
unrelated to the code they run. Traits that win repeat visits include:

  • Staying up (sounds so simple, but needs saying)

  • The network effects that come from people inviting their friends or
    going where the action is--effects that give the innovative
    vendor a first-mover advantage

  • A site's appearance and visual guides to navigation, which
    includes aspects that can be trademarked

  • Well-designed APIs that facilitate the third-party applications
    mentioned earlier

So the source code to SaaS software isn't as precious a secret
as vendors might think. Anyway, software is more and more a commodity
nowadays. That's why a mind-boggling variety of JavaScript
frameworks, MVC platforms, and even whole new programming languages
are being developed for the vendors' enjoyment. Scripting
languages, powerful libraries, and other advances speed up the pace of
development. Anyone who likes the look of a web service and wants to
create a competitor can spin it up in record time for low cost.

Maybe we've softened up the vendors some. Now, on to the
pinnacle of cloud computing--and the high point on which this
article will end--a vision of the benefits a free cloud could
offer to vendors, customers, and developers alike.

Next section:
Reaching the pinnacle: truly open web services and clouds.

December 17 2010

Why clouds and web services will continue to take over computing


What are the chances for a free software cloud?

  • Resolving the contradictions between web services, clouds, and open source (12/13)
  • Defining clouds, web services, and other remote computing (12/15)
  • Why clouds and web services will continue to take over computing (12/17)
  • Why web services should be released as free software (12/20)
  • Reaching the pinnacle: truly open web services and clouds (12/22)

Additional posts in this 5-part series are available here.

Previous section:

Definitions: Clouds, web services, and other remote computing

The tech press is intensely occupied and pre-occupied with analyzing the cloud from a business point of view. Should you host your operations in a cloud provider? Should you use web services for office work? The stream of articles and blogs on these subjects show how indisputably the cloud is poised to take over.

But the actual conclusions these analysts reach are intensely
conservative: watch out, count up your costs carefully, look closely

regulations and liability issues
that hold you back, etc.
The analysts are obsessed with the cloud, but they're not
encouraging companies to actually use it--or at least
they're saying we'd better put lots of thought into it

My long-term view convinces me we all WILL be in the cloud.
No hope in bucking the trend. The advantages are just too compelling.

I won't try to replicate here the hundreds and hundreds of
arguments and statistics produced by the analysts. I'll just run
quickly over the pros and cons of using cloud computing and web
services, and why they add up to a ringing endorsement. That will help
me get to the question that really concerns this article: what can we
do to preserve freedom in the cloud?

The promise of the cloud shines bright in many projections. The
federal government has committed to a "Cloud First" policy in its

Information Technology reform plan
The companies offering IaaS, and Paas, and SaaS promulgate
mouth-watering visions of their benefits. But some of the advantages I
see aren't even in the marketing literature--and some of them, I bet,
could make even a free software advocate come around to appreciating
the cloud.

Advantages of cloud services

The standard litany of reasons for moving to IaaS or PaaS can be
summarized under a few categories:

Low maintenance

No more machine rooms, no more disk failures (that is, disk failures you know about and have to deal with), no more late-night calls to go in and reboot a critical server.

These simplifications, despite the fears of some Information
Technology professionals, don't mean companies can fire their system
administrators. The cloud still calls for plenty of care and
feeding. Virtual systems go down at least as often as physical ones,
and while the right way to deal with system failures is to automate
recovery, that takes sophisticated administrators. So the system
administrators will stay employed and will adapt. The biggest change
will be a shift from physical system management to diddling with
software; for an amusing perspective on the shift see my short story

Hardware Guy

Fast ramp-up and elasticity

To start up a new operation, you no longer have to wait for hardware to arrive and then lose yourself in snaking cables for hours. Just ask the cloud center to spin up as many virtual systems as you want.

Innovative programmers can also bypass IT management, developing new
products in the cloud. Developers worry constantly whether their
testing adequately reproduces the real-life environment in which
production systems will run, but if both the test systems and the
final production systems run in the cloud, the test systems can match
the production ones much more closely.

The CIO of O'Reilly Media, citing the goal of directing

60 percent of IT spending into new projects
has made internal and external cloud computing into pillars of

O'Reilly's IT strategy

Because existing companies have hardware and systems for buying
hardware in place already, current cloud users tend to come from
high-tech start-ups. But any company that wants to launch a new
project can benefit from the cloud. Peaks and troughs in usage can
also be handled by starting and stopping virtual systems--you
just have to watch how many get started up, because a lack of
oversight can incur run-away server launches and high costs.

Cost savings

In theory, clouds provide economies of scale that undercut anything an individual client could do on their own. How can a private site, chugging away on a few computers, be more efficient than thousands of fungible processors in one room under the eye of a highly trained expert, all strategically located in an area with cheap real estate and electricity?

Currently, the cost factor in the equation is not so cut and dried.
Running multiple servers on a single microprocessor certainly brings
savings, although loads have to be balanced carefully to avoid slowing
down performance unacceptably. But running processors constantly
generates heat, and if enough of them are jammed together the costs of
air conditioning could exceed the costs of the computers. Remote
computing also entails networking costs.

It will not take long, however, for the research applied by cloud
vendors to pay off in immense efficiencies that will make it hard for
organizations to justify buying their own computers.

Elasticity and consolidation make IaaS so attractive that large
companies are trying to build "private clouds" and bring all the
organization's server hardware into one department, where the
hardware is allocated as virtual resources to the rest of the company.
These internal virtualization projects don't incur some of the
disadvantages that this paper address, so I won't consider them

Advantages of web services

SaaS offers some benefits similar to IaaS and PaaS, but also
significant differences.

Low maintenance

No more installation, no more upgrades, no more incompatibilities with other system components or with older versions of the software on other people's systems. Companies licensing data, instead of just buying it on disks, can access it directly from the vendor's site and be sure of always getting the most recent information.

Fast ramp-up and elasticity

As with IaaS, SaaS frees staff from running every innovation past the IT group. They can recreate their jobs and workflows in the manner they want.


To see what's popular and to prioritize future work, companies love to know how many people are using a feature and how long they spend in various product functions. SaaS makes this easy to track because it can log every mouse click.

Enough of the conventional assessment. What hidden advantages lie in clouds and web services?

What particularly should entice free and open software software advocates is web services' prospects for making money. Although free software doesn't have to be offered cost-free (as frequently assumed by those who don't know the field), there's no way to prevent people from downloading and installing it, so most of the money in free software is made through consulting and additional services. Web services allow subscriptions instead, a much more stable income. Two popular content management systems exemplify this benefit: WordPress offers hosting at and Drupal at, all while offering their software as open source.

But I find another advantage to web services. They're making
applications better than they ever have been in the sixty-year history
of application development.

Compare your own experiences with stand-alone software to web sites. The quality of the visitor's experience on a successful web site is much better. It's reminiscent of the old cliché about restaurant service in capitalist versus socialist economies.

According to this old story, restaurants in capitalist countries
depend on repeat business from you and your friends, driving the
concern for delivering a positive customer experience from management
down to the lowest level of the wait staff. In a socialist economy,
supposedly, the waiters know they will get paid no matter whether you
like their service or not, so they just don't try. Furthermore,
taking pains to make you happy would be degrading to them as heroes of
a workers' society.

I don't know whether this phenomenon is actually true of restaurants,
but an analogous dynamic holds in software. Web sites know that
visitors will vanish in half a second if the experience is not
immediately gripping, gratifying, and productive. Every hour of every
day, the staff concentrate on the performance and usability of the
site. Along with the business pressure on web services to keep users
on the page, the programmers there can benefit from detailed feedback
about which pages are visited, in which order, and for how long.

In contrast, the programmers of stand-alone software measure
their personal satisfaction by the implementation of complex and
sophisticated calculations under the product's surface. Creating
the user interface is a chore relegated to less knowledgeable staff.

Whatever the reason, I find the interfaces of proprietary as well as
free software to be execrable, and while I don't have statistics to
bolster my claim. I think most readers can cite similar experiences.
Games are the main exception, as well as a few outstanding consumer
applications, but these unfortunately do not seem a standard for the
vast hoards of other programmers to follow.

Moving one's aching fingers from stand-alone software to a web
service brings a sudden rush of pleasure, affirming what working with
computers can be. A bit of discipline in the web services world would
be a good cold bath for the vendors and coders.

Drawbacks of clouds and web services

So why are the analysts and customers still wary of cloud computing? They have their reasons, but some dangers are exaggerated.

Managers responsible for sensitive data feel a visceral sense of vulnerability when they entrust that data to some other organization. Web services have indeed had breaches, because they are prisoners of the twin invariants that continue to ensure software flaws: programmers are human, and so are administrators. Another risk comes when data is transmitted to a service such as's S3, a process during which it be seen or even in theory altered.

Still, I expect the administrators of web and cloud services to be better trained and more zealous in guarding against security breaches than the average system administrator at a private site. The extra layer added by IaaS also creates new possibilities. An article called "Security in the Cloud" by Gary Anthes, published in the November 2010 Communications of the ACM, points to research projects by Hewlett-Packard and IBM that would let physical machines monitor the virtual machines running on them for viruses and other breaches of security, a bit like a projectionist can interrupt a movie.

A cloud or web service provider creates some risk just because it
provides a tasty target to intruders, who know they can find thousands
of victims in one place. On the other hand, if you put your data in
the cloud, you aren't as likely to lose it to some drive-by
trouble-seeker picking it up off of a wireless network that your
administrator failed to secure adequately, as famously happened to
T.J. Maxx (and they weren't alone).

And considering that security experts suspect most data breaches to be
internal, putting data in the cloud might make it more secure by
reducing its exposure to employees outside of the few programmers or
administrators with access rights. If the Department of Defense had
more systems in the cloud, perhaps it wouldn't have suffered such a
sinister security breach in 2008 through a

flash drive with a virus

In general, the solution to securing data and transactions is to
encrypt everything. Encrypting the operating systems loaded in IaaS,
for instance, gives the client some assurance that no one can figure
out what it's doing in the cloud, even if another client or even the
vendor itself tries to snoop. If some technological earthquake
undermines the integrity of encryption technologies--such as the
development of a viable quantum computer--we'll have to rethink the
foundations of the information age entirely anyway.

The main thing to remember is that most data breaches are caused by
lapses totally unrelated to how servers are provisioned: they happen
because staff stored unencrypted data on laptops or mobile devices,
because intruders slipped into applications by exploiting buffer
overflows or SQL injection, and so on. (See, for instance, a
U.S. Health & Human Services study saying that
"Laptop theft
is the most prevalent cause of the breach of health information
affecting more than 500 people.

Regulations such as HIPAA can rule out storing some data off-site, and
concerns about violating security regulations come up regularly during
cloud discussions. But these regulations affect only a small amount of
the data and computer operations, and the regulations can be changed
once the computer industry shows that clouds are both valuable and
acceptably secure.

Bandwidth is a concern, particularly in less technologically developed
parts of the world (like much of the United States, come to think of
it), where bandwidth is inadequate. But in many of these areas, people
often don't even possess computers. SaaS is playing a major role
in underdeveloped areas because it leverages the one type of computer
in widespread use (the cell phone) and the one digital network
that's widely available (the cellular grid). So in some ways,
SaaS is even more valuable in underdeveloped areas, just in a
different form from regions with high bandwidth and universal access.

Nevertheless, important risks and disadvantages have been identified
in clouds and web services. IaaS and PaaS are still young enough (and
their target customers sophisticated enough) for the debate to keep up
pretty well with trends; in contrast, SaaS has been crying out quite a
while for remedies to be proposed, such as the
best practices
recently released by the Consumer Federation of America. This article
will try to raise the questions to a higher level, to find more
lasting solutions to problems such as the following.


Every system has down time, but no company wants to be at the mercy of a provider that turns off service, perhaps for 24 hours or more, because they failed to catch a bug in their latest version or provide adequate battery backup during a power failure.

When Wikileaks was forced off of's cloud service, it sparked outrage whose echo reached as far as a Wall Street Journal blog and highlighted the vulnerability of depending on clouds. Similarly, the terms of service on social networks and other SaaS sites alienate some people who feel they have legitimate content that doesn't pass muster on those sites.


One of the big debates in the legal arena is how to apportion blame when a breach or failure happens in a cascading service, where one company leases virtual systems in the cloud to provide a higher-level service to other companies.


How can you tell whether the calculation that a service ran over your corporate data produced the correct result? This is a lasting problem with proprietary software, which the free software developers argue they've solved, but which most customers of proprietary software have learned to live with and which therefore doesn't turn them against web services.

But upgrades can present a problem. When a new version of stand-alone
software comes out, typical consumers just click "Yes" on the upgrade
screen and live with the consequences. Careful system administrators
test the upgrade first, even though the vendor has tested it, in case
it interacts perniciously with some factor on the local site and
reveals a bug. Web services reduce everyone to the level of a passive
consumer by upgrading their software silently. There's no
recourse for clients left in the lurch.


Leaving the software on the web service's site also removes all end-user choice. Some customers of stand-alone software choose to leave old versions in place because the new version removed a feature the customers found crucial, or perhaps just because they didn't want the features in the new version and found its performance worse. Web services offer one size to fit all.

Because SaaS is a black box, and one that can change behavior without
warning to the visitors, it can provoke concerns among people
sensitive about consistency and reliability. See my article

Results from Wolfram Alpha: All the Questions We Ever Wanted to Ask About Software as a Service


Web services have been known to mine customer data and track customer behavior for marketing purposes, and have given data to law enforcement authorities. It's much easier to monitor millions of BlackBerry messages traveling through a single server maintained by the provider than the messages bouncing in arbitrary fashion among thousands of Sendmail servers. If a customer keeps the data on its own systems, law enforcement can still subpoena it, but at least the customer knows she's being investigated.

In the United States, furthermore, the legal requirements that investigators must meet to get data is higher for customers' systems than for data stored on a third-party site such as a web service. Recent Congressional hearings (discussed on O'Reilly's Radar site highlighted the need to update US laws to ensure privacy for cloud users).

These are knotty problems, but one practice could tease them apart:
making the software running clouds or web services open source.

A number of proponents for this viewpoint can be found, such as the Total Information Outsourcing group, as well as a few precedents. Besides the WordPress and Drupal services mentioned earlier, StatusNet runs the microblogging site and opens up its code so that other people could run sites that interoperate with it. Source code for Google's AppEngine, mentioned earlier as a leading form of IaaS, has been offered for download by Google under a free license. Talend offers data integration and business intelligence as both free software and SaaS.

The Free Software Foundation, a leading free software organization that provides a huge amount of valuable software to Linux and other systems through the GNU project, has created a license called the GNU Affero General Public License that encourages open code for web services. When sites such as StatusNet release code under that license, other people are free to build web services on it but must release all their enhancements and bug fixes to the world as well.

What problems can be ameliorated by freeing the cloud and web service software? Can the companies who produced that software be persuaded to loosen their grip on the source code? And what could a world of free cloud and web services look like? That is where we will turn next.

Next section:
Why web services should be released as free software.

December 15 2010

Defining clouds, web services, and other remote computing


What are the chances for a free software cloud?

  • Resolving the contradictions between web services, clouds, and open source (12/13)
  • Defining clouds, web services, and other remote computing (12/15)
  • Why clouds and web services will continue to take over computing (12/17)
  • Why web services should be released as free software (12/20)
  • Reaching the pinnacle: truly open web services and clouds (12/22)

Additional posts in this 5-part series are available here.

Technology commentators are a bit trapped by the term "cloud," which has been kicked and slapped around enough to become truly shapeless. Time for confession: I stuck the term in this article's title because I thought it useful to attract readers' attention. But what else should I do? To run away from "cloud" and substitute any other term ("web services" is hardly more precise, nor is the phrase "remote computing" I use from time to time) just creates new confusions and ambiguities.

So in this section I'll offer a history of services that have
led up to our cloud-obsessed era, hoping to help readers distinguish
the impacts and trade-offs created by all the trends that lie in the

Computing and storage

The basic notion of cloud computing is simply this: one person uses a
computer owned by another in some formal, contractual manner. The
oldest precedent for cloud computing is therefore timesharing, which
was already popular in the 1960s. With timesharing, programmers could
enter their programs on teletype machines and transmit them over
modems and phone lines to central computer facilities that rented out
CPU time in units of one-hundredth of a second.

Some sites also purchased storage space on racks of large magnetic
tapes. The value of storing data remotely was to recover from flood,
fire, or other catastrophe.

The two major, historic cloud services offered by the Compute Cloud (EC2) and Simple Storage Service
(S3)--are the descendants of timesharing and remote backup,

EC2 provides complete computer systems to clients, who can request any
number of systems and dismiss them again when they are no longer
needed. Pricing is quite flexible (even including an option for an
online auction) but is essentially a combination of hourly rates and
data transfer charges.

S3 is a storage system that lets clients reserve as much or as little
space as needed. Pricing reflects the amount of data stored and the
amount of data transferred in and out of Amazon's storage. EC2 and S3
complement each other well, because EC2 provides processing but no
persistent storage.

Timesharing and EC2-style services work a bit like renting a community
garden. Just as community gardens let apartment dwellers without
personal back yards grow fruits and vegetables, timesharing in the
1960s brought programming within reach of people who couldn't
afford a few hundred thousand dollars to buy a computer. All the
services discussed in this section provide hardware to people who run
their own operations, and therefore are often called
Infrastructure as a Service or IaaS.

We can also trace back cloud computing in another direction as the
commercially viable expression of grid computing, an idea
developed through the first decade of the 2000s but whose
implementations stayed among researchers. The term "grid"
evokes regional systems for delivering electricity, which hide the
origin of electricity so that I don't have to strike a deal with
a particular coal-burning plant, but can simply plug in my computer
and type away. Similarly, grid computing combined computing power from
far-flung systems to carry out large tasks such as weather modeling.
These efforts were an extension of earlier cluster technology
(computers plugged into local area networks), and effectively
scattered the cluster geographically. Such efforts were also inspired
by the well-known
SETI@home program,
an early example of Internet crowdsourcing that millions of people have
downloaded to help process signals collected from telescopes.

Another form of infrastructure became part of modern life in the 1990s
when it seemed like you needed your own Web site to be anybody.
Internet providers greatly expanded their services, which used to
involve bare connectivity and an email account. Now they also offer
individualized Web sites and related services. Today you can find a
wealth of different hosting services at different costs depending on
whether you want a simple Web presence, a database, a full-featured
content management system, and so forth.

These hosting services keep costs low by packing multiple users onto
each computer. A tiny site serving up occasional files, such as my own, needs nothing that
approaches the power of a whole computer system. Thanks to virtual
hosting, I can use a sliver of a web server that dozens of other sites
share and enjoy my web site for very little cost. But
still looks and behaves like an independent, stand-alone web server.
We'll see more such legerdemain as we explore virtualization and
clouds further.

The glimmer of the cloud in the World Wide Web

The next great breakthrough in remote computing was the concept of an
Application Service Provider. This article started with one
contemporary example, Gmail. Computing services such as payroll
processing had been outsourced for some time, but in the 1990s, the
Web made it easy for a business to reach right into another
organization's day-to-day practice, running programs on central
computers and offer interfaces to clients over the Internet. People
used to filling out forms and proceeding from one screen to the next
on a locally installed program could do the same on a browser with
barely any change in behavior.

Using an Application Service Provider is a little like buying a house
in the suburbs with a yard and garden, but hiring a service to
maintain them. Just as the home-owner using a service doesn't
have to get his hands dirty digging holes for plants, worry about the
composition of the lime, or fix a broken lawnmower, companies who
contract with Application Service Providers don't have to
wrestle with libraries and DLL hell, rush to upgrade software when
there's a security breach, or maintain a license server. All
these logistics are on the site run by the service, hidden away from
the user.

Early examples of Application Service Providers for everyday personal
use include blogging sites such as and
These sites offer web interfaces for everything from customizing the
look of your pages to putting up new content (although advanced users
have access to back doors for more complex configuration).

Interestingly, many companies recognized the potential of web browsers
to deliver services in the early 2000s. But browsers' and
JavaScript's capabilities were too limited for rich interaction.
These companies had to try to coax users into downloading plugins that
provided special functionality. The only plugin that ever caught on
was Flash (which, of course, enables many other applications). True
web services had to wait for the computer field to evolve along
several dimensions.

As broadband penetrated to more and more areas, web services became a
viable business model for delivering software to individual users.
First of all, broadband connections are "always on," in
contrast to dial-up. Second, the HttpRequest extension allows browsers
to fetch and update individual snippets of a web page, a practice that
programmers popularized under the acronym AJAX.

Together, these innovations allow web applications to provide
interfaces almost as fast and flexible as native applications running
on your computer, and a new version of HTML takes the process even
farther. The movement to the web is called Software as a
or SaaS.


pinned web site feature introduced in Internet Explorer 9

encourages users to create menu items or icons representing web sites,
making them as easy to launch as common applications on their
computer. This feature is a sign of the shift of applications from
the desktop to the Web.

Every trend has its logical conclusion, even if it's farther
than people are willing to go in reality. The logical conclusion of
SaaS is a tiny computer with no local storage and no software except
the minimal operating system and networking software to access servers
that host the software to which users have access.

Such thin clients were already prominent in the work world
before Web services became popular; they connected terminals made by
companies such as Wyse with local servers over cables. (Naturally,
Wyse has

recently latched on to the cloud hype
The Web equivalent of thin clients is mobile devices such as iPhones
with data access, or
Google Chrome OS,
which Google is hoping will wean people away from popular software
packages in favor of Web services like Google Docs. Google is planning
to release a netbook running Chrome OS in about six months. Ray
Ozzie, chief software architect of Microsoft, also speaks of an
upcoming reality of
continuous cloud services delivered to thin appliances
The public hasn't followed the Web services revolution this far,
though; most are still lugging laptops.

Data, data everywhere

Most of the world's data is now in digital form, probably in some
relational database such as Oracle, IBM's DB2, or MySQL. If the
storage of the data is anything more formal than a spreadsheet on some
clerical worker's PC (and a shameful amount of critical data is still
on those PCs), it's probably already in a kind of cloud.

Database administrators know better than to rely on a single disk to
preserve those millions upon millions of bytes, because tripping over
an electric cable can lead to a disk crash and critical information
loss. So they not only back up their data on tape or some other
medium, but duplicate it on a series of servers in a strategy called
replication. They often transmit data second by second over
hundreds of miles of wire so that flood or fire can't lead to
permanent loss.

Replication strategies can get extremely complex (for instance, code
that inserts the "current time" can insert different
values as the database programs on various servers execute it), and
they are supplemented by complex caching strategies. Caches are
necessary because public-facing systems should have the most commonly
requested data--such as current pricing information for company
products--loaded right into memory. An extra round-trip over the
Internet for each item of data can leave users twiddling their thumbs in
annoyance. Loading or "priming" these caches can take
hours, because primary memories on computers are so large.

The use of backups and replication can be considered a kind of private
cloud, and if a commercial service becomes competitive in reliability
or cost, we can expect businesses to relax their grip and entrust
their data to such a service.

We've seen how's S3 allowed people to store
data on someone else's servers. But as a primary storage area,
S3 isn't cost-effective. It's probably most valuable when
used in tandem with an IaaS service such as EC2: you feed your data
from the data cloud service into the compute cloud service.

Some people also use S3, or one of many other data storage services,
as a backup to their local systems. Although it may be hard to get
used to trusting some commercial service over a hard drive you can
grasp in your hand, the service has some advantages. They are actually
not as likely as you are to drop the hard drive on the floor and break
it, or have it go up in smoke when a malfunctioning electrical system
starts a fire.

But data in the cloud has a much more powerful potential. Instead of
Software as a Service, a company can offer its data online for others
to use.

Probably the first company to try this radical exposure of data was, who can also be credited for starting the cloud services
mentioned earlier. released a service that let programmers
retrieve data about its products, so that instead of having to visit
dozens of web pages manually and view the data embedded in the text,
someone could retrieve statistics within seconds.

Programmers loved this. Data is empowering, even if it's just
sales from one vendor, and developers raced to use the application
programming interface (API) to create all kinds of intriguing
applications using data from Amazon. Effectively, they leave it up to
Amazon to collect, verify, maintain, search through, and correctly
serve up data on which their applications depend. Seen as an aspect of
trust, web APIs are an amazing shift in the computer

Amazon's API was a hack of the Web, which had been designed to
exchange pages of information. Like many other Internet services, the
Web's HTTP protocol offers a few basic commands: GET, PUT, POST,
and DELETE. The API used the same HTTP protocol to get and put
individual items of data. And because it used HTTP, it could easily be
implemented in any language. Soon there were libraries of programming
code in all popular languages to access services such as's data.

Another early adopter of Web APIs was Google. Because its
Google Maps service exposed
data in a program-friendly form, programmers started to build useful
services on top of it. One famous example combined Google Maps with a
service that published information on properties available for rent;
users could quickly pull up a map showing where to rent a room in
their chosen location. Such combinations of services were called
mash-ups, with interesting cultural parallels to the
practices of musicians and artists in the digital age who combine
other people's work from many sources to create new works.

The principles of using the Web for such programs evolved over several
years in the late 1990s, but the most popular technique was codified
in a 2000 PhD thesis by HTTP designer Roy Thomas Fielding, who
invented the now-famous term REST (standing for Representational State
Transfer) to cover the conglomeration of practices for defining URLs
and exchanging messages. Different services adhere to these principles
to a greater or lesser extent. But any online service that wants to
garner serious, sustained use now offers an API.

A new paradigm for programmers

SaaS has proven popular for programmers. In 1999, a company named VA
Linux created a site called
with the classic SaaS goal of centralizing the administration of
computer systems and taking that burden off programmers' hands. A
programmer could upload his program there and, as is typical for free
software and open source, accept code contributions from anyone else
who chose to download the program.

VA Linux at that time made its money selling computers that ran the
GNU/Linux operating system. It set up SourceForge as a donation to the
free software community, to facilitate the creation of more free
software and therefore foster greater use of Linux. Eventually the
hardware business dried up, so SourceForge became the center of the
company's business: corporate history anticipated cloud
computing history.

SourceForge became immensely popular, quickly coming to host hundreds
of thousands of projects, some quite heavily used. It has also
inspired numerous other hosting sites for programmers, such as
Github. But these sites don't
completely take the administrative hassle out of being a programmer.
You still need to run development software--such as a compiler
and debugger--on your own computer.

Google leapt up to the next level of programmer support with
Google App Engine,
a kind of programmer equivalent to Gmail or
Google Docs.
App Engine is a cocoon within which you can plant a software larva and
carry it through to maturity. Like SaaS, the programmer does the
coding, compilation, and debugging all on the App Engine site. Also
like SaaS, the completed program runs on the site and offers a web
interface to the public. But in terms of power and flexibility, App
Engine is like IaaS because the programmer can use it to offer any
desired service. This new kind of development paradigm is called
Platform as a Service or PaaS.

Microsoft offers both IaaS and PaaS in its
Windows Azure

Hopefully you now see how various types of remote computing are alike,
as well as different.

December 13 2010

Resolving the contradictions between web services, clouds, and open source


What are the chances for a free software cloud?

Additional posts in this 5-part series are available here.

Predicting trends in computer technology is an easy way to get into trouble, but two developments have been hyped so much over the past decade that there's little risk in jumping on their bandwagons: free software and cloud computing. What's odd is that both are so beloved of crystal-gazers, because on the surface they seem incompatible.

The first trend promises freedom, the second convenience. Both freedom and convenience inspire people to adopt new technology, so I believe the two trends will eventually coexist and happily lend power to each other. But first, the proponents of each trend will have to get jazzed up about why the other trend is so compelling.

Freedom is promised by the free and open source software movement. Its
foundation is the principle of radical sharing: the knowledge one
produces should be offered to others. Starting with a few
break-through technologies that surprised outsiders by coming to
dominate their industries--the GNU C compiler, the Linux kernel,
the Apache web server--free software has insinuated itself into
every computing niche.

The trend toward remote computing--web services and the vaguely
defined cloud computing--promises another appealing kind of
freedom: freedom from having to buy server hardware and set up
operations, freedom from installations and patches and upgrades,
freedom in general from administrative tasks. Of course, these
advantages are merely convenience, not the kind of freedom championed
by the free software movement.

Together with the mobile revolution (not just programs on cell phones,
but all kinds of sensors, cameras, robots, and specialized devices for
recording and transmitting information) free software and remote
computing are creating new environments for us to understand
information, ourselves, and each other.

The source of the tension

Remote computing, especially the layer most of us encounter as web
services, is offered on a take-it-or-leave-it basis. Don't like
Facebook's latest change to its privacy settings? (Or even where
it locates its search box?) Live with it or break your Facebook habit
cold turkey.

Free software, as we'll see, was developed in resistance to such
autocratic software practices. And free software developers were among
the first to alert the public about the limitations of clouds and web
services. These developers--whose ideals are regularly challenged
by legal, social, and technological change--fear that remote
computing undermines the premises of free software. To understand the
tension, let's contrast traditional mail delivery with a popular
online service such as
Gmail, a textbook example of a web
service familiar to many readers.

For years, mail was transmitted by free software. The most popular
mail server was Sendmail, which could stand with the examples I listed
at the beginning of this article as one of earliest examples of free
software in widespread use. Sendmail's source code has been
endlessly examined, all too often for its many security flaws.

Lots of organizations still use free software mail servers, even
though in the commercial world, Microsoft's closed-source
Exchange is the standard. But organizations are flocking now to Gmail,
which many people find the most appealing interface for email.

Not only is Gmail closed, but the service would remain closed even if
Google released all the source code. This is because nobody who uses
Gmail software actually loads it on their systems (except some
JavaScript that handles user interaction). We all simply fire up a
browser to send a message to code running on Google servers. And if
Google hypothetically released the source code and someone set up a
competing Gmail, that would be closed for the same reason. A web
service runs on a privately owned computer and therefore is always

So the cloud--however you define it--seems to render the notion of
software freedom meaningless. But things seem to get even worse. The
cloud takes the client/server paradigm to its limit. There is forever
an unbreachable control gap between those who provide the service and
those who sign up for it.

And this is apparently a step backward in computing history. Closed,
proprietary software erected a gateway between the all-powerful
software developers and the consumers of the software. Free software
broke the gate down by giving the consumers complete access to source
code and complete freedom to do what they wanted. Amateurs around the
world have grabbed the opportunity to learn programming techniques
from free software and to make it fit their whims and needs. Now, once
again, software hidden behind a server commands the user to relinquish
control--and as the popularity of Gmail and other services show,
users are all too ready to do it.

Cloud computing is leading to the bifurcation of computing into a
small number of developers with access to the full power and
flexibility that computers can offer, contrasted with a world full of
small devices offering no say in what the vendors choose for us to
run, a situation predicted in Jonathan Zittrain's book

The Future of the Internet

Tim Berners-Lee, inventor of the World Wide Web, as part of a major
Scientific American article,

criticized social networks like Facebook
as silos that commit the
sin of hoarding data entered by visitors instead of exposing it openly
on the Internet. Ho, Sir Berners-Lee, that's exactly why many visitors
use social networks: to share their personal thoughts and activities
with a limited set of friends or special-interest groups. Social
networks and their virtual walls therefore contribute to the potential
of the Internet as a place to form communities.

But Berners-Lee was airing his complaint as part of a larger point
about the value of providing data for new and unanticipated
applications, and his warning does raise the question of scale. If
Facebook-type networks became the default and people "lived" on them
all the time instead of the wider Web, opportunities for
interconnection and learning would diminish.

Complementary trends

But one would be jumping to conclusions to assume that cloud computing
is inimical to free software. Google is one of the world's great
consumers of free software, and a supporter as well. Google runs its
servers on Linux, and has placed it at the core of its fast-growing
Android mobile phone system. Furthermore, Google submits enhancements
to free software projects, releases many of its peripheral
technologies as open source, and runs projects such as
Summer of Code to develop
new free software programs and free software programmers in tandem.

This is the trend throughout computing. Large organizations with banks
of servers tend to run free software on them. The tools with which
they program and administer the servers are also free.

A "free software cloud" may seem to be an oxymoron, like
"non-combat troops." But I believe that free software and
remote computing were made for each other; their future lies together
and the sooner they converge, the faster they will evolve and gain
adoption. In fact, I believe a free software cloud--much more
than the "open cloud" that
many organizations are working on--lies
in our future. This series will explore the traits of each trend and
show why they are meant to join hands.

December 10 2010

Four short links: 10 December 2010

  1. Let it Snow -- bookmarklet from David Flanagan that makes Javascript snowflakes fall. Awww. (via Mike Loukides)
  2. You Can Work on Great Technology at Startups -- There are more innovative database startups at various stages in their life than I can remember right now. So true--waiting for the inevitable amalgamation, thinning out, etc. (via Nat Friedman on Twitter)
  3. Dropbox for Teams -- an interesting package of features from a very innovative company. (via Hacker News)
  4. Cloud Computing Checklist -- Comparison and Analysis of the Terms and Conditions of Cloud Computing Services. What to look out for when signing a cloud contract. (via Rick Shera in email)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...