Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 21 2012

Clinician, researcher, and patients working together: progress aired at Indivo conference

While thousands of health care professionals were flocking to the BIO International Convention this week, I spent Monday in a small library at the Harvard Medical School listening to a discussion of the Indivo patient health record and related open source projects with about 80 intensely committed followers. Lead Indivo architect Daniel Haas, whom I interviewed a year ago, succeeded in getting the historical 2.0 release of Indivo out on the day of the conference. This article explains the significance of the release in the health care field and the promise of the work being done at Harvard Medical School and its collaborators.

Although still at the early adoption stages, Indivo and the related SMART and i2b2 projects merit attention and have received impressive backing. The Office of the National Coordinator funded SMART, and NIH funded i2b2. National Coordinator Farzad Mostashari was scheduled to attend Monday's conference (although he ended up having to speak over a video hookup). Indivo inspired both Microsoft HealthVault and Google Health, and a good deal of its code underlies HealthVault. Australia has taken a nationwide PHR initiative inspired by Indivo. A Partners HealthCare representative spoke at the conference, as did someone from the MIT Media Lab. Clayton M. Christensen et al. cited Indivo as a good model in The Innovator's Prescription: A Disruptive Solution for Health Care. Let's take a look at what makes the combination so powerful.

Platform and reference implementation

The philosophy underlying this distributed open source initiative is to get clinicians, health researchers, and patients to share data and work together. Today, patient data is locked up in thousands of individual doctors or hospital repositories; whether they're paper or electronic hardly makes a difference because they can't be combined or queried. The patient usually can't see his own data, as I described in an earlier posting, much less offer it to researchers. Dr. Kenneth Mandl, opening the conference, pointed out that currently, an innovative company in the field of health data will die on the vine because they can't get data without making deals with each individual institution and supporting its proprietary EHR.

The starting point for changing all that, so far as this conference goes, is the SMART platform. It simply provides data models for storing data and APIs to retrieve it. If an electronic health record can translate data into a simple RDF model and support the RESTful API, any other program or EHR that supports SMART can access the data. OAuth supports security and patient control over access.

Indivo is a patient health record (or, to use the term preferred by the conference speakers, a personally controlled health record). It used to have its own API, and the big significance of Monday's 2.0 release is that it now supports SMART. The RESTful interface will make Indivo easy to extend beyond its current Java and Python interfaces. So there's a far-reaching platform now for giving patients access to data and working seemlessly with other cooperating institutions.

The big missing piece is apps, and a hackathon on Tuesday (which I couldn't attend) was aimed at jump-starting a few. Already, a number of researchers are using SMART to coordinate data sharing and computation through the i2b2 platform developed by Partners. Ultimately, the SMART and Indivo developers hope to create an app store, inspired by Apple's, where a whole marketplace can develop. Any app written to the SMART standard can run in Indivo or any other system supporting SMART. But the concept of an app in SMART and Indivo is different from a consumer market, though. The administrator of the EHR or PHR would choose apps, vetting them for quality and safety, and then a doctor, researcher, or patient could use one of the chosen apps.

Shawn Murphy of Partners described the use of i2b2 to choose appropriate patients for a clinical study. Instead of having to manually check many different data repositories manually for patients meeting the requirements (genetic, etc.), a researcher could issue automated queries over SMART to the databases. The standard also supports teamwork across institutions. Currently, 60 different children's hospitals' registries talk to each other through i2b2.

It should be noted i2b2 does not write into a vendor's EHR system (which the ONC and many others call an important requirement for health information exchange) because putting data back into a silo isn't disruptive innovation. It's better to give patients a SMART-compatible PHR such as Indivo.

Regarding Tuesday's hackathon, Haas wrote me, "By the end of the day, we had several interesting projects in the works, including an app to do contextualized search based on a patient's Problems list (integration with google.com and MedlinePlus), and app integration with BodyTrack, which displays Indivo labs data in time-series form alongside data from several other open API inputs, such as Fitbit and Zeo devices."

Standards keep things simple

All the projects mentioned are low-budget efforts, so they all borrow and repurpose whatever open source tools they can. As Mostashari said in his video keynote, they believe in "using what you've got." I have already mentioned SMART's dependence on standards, and Indivo is just as behold to other projects, particularly Django. For instance, Indivo allows data to be stored in Django's data models (Python structures that represent basic relational tables). Indivo also provides an even simpler JSON-based data model.

The format of data is just as important as the exchange protocol, if interoperability is to success. The SMART team chose to implement several "best-of-breed" standards that would cover 80% of use cases: for instance, SNOMED for medical conditions, RxNORM for medications, and LOINC for labs. Customers using other terminologies will have to translate them into the supported standards, so SMART contains Provenance fields indicating the data source.

The software is also rigorously designed to be modular, so both the original developers and other adopters can replace pieces as desired. Indivo already has plenty of fields about patient data and about context (provider names, etc.), but more can be added ad infinitum to support any health app that comes along. Indivo 2.0 includes pluggable data models, which allow a site to customize every step from taking in data to removing it. It also supports new schemas for data of any chosen type.

The simplicity of Indivo, SMART, and i2b2--so much in contrast with most existing health information exchanges--is reminiscent of Blue Button. Mandl suggested that a Blue Button app would be easy to write. But the difference is that Blue Button aimed to be user-friendly whereas the projects at this conference are developer-friendly. That means that can add some simple structure and leave it up to app developers to present the data to users in a friendly manner.

The last hurdle

Because SMART and Indivo ultimately want the patient to control access to data, trust is a prerequisite. OAuth is widely used by Twitter apps and other sites across the Web, but hasn't been extensively tested in a health care environment. We'll need more experience with OAuth to see whether the user experience and their sense of security are adequate. And after that, trust is up to the institutions adopting Indivo or SMART. A couple speakers pointed out that huge numbers of people trust mint.com with their passwords to financial accounts, so when they learn the benefits of access to patient records they should adopt Indivo as well. An Indivo study found that 84% of people are willing to share data with social networks for research and learning.

SMART, Indivo, and i2b2 make data sharing easier than ever. but as many have pointed out, none of this will get very far until patients, government, and others demand that institutions open up. Mandl suggested that one of the major reasons Google Health failed was that it could never get enough data to gain traction--the health providers just wouldn't work with the PHR. At least the open source standards take away some of the technical excuses they have used up to now.

May 02 2012

Recombinant Research: Breaking open rewards and incentives

In the previous articles in this series I've looked at problems in current medical research, and at the legal and technical solutions proposed by Sage Bionetworks. Pilot projects have shown encouraging results but to move from a hothouse environment of experimentation to the mainstream of one of the world's most lucrative and tradition-bound industries, Sage Bionetworks must aim for its nucleus: rewards and incentives.

Previous article in the series: Sage Congress plans for patient engagement.

Think about the publication system, that wretchedly inadequate medium for transferring information about experiments. Getting the data on which a study was based is incredibly hard; getting the actual samples or access to patients is usually impossible. Just as boiling vegetables drains most of their nutrients into the water, publishing results of an experiment throws away what is most valuable.

But the publication system has been built into the foundation of employment and funding over the centuries. A massive industry provides distribution of published results to libraries and research institutions around the world, and maintains iron control over access to that network through peer review and editorial discretion. Even more important, funding grants require publication (but the data behind the study only very recently). And of course, advancement in one's field requires publication.

Lawrence Lessig, in his keynote, castigated for-profit journals for restricting access to knowledge in order to puff up profits. A chart in his talk showed skyrocketing prices for for-profit journals in comparison to non-profit journals. Lessig is not out on the radical fringe in this regard; Harvard Library is calling the current pricing situation "untenable" in a move toward open access echoed by many in academia.

Lawrence Lessig keynote at Sage Congress
Lawrence Lessig keynote at Sage Congress.

How do we open up this system that seemed to serve science so well for so long, but is now becoming a drag on it? One approach is to expand the notion of publication. This is what Sage Bionetworks is doing with Science Translational Medicine in publishing validated biological models, as mentioned in an earlier article. An even more extensive reset of the publication model is found in Open Network Biology (ONB), an online journal. The publishers require that an article be accompanied by the biological model, the data and code used to produce the model, a description of the algorithm, and a platform to aid in reproducing results.

But neither of these worthy projects changes the external conditions that prop up the current publication system.

When one tries to design a reward system that gives deserved credit to other things besides the final results of an experiment, as some participants did at Sage Congress, great unknowns loom up. Is normalizing and cleaning data an activity worth praise and recognition? How about combining data sets from many different projects, as a Synapse researcher did for the TCGA? How much credit do you assign researchers at each step of the necessary procedure for a successful experiment?

Let's turn to the case of free software to look at an example of success in open sharing. It's clear that free software has swept the computer world. Most web sites use free software ranging from the server on which they run to the language compilers that deliver their code. Everybody knows that the most popular mobile platform, Android, is based on Linux, although fewer realize that the next most popular mobile platforms, Apple's iPhones and iPads, run on a modified version of the open BSD operating system. We could go on and on citing ways in which free and open source software have changed the field.

The mechanism by which free and open source software staked out its dominance in so many areas has not been authoritatively established, but I think many programmers agree on a few key points:

  • Computer professionals encountered free software early in their careers, particularly as students or tinkerers, and brought their predilection for it into jobs they took at stodgier institutions such as banks and government agencies. Their managers deferred to them on choices for programming tools, and the rest is history.

  • Of course, computer professionals would not have chosen the free tools had they not been fit for the job (and often best for the job). Why is free software so good? Probably because the people creating it have complete jurisdiction over what to produce and how much time to spend producing it, unlike in commercial ventures with requirements established through marketing surveys and deadlines set unreasonably by management.

  • Different pieces of free software are easy to hook up, because one can alter their interfaces as necessary. Free software developers tend to look for other tools and platforms that could work with their own, and provide hooks into them (Apache, free database engines such as MySQL, and other such platforms are often accommodated.) Customers of proprietary software, in contrast, experience constant frustration when they try to introduce a new component or change components, because the software vendors are hostile to outside code (except when they are eager to fill a niche left by a competitor with market dominance). Formal standards cannot overcome vendor recalcitrance--a painful truth particularly obvious in health care with quasi-standards such as HL7.

  • Free software scales. Programmers work on it tirelessly until it's as efficient as it needs to be, and when one solution just can't scale any more, programmers can create new components such as Cassandra, CouchDB, or Redis that meet new needs.

Are there lessons we can take from this success story? Biological research doesn't fit the circumstances that made open source software a success. For instance, researchers start out low on the totem pole in very proprietary-minded institutions, and don't get to choose new ways of working. But the cleverer ones are beginning to break out and try more collaboration. Software and Internet connections help.

Researchers tend to choose formats and procedures on an ad hoc, project by project basis. They haven't paid enough attention to making their procedures and data sets work with those produced by other teams. This has got to change, and Sage Bionetworks is working hard on it.

Research is labor-intensive. It needs desperately to scale, as I have pointed out throughout this article, but to do so it needs entire new paradigms for thinking about biological models, workflow, and teamwork. This too is part of Sage Bionetworks' mission.

Certain problems are particularly resistant in research:

  • Conditions that affect small populations have trouble raising funds for research. The Sage Congress initiatives can lower research costs by pooling data from the affected population and helping researchers work more closely with patients.

  • Computation and statistical methods are very difficult fields, and biological research is competing with every other industry for the rare individuals who know these well. All we can do is bolster educational programs for both computer scientists and biologists to get more of these people.

  • There's a long lag time before one knows the effects of treatments. As Heywood's keynote suggested, this is partly solved by collecting longitudinal data on many patients and letting them talk among themselves.

Another process change has revolutionized the computer field: agile programming. That paradigm stresses close collaboration with the end-users whom the software is supposed to benefit, and a willingness to throw out old models and experiment. BRIDGE and other patient initiatives hold out the hope of a similar shift in medical research.

All these things are needed to rescue the study of genetics. It's a lot to do all at once. Progress on some fronts were more apparent than others at this year's Sage Congress. But as more people get drawn in, and sometimes fumbling experiments produce maps for changing direction, we may start to see real outcomes from the efforts in upcoming years.

All articles in this series, and others I've written about Sage Congress, are available through a bit.ly bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

May 01 2012

Recombinant Research: Sage Congress plans for patient engagement

Clinical trials are the pathway for approving drug use, but they aren't good enough. That has become clear as a number of drugs (Vioxx being the most famous) have been blessed by the FDA, but disqualified after years of widespread use reveal either lack of efficacy or dangerous side effects. And the measures taken by the FDA recently to solve this embarrassing problem continue the heavy-weight bureaucratic methods it has always employed: more trials, raising the costs of every drug and slowing down approval. Although I don't agree with the opinion of Avik S. A. Roy (reprinted in Forbes) that Phase III trials tend to be arbitrary, I do believe it is time to look for other ways to test drugs for safety and efficacy.

First article in the series: Recombinant Research: Sage Congress Promotes Data Sharing in Genetics.

But the Vioxx problem is just one instance of the wider malaise afflicting the drug industry. They just aren't producing enough new medications, either to solve pressing public needs or to keep up their own earnings. Vicki Seyfert-Margolis of the FDA built on her noteworthy speech at last year's Sage Congress (reported in one of my articles about the conference) with the statistic that drug companies have submitted 20% fewer medications to the FDA between 2001 and 2007. Their blockbuster drugs produce far fewer profits than before as patents expire and fewer new drugs emerge (a predicament called the "patent cliff"). Seyfert-Margolis intimated that this crisis in the cause of layoffs in the industry, although I heard elsewhere that the companies are outsourcing more research, so perhaps the downsizing is just a reallocation of the same money.

Benefits of patient involvement

The field has failed to rise to the challenges posed by new complexity. Speakers at Sage Congress seemed to feel that genetic research has gone off the tracks. As the previous article in this series explained, Sage Bionetworks wants researchers to break the logjam by sharing data and code in GitHub fashion. And surprisingly, pharma is hurting enough to consider going along with an open research system. They're bleeding from a situation where as much as 80% of each clinical analysis is spent retrieving, formatting, and curating the data. Meanwhile, Kathy Giusti of the Multiple Myeloma Research Foundation says that in their work, open clinical trials are 60% faster.

Attendees at a breakout session where I sat in, including numerous managers from major pharma companies, expressed confidence that they could expand public or "pre-competitive" research in the direction Sage Congress proposed. The sector left to engage is the one that's central to all this work--the public.

If we could collect wide-ranging data from, say, 50,000 individuals (a May 2013 goal cited by John Wilbanks of Sage Bionetworks, a Kauffman Foundation Fellow), we could uncover a lot of trends that clinical trials are too narrow to turn up. Wilbanks ultimately wants millions of such data samples, and another attendee claimed that "technology will be ready by 2020 for a billion people to maintain their own molecular and longitudinal health data." And Jamie Heywood of PatientsLikeMe, in his keynote, claimed to have demonstrated through shared patient notes that some drugs were ineffective long before the FDA or manufacturers made the discoveries. He decried the current system of validating drugs for use and then failing to follow up with more studies, snorting that, "Validated means that I have ceased the process of learning."

But patients have good reasons to keep a close hold on their health data, fearing that an insurance company, an identity thief, a drug marketer, or even their own employer will find and misuse it. They already have little enough control over it, because the annoying consent forms we always have shoved in our faces when we come to a clinic give away a lot of rights. Current laws allow all kinds of funny business, as shown in the famous case of the Vermont law against data mining, which gave the Supreme Court a chance to say that marketers can do anything they damn please with your data, under the excuse that it's de-identified.

In a noteworthy poll by Sage Bionetworks, 80% of academics claimed they were comfortable sharing their personal health data with family members, but only 31% of citizen advocates would do so. If that 31% is more representative of patients and the general public, how many would open their data to strangers, even when supposedly de-identified?

The Sage Bionetworks approach to patient consent

It's basic research that loses. So Wilbanks and a team have been working for the past year on a "portable consent" procedure. This is meant to overcome the hurdle by which a patient has to be contacted and give consent anew each time a new researcher wants data related to his or her genetics, conditions, or treatment. The ideal behind portable consent is to treat the entire research community as a trusted user.

The current plan for portable consent provides three tiers:

Tier 1

No restrictions on data, so long as researchers follow the terms of service. Hopefully, millions of people will choose this tier.

Tier 2

A middle ground. Someone with asthma may state that his data can be used only by asthma researchers, for example.

Tier 3

Carefully controlled. Meant for data coming from sensitive populations, along with anything that includes genetic information.

Synapse provides a trusted identification service. If researchers find a person with useful characteristics in the last two tiers, and are not authorized automatically to use that person's data, they can contact Synapse with the random number assigned to the person. Synapse keeps the original email address of the person on file and will contact him or her to request consent.

Portable consent also involves a lot of patient education. People will sign up through a software wizard that explains the risks. After choosing portable consent, the person decides how much to put in: 23andMe data, prescriptions, or whatever they choose to release.

Sharon Terry of the Genetic Alliance said that patient advocates currently try to control patient data in order to force researchers to share the work they base on that data. Portable consent loosens this control, but the field may be ready for its more flexible conditions for sharing.

Pharma companies and genetics researchers have lots to gain from access to enormous repositories of patient data. But what do the patients get from it? Leaders in health care already recognize that patients are more than experimental subjects and passive recipients of treatment. The recent ONC proposal for Stage 2 of Meaningful Use includes several requirements to share treatment data with the people being treated (which seems kind of a no-brainer when stated this baldly) and the ONC has a Consumer/Patient Engagement Power Team.

Sage Congress is fully engaged in the patient engagement movement too. One result is the BRIDGE initiative, a joint project of Sage Bionetworks and Ashoka with funding from the Robert Wood Johnson Foundation, to solicit questions and suggestions for research from patients. Researchers can go for years researching a condition without even touching on some symptom that patients care about. Listening to patients in the long run produces more cooperation and more funding.

Portable consent requires a leap of faith, because as Wilbanks admits, releasing aggregates of patient data mean that over time, a patient is almost certain to be re-identified. Statistical techniques are just getting too sophisticated and compute power growing too fast for anyone to hide behind current tricks such as using only the first three digits of a five-digit postal code. Portable consent requires the data repository to grant access only to bona fide researchers and to set terms of use, including a ban on re-identifying patients. Still, researchers will have rights to do research, redistribute data, and derive products from it. Audits will be built in.

But as mentioned by Kelly Edwards of the University of Washington, tools and legal contracts can contribute to trust, but trust is ultimately based on shared values. Portable consent, properly done, engages with frameworks like Synapse to create a culture of respect for data.

In fact, I think the combination of the contractual framework in portable consent and a platform like Synapse, with its terms of use, might make a big difference in protecting patient privacy. Seyfert-Margolis cited predictions that 500 million smartphone users will be using medical apps by 2015. But mobile apps are notoriously greedy for personal data and cavalier toward user rights. Suppose all those smartphone users stored their data in a repository with clear terms of use and employed portable consent to grant access to the apps? We might all be safer.

The final article in this series will evaluate the prospects for open research in genetics, with a look at the grip of journal publishing on the field, and some comparisons to the success of free and open source software.

Next: Breaking Open Rewards and Incentives. All articles in this series, and others I've written about Sage Congress, are available through a bit.ly bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

April 30 2012

Recombinant Research: Sage Congress promotes data sharing in genetics

Given the exponential drop in the cost of personal genome sequencing (you can get a basic DNA test from 23andMe for a couple hundred dollars, and a full sequence will probably soon come down to one thousand dollars in cost), a new dawn seems to be breaking forth for biological research. Yet the assessment of genetics research at the recent Sage Congress was highly cautionary. Various speakers chided their own field for tilling the same ground over and over, ignoring the urgent needs of patients, and just plain researching the wrong things.

Sage Congress also has some plans to fix all that. These projects include tools for sharing data and storing it in cloud facilities, running challenges, injecting new fertility into collaboration projects, and ways to gather more patient data and bring patients into the planning process. Through two days of demos, keynotes, panels, and breakout sessions, Sage Congress brought its vision to a high-level cohort of 230 attendees from universities, pharmaceutical companies, government health agencies, and others who can make change in the field.

In the course of this series of articles, I'll pinpoint some of the pain points that can force researchers, pharmaceutical companies, doctors, and patients to work together better. I'll offer a look at the importance of public input, legal frameworks for cooperation, the role of standards, and a number of other topics. But we'll start by seeing what Sage Bionetworks and its pals have done over the past year.

Synapse: providing the tools for genetics collaboration

Everybody understands that change is driven by people and the culture they form around them, not by tools, but good tools can make it a heck of a lot easier to drive change. To give genetics researchers the best environment available to share their work, Sage Bionetworks created the Synapse platform.

Synapse recognizes that data sets in biological research are getting too large to share through simple data transfers. For instance, in his keynote about cancer research (where he kindly treated us to pictures of cancer victims during lunch), UC Santa Cruz professor David Haussler announced plans to store 25,000 cases at 200 gigabytes per case in the Cancer Genome Atlas, also known as TCGA in what seems to be a clever pun on the four nucleotides in DNA. Storage requirements thus work out to 5 petabytes, which Haussler wants to be expandable to 20 petabytes. In the face of big data like this, the job becomes moving the code to the data, not moving the data to the code.

Synapse points to data sets contributed by cooperating researchers, but also lets you pull up a console in a web browser to run R or Python code on the data. Some effort goes into tagging each data set with associated metadata: tissue type, species tested, last update, number of samples, etc. Thus, you can search across Synapse to find data sets that are pertinent to your research.

One group working with Synapse has already harmonized and normalized the data sets in TCGA so that a researcher can quickly mix and run stats on them to extract emerging patterns. The effort took about one and half full-time employees for six months, but the project leader is confident that with the system in place, "we can activate a similar size repository in hours."

This contribution highlights an important principle behind Synapse (appropriately called "viral" by some people in the open source movement): when you have manipulated and improved upon the data you find through Synapse, you should put your work back into Synapse. This work could include cleaning up outlier data, adding metadata, and so on. To make work sharing even easier, Synapse has plans to incorporate the Amazon Simple Workflow Service (SWF). It also hopes to add web interfaces to allow non-programmers do do useful work with data.

The Synapse development effort was an impressive one, coming up with a feature-rich Beta version in a year with just four coders. And Synapse code is entirely open source. So not only is the data distributed, but the creators will be happy for research institutions to set up their own Synapse sites. This may make Synapse more appealing to geneticists who are prevented by inertia from visiting the original Synapse.

Mike Kellen, introducing Synapse, compared its potential impact to that of moving research from a world of journals to a world like GitHub, where people record and share every detail of their work and plans. Along these lines, Synapse records who has used a data set. This has many benefits:

  • Researchers can meet up with others doing related work.

  • It gives public interest advocates a hook with which to call on those who benefit commercially from Synapse--as we hope the pharmaceutical companies will--to contribute money or other resources.

  • Members of the public can monitor accesses for suspicious uses that may be unethical.

There's plenty more work to be done to get data in good shape for sharing. Researchers must agree on some kind of metadata--the dreaded notion of ontologies came up several times--and clean up their data. They must learn about data provenance and versioning.

But sharing is critical for such basics of science as reproducing results. One source estimates that 75% of published results in genetics can't be replicated. A later article in this series will examine a new model in which enough metainformation is shared about a study for it to be reproduced, and even more important to be a foundation for further research.

With this Beta release of Synapse, Sage Bionetworks feels it is ready for a new initiative to promote collaboration in biological research. But how do you get biologists around the world to start using Synapse? For one, try an activity that's gotten popular nowadays: a research challenge.

The Sage DREAM challenge

Sage Bionetworks' DREAM challenge asks genetics researchers to find predictors of the progression of breast cancer. The challenge uses data from 2000 women diagnosed with breast cancer, combining information on DNA alterations affecting how their genes were expressed in the tumors, clinical information about their tumor status, and their outcomes over ten years. The challenge is to build models integrating the alterations with molecular markers and clinical features to predict which women will have the most aggressive disease over a ten year period.

Several hidden aspects of the challenge make it a clever vehicle for Sage Bionetworks' values and goals. First, breast cancer is a scourge whose urgency is matched by its stubborn resistance to diagnosis. The famous 2009 recommendations of U.S. Preventive Services Task Force, after all the controversy was aired, left us with the dismal truth that we don't know a good way to predict breast cancer. Some women get mastectomies in the total absence of symptoms based just on frightening family histories. In short, breast cancer puts the research and health care communities in a quandary.

We need finer-grained predictors to say who is likely to get breast cancer, and standard research efforts up to now have fallen short. The Sage proposal is to marshal experts in a new way that combines their strengths, asking them to publish models that show the complex interactions between gene targets and influences from the environment. Sage Bionetworks will publish data sets at regular intervals that it uses to measure the predictive ability of each model. A totally fresh data set will be used at the end to choose the winning model.

The process behind the challenge--particularly the need to upload code in order to run it on the Synapse site--automatically forces model builders to publish all their code. According to Stephen Friend, founder of Sage Bionetworks, "this brings a level of accountability, transparency, and reproducibility not previously achieved in clinical data model challenges."

Finally, the process has two more effects: it shows off the huge amount of genetic data that can be accessed through Synapse, and it encourages researchers to look at each other's models in order to boost their own efforts. In less than a month, the challenge already received more than 100 models from 10 sources.

The reward for winning the challenge is publication in a respected journal, the gold medal still sought by academic researchers. (More on shattering this obelisk later in the series.) Science Translational Medicine will accept results of the evaluation as a stand-in for peer review, a real breakthrough for Sage Bionetworks because it validates their software-based, evidence-driven process.

Finally, the DREAM challenge promotes use of the Synapse infrastructure, and in particular the method of bringing the code to the data. Google is donating server space for the challenge, which levels the playing field for researchers, freeing them from paying for their own computing.

A single challenge doesn't solve all the problems of incentives, of course. We still need to persuade researchers to put up their code and data on a kind of genetic GitHub, persuade pharmaceutical companies to support open research, and persuade the general public to share data about the phonemes (life data) and genes--all topics for upcoming articles in the series.

Next: Sage Congress Plans for Patient Engagement. All articles in this series, and others I've written about Sage Congress, are available through a bit.ly bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

April 19 2012

Sage Congress: The synthesis of open source with genetics

For several years, O'Reilly Radar has been covering the exciting
potential that open source software, open data, and a general attitude
of sharing and cooperation bring to health care. Along with many
exemplary open source projects in areas directly affecting the
public — such as the VA's Blue
Button
in electronic medical records and the href="http://wiki.directproject.org/">Direct project in data
exchange — the study of disease is undergoing a paradigm shift.

Sage Bionetworks stands at the
center of a wide range of academic researchers, pharmaceutical
companies, government agencies, and health providers realizing that
the old closed system of tiny teams who race each other to a cure has
got to change. Today's complex health problems, such as Alzheimer's,
AIDS, and cancer, are too big for a single team. And these
institutions are slowly wrenching themselves out of the habit of data
hoarding and finding ways to work together.

A couple weeks ago I talked to the founder of Sage Bionetworks,
Stephen Friend, about recent advances in open source in this area, and
the projects to be highlighted at the upcoming http://sagecongress.org/">Sage Commons congress. Steve is careful
to call this a "congress" instead of a "conference" because all
attendees are supposed to pitch in and contribute to the meme pool. I
covered Sage Congress in a series of
articles last year
. The following podcast ranges over
topics such as:

  • what is Sage Bionetworks [Discussed at the 00:25 mark];
  • the commitment of participants to open source software [Discussed at the 01:01 mark];
  • how open source can support a business model in drug development [Discussed at the 01:40 mark];
  • a look at the upcoming congress [Discussed at the 03:47 mark];
  • citizen-led contributions or network science [Discussed at the 06:12 mark];
  • data sharing philosophy [Discussed at the 09:01 mark];
  • when projects are shared with other institutions [Discussed at the 12:43 mark];
  • how to democratize medicine [Discussed at the 17:10 mark];
  • a portable legal consent approach where the patient controls his or her own data [Discussed at the 20:07 mark];
  • solving the problem of non-sharing in the industry [Discussed at the 22:15 mark]; and
  • key speakers at the congress [Discussed at the 26:35 mark].

Sessions from the congress will be broadcast live via webcast and posted on the Internet.

April 14 2012

MySQL in 2012: Report from Percona Live

The big annual MySQL conference, started by MySQL AB in 2003 and run
by my company O'Reilly for several years, lives on under the able
management of Percona. This
fast-growing company started out doing consulting on MySQL,
particularly in the area of performance, and branched out into
development and many other activities. The principals of this company
wrote the most recent two editions of the popular O'Reilly book href="http://shop.oreilly.com/product/0636920022343.do">High
Performance MySQL
.

Percona started offering conferences a couple years ago and decided to
step in when O'Reilly decided not to run the annual MySQL conference
any more. Oracle did not participate in Percona Live, but has
announced href="http://www.oracle.com/us/corporate/press/1577449">its own MySQL
conference for next September.

Percona Live struck me as a success, with about one thousand attendees
and the participation of leading experts from all over the MySQL
world, save for Oracle itself. The big players in the MySQL user
community came out in force: Facebook, HP, Google, Pinterest (the
current darling of the financial crowd), and so on.

The conference followed the pattern laid down by old ones in just
about every way, with the same venue (the Santa Clara Convention
Center, which is near a light-rail but nothing else of interest), the
same food (scrumptious), the standard format of one day of tutorials
and two days of sessions (although with an extra developer day tacked
on, which I will describe later), an expo hall (smaller than before,
but with key participants in the ecosystem), and even community awards
(O'Reilly Media won an award as Corporate Contributor of the Year).
Monty Widenius was back as always with a MariaDB entourage, so it
seemed like old times. The keynotes seemed less well attended than the
ones from previous conferences, but the crowd was persistent and
showed up in impressive numbers for the final events--and I don't
believe it was because everybody thought they might win one of the
door prizes.

Jeremy Zawodny ready to hand out awards
Jeremy Zawodny ready to hand out awards.

Two contrasting database deployments

I checked out two well-attended talks by system architects from two

high-traffic sites: Pinterest and craigslist. The radically divergent
paths they took illustrate the range of options open to data centers
nowadays--and the importance of studying these options so a data
center can choose the path appropriate to its mission and
applications.

Jeremy Zawodny (co-author of the first edition of High Performance
MySQL
) href="http://www.percona.com/live/mysql-conference-2012/sessions/living-sql-and
-nosql-craigslist-pragmatic-approach">presented
the design of craigslist's site, which illustrates the model of
software accretion over time and an eager embrace of heterogeneity.
Among their components are:


  • Memcache, lying between the web servers and the MySQL database in
    classic fashion.

  • MySQL to serve live postings, handle abuse, data for monitoring
    system, and other immediate needs.

  • MongoDB to store almost 3 billion items related to archived (no longer
    live) postings.

  • HAproxy to direct requests to the proper MySQL server in a cluster.

  • Sphinx for text searches, with

    indexes over all live postings, archived postings, and forums.

  • Redis for temporary items such as counters and blobs.

  • An XFS filesystem for images.

  • Other helper functions that Zawodny lumped together as "async
    services."


Care and feeding of this menagerie becomes a job all in itself.
Although craigslist hires enough developers to assign them to
different areas of expertise, they have also built an object layer
that understands MySQL, cache, Sphinx, MongoDB. The original purpose
of this layer was to aid in migrating old data from MySQL to MongoDB
(a procedure Zawodny admitted was painful and time-consuming) but it
was extended into a useful framework that most developers can use
every day.

Zawodny praised MySQL's durability and its method of replication. But
he admitted that they used MySQL also because it was present when they
started and they were familiar with it. So adopting the newer entrants
into the data store arena was by no means done haphazardly or to try
out cool new tools. Each one precisely meets particular needs of the
site.


For instance, besides being fast and offering built-in sharding,
MongoDB was appealing because they don't have to run ALTER TABLE every
time they add a new field to the database. Old entries coexist happily
with newer ones that have different fields. Zawodny also likes using a
Perl client to interact with a database, and the Perl client provided
by MongoDB is unusually robust because it was developed by 10gen
directly, in contrast to many other datastores where Perl was added by
some random volunteer.

The architecture at craigslist was shrewdly chosen to match their
needs. For instance, because most visitors click on the limited set
of current listings, the Memcache layer handles the vast majority of
hits and the MySQL database has a relatively light load.


However, the MySQL deployment is also carefully designed. Clusters are
vertically partitioned in two nested ways. First, different types of
items are stored on separate partitions. Then, within each type, the
nodes are further divided by the type of query:

  • A single master to handle all writes.

  • A group for very fast reads (such as lookups on a primary key)

  • A group for "long reads" taking a few seconds


  • A special node called a "thrash handler" for rare, very complex
    queries

It's up to the application to indicate what kind of query it is
issuing, and HAproxy interprets this information to direct the query
to the proper set of nodes.

Naturally, redundancy is built in at every stage (three HAproxy
instances used in round robin, two Memcache instances holding the same
data, two data centers for the MongoDB archive, etc.).


It's also interesting what recent developments have been eschewed by
craigslist. The self-host everything and use no virtualization.
Zawodny admits this leads to an inefficient use of hardware, but
avoids the overhead associated with virtualization. For efficiency,
they have switched to SSDs, allowing them to scale down from 20
servers to only 3. They don't use a CDN, finding that with aggressive
caching and good capacity planning they can handle the load
themselves. They send backups and logs to a SAN.

Let's turn now from the teeming environment of craigslist to the
decidedly lean operation of Pinterest, a much younger and smaller
organization. As href="http://www.percona.com/live/mysql-conference-2012/sessions/scaling-pinterest">presented
by Marty Weiner and Yashh Nelapati, when they started web-scale
growth in the Autumn of 2011, they reacted somewhat like craigslist,
but with much less thinking ahead, throwing in all sorts of software
such as Cassandra and MongoDB, and diversifying a bit recklessly.
Finally they came to their senses and went on a design diet. Their
resolution was to focus on MySQL--but the way they made it work is
unique to their data and application.

They decided against using a cluster, afraid that bad application code
could crash everything. Sharding is much simpler and doesn't require
much maintenance. Their advice for implementing MySQL sharding
included:

  • Make sure you have a stable schema, and don't add features for a
    couple months.


  • Remove all joins and complex queries for a while.

  • Do simple shards first, such as moving a huge table into its own
    database.

They use Pyres, a
Python clone of Resque, to move data into shards.

However, sharding imposes severe constraints that led them to
hand-crafted work-arounds.

Many sites want to leave open the possibility for moving data between
shards. This is useful, for instance, if they shard along some
dimension such as age or country, and they suddenly experience a rush
of new people in their 60s or from China. The implementation of such a
plan requires a good deal of coding, described in the O'Reilly book href="http://shop.oreilly.com/product/9780596807290.do">MySQL High
Availability
, including the creation of a service that just
accepts IDs and determines what shard currently contains the ID.

The Pinterest staff decided the ID service would introduce a single
point of failure, and decided just to hard-code a shard ID in every ID
assigned to a row. This means they never move data between shards,
although shards can be moved bodily to new nodes. I think this works
for Pinterest because they shard on arbitrary IDs and don't have a
need to rebalance shards.

Even more interesting is how they avoid joins. Suppose they want to
retrieve all pins associated with a certain board associated with a
certain user. In classical, normalized relational database practice,
they'd have to do a join on the comment, pin, and user tables. But
Pinterest maintains extra mapping tables. One table maps users to
boards, while another maps boards to pins. They query the
user-to-board table to get the right board, query the board-to-pin
table to get the right pin, and then do simple queries without joins
on the tables with the real data. In a way, they implement a custom
NoSQL model on top of a relational database.

Pinterest does use Memcache and Redis in addition to MySQL. As with
craigslist, they find that most queries can be handled by Memcache.
And the actual images are stored in S3, an interesting choice for a
site that is already enormous.

It seems to me that the data and application design behind Pinterest
would have made it a good candidate for a non-ACID datastore. They
chose to stick with MySQL, but like organizations that use NoSQL
solutions, they relinquished key aspects of the relational way of
doing things. They made calculated trade-offs that worked for their
particular needs.

My take-away from these two fascinating and well-attended talks was
that how you must understand your application, its scaling and
performance needs, and its data structure, to know what you can
sacrifice and what solution gives you your sweet spot. craigslist
solved its problem through the very precise application of different
tools, each with particular jobs that fulfilled craigslist's

requirements. Pinterest made its own calculations and found an
entirely different solution depending on some clever hand-coding
instead of off-the-shelf tools.

Current and future MySQL

The conference keynotes surveyed the state of MySQL and some
predictions about where it will go.

Conference co-chair Sarah Novotny at keynote
Conference co-chair Sarah Novotny at keynotes.

The world of MySQL is much more complicated than it was a couple years
ago, before Percona got heavily into the work of releasing patches to
InnoDB, before they created entirely new pieces of software, and
before Monty started MariaDB with the express goal of making a better
MySQL than MySQL. You can now choose among Oracle's official MySQL
releases, Percona's supported version, and MariaDB's supported
version. Because these are all open source, a major user such as

Facebook can even apply patches to get the newest features.

Nor are these different versions true forks, because Percona and
MariaDB create their enhancements as patches that they pass back to
Oracle, and Oracle is happy to include many of them in a later
release. I haven't even touched on the commercial ecosystem around
MySQL, which I'll look at later in this article.

In his href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-mysql-
evolution">opening
keynote, Percona founder Peter Zaitsev praised the latest MySQL
release by Oracle. With graceful balance he expressed pleasure that
the features most users need are in the open (community) edition, but
allowed that the proprietary extensions are useful too. In short, he
declared that MySQL is less buggy and has more features than ever.


The href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-making-lamp-cloud">former
CEO of MySQL AB, Mårten Mickos, also found that MySQL is
doing well under Oracle's wing. He just chastised Oracle for failing
to work as well as it should with potential partners (by which I
assume he meant Percona and MariaDB). He lauded their community
managers but said the rest of the company should support them more.

Keynote by Mårten Mickos
Keynote by Mårten Mickos.

href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-new-mysql-cloud-ecosystem">Brian
Aker presented an OpenStack MySQL service developed by his current
employer, Hewlett-Packard. His keynote retold the story that had led
over the years to his developing href="https://launchpad.net/drizzle">Drizzle (a true fork of MySQL
that tries to return it to its lightweight, Web-friendly roots) and
eventually working on cloud computing for HP. He described modularity,
effective use of multiple cores, and cloud deployment as the future of
databases.

A href="http://www.percona.com/live/mysql-conference-2012/sessions/future-perfect
-road-ahead-mysql">panel
on the second day of the conference brought together high-level
managers from many of the companies that have entered the MySQL space
from a variety of directions in a high-level discussion of the
database engine's future. Like most panels, the conversation ranged
over a variety of topics--NoSQL, modular architecture, cloud
computing--but hit some depth only on the topic of security, which was
not represented very strongly at the conference and was discussed here
at the insistence of Slavik Markovich from McAfee.

Keynote by Brian Aker
Keynote by Brian Aker.


Many of the conference sessions disappointed me, being either very
high level (although presumably useful to people who are really new to
various topics, such as Hadoop or flash memory) or unvarnished
marketing pitches. I may have judged the latter too harshly though,
because a decent number of attendees came, and stayed to the end, and
crowded around the speakers for information.

Two talks, though, were so fast-paced and loaded with detail that I
couldn't possibly keep my typing up with the speaker.

One such talk was the href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-what-c
omes-next">keynote
by Mark Callaghan of Facebook. (Like the other keynotes, it should be
posted online soon.) A smattering of points from it:


  • Percona and MariaDB are adding critical features that make replication
    and InnoDB work better.

  • When a logical backup runs, it is responsible for 50% of IOPS.

  • Defragmenting InnoDB improves compression.

  • Resharding is not worthwhile for a large, busy site (an insight also
    discovered by Pinterest, as I reported earlier)


The other fact-filled talk was href="http://www.percona.com/live/mysql-conference-2012/sessions/using-nosql-in
nodb-memcached">by
Yoshinori Matsunobu of Facebook, and concerned how to achieve
NoSQL-like speeds while sticking with MySQL and InnoDB. Much of the
talk discussed an InnoDB memcached plugin, which unfortunately is
still in the "lab" or "pre-alpha" stage. But he also suggested some
other ways to better performance, some involving Memcache and others
more round-about:

  • Coding directly with the storage engine API, which is storage-engine
    independent.


  • Using HandlerSocket, which queues write requests and performs them
    through a single thread, avoiding costly fsync() calls. This can
    achieve 30,000 writes per second, robustly.

Matsunobu claimed that many optimizations are available within MySQL
because a lot of data can fit in main memory. For instance, if you
have 10 million users and store 400 bytes per user, the entire user
table can fit in 20 GB. Matsunobu tests have shown that most CPU time
in MySQL is spent in functions that are not essential for processing
data, such as opening and closing a table. Each statement opens a
separate connection, which in turn requires opening and closing the
table again. Furthermore, a lot of data is sent over the wire besides
the specific fields requested by the client. The solutions in the talk
evade all this overhead.


The commercial ecosystem

Both as vendors and as sponsors, a number of companies have always
lent another dimension to the MySQL conference. Some of these really
have nothing to do with MySQL, but offer drop-in replacements for it.
Others really find a niche for MySQL users. Here are a few that I
happened to talk to:

  • Clustrix provides a very
    different architecture for relational data. They handle sharding
    automatically, permitting such success stories as the massive scaling
    up of the social media site Massive Media NV without extra
    administrative work. Clustrix also claims to be more efficient by
    breaking queries into fragments (such as the WHERE clauses of joins)
    and executing them on different nodes, passing around only the data

    produced by each clause.

  • Akiban also offers faster
    execution through a radically different organization of data. They
    flatten the normalized tables of a normalized database into a single
    data structure: for instance, a customer and his orders may be located
    sequentially in memory. This seems to me an import of the document
    store model into the relational model. Creating, in effect, an object
    that maps pretty closely to the objects used in the application
    program, Akiban allows common queries to be executed very quickly, and
    could be deployed as an adjunct to a MySQL database.

  • Tokutek produced a drop-in
    replacement for InnoDB. The founders developed a new data structure
    called a fractal tree as a faster alternative to the B-tree structures
    normally used for indexes. The existence of Tokutek vindicates both

    the open source distribution of MySQL and its unique modular design,
    because these allowed Tokutek's founders to do what they do
    best--create a new storage engine--without needing to create a whole
    database engine with the related tools and interfaces it would
    require.

  • Nimbus Data Systems creates a
    flash-based hardware appliance that can serve as a NAS or SAN to
    support MySQL. They support a large number of standard data transfer
    protocols, such as InfiniBand, and provide such optimizations as
    caching writes in DRAM and making sure they write complete 64KB blocks
    to flash, thus speeding up transfers as well as preserving the life of
    the flash.


Post-conference events

A low-key developer's day followed Percona Live on Friday. I talked to
people in the Drizzle and
Sphinx tracks.

As a relatively young project, the Drizzle talks were aimed mostly at
developers interested in contributing. I heard talks about their
kewpie test framework and about build and release conventions. But in
keeping with it's goal to make database use easy and light-weight, the
project has added some cool features.

Thanks to a
JSON
interface
and a built-in web server, Drizzle now presents you with a
Web interface for entering SQL commands. The Web interface translates
Drizzle's output to simple HTML tables for display, but you can also
capture the JSON directly, making programmatic access to Drizzle
easier. A developer explained to me that you can also store JSON
directly in Drizzle; it is simply stored as a single text column and
the JSON fields can be queried directly. This reminded me of an XQuery
interface added to some database years ago. There too, the XML was
simply stored as a text field and a new interface was added to run the
XQuery selects.

Sphinx, in contrast to Drizzle, is a mature product with commercial
support and (as mentioned earlier in the article) production
deployments at places such as craigslist, as well as an href="http://shop.oreilly.com/product/9780596809539.do">O'Reilly
book. I understood better, after attending today's sessions, what
makes Sphinx appealing. Its quality is unusually high, due to the use

of sophisticated ranking algorithms from the research literature. The
team is looking at recent research to incorporate even better
algorithms. It is also fast and scales well. Finally, integration with
MySQL is very clean, so it's easy to issue queries to Sphinx and pick
up results.

Recent enhancements include an href="https://github.com/alexksikes/fSphinx">add-on called fSphinx
to make faceted searches faster (through caching) and easier, and
access to Bayesian Sets to find "items similar to this one." In Sphinx
itself, the team is working to add high availability, include a new
morphology (stemming, etc.) engine that handles German, improve
compression, and make other enhancements.


The day ended with a reception and copious glasses of Monty Widenius's
notorious licorice-flavored vodka, an ending that distinguishes the
MySQL conference from others for all time.

April 06 2012

Promoting and documenting a small software project: VoIP Drupal update

Isn't the integration of mobile phones and the Web one of the hot topics in modern technology? If so, VoIP Drupal should become a fixture of web development and administration. I have been meeting with leaders of the project to help with their documentation and publicity. I reported on my first meeting with them here on Radar, and this posting is one of a series of follow-ups I plan to write.

Immediate pressures

When Leo Burd, the lead developer of VoIP Drupal, first pulled together a small cohort of supporters to help with documentation and promotion, only three weeks were left before DrupalCon, the major Drupal annual conference. Leo was doing some presentations there, and the pressing question was what users would want in order to get interested in VoIP Drupal, learn more about it, and be able to get started.

Leo was indefatigable. He planned to get some new modules finished before the conference, but in addition to coding and preparing his own presentations, he wanted to address the lack of introductory materials because it might get in the way of enticing new users to try VoIP Drupal.

In the end, Leo led a webinar to drum up interest. The little group of half a dozen self-selected fans reviewed his slides, sat in on a preview, and helped whip it into a really well-focused, hard-hitting survey. This webinar had 94 participants from 19 countries and more than 40 companies.

Michele Metts, a non-profit activist and Drupal consultant, also stepped up to create slides and lead webinars with slides explaining VoIP Drupal basics.

Slide from a VoIP Drupal webinar
Slide from a VoIP Drupal webinar

Although we agreed that many parts of the system needed more documentation, it was a more effective use of time at this point to do webinars. VoiP Drupal has many interactive aspects that are best shown off through demonstrations. As I'll explain, documentation would require more resources.

Perceptions and challenges

Responses to the webinar and to Leo's DrupalCon presentation helped us understand better what it would take to win more adherents to this useful tool. Leo reported that few people knew of the existence of VoIP Drupal, but that when they heard of it, they assumed it would be expensive. His impression was that they knew traditional PBX systems and didn't realize how much more low-cost and light-weight VoIP is.

In my opinion, Web administrators (and many content providers) still see the Web and the telephone as separate worlds, never to meet. This despite the increasing popularity of running queries and Internet apps on mobile devices. We have to address server-side apathy. There should come a day when the things VoIP Drupal does are taken for granted: people leaving phone numbers on Web sites to receive SMS or voice updates, pressing Web buttons on their cell phones to get an interactive voice menu, etc.

The versatility of VoIP Drupal gives it fantastic potential but makes it harder to explain in an elevator speech. The ways in which voice can be integrated with the Web are nearly infinite. In addition, two distinctly different settings can benefit from VoIP. One consists of highly structured corporate sites, which can use VoIP Drupal for such typical bureaucratic tasks as offering interactive voice menus and routing customers to extensions. The other consists of sites serving underdeveloped areas, where people are more adept at using their phones than dealing with text-based Web sites. We need different pitches for each potential user.

Finally (and here my training as an editor pokes in), there are a number of different audiences who need to understand VoIP Drupal on different levels. Content providers need to understand what it can do and see models for incorporating it into their Drupal sites. Administrators need to understand how to find scripts and offer them to content providers. And programmers need to learn how to write fresh scripts. We also want to attract open source developers who could help enhance and maintain VoIP Drupal. There is even a tool called Visual VoIP Drupal that reduces the amount of programming skill required to create a new script pretty close to a minimum--that creates yet another audience.

Documentation needs

My goal in joining the informal VoiP Drupal promotion team was to use this project as a test case for exploring what software projects do for documentation, and how they could do better.

A number of sites have successfully implemented VoIP Drupal, some of them in quite sophisticated ways, but you could call these the alpha developers or early adopters. They managed, for instance, to find the reference documentation that requires several clicks to access, and which I did not notice for weeks. And needless to say, they required no other documentation, because the tutorials are extremely rudimentary.

I think that VoIP Drupal documentation is typical of early software projects--and in fact better than many. Projects tend to toss the reader a brief tutorial, provide some examples with inadequate explanatory text, and finally dunk the reader in the middle of an API reference. The tutorial tends to be fragile, in the sense that any problem encountered by the reader (a missing step, a change in environment) leaves her in an unrecoverable state, because there is no documentation about the way the system works and its assumptions. And the context for understanding reference documentation is missing, so that it can be used only by experienced developers who have seen similar programs in other environments.

Useful VoIP Drupal documentation includes (these are examples):

Our team knew that more descriptive text was needed to pull together these pieces, but whoever wrote a document would need some intensive time with a core developer. This was unfeasible in the rush to get the software in shape for DrupalCon. I found out about the difficulties of producing documents first-hand when I decided to tackle Visual VoIP Drupal, which looked simple and intuitive. Unfortunately, there were a lot of oddities in its behavior, and the simplicity of the interface didn't save me from having to know some of the subtler and less documented aspects of VoIP Drupal programming, such as handling input variables.

In a recent teleconference, I asked a bunch of preliminary questions and got a better idea of what documentation already covers, as well as a starting point for doing more documentation.

Current documentation tasks, in my opinion, include:

  • Expand the tutorials to show more of the capabilities of VoIP Drupal.

  • Provide explanations of key topics, such as different ways to handle voice input, keyboard input, and the metainformation provided by VoIP Drupal about each call. The developers' provision of a simple scripting system on top of PHP, and even more the creation of Visual VoIP Drupal, demonstrates their commitment to reaching non-programmers, but we have to follow through by filling in the background they lack.

  • Create a few videos or webinars on Visual VoIP Drupal.

  • Make links to the reference documentation more prominent, and link to it liberally in the tutorials and background documents. The use of doc strings from the code is reasonable for reference documentation, because nobody is asking it to look pretty and we want to maximize the probability that it will stay up to date.

  • Ask the community for examples and case studies, and describe what makes each one an interesting use of VoIP Drupal.

There's plenty that could be done, such as describing how to integrate VoIP Drupal into existing PHP code (and therefore more fully into existing Drupal pages) but that can be postponed. Leo said he's "particularly interested in working with Micky and the rest this group in the creation of a Visual VoIP Drupal webinar, and one about things we can do with VoIP Drupal right out of the box, with no or minimum programming."

What motivates people like Michele and me to put so much time into this project? Certainly, Michele wants to promote a project that she uses in her own work so that it thrives and continues to evolve. I can use what I learn from this work to provide services to other open source communities. But beyond all these individual rewards is a gut feeling that VoIP Drupal is cool. Using it is fun, and talking about it is also fun. Projects have achieved success for more light-weight reasons.

April 05 2012

Steep climb for National Cancer Institute toward open source collaboration

Although a lot of government agencies produce open source software, hardly any develop relationships with a community of outside programmers, testers, and other contributors. I recently spoke to John Speakman of the National Cancer Institute to learn about their crowdsourcing initiative and the barriers they've encountered.

First let's orient ourselves a bit--forgive me for dumping out a lot of abbreviations and organizational affiliations here. The NCI is part of the National Institutes of Health. Speakman is the Chief Program Officer for NCI's Center for Biomedical Informatics and Information Technology. Their major open source software initiative is the Cancer Biomedical Informatics Grid (caBIG), which supports tools for transferring and manipulating cancer research data. For example, it provides access to data classifying the carcinogenic aspects of genes (The Cancer Genome Atlas) and resources to help researchers ask questions of and visualize this data (the Cancer Molecular Analysis Portal).

Plenty of outside researchers use caBIG software, but it's a one-way street, somewhat in the way the Department of Veterans Affairs used to release its VistA software. NCI sees the advantages of a give-and-take such as the CONNECT project has achieved, through assiduous cultivation of interested outside contributors, and wants to wean its outside users away from the dependent relationship that has been all take and no give. And even the VA decided last year that a more collaborative arrangement for VistA would benefit them, thus putting the software under the guidance of an independent non-profit, the Open Source Electronic Health Record Agent (OSEHRA).

Another model is Forge.mil, which the Department of Defense set up with the help of CollabNet, the well-known organization in charge of the Subversion revision control tool. Forge.mil represents a collaboration between the DoD and private contractors, encouraging them to create shared libraries that hopefully increase each contractor's productivity, but it is not open source.

The OSEHRA model--creating an independent, non-government custodian--seems a robust solution, although it takes a lot of effort and risks failure if the organization can't create a community around the project. (Communities don't just spring into being at the snap of a bureaucrat's fingers, as many corporations have found to their regret.) In the case of CONNECT, the independent Alembic Foundation stepped in to fill the gap after a lawsuit stalled CONNECT's development within the government. According to Alembic co-founder David Riley, with the contract issues resolved, CONNECT's original sponsor--the Office of the National Coordinator--is spinning off CONNECT to a private sector, open source entity, and work is underway to merge the two baselines.

Whether an agency manages its own project or spins off management, it has to invest a lot of work to turn an internal project into one that appeals to outside developers. This burden has been discovered by many private corporations as well as public entities. Tasks include:

  • Setting up public repositories for code and data.

  • Creating a clean software package with good version control that make downloading and uploading simple.

  • Possibly adding an API to encourage third-party plugins, an effort that may require a good deal of refactoring and a definition of clear interfaces.

  • Substantially adding to the documentation.

  • General purging of internal code and data (sometimes even passwords!) that get in the way of general use.

Companies and institutions have also learned that "build it and they will come" doesn't usually work. An open source or open data initiative must be promoted vigorously, usually with challenges and competitions such as the Department of Health and Human Services offer in their annual Health Data Initiative forums (a.k.a datapaloozas).

With these considerations in mind, the NCI decided in the summer of 2011 to start looking for guidance and potential collaborators. Here, laws designed long ago to combat cronyism put up barriers. The NCI was not allowed to contact anyone it wanted out of the blue. Instead, it has to issue a Request for Information and talk to people who responded. Although the RFI went online, it obviously wasn't widely seen. After all, do you regularly look for RFIs and RFPs from government agencies? If so, I can safely guess that you're paid by a large company or lobbying agency to follow a particular area of interest.

RFIs and RFPs are released as a gesture toward transparency, but in reality they just make it easier for the usual crowd of established contractors and lobbyists to build on the relationships they already have with agencies. And true to form, the NCI received only a limited set of responses, frustrated in their attempts to talk to new actors with the expertise they needed for their open source efforts.

And because the RFI had to allow a limited time window for responses, there is no point in responding to it now.

Still, Speakman and his colleagues are educating themselves and meeting with stakeholders. Cancer research is a hot topic drawing zealous attention from many academic and commercial entities, and they're hungry for data. Already, the NCI is encouraged by the initial positive response from the cancer informatics community, many of whom are eager to see the caBIG software deposited in an open repository like GitHub right away. Luckily, HHS has already negotiated terms of service with GitHub and SourceForge, removing at least one important barrier to entry. The NCI is packaging its first tool (a laboratory information management system called caLIMS) for deposit into a public repository. So I'm hoping the NCI is too caBIG to fail.

February 23 2012

Report from HIMSS 2012: toward interoperability and openness

I was wondering how it would feel to be in the midst of 35,000 people whose livelihoods are driven by the decisions of a large institution at the moment when that institution releases a major set of rules. I didn't really find out, though. The 35,000 people I speak of are the attendees of the HIMSS conference and the institution is the Department of Health and Human Services. But HHS just sort of half-released the rules (called Stage 2 of meaningful use), telling us that they would appear online tomorrow and meanwhile rushing over a few of the key points in a presentation that drew overflow crowds in two rooms.

The reaction, I sensed, was a mix of relief and frustration. Relief because Farzad Mostashari, National Coordinator for Health Information Technology, promised us the rules would be familiar and hewed closely to what advisors had requested. Frustration, however, at not seeing the details. The few snippets put up on the screen contained enough ambiguities and poorly worded phrases that I'm glad there's a 60-day comment period before the final rules are adopted.

There isn't much one can say about the Stage 2 rules until they are posted and the experts have a chance to parse them closely, and I'm a bit reluctant to throw onto the Internet one of potentially 35,000 reactions to the announcement, but a few points struck me enough to be worth writing about. Mostashari used his pulpit for several pronouncements about the rules:

  • HHS would push ahead on goals for interoperability and health information exchange. "We can't wait five years," said Mostashari. He emphasized the phrase "standard-based" in referring to HIE.

  • Patient engagement was another priority. To attest to Stage 2, institutions will have to allow at least half their patients to download and transfer their records.

  • They would strive for continuous quality improvement and clinical decision support, key goals enabled by the building blocks of meaningful use.

Two key pillars of the Stage 2 announcement are requirements to use the Direct project for data exchange and HL7's consolidated CDA for the format (the only data exchange I heard mentioned was a summary of care, which is all that most institutions exchange when a patient is referred).

The announcement demonstrates the confidence that HHS has in the Direct project, which it launched just a couple years ago and that exemplifies a successful joint government/private sector project. Direct will allow health care providers of any size and financial endowment to use email or the Web to share summaries of care. (I mentioned it in yesterday's article.) With Direct, we can hope to leave the cumbersome and costly days of health information exchange behind. The older and more complex CONNECT project will be an option as well.

The other half of that announcement, regarding adoption of the CDA (incarnated as a CCD for summaries of care), is a loss for the older CCR format, which was an option in Stage 1. The CCR was the Silicon Valley version of health data, a sleek and consistent XML format used by Google Health and Microsoft HealthVault. But health care experts criticized the CCR as not rich enough to convey the information institutions need, so it lost out to the more complex CCD.

The news on formats is good overall, though. The HL7 consortium, which has historically funded itself by requiring organizations to become members in order to use its standards, is opening some of them for free use. This is critical for the development of open source projects. And at an HL7 panel today, a spokesperson said they would like to head more in the direction of free licensing and have to determine whether they can survive financially while doing so.

So I'm feeling optimistic that U.S. health care is moving "toward interoperability and openness," the phrase I used in the title to his article and also used in a posting from HIMSS two years ago.

HHS allowed late-coming institutions (those who began the Stage 1 process in 2011) to continue at Stage 1 for another year. This is welcome because they have so much work to do, but means that providers who want to demonstrate Stage 2 information exchange may have trouble because they can't do it with other providers who are ready only for Stage 1.

HHS endorsed some other standards today as well, notably SNOMED for diseases and LRI for lab results. Another nice tidbit from the summit includes the requirement to use electronic medication administration (for instance, bar codes to check for errors in giving medicine) to foster patient safety.

February 22 2012

Report from HIMSS: health care tries to leap the chasm from the average to the superb

I couldn't attend the session today on StealthVest--and small surprise. Who wouldn't want to come see an Arduino-based garment that can hold numerous health-monitoring devices in a way that is supposed to feel like a completely normal piece of clothing? As with many events at the HIMSS conference, which has registered over 35,000 people (at least four thousand more than last year), the StealthVest presentation drew an overflow crowd.

StealthVest sounds incredibly cool (and I may have another chance to report on it Thursday), but when I gave up on getting into the talk I walked downstairs to a session that sounds kind of boring but may actually be more significant: Practical Application of Control Theory to Improve Capacity in a Clinical Setting.

The speakers on this session, from Banner Gateway Medical Center in Gilbert, Arizona, laid out a fairly standard use of analytics to predict when the hospital units are likely to exceed their capacity, and then to reschedule patients and provider schedules to smooth out the curve. The basic idea comes from chemical engineering, and requires them to monitor all the factors that lead patients to come in to the hospital and that determine how long they stay. Queuing theory can show when things are likely to get tight. Hospitals care a lot about these workflow issues, as Fred Trotter and David Uhlman discuss in the O'Reilly book Beyond Meaningful Use, and they have a real effect on patient care too.

The reason I find this topic interesting is that capacity planning leads fairly quickly to visible cost savings. So hospitals are likely to do it. Furthermore, once they go down the path of collecting long-term data and crunching it, they may extend the practice to clinical decision support, public health reporting, and other things that can make a big difference to patient care.

A few stats about data in U.S. health care

Do we need a big push to do such things? We sure do, and that's why meaningful use was introduced into HITECH sections of the American Recovery and Reinvestment Act. HHS released mounds of government health data on Health.data.gov hoping to serve a similar purpose. Let's just take a look at how far the United States is from using its health data effectively.

  • Last November, a CompTIA survey (reported by Health Care IT News) found that only 28% of providers have comprehensive EHRs in use, and another 17% have partial implementations. One has to remember that even a "comprehensive" EHR is unlikely to support the sophisticated data mining, information exchange, and process improvement that will eventually lead to lower costs and better care.

  • According to a recent Beacon Partners survey (PDF), half of the responding institutions have not yet set up an infrastructure for pursuing health information exchange, although 70% consider it a priority. The main problem, according to a HIMSS survey, is budget: HIEs are shockingly expensive. There's more to this story, which I reported on from a recent conference in Massachusetts.

Stats like these have to be considered when HIMSS board chair, Charlene S. Underwood, extolled the organization's achievements in the morning keynote. HIMSS has promoted good causes, but only recently has it addressed cost, interoperability, and open source issues that can allow health IT to break out of the elite of institutions large or sophisticated enough to adopt the right practices.

As signs of change, I am particularly happy to hear of HIMSS's new collaboration with Open Health Tools and their acquisition of the mHealth summit. These should guide the health care field toward more patient engagement and adaptable computer systems. HIEs are another area crying out for change.

An HIE optimist

With the flaccid figures for HIE adoption in mind, I met Charles Parisot, chair of Interoperability Standards and Testing Manager for EHRA, which is HIMSS's Electronic Health Records Association. The biggest EHR vendors and HIEs come together in this association, and Parisot was just stoked with positive stories about their advances.

His take on the cost of HIEs is that most of them just do it in a brute force manner that doesn't work. They actually copy the data from each institution into a central database, which is hard to manage from many standpoints. The HIEs that have done it right (notably in New York state and parts of Tennessee) are sleek and low-cost. The solution involves:

  • Keeping the data at the health care providers, and storing in the HIE only some glue data that associates the patient and the type of data to the provider.

  • Keeping all metadata about formats out to the HIE, so that new formats, new codes, and new types of data can easily be introduced into the system without recoding the HIE.

  • Breaking information exchange down into constituent parts--the data itself, the exchange protocols, identification, standards for encryption and integrity, etc.--and finding standard solutions for each of these.

So EHRA has developed profiles (also known by its ONC term, implementation specifications) that indicate which standard is used for each part of the data exchange. Metadata can be stored in the core HL7 document, the Clinical Document Architecture, and differences between implementations of HL7 documents by different vendors can also be documented.

A view of different architectures in their approach can be found in an EHRA white paper, Supporting a Robust Health Information Exchange Strategy with a Pragmatic Transport Framework. As testament to their success, Parisot claimed that the interoperability lab (a huge part of the exhibit hall floor space, and a popular destination for attendees) could set up the software connecting all the vendors' and HIEs' systems in one hour.

I asked him about the simple email solution promised by the government's Direct project, and whether that may be the path forward for small, cash-strapped providers. He accepted that Direct is part of the solution, but warned that it doesn't make things so simple. Unless two providers have a pre-existing relationship, they need to be part of a directory or even a set of federated directories, and assure their identities through digital signatures.

And what if a large hospital receives hundreds of email messages a day from various doctors who don't even know to whom their patients are being referred? Parisot says metadata must accompany any communications--and he's found that it's more effective for institutions to pull the data they want than for referring physicians to push it.

Intelligence for hospitals

Finally, Parisot told me EHRA has developed standards for submitting data to EHRs from 350 types of devices, and have 50 manufacturers working on devices with these standards. I visited a booth of iSirona as an example. They accept basic monitoring data such as pulses from different systems that use different formats, and translate over 50 items of information into a simple text format that they transmit to an EHR. They also add networking to devices that communicate only over cables. Outlying values can be rejected by a person monitoring the data. The vendor pointed out that format translation will be necessary for some time to come, because neither vendors nor hospitals will replace their devices simply to implement a new data transfer protocol.

For more about devices, I dropped by one of the most entertaining parts of the conference, the Intelligent Hospital Pavilion. Here, after a badge scan, you are somberly led through a series of locked doors into simulated hospital rooms where you get to watch actors in nursing outfits work with lifesize dolls and check innumerable monitors. I think the information overload is barely ameliorated and may be worsened by the arrays of constantly updated screens.

But the background presentation is persuasive: by using attaching RFIDs and all sorts of other devices to everything from people to equipment, and basically making the hospital more like a factory, providers can radically speed up responses in emergency situations and reduce errors. Some devices use the ISM "junk" band, whereas more critical ones use dedicated spectrum. Redundancy is built in throughout the background servers.

Waiting for the main event

The US health care field held their breaths most of last week, waiting for Stage 2 meaningful use guidelines from HHS. The announcement never came, nor did it come this morning as many people had hoped. Because meaningful use is the major theme of HIMSS, and many sessions were planned on helping providers move to Stage 2, the delay in the announcement put the conference in an awkward position.

HIMSS is also nonplussed over a delay in another initiative, the adoption of a new standard in the classification of disease and procedures. ICD-10 is actually pretty old, having been standardized in the 1980s, and the U.S. lags decades behind other countries in adopting it. Advantages touted for ICD-10 are:

  • It incorporates newer discoveries in medicine than the dominant standard in the U.S., ICD-9, and therefore permits better disease tracking and treatment.

  • Additionally, it's much more detailed than ICD-9 (with an order of magnitude more classifications). This allows the recording of more information but complicates the job of classifying a patient correctly.

ICD-10 is rather controversial. Some people would prefer to base clinical decisions on SNOMED, a standard described in the Beyond Meaningful Use book mentioned earlier. Ultimately, doctors lobbied hard against the HHS timeline for adopting ICD-10 because providers are so busy with meaningful use. (But of course, the goals of adopting meaningful use are closely tied to the goals of adopting ICD-10.) It was the pushback from these institutions that led HHS to accede and announce a delay. HIMSS and many of its members were disappointed by the delay.

In addition, there is an upcoming standard, ICD-11, whose sandal some say ICD-10 is not even worthy to lace. A strong suggestion that the industry just move to ICD-11 was aired in Government Health IT, and the possibility was raised in Health Care IT News as well. In addition reflecting the newest knowledge about disease, ICD-11 is praised for its interaction with SNOMED and its use of Semantic Web technology.

That last point makes me a bit worried. The Semantic Web has not been widely adopted, and if people in the health IT field think ICD-10 is complex, how are they going to deal with drawing up and following relationships through OWL? I plan to learn more about ICD-11 at the conference.

February 17 2012

Documentation strategy for a small software project: launching VoIP Drupal introductions

VoIP Drupal is a window onto the promises and challenges faced by a new open source project, including its documentation. At O'Reilly, we've been conscious for some time that we lack a business model for documenting new collaborative projects--near the beginning, at the stage where they could use the most help with good materials to promote their work, but don't have a community large enough to support a book--and I joined VoIP Drupal to explore how a professional editor can help such a team.

Small projects can reach a certain maturity with poor and sparse document. But the critical move from early adopters to mainstream requires a lot more hand-holding for prospective users. And these projects can spare hardly any developer time for documentation. Users and fans can be helpful here, but their documentation needs to be checked and updated over time; furthermore, reliance on spontaneous contributions from users leads to spotty and unpredictable coverage.

Large projects can hire technical writers, but what they do is very different from traditional documentation; they must be community managers as well as writers and editors (see Anne Gentle's book Conversation and Community: The Social Web for Documentation). So these projects can benefit from research into communities also.

I met at the MIT Media Lab this week with Leo Burd, the inventor of VoIP Drupal, and a couple other supporters, notably Micky Metts of DrupalConnection.com. We worked out some long-term plans for firming up VoIP Drupal's documentation and other training materials. But we also had to deal with an urgent need for materials to offer at DrupalCon, which begins in just over one month.

Challenges

One of the difficulties of explaining VoIP Drupal is that it's just so versatile. The foundations are simple:

  • A thin wrapper around PHP permits developers to write simple scripts that dial phone numbers, send SMS messages, etc. These scripts run on services that initiate connections and do translation between voice and text (Tropo, Twilio, and the free Plivo are currently supported).

  • Administrators on Drupal sites can use the Drupal interface to configure VoIP Drupal modules and add phone/SMS scripts to their sites.

  • Content providers can use the VoIP Drupal capabilities provided by their administrators to do such things as send text messages to site users, or to enable site users to record messages using their phone or computer.

Already you can see one challenge: VoIP Drupal has three different audiences that need very different documentation. In fact, we've thought of two more audiences: decision-makers who might build a business or service on top of VoIP Drupal, and potential team members who will maintain and build new features.

Some juicy modules built on top of VoIP Drupal's core extend its versatility to the point where it's hard to explain on an elevator ride what VoIP Drupal could do. Leo tosses out a few ideas such as:

  • Emergency awareness systems that use multiple channels to reach out to a population who live in a certain area. That would require a combination of user profiling, mapping and communication capabilities tend to be extremely hard to put together under one single package.

  • Community polling/voting systems that are accessible via web, SMS, email, phone, etc.

  • CRM systems that keep track (and even record) phone interactions, organize group conference calls with the click of a button, etc.

  • Voice-based bulletin boards.

  • Adding multiple authentication mechanisms to a site.

  • Sending SMS event notifications based on Google Calendars.

In theory you could create a complete voice and SMS based system out of VoIP Drupal and ignore the web site altogether, but that would be a rather cumbersome exercise. VoIP Drupal is well-suited to integrating voice and the Web--and it leaves lots of room for creativity.

Long-term development

A community project, we agreed, needs to be incremental and will result in widely distributed documents. Some people like big manuals, but most want a quickie getting-started guide and then lots of chances to explore different options at their own pace. Communities are good for developing small documents of different types. The challenge is finding someone to cover any particular feature, as well as to do the sometimes tedious work of updating the document over time.

We decided that videos would be valuable for the administrators and content providers, because they work through graphical interfaces. However, the material should also be documented in plain text. This expands access to the material in two ways. First, VoIP Drupal may be popular in part of the world where bandwidth limitations make it hard to view videos. Second, the text pages are easier to translate into other languages.

Just as a video can be worth a thousand words, working scripts can replace a dozen explanations. Leo will set up a code contribution site on Github. This is more work than it may seem, because malicious or buggy scripts can wreak havoc for users (imagine someone getting a thousand identical SMS messages over the course of a single hour, for instance), so contributions have to be vetted.

Some projects assign a knowledgeable person or two to create an outline, then ask community members to fill it in. I find this approach too restrictive. Having a huge unfilled structure is just depressing. And one has to grab the excitement of volunteers wherever it happens to land. Just asking them to document what they love about a project will get you more material than presenting them with a mandate to cover certain topics.

But then how do you get crucial features documented? Wait and watch forums for people discussing those features. When someone seems particularly knowledgeable and eager to help, ask him or her for a longer document that covers the feature. You then have to reward this person for doing the work, and a couple ways that make sense in this situation include:

  • Get an editor to tighten up the document and work with the author to make a really professional article out of it.

  • Highlight it on your web site and make sure people can find it easily. For many volunteers, seeing their material widely used is the best reward.

We also agreed that we should divide documentation into practical, how-to documents and conceptual documents. Users like to grab a hello-world document and throw together their first program. As they start to shape their own projects, they realize they don't really understand how the system fits together and that they need some background concepts. Here is where most software projects fail. They assume that the reader understands the reasoning behind the design and knows how best to use it.

Good conceptual documentation is hard to produce, partly because the lead developers have the concepts so deeply ingrained that they don't realize what it is that other people don't know. Breaking the problems down into small chunks, though, can make it easier to produce useful guides.

Like many software projects, VoIP Drupal documentation currently starts the reader off with a list of modules. The team members liked an idea of mine to replace these with brief tutorials or use cases. Each would start with a goal or question (what the reader wants to accomplish) and then introduce the relevant module. In general, given the flexibility of VoIP Drupal, we agreed we need a lot more "why and when" documentation.

Immediate preparations

Before we take on a major restructuring and expansion of documentation, though, we have a tight deadline for producing some key videos and documents. Leo is going to lead a development workshop at DrupalCon, and he has to determine the minimum documentation needed to make it a productive experience. He also wants to do a webinar on February 28 or 29, and a series of videos on basic topics such as installing VoIP Drupal, a survey of successful sites using it, and a nifty graphical interface called Visual VoIP Drupal. Visual VoIP Drupal, which will be released in a few weeks, is one of the new features Leo would like to promote in order to excite users. It lets a programmer select blocks and blend them into a script through a GUI, instead of typing all the code.

The next few weeks will bring a flurry of work to realize our vision.

January 20 2012

Developer Week in Review: Early thoughts on iBooks Author

One down, two to go, Patriots-wise. Thankfully, this week's game is on Sunday, so it doesn't conflict with my son's 17th birthday on Saturday. They grow up so quickly; I can remember him playing with his Comfy Keyboard, now he's writing C code for robots.

A few thoughts on iBooks Author and Apple's textbook move

iBooks AuthorThursday's Apple announcement of Apple's new iBooks Author package isn't developer news per se, but I thought I'd drop in a few initial thoughts before jumping into the meat of the WIR because it will have an impact on the community in several ways.

Most directly, it is another insidious lock-in that Apple is wrapping inside a candy-covered package. Since iBooks produced with the tool can only be viewed in full on iOS devices, textbooks and other material produced with iBooks Author will not be available (at least in the snazzy new interactive form) on Kindles or other ereaders. If Apple wanted to play fair, it should make the new iBooks format an open standard. Of course, this would cut Apple out of its cut of the royalties as well as yielding the all-important control of the user experience that Steve Jobs installed as a core value in the company.

On a different level, this could radically change the textbook and publishing industry. It will make it easier to keep textbooks up to date and start to loosen the least-common-denominator stranglehold that huge school districts have on the textbook creation process. On the other hand, I can see a day when pressure from interest groups results in nine different textbooks being used in the same class, one of which ignores evolution, one of which emphasizes the role of Antarctic-Americans in U.S. history, etc.

It's also another step in the disintermediation of publishing since the cost of getting your book out to the world just dropped to zero (not counting proofreading, indexing, editing, marketing, and all the other nice things a traditional publisher does for a writer). I wonder if Apple is going to enforce the same puritanical standards on iBooks as they do on apps. What are they going to do when someone submits a My Little Pony / Silent Hill crossover fanfic as an iBook?

Another item off my bucket list

I've been to Australia. I've had an animal cover book published. And now I've been called a moron (collectively) by Richard Stallman.

The occasion was the previously mentioned panel on the legacy of Steve Jobs, on which I participated this previous weekend. As could have been expected, Stallman started in describing Jobs as someone who the world would have been better off without. He spent the rest of the hour defending the position that it doesn't matter how unusable the free alternative to a proprietary platform is, only that it's free. When we disagreed, he shouted us down as "morons."

As I've mentioned before, that position makes a few invalid assumptions. One is that people's lives will be better if they use a crappy free software package over well-polished commercial products. In reality, the perils of commercial software that Stallman demonizes so consistently are largely hypothetical, whereas the usability issues of most consumer-facing free software are very real. For the 99.999% of people who aren't software professionals, the important factor is whether the darn thing works, not if they can swap out an internal module.

The other false premise at play here is that companies are Snidely Whiplash wanna-bes that go out of their way to oppress the masses. Stallman, to his credit as a savvy propagandist, has co-opted the slogans of the Occupy Wall Street movement, referring to the 1% frequently. The reality is that when companies try to pull shady stunts, especially in the software industry, they usually get caught and have to face the music. Remember the furor over Apple's allegedly accidental recording of location data on the iPhone? Stallman's dystopian future, where corporations use proprietary platforms as a tool of subjugation, has pretty much failed every time it's actually been tried on the ground. I'm not saying corporations are angels, or even that they have the consumer's best interests in mind, it's just that they aren't run by demonic beings that eat babies and plot the enslavement of humanity.

Achievement unlocked: Erased user's hard drive

Sometimes life as a software engineer may seem like a game, but Microsoft evidently wants to turn it into a real one. The company has announced a new plug-in for Visual Studio that lets you earn achievements for coding practices and other developer-related activities.

Most of them are tongue in cheek, but I'm terrified that we may start seeing these achievements in live production code as developers compete to earn them all. Among the more fear-inspiring:

  • "Write 20 single letter class-level variables in one file. Kudos to you for being cryptic!"
  • "Write a single line of 300 characters long. Who needs carriage returns?"
  • "More than 10 overloads of a method. You could go with this or you could go with that."
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Got news?

Please send tips and leads here.

Related:

December 01 2011

Could closed core prove a more robust model than open core?

When participating recently in a sprint held at Google to document four free software projects, I thought about what might have prompted Google to invest in this effort. Their willingness to provide a hotel, work space, and food for some thirty participants, along with staff support all week long, demonstrates their commitment to nurturing open source.

Google is one of several companies for which I'll coin the term "closed core." The code on which they build their business and make their money is secret. (And given the enormous infrastructure it takes to provide a search service, opening the source code wouldn't do much to stimulate competition, as I point out in a posting on O'Reilly's radar blog). But they depend on a huge range of free software, ranging from Linux running on their racks to numerous programming languages and libraries that they've drawn on to develop their services.

So Google contributes a lot back to the free software community. The release code for many non-essential functions. They promote the adoption of standards such as HTML 5. They have been among the first companies to offer APIs for important functions, including their popular Google Maps. They have opened the source code to Android (although its development remains under their control), which has been the determining factor in making Android devices compete with the arguably more highly-functioning iOS products. They even created a whole new programming language (Go) and are working on another.

Google is not the only "closed core" company (for instance, Facebook has also built their service around APIs and released their Cassandra project). Microsoft has a whole open source program, including some important contributions to health IT. Scads of other companies, such as IBM, Hewlett Packard, and VMware, have complex relationships to open source software that don't fit a simple "open core" or "closed core" model. But the closed core trend represents a fertile collaboration between communities and companies that have businesses in specific areas. The closed core model requires businesses to determine where their unique value lies and to be generous in offering the public extra code that supports their infrastructure but does not drive revenue.

This model may prove more robust and lasting than open core, which attracts companies occupying minor positions in their industries. The shining example of open core is MySQL, but its complex status, including a long history of dual licensing and simultaneous development by several organizations, make it a difficult model from which to draw lessons about the whole movement. In particular, Software as a Service redefines the relationships that the free software movement has traditionally defined between open and proprietary. Deploying and monitoring the core SaaS software creates large areas for potential innovation, as we saw with Cassandra, where a company can benefit from turning their code into a community project.

November 23 2011

November 21 2011

VoIP Drupal reaches out to the developing world

I don't know why so few of us turned up on Saturday for the VoIP Drupal hackathon. As a way to integrate voice and SMS into a Drupal site, the VoIP modules form a door throught which Drupal can move into a vast world of touch tone telephones, smart telephones, and text messaging, and therefore toward integrating a huge range of users in developing regions who use those technologies instead of desktop or laptop computers. Perhaps Boston isn't the right place or November the right month for a workshop (although the weather was quite nice), but just four of us gathered to get the low-down on VoIP Drupal from Leo Burd, a research scientist at MIT's Media Lab and Center for Civic Media.

Together with just a couple other developers, he is putting together modules that support Twilio and Tropo, two cloud platforms that are highly scalable and provide telephone and SMS capabilities accessible from different countries. For cases where those services are not available or desirable, VoIP Drupal provides support for Free/SWITCH, an open source telephony platform, via the Plivo communication framework/API.

Most of the time we played with the scripting language using the VoIP Drupal sandbox. The scripting language is a domain-specific language for VoIP built on top of Drupal's module language, PHP. It has about 15 commands to create interactive calls doing such things as recording and playing back audio, handling input from the telephone keypad, managing conference calls, and sending and receiving SMS messages. A trivial script I created went like this:


$script->addSetVoice('woman'); $script->addSetLanguage('fr');

$script->addSay('Voiçi un message. Ne répondez pas.');

$script->addHangup();

(The scripting language requires you to create the $script object first, but the sandbox does that for you silently.) When I played this back, I got a pretty authentic sounding Parisian voice, having even the suitably cavalier tone when she told me not to talk in return (although she was tolerant of my spelling mistakes).

Of course, much richer applications are available through the scripting language. It is mostly linear, although you can define and call subroutines, you can set and retrieve variables, and there is a primitive assembly-language-like statement that lets you branch to a label based on a condition. Furthermore, the modules' full power is available through a PHP API. The Drupal administration menu allows you to specify a script to play when the site makes an outgoing call, a script to play when someone calls the site's phone number, and a script to play when someone sends a text message to the site's phone number.

Burd showed off a site put together by a non-profit in Dorchester (a low-income area of Boston, Mass.) together with the MIT Center for Civic Media. A group of young students recorded some descriptions of nearby locations of interest. These locations display plaques with a phone number for the web site and an extension unique for their location. Someone dialing in hears the message and is invited to record his or her own opinions or stories about that part of the city. Attendees today were so impressed that they said, if this application could be released as a drop-in module, it would boost the use of the VoIP modules immediately.

Just a few of the many uses for VoIP in Drupal include:

  • The equivalent of mass mailings via voice calls and SMS, so you can send messages, for instance, to people who sign up for political campaigns

  • Letting visitors leave voice mail or add verbal comments to the site

  • Embedding a phone interface on the web page so people can make VoIP calls directly from your site, with no extra stand-alone software such as Skype

  • Providing a conference call service through your site

  • Letting people sign up for groups to receive SMS messages on chosen topics of interest

More modules are under construction; an overview is available on the Drupal web site. A messaging module lets you send a voice message that is delivered in by phone, email, or SMS, as preferred by the site's visitor. The modules are developed for Drupal version 6, but Burd plans to create Drupal 7 modules as soon as the version 6 ones reach their 1.0 release, depending on interest from the community.

Of the underlying services supported, Twilio offers voice generation for four languages. Tropo supports voice generation for 24 languages, and can also do speech-to-text. Both of those companies have been very friendly to the VoIP Drupal project and promote it at Drupal conferences. Free/SWITCH and Plivo require you to do a lot of the work that Twilio and Tropo will do for you. But Free/SWITCH is useful for areas without Twilio or Tropo support, and for high volume use because it tends to cost less under those circumstances. Free/SWITCH also give the programmer more control over the server and allows you to run everything from the same box. Overall, VoIP Drupal represents another step toward an Internet where communicating by voice is taken for granted.

October 21 2011

Wrap-up from FLOSS Manuals book sprint at Google

At several points during this week's documentation sprint at Google, I talked with the founder of FLOSS Manuals, Adam Hyde, who developed the doc sprint as it is practiced today. Our conversation often returned to the differences between the group writing experience we had this week and traditional publishing. The willingness of my boss at O'Reilly Media to send me to this conference shows how interested the company is learning what we might be able to take from sprints.

Some of the differences between sprints and traditional publishing are quite subtle. The collaborative process is obviously different, but many people outside publishing might not realize just how deeply the egoless collaboration of sprints flies in the face of the traditional publishing model. The reason is that publishing has long depended on the star author. In whatever way a person becomes this kind a star, whether by working his way up the journalism hierarchy like Thomas Friedman or bursting on the scene with a dazzling person story like Greg Mortenson (author of Three Cups of Tea), stardom is almost the only way to sell books in profitable numbers. Authors who use the books themselves to build stardom still need to keep themselves in the public limelight somehow. Without colorful personalities, the publishing industry needs a new way to make money (along with Hollywood, television, and pop music).

But that's not the end of differences. Publishers also need to promise a certain amount of content, whereas sprinters and other free documentation projects can just put out what they feel like writing and say, "If you want more, add it." Traditional publishing will alienate readers if books come out with certain topics missing. Furthermore, if a book lacks a popular topic that a competitor has, the competitor will trounce the less well-endowed book in the market. So publishers are not simply inventing needs to maintain control over the development effort. They're not exerting control just to tamp down on unauthorized distribution or something like that. When they sell content, users have expectations that publishers strive to meet, so they need strong control over the content and the schedule for each book.

But O'Reilly, along with other publishers across the industry, is trying to change expectations. The goal of comprehensiveness conflicts with another goal, timeliness, that is becoming more and more important. We're responding in three ways that both bring us closer to what FLOSS Manuals is doing: we put out "early releases" containing parts of books that are underway, we sign contracts for projects on limited topics that are very short by design, and we're experimenting with systems that are even closer to the FLOSS Manuals system, allowing authors to change a book at whim and publish a new version immediately.

Although FLOSS Manuals produces free books and gets almost none of its funding from sales (the funding comes from grants and from the sponsors of sprints), the idea of sprinting is still compatible with traditional publishing, in which sales are the whole point. Traditional publishers tend to give several thousand dollars to authors in the form of advances, and if the author takes several months to produce a book, we don't see the royalties that pay us back for that investment for a long time. Why not spend a few thousand dollars to bring a team of authors to a pleasant but distraction-free location (I have to admit that Google headquarters is not at all distraction-free) and pay for a week of intense writing?

Authors would probably find it much more appealing to take a one-week vacation and say good-bye to their families for this time than to spend months stealing time on evenings and weekends and apologizing for not being fully present.

The problem, as I explained in my first posting this week, is that you never quite know what you're going to get from a sprint. In addition, the material is still rough at the end of a week and has to absorb a lot of work to rise to the standards of professional publishing. Still, many technical publishers would be happy to get over a hundred pages of relevant material in a single week.

Publishers who fail to make documents free and open might be more disadvantaged when seeking remote contributions. Sprints don't get many contributions from people outside the room where it is conducted, but sometimes advice and support weigh in on some critical, highly technical point. The sprints I have participated in (sometimes remotely) benefited from answers that came out of the cloud to resolve difficult questions. For instance, one commenter on this week's KDE conference warned us we were using product names all wrong and had us go back through the book to make sure our branding was correct.

Will people offer their time to help authors and publishers develop closed books? O'Reilly has put books online during development, and random visitors do offer fixes and comments. There is some good will toward anyone who wants to offer guidance that a community considers important. But free, open documents are likely to draw even more help from crowdsourcing.

At the summit today, with the books all wrapped up and published, we held a feedback session. The organizers asked us our opinions on the sprint process, the writing tools, and how to make the sprint more effective. Our facilitator raised three issues that, once again, reminded me of the requirements of traditional publishing:

  • Taking long-term responsibility for a document. How does one motivate people to contribute to it? In the case of free software communities, they need to make updates a communal responsibility and integrate the document into their project life cycle just like the software.

  • Promoting the document. Without lots of hype, people will not notice the existence of the book and pick it up. Promotion is pretty much the same no matter how content is produced (social networking, blogging, and video play big roles nowadays), but free books are distinguished by the goal of sharing widely without concern for authorial control or payment. Furthermore, while FLOSS Manuals is conscious of branding, it does not use copyright or trademarks to restrict use of graphics or other trade dress.

  • Integrating a document into a community. This is related to both maintenance and promotion. But every great book has a community around it, and there are lots of ways people can use them in training and other member-building activities. Forums and feedback pages are also important.

Over the past decade, a system of information generation has grown up in parallel with the traditional expert-driven system. In the old system everyone defers to an expert, while in the new system the public combines its resources. In the old system, documents are fixed after publication, whereas in the new system they are fluid. The old system was driven by the author's ego and increasingly by the demand for generating money, whereas the new system has revenue possibilities but has a strong sense of responsibility for the welfare of communities.

Mixtures of grassroots content generation and unique expertise have existed (Homer, for instance) and more models will be found. Understanding the points of commonality between the systems will help us develop such models.

(All my postings from this sprint are listed in a bit.ly bundle.)

FLOSS Manuals books published after three-day sprint

The final day of the FLOSS Manuals documentation sprint at Google began with a bit of a reprieve from Sprintmeister Adam Hyde's dictum that we should do no new writing. He allowed us to continue work till noon, time that the KDE team spent partly in heated arguments over whether we had provided enough coverage of key topics (the KDE project architecture, instructions for filing bug reports, etc.), partly in scrutinizing dubious material the book had inherited from the official documentation, and (at least a couple of us) actually writing material for chapters that readers may or may not find useful, such as a glossary.

I worried yesterday that the excitement of writing a complete book would be succeeded by the boring work of checking flow and consistency. Some drudgery was involved, but the final reading allowed groups to revisit their ways of presenting concepts and bringing in the reader.

Having done everything I thought I could do for the KDE team, I switched to OpenStreetMap, who produced a short, nicely paced, well-illustrated user guide. I think it's really cool that Google, which invests heavily in its own mapping service, helps OpenStreetMap as well. (They are often represented in Google Summer of Code.)

After dinner we started publishing our books. The new publication process at FLOSS Manuals loads the books not only to the FLOSS Manuals main page but to Lulu for purchase.

Publishing books at doc sprint
Publishing books at doc sprint

Joining the pilgrimage that all institutions are making toward wider data use, FLOSS Manuals is exposing more and more of the writing process. As described by founder Adam Hyde in a blog posting today, Visualising your book, recently added tools that help participants and friends follow the progress of the book (you can view a list of chapters edited on an RSS feed, for instance) and get a sense of what was done. For instance, a timeline with circles representing chapter edits shows you which chapters had the most edits and when activity took place. (Pierre Commenge created the visualization for FLOSS Manuals.)

Participants at doc sprint
Participants at doc sprint

(All my postings from this sprint are listed in a href="https://bitly.com/bundles/praxagora/4">bit.ly bundle.)

October 20 2011

Day two of FLOSS Manuals book sprint at Google Summer of Code summit

We started the second day of the FLOSS Manuals sprint with a circle encounter where each person shared some impressions of the first day. Several reported that they had worked on wikis and other online documentation before, but discovered that doing a book was quite different (I could have told them that, of course). They knew that a book had to be more organized, and offer more background than typical online documentation. More fundamentally, they felt more responsibility toward a wider range of readers, knowing that the book would be held up as an authority on the software they worked on and cared so much about.

We noted how gratifying it was to get questions answered instantly and be able to go through several rounds of editing in just a couple minutes. I admitted that I had been infected with the enthusiasm of the KDE developers I was working with, but had to maintain a bit of critical distance, an ability to say, "Hey, you're telling me this piece of software is wonderful, but I find its use gnarly and convoluted."

As I explained in Monday's posting, all the writing had to fit pretty much into two days. Each of the four teams started yesterday by creating an outline, and I'm sure my team was not the only one to revise it constantly throughout the day.

Circle at beginning of the day
Circle at beginning of the day

Today, the KDE team took a final look at the outline and discussed everything we'd like to add to it. We pretty much finalized it early int the day and just filled in the blanks for the next eleven hours. I continued to raise flags about what I felt were insufficiently detailed explanations, and got impatient enough to write a few passages of my own in the evening.

Celebrating our approach to the end of the KDE writing effort
Celebrating our approach to the end of the KDE writing effort

The KDE book is fairly standard developer documentation, albeit a beginner's guide with lots of practical advice about working in the KDE environment with the community. As a relatively conventional book, it was probably a little easier to write (but also probably less fun) than the more high-level approaches taken by some other teams that were trying to demonstrate to potential customers that their projects were worth adopting. Story-telling will be hard to detect in the KDE book.

And we finished! Now I'm afraid we'll find tomorrow boring, because we won't be allowed (and probably won't need) to add substantial new material. Instead, we'll be doing things like checking everything for consistency, removing references to missing passages, adding terms to the glossary, and other unrewarding slogs through a document that is far too familiar to us already. The only difference between the other team members and me is that I may be assigned to do this work on some other project.

(All my postings from this sprint are listed in a bit.ly bundle.)

October 19 2011

Day one of FLOSS Manuals book sprint at Google Summer of Code summit

Four teams at Google launched into endeavors that will lead, less than 72 hours from now, to complete books on four open source projects (KDE, OpenStreetMap, OpenMRS, and Sahana Eden). Most participants were recruited on the basis of a dream and a promise, so going through the first third of our sprint was eye-opening for nearly everybody. Although I had participated in one sprint before on-site and two sprints remotely, I found that the modus operandi has changed so much during the past year of experimentation that I too had a lot to learn.

Our doc sprint coordinator, Adam Hyde, told each team to spend an hour making an outline. The team to which I was assigned, KDE, took nearly two, and part way through Adam came in to tell us to stop because we had enough topics for three days of work. We then dug in to filling in the outline through a mix of fresh writing and cutting and pasting material from the official KDE docs. The latter required a complete overhaul, and naturally proved often to be more than a year out of date.

KDE team at doc sprint
KDE team at doc sprint

The KDE team's focus on developer documentation spared them the open-ended discussions over scope that the other teams had to undergo. But at key points during the writing, we still were forced to examine passages that appeared too hurried and unsubstantiated, evidence of gaps in information. At each point we had to determine what the hidden topics were, and then whether to remove all references to them (as we did, for instance, on the topic of getting permission to commit code fixes) or to expand them into new chapters of their own (as we did for internationalization). The latter choice created a dilemma of its own, because none of the team members present had experience with internationalization, so we reached out and tried to contact remote KDE experts who could write the chapter.

The biggest kudos today go to Sahana Eden, I think. I reported yesterday that the team expressed deep difference of opinion about the audience they should address and how they should organize their sprint. Today they made some choices and got a huge amount of documentation down on the screen. Much of it was clearly provisional (they were boo'ed for including so many bulleted lists) but it was evidence of their thinking and a framework for further development.

Sahana team at doc sprint
Sahana team at doc sprint

My own team had a lot of people with serious jet lag, and we had some trouble going from 9:00 in the morning to 9:30 at night. But we created (or untangled, as the case may be) some 60 pages of text. We reorganized the book at least once per hour, a process that the FLOSS Manuals interface makes as easy as drag and drop. A good heuristic was to choose a section title for each group of chapters. If we couldn't find a good title, we had to break up the group.

The end of the day brought us to the half-way mark for writing. We ares told we need to complete everything at the end of the evening tomorrow and spend the final day rearranging and cleaning up text. More than a race against time, this is proving to be a race against complexity.

Topics for discussion at doc sprint
Topics for discussion at doc sprint

October 18 2011

FLOSS Manuals sprint starts at Google Summer of Code summit

Five days of intense book production kicked off today at the FLOSS Manuals sprint, hosted by Google. Four free software projects have each sent three to five volunteers to write books about the projects this week. Along the way we'll all learn about the group writing process and the particular use of book sprints to make documentation for free software.

I came here to provide whatever editorial help I can and to see the similarities and differences between conventional publishing and the intense community effort represented by book sprints. I plan to spend time with each of the four projects, participating in their discussions and trying to learn what works best by comparing what they bring in the way of expertise and ideas to their projects. All the work will be done out in the open on the FLOSS Manuals site for the summit, so you are welcome also to log in and watch the progress of the books or even contribute.

A book in a week sounds like a pretty cool achievement, whether for a free software projects or a publisher. In fact, the first day (today) and last day of the sprint are unconferences, so there are only three days for actual writing. The first hour tomorrow will be devoted to choosing a high-level outline for each project, and then they will be off and running.

And there are many cautions about trying to apply this model to conventional publishing. First, the books are never really finished at the end of the sprint, even though they go up for viewing and for sale immediately. I've seen that they have many rough spots, such as redundant sections written by different people on the same topic, and mistakes in cross-references or references to non-existent material. Naturally, they also need a copy-edit. This doesn't detract from the value of the material produced. It just means they need some straightening out to be considered professional quality.

Books that come from sprints are also quite short. I think a typical length is 125 pages, growing over time as follow-up sprints are held. The length also depends of course on the number of people working on the sprint. We have the minimum for a good sprint here at Google, because the three to five team members will be joined by one or two people like me who are unaffiliated.

Finally, the content of a sprint book is decided on an ad hoc basis. FLOSS Manuals founder Adam Hyde explained today that his view of outlining and planning has evolved considerably. He quite rationally assumed at first that very book should have a detailed outline before the sprint started. Then he found that one could not impose an outline on sprinters, but had to let them choose subjects they wanted to cover. Each sprinter brings certain passions, and in such an intense environment one can only go with the flow and let each person write what interests him or her. Somehow, the books pull together into a coherent product, but one cannot guarantee they'll have exactly what the market is asking for. I, in fact, was involved in the planning of a FLOSS Manuals sprint for the CiviCRM manual (the first edition of a book that is now in its third) and witnessed the sprinters toss out an outline that I had spent weeks producing with community leaders.

So a sprint is different in every way from a traditional published manual, and I imagine this will be true for community documentation in general.

The discussions today uncovered the desires and concerns of the sprinters, and offered some formal presentations to prepare us, we hope, for the unique experience of doing a book sprint. The concerns expressed by sprinters were pretty easy to anticipate. How does one motivate community members to write? How can a project maintain a book in a timely manner after it is produced? What is the role of graphics and multimedia? How does one produce multiple translations?

Janet Swisher, a documentation expert from Austin who is on the board of FLOSS Manuals, gave a presentation asking project leaders to think about basic questions such as why a user would use their software and what typical uses are. Her goal was to bring home the traditional lessons of good writing: empathy for a well-defined audience. "If I had a nickel for every web site I've visited put up by an open source project that doesn't state what the software is for..." she said. That's just a single egregious instance of the general lack of understanding of the audience that free software authors suffer from.

Later, Michael McAndrew of the CiviCRM project took us several steps further along the path, asking what the project leaders would enjoy documenting and "what would be insane to leave out." I sat with the group from Sahana to watch as they grappled with the pressures these questions created. This is sure one passionate group of volunteers, caring deeply about what they do. Splits appeared concerning how much time to devote to high-level concepts versus practical details, which audiences to serve, and what to emphasize in the outline. I have no doubt, however, listening to them listen to each other, that they'll have their plan after the first hour tomorrow and will be off and running.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl