Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 04 2012

Four short links: 4 May 2012

  1. Common Statistical Fallacies (Flowing Data) -- once you know to look for them, you see them everywhere. Or is that confirmation bias?
  2. Project Hijack -- Hijacking power and bandwidth from the mobile phone's audio interface. Creating a cubic-inch peripheral sensor ecosystem for the mobile phone.
  3. Peak Plastic -- Deb Chachra points out that if we’re running out of oil, that also means that we’re running out of plastic. Compared to fuel and agriculture, plastic is small potatoes. Even though plastics are made on a massive industrial scale, they still account for less than 10% of the world’s oil consumption. So recycling plastic saves plastic and reduces its impact on the environment, but it certainly isn’t going to save us from the end of oil. Peak oil means peak plastic. And that means that much of the physical world around us will have to change. I hadn't pondered plastics in medicine before. (via BoingBoing)
  4. web.go (GitHub) -- web framework for the Go programming language.

May 02 2012

Recombinant Research: Breaking open rewards and incentives

In the previous articles in this series I've looked at problems in current medical research, and at the legal and technical solutions proposed by Sage Bionetworks. Pilot projects have shown encouraging results but to move from a hothouse environment of experimentation to the mainstream of one of the world's most lucrative and tradition-bound industries, Sage Bionetworks must aim for its nucleus: rewards and incentives.

Previous article in the series: Sage Congress plans for patient engagement.

Think about the publication system, that wretchedly inadequate medium for transferring information about experiments. Getting the data on which a study was based is incredibly hard; getting the actual samples or access to patients is usually impossible. Just as boiling vegetables drains most of their nutrients into the water, publishing results of an experiment throws away what is most valuable.

But the publication system has been built into the foundation of employment and funding over the centuries. A massive industry provides distribution of published results to libraries and research institutions around the world, and maintains iron control over access to that network through peer review and editorial discretion. Even more important, funding grants require publication (but the data behind the study only very recently). And of course, advancement in one's field requires publication.

Lawrence Lessig, in his keynote, castigated for-profit journals for restricting access to knowledge in order to puff up profits. A chart in his talk showed skyrocketing prices for for-profit journals in comparison to non-profit journals. Lessig is not out on the radical fringe in this regard; Harvard Library is calling the current pricing situation "untenable" in a move toward open access echoed by many in academia.

Lawrence Lessig keynote at Sage Congress
Lawrence Lessig keynote at Sage Congress.

How do we open up this system that seemed to serve science so well for so long, but is now becoming a drag on it? One approach is to expand the notion of publication. This is what Sage Bionetworks is doing with Science Translational Medicine in publishing validated biological models, as mentioned in an earlier article. An even more extensive reset of the publication model is found in Open Network Biology (ONB), an online journal. The publishers require that an article be accompanied by the biological model, the data and code used to produce the model, a description of the algorithm, and a platform to aid in reproducing results.

But neither of these worthy projects changes the external conditions that prop up the current publication system.

When one tries to design a reward system that gives deserved credit to other things besides the final results of an experiment, as some participants did at Sage Congress, great unknowns loom up. Is normalizing and cleaning data an activity worth praise and recognition? How about combining data sets from many different projects, as a Synapse researcher did for the TCGA? How much credit do you assign researchers at each step of the necessary procedure for a successful experiment?

Let's turn to the case of free software to look at an example of success in open sharing. It's clear that free software has swept the computer world. Most web sites use free software ranging from the server on which they run to the language compilers that deliver their code. Everybody knows that the most popular mobile platform, Android, is based on Linux, although fewer realize that the next most popular mobile platforms, Apple's iPhones and iPads, run on a modified version of the open BSD operating system. We could go on and on citing ways in which free and open source software have changed the field.

The mechanism by which free and open source software staked out its dominance in so many areas has not been authoritatively established, but I think many programmers agree on a few key points:

  • Computer professionals encountered free software early in their careers, particularly as students or tinkerers, and brought their predilection for it into jobs they took at stodgier institutions such as banks and government agencies. Their managers deferred to them on choices for programming tools, and the rest is history.

  • Of course, computer professionals would not have chosen the free tools had they not been fit for the job (and often best for the job). Why is free software so good? Probably because the people creating it have complete jurisdiction over what to produce and how much time to spend producing it, unlike in commercial ventures with requirements established through marketing surveys and deadlines set unreasonably by management.

  • Different pieces of free software are easy to hook up, because one can alter their interfaces as necessary. Free software developers tend to look for other tools and platforms that could work with their own, and provide hooks into them (Apache, free database engines such as MySQL, and other such platforms are often accommodated.) Customers of proprietary software, in contrast, experience constant frustration when they try to introduce a new component or change components, because the software vendors are hostile to outside code (except when they are eager to fill a niche left by a competitor with market dominance). Formal standards cannot overcome vendor recalcitrance--a painful truth particularly obvious in health care with quasi-standards such as HL7.

  • Free software scales. Programmers work on it tirelessly until it's as efficient as it needs to be, and when one solution just can't scale any more, programmers can create new components such as Cassandra, CouchDB, or Redis that meet new needs.

Are there lessons we can take from this success story? Biological research doesn't fit the circumstances that made open source software a success. For instance, researchers start out low on the totem pole in very proprietary-minded institutions, and don't get to choose new ways of working. But the cleverer ones are beginning to break out and try more collaboration. Software and Internet connections help.

Researchers tend to choose formats and procedures on an ad hoc, project by project basis. They haven't paid enough attention to making their procedures and data sets work with those produced by other teams. This has got to change, and Sage Bionetworks is working hard on it.

Research is labor-intensive. It needs desperately to scale, as I have pointed out throughout this article, but to do so it needs entire new paradigms for thinking about biological models, workflow, and teamwork. This too is part of Sage Bionetworks' mission.

Certain problems are particularly resistant in research:

  • Conditions that affect small populations have trouble raising funds for research. The Sage Congress initiatives can lower research costs by pooling data from the affected population and helping researchers work more closely with patients.

  • Computation and statistical methods are very difficult fields, and biological research is competing with every other industry for the rare individuals who know these well. All we can do is bolster educational programs for both computer scientists and biologists to get more of these people.

  • There's a long lag time before one knows the effects of treatments. As Heywood's keynote suggested, this is partly solved by collecting longitudinal data on many patients and letting them talk among themselves.

Another process change has revolutionized the computer field: agile programming. That paradigm stresses close collaboration with the end-users whom the software is supposed to benefit, and a willingness to throw out old models and experiment. BRIDGE and other patient initiatives hold out the hope of a similar shift in medical research.

All these things are needed to rescue the study of genetics. It's a lot to do all at once. Progress on some fronts were more apparent than others at this year's Sage Congress. But as more people get drawn in, and sometimes fumbling experiments produce maps for changing direction, we may start to see real outcomes from the efforts in upcoming years.

All articles in this series, and others I've written about Sage Congress, are available through a bit.ly bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

The UK's battle for open standards

Many of you are probably not aware, but there is an ongoing battle within the U.K. that will shape the future of the U.K. tech industry. It's all about open standards.

Last year, the Cabinet Office ran a consultation on open standards covering 970 CIOs and academics. The result of this consultation was a policy (PDF) in favour of royalty-free (RF) open standards in the U.K. I'm not going to go through the benefits of open standards in this space, other than to note that they are essential for the U.K.'s future competitive position, for spurring on innovation and creating a level playing field within the tech field. For those who wish to read more on this subject, Mark Thompson, the only academic I know to have published a paper on open standards in a quality peer reviewed journal, has provided an excellent overview.

Normally, I put these battles into an historical context, and I certainly have a plethora of examples of past industries attempting to lobby against future change. However, to keep this short I'll simply note that the incumbent industry has reacted to the Cabinet Office policy with attempts to redefine open standards to include non-open FRAND (fair, reasonable and non discriminatory) licenses and portray some sort of legitimate debate of RF versus FRAND, which doesn't exist.

Whilst this is clearly wrong and underhanded, there's another story I wish to focus on. It relates to the accusations that the meetings have been filled with "spokespeople for big vendors to argue in favour of paid-for software, specifically giving advocates of FRAND the chance to argue that free software on RF terms would be a bad thing" as reported by TechWeek Europe.

The back story is that since the Government policy on open standards was put in place, the Cabinet Office was pressured into a u-turn and running another consultation by various standards bodies and other vested interests. The arguments used were either fortuitous misunderstandings of the policy or willful misinformation in favour of current business interests. The Cabinet Office then appeared to relent to the pressure and undertake a second set of consultations. What happened next shows the sorry behaviour of lobbyists in our industry.

"Software patent heavyweights piled into the first public meeting," filling the room with unrepresentative views backed up by vendors flying in senior individuals from the U.S. It apparently seems that the chair of the roundtable was himself a paid lobbyist working on behalf of those vested interests, a fact that he forgot to mention to the Cabinet Office. Microsoft has now been "accused of trying to secretly influence government consultation."

What's surprising is that the majority of this had been uncovered by two journalists — Mark Ballard at Computer Weekly and Glyn Moody — who work mainly outside the mainstream media. In fact, the mainstream media has remained silent on the issue, with the notable exception of The Guardian.

The end result of the work of these two journalists is that the Cabinet Office has had to extend the consultation and, as noted by The Guardian, "rerun one of its discussion roundtables after it found that an independent facilitator of one of its discussions was simultaneously advising Microsoft on the consultation."

So, we have two plucky journalists who stand alone uncovering large corporations that are bullying Government to protect profits worth hundreds of millions. Our heroes' journey uncovers gerrymandering, skullduggery, rampant conflicts of interests, dubious ethics and a host of other sordid details and ... hold on, this sounds like a Hollywood script, not real life. Why on earth isn't mainstream media all over this, especially given the leaked Bell Pottinger memo on exploiting citizen initiatives?

The silence makes me wonder whether investigative journalism into things that might matter and might make a positive difference doesn't sell much advertising? Would it help if the open standards battle had celebrity endorsement? Alas, that's not the case and the battle for open standards might have been extended, but it is still ongoing. This issue is as important to the U.K. as SOPA / PIPA were to the U.S., but rather than fighting against a Government trying to do something that harms the growth of future industry, we are fighting with a Government trying to do the right thing and benefit a nation.

If you're too busy to help, that's understandable, but don't ever grumble about why the U.K. Government doesn't do more to support open standards and open source. The U.K. Government is trying to make a difference. It's trying to fight a good fight against a huge and well-funded lobby, but it needs you to turn up.

The battle for open standards needs help, so get involved.

Related:

Four short links: 2 May 2012

  1. Punting on SxSW (Brad Feld) -- I came across this old post and thought: if you can make money by being a dick, or make money by being a caring family person, why would you choose to be a dick? As far as I can tell, being a dick is optional. Brogrammers, take note. Be more like Brad Feld, who prioritises his family and acts accordingly.
  2. Probabilistic Structures for Data Mining -- readable introduction to useful algorithms and datastructures showing their performance, reliability, and resources trade-off. (via Hacker News)
  3. Dataset -- a Javascript library for transforming, querying, manipulating data from different sources.
  4. Many HTTPS Servers are Insecure -- 75% still vulnerable to the BEAST attack.

May 01 2012

Recombinant Research: Sage Congress plans for patient engagement

Clinical trials are the pathway for approving drug use, but they aren't good enough. That has become clear as a number of drugs (Vioxx being the most famous) have been blessed by the FDA, but disqualified after years of widespread use reveal either lack of efficacy or dangerous side effects. And the measures taken by the FDA recently to solve this embarrassing problem continue the heavy-weight bureaucratic methods it has always employed: more trials, raising the costs of every drug and slowing down approval. Although I don't agree with the opinion of Avik S. A. Roy (reprinted in Forbes) that Phase III trials tend to be arbitrary, I do believe it is time to look for other ways to test drugs for safety and efficacy.

First article in the series: Recombinant Research: Sage Congress Promotes Data Sharing in Genetics.

But the Vioxx problem is just one instance of the wider malaise afflicting the drug industry. They just aren't producing enough new medications, either to solve pressing public needs or to keep up their own earnings. Vicki Seyfert-Margolis of the FDA built on her noteworthy speech at last year's Sage Congress (reported in one of my articles about the conference) with the statistic that drug companies have submitted 20% fewer medications to the FDA between 2001 and 2007. Their blockbuster drugs produce far fewer profits than before as patents expire and fewer new drugs emerge (a predicament called the "patent cliff"). Seyfert-Margolis intimated that this crisis in the cause of layoffs in the industry, although I heard elsewhere that the companies are outsourcing more research, so perhaps the downsizing is just a reallocation of the same money.

Benefits of patient involvement

The field has failed to rise to the challenges posed by new complexity. Speakers at Sage Congress seemed to feel that genetic research has gone off the tracks. As the previous article in this series explained, Sage Bionetworks wants researchers to break the logjam by sharing data and code in GitHub fashion. And surprisingly, pharma is hurting enough to consider going along with an open research system. They're bleeding from a situation where as much as 80% of each clinical analysis is spent retrieving, formatting, and curating the data. Meanwhile, Kathy Giusti of the Multiple Myeloma Research Foundation says that in their work, open clinical trials are 60% faster.

Attendees at a breakout session where I sat in, including numerous managers from major pharma companies, expressed confidence that they could expand public or "pre-competitive" research in the direction Sage Congress proposed. The sector left to engage is the one that's central to all this work--the public.

If we could collect wide-ranging data from, say, 50,000 individuals (a May 2013 goal cited by John Wilbanks of Sage Bionetworks, a Kauffman Foundation Fellow), we could uncover a lot of trends that clinical trials are too narrow to turn up. Wilbanks ultimately wants millions of such data samples, and another attendee claimed that "technology will be ready by 2020 for a billion people to maintain their own molecular and longitudinal health data." And Jamie Heywood of PatientsLikeMe, in his keynote, claimed to have demonstrated through shared patient notes that some drugs were ineffective long before the FDA or manufacturers made the discoveries. He decried the current system of validating drugs for use and then failing to follow up with more studies, snorting that, "Validated means that I have ceased the process of learning."

But patients have good reasons to keep a close hold on their health data, fearing that an insurance company, an identity thief, a drug marketer, or even their own employer will find and misuse it. They already have little enough control over it, because the annoying consent forms we always have shoved in our faces when we come to a clinic give away a lot of rights. Current laws allow all kinds of funny business, as shown in the famous case of the Vermont law against data mining, which gave the Supreme Court a chance to say that marketers can do anything they damn please with your data, under the excuse that it's de-identified.

In a noteworthy poll by Sage Bionetworks, 80% of academics claimed they were comfortable sharing their personal health data with family members, but only 31% of citizen advocates would do so. If that 31% is more representative of patients and the general public, how many would open their data to strangers, even when supposedly de-identified?

The Sage Bionetworks approach to patient consent

It's basic research that loses. So Wilbanks and a team have been working for the past year on a "portable consent" procedure. This is meant to overcome the hurdle by which a patient has to be contacted and give consent anew each time a new researcher wants data related to his or her genetics, conditions, or treatment. The ideal behind portable consent is to treat the entire research community as a trusted user.

The current plan for portable consent provides three tiers:

Tier 1

No restrictions on data, so long as researchers follow the terms of service. Hopefully, millions of people will choose this tier.

Tier 2

A middle ground. Someone with asthma may state that his data can be used only by asthma researchers, for example.

Tier 3

Carefully controlled. Meant for data coming from sensitive populations, along with anything that includes genetic information.

Synapse provides a trusted identification service. If researchers find a person with useful characteristics in the last two tiers, and are not authorized automatically to use that person's data, they can contact Synapse with the random number assigned to the person. Synapse keeps the original email address of the person on file and will contact him or her to request consent.

Portable consent also involves a lot of patient education. People will sign up through a software wizard that explains the risks. After choosing portable consent, the person decides how much to put in: 23andMe data, prescriptions, or whatever they choose to release.

Sharon Terry of the Genetic Alliance said that patient advocates currently try to control patient data in order to force researchers to share the work they base on that data. Portable consent loosens this control, but the field may be ready for its more flexible conditions for sharing.

Pharma companies and genetics researchers have lots to gain from access to enormous repositories of patient data. But what do the patients get from it? Leaders in health care already recognize that patients are more than experimental subjects and passive recipients of treatment. The recent ONC proposal for Stage 2 of Meaningful Use includes several requirements to share treatment data with the people being treated (which seems kind of a no-brainer when stated this baldly) and the ONC has a Consumer/Patient Engagement Power Team.

Sage Congress is fully engaged in the patient engagement movement too. One result is the BRIDGE initiative, a joint project of Sage Bionetworks and Ashoka with funding from the Robert Wood Johnson Foundation, to solicit questions and suggestions for research from patients. Researchers can go for years researching a condition without even touching on some symptom that patients care about. Listening to patients in the long run produces more cooperation and more funding.

Portable consent requires a leap of faith, because as Wilbanks admits, releasing aggregates of patient data mean that over time, a patient is almost certain to be re-identified. Statistical techniques are just getting too sophisticated and compute power growing too fast for anyone to hide behind current tricks such as using only the first three digits of a five-digit postal code. Portable consent requires the data repository to grant access only to bona fide researchers and to set terms of use, including a ban on re-identifying patients. Still, researchers will have rights to do research, redistribute data, and derive products from it. Audits will be built in.

But as mentioned by Kelly Edwards of the University of Washington, tools and legal contracts can contribute to trust, but trust is ultimately based on shared values. Portable consent, properly done, engages with frameworks like Synapse to create a culture of respect for data.

In fact, I think the combination of the contractual framework in portable consent and a platform like Synapse, with its terms of use, might make a big difference in protecting patient privacy. Seyfert-Margolis cited predictions that 500 million smartphone users will be using medical apps by 2015. But mobile apps are notoriously greedy for personal data and cavalier toward user rights. Suppose all those smartphone users stored their data in a repository with clear terms of use and employed portable consent to grant access to the apps? We might all be safer.

The final article in this series will evaluate the prospects for open research in genetics, with a look at the grip of journal publishing on the field, and some comparisons to the success of free and open source software.

Next: Breaking Open Rewards and Incentives. All articles in this series, and others I've written about Sage Congress, are available through a bit.ly bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

Four short links: 1 May 2012

  1. Sugata Mitra: Beyond The Hole in the Wall (YouTube) -- great talk by the education researcher Sugata Mitra whose big kick is self-directed learning. Great stories about the deployments and effects he's had with technology and supervision rather than teaching, but the end is a real kicker: the core skills we have are literacy, search, and belief. Of the three, the most problematic is belief: when and how do/should we turn something we've read into something ingrained, accepted, and built-upon? (via Tara Taylor-Jorgenson)
  2. Interview with Bunnie Huang (Makezine) -- fascinating interview with the hardware guy behind the Chumby. It's all gold, from rapid iteration at early stages of hardware through to the need to simplify. I think one of the most gut-wrenching realizations that small companies have to make is that they aren’t Apple. Apple spends over a billion dollars a year on tooling. An injection molding tool may cost around $40k and 2-3 months to make; Apple is known to build five or six simultaneously and then scrap all but one so they can evaluate multiple design approaches. But for them, tossing $200k in tooling to save 2 months time to market is peanuts. But for a startup that raised a million bucks, it’s unthinkable. Apple also has hundreds of staff; a startup has just a few members to do everything. The precision and refinement of Apple’s products come at an enormous cost that is just out of the reach of startups.
  3. ssh as Chrome Extension -- can't help but feel that building a secure login system on top of web browsers on top of operating systems isn't going to be more secure than building a secure login system on top of the operating system.
  4. (Tablet) Size Matters (Luke Wroblewski) -- as the screen gets bigger, we use the Web more.

April 30 2012

Recombinant Research: Sage Congress promotes data sharing in genetics

Given the exponential drop in the cost of personal genome sequencing (you can get a basic DNA test from 23andMe for a couple hundred dollars, and a full sequence will probably soon come down to one thousand dollars in cost), a new dawn seems to be breaking forth for biological research. Yet the assessment of genetics research at the recent Sage Congress was highly cautionary. Various speakers chided their own field for tilling the same ground over and over, ignoring the urgent needs of patients, and just plain researching the wrong things.

Sage Congress also has some plans to fix all that. These projects include tools for sharing data and storing it in cloud facilities, running challenges, injecting new fertility into collaboration projects, and ways to gather more patient data and bring patients into the planning process. Through two days of demos, keynotes, panels, and breakout sessions, Sage Congress brought its vision to a high-level cohort of 230 attendees from universities, pharmaceutical companies, government health agencies, and others who can make change in the field.

In the course of this series of articles, I'll pinpoint some of the pain points that can force researchers, pharmaceutical companies, doctors, and patients to work together better. I'll offer a look at the importance of public input, legal frameworks for cooperation, the role of standards, and a number of other topics. But we'll start by seeing what Sage Bionetworks and its pals have done over the past year.

Synapse: providing the tools for genetics collaboration

Everybody understands that change is driven by people and the culture they form around them, not by tools, but good tools can make it a heck of a lot easier to drive change. To give genetics researchers the best environment available to share their work, Sage Bionetworks created the Synapse platform.

Synapse recognizes that data sets in biological research are getting too large to share through simple data transfers. For instance, in his keynote about cancer research (where he kindly treated us to pictures of cancer victims during lunch), UC Santa Cruz professor David Haussler announced plans to store 25,000 cases at 200 gigabytes per case in the Cancer Genome Atlas, also known as TCGA in what seems to be a clever pun on the four nucleotides in DNA. Storage requirements thus work out to 5 petabytes, which Haussler wants to be expandable to 20 petabytes. In the face of big data like this, the job becomes moving the code to the data, not moving the data to the code.

Synapse points to data sets contributed by cooperating researchers, but also lets you pull up a console in a web browser to run R or Python code on the data. Some effort goes into tagging each data set with associated metadata: tissue type, species tested, last update, number of samples, etc. Thus, you can search across Synapse to find data sets that are pertinent to your research.

One group working with Synapse has already harmonized and normalized the data sets in TCGA so that a researcher can quickly mix and run stats on them to extract emerging patterns. The effort took about one and half full-time employees for six months, but the project leader is confident that with the system in place, "we can activate a similar size repository in hours."

This contribution highlights an important principle behind Synapse (appropriately called "viral" by some people in the open source movement): when you have manipulated and improved upon the data you find through Synapse, you should put your work back into Synapse. This work could include cleaning up outlier data, adding metadata, and so on. To make work sharing even easier, Synapse has plans to incorporate the Amazon Simple Workflow Service (SWF). It also hopes to add web interfaces to allow non-programmers do do useful work with data.

The Synapse development effort was an impressive one, coming up with a feature-rich Beta version in a year with just four coders. And Synapse code is entirely open source. So not only is the data distributed, but the creators will be happy for research institutions to set up their own Synapse sites. This may make Synapse more appealing to geneticists who are prevented by inertia from visiting the original Synapse.

Mike Kellen, introducing Synapse, compared its potential impact to that of moving research from a world of journals to a world like GitHub, where people record and share every detail of their work and plans. Along these lines, Synapse records who has used a data set. This has many benefits:

  • Researchers can meet up with others doing related work.

  • It gives public interest advocates a hook with which to call on those who benefit commercially from Synapse--as we hope the pharmaceutical companies will--to contribute money or other resources.

  • Members of the public can monitor accesses for suspicious uses that may be unethical.

There's plenty more work to be done to get data in good shape for sharing. Researchers must agree on some kind of metadata--the dreaded notion of ontologies came up several times--and clean up their data. They must learn about data provenance and versioning.

But sharing is critical for such basics of science as reproducing results. One source estimates that 75% of published results in genetics can't be replicated. A later article in this series will examine a new model in which enough metainformation is shared about a study for it to be reproduced, and even more important to be a foundation for further research.

With this Beta release of Synapse, Sage Bionetworks feels it is ready for a new initiative to promote collaboration in biological research. But how do you get biologists around the world to start using Synapse? For one, try an activity that's gotten popular nowadays: a research challenge.

The Sage DREAM challenge

Sage Bionetworks' DREAM challenge asks genetics researchers to find predictors of the progression of breast cancer. The challenge uses data from 2000 women diagnosed with breast cancer, combining information on DNA alterations affecting how their genes were expressed in the tumors, clinical information about their tumor status, and their outcomes over ten years. The challenge is to build models integrating the alterations with molecular markers and clinical features to predict which women will have the most aggressive disease over a ten year period.

Several hidden aspects of the challenge make it a clever vehicle for Sage Bionetworks' values and goals. First, breast cancer is a scourge whose urgency is matched by its stubborn resistance to diagnosis. The famous 2009 recommendations of U.S. Preventive Services Task Force, after all the controversy was aired, left us with the dismal truth that we don't know a good way to predict breast cancer. Some women get mastectomies in the total absence of symptoms based just on frightening family histories. In short, breast cancer puts the research and health care communities in a quandary.

We need finer-grained predictors to say who is likely to get breast cancer, and standard research efforts up to now have fallen short. The Sage proposal is to marshal experts in a new way that combines their strengths, asking them to publish models that show the complex interactions between gene targets and influences from the environment. Sage Bionetworks will publish data sets at regular intervals that it uses to measure the predictive ability of each model. A totally fresh data set will be used at the end to choose the winning model.

The process behind the challenge--particularly the need to upload code in order to run it on the Synapse site--automatically forces model builders to publish all their code. According to Stephen Friend, founder of Sage Bionetworks, "this brings a level of accountability, transparency, and reproducibility not previously achieved in clinical data model challenges."

Finally, the process has two more effects: it shows off the huge amount of genetic data that can be accessed through Synapse, and it encourages researchers to look at each other's models in order to boost their own efforts. In less than a month, the challenge already received more than 100 models from 10 sources.

The reward for winning the challenge is publication in a respected journal, the gold medal still sought by academic researchers. (More on shattering this obelisk later in the series.) Science Translational Medicine will accept results of the evaluation as a stand-in for peer review, a real breakthrough for Sage Bionetworks because it validates their software-based, evidence-driven process.

Finally, the DREAM challenge promotes use of the Synapse infrastructure, and in particular the method of bringing the code to the data. Google is donating server space for the challenge, which levels the playing field for researchers, freeing them from paying for their own computing.

A single challenge doesn't solve all the problems of incentives, of course. We still need to persuade researchers to put up their code and data on a kind of genetic GitHub, persuade pharmaceutical companies to support open research, and persuade the general public to share data about the phonemes (life data) and genes--all topics for upcoming articles in the series.

Next: Sage Congress Plans for Patient Engagement. All articles in this series, and others I've written about Sage Congress, are available through a bit.ly bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

Four short links: 30 April 2012

  1. Chanko (Github) -- trivial A/B testing from within Rails.
  2. OpenMeetings -- Apache project for audio/video conferencing, screen sharing, whiteboard, calendar, and other groupware features.
  3. Low Innovation Internet (Wired) -- I disagree, I think this is a Louis CK Nobody's Happy moment. We renormalize after change and become blind to the amazing things we're surrounded by. Hundreds of thousands (millions?) of people work from home, collaborate to develop software that has saved the world billions of dollars in licensing fees, provide services, write and share books, make voice and video calls, create movies, fund creative projects, buy and sell used goods, and you're unhappy because there aren't "huge changes"? Have you spoken to someone in the publishing, music, TV, film, newspaper, retail, telephone, or indeed any industry that exists outside your cave, you obtuse contrarian pillock? There's no room on my Internet for weenie whiners.
  4. Context-Free Patent Art -- endlessly amusing. (via David Kaneda)

April 26 2012

Four short links: 26 April 2012

  1. Apollo Software -- amazing collection of source code to the software behind the Apollo mission. And memos, and quick references, and operations plans, and .... Just another reminder that the software itself is generally dwarfed by its operation.
  2. flickrapi.js (Github) -- Aaron Straup Cope's Javascript library for Flickr.
  3. t (Github) -- command-line power-tool for Twitter.
  4. Habits of Mind (PDF) -- Much more important than specific mathematical results are the habits of mind used by the people who create those results,and we envision a curriculum that elevates the methods by which mathematics is created,the techniques used by researchers,to a status equal to that enjoyed by the results of that research. Loved it: talks about the habits and mindsets of mathematicians, rather than the set of algorithms and postulates students must be able to recall. (via Dan Meyer)

April 25 2012

Four short links: 25 April 2012

  1. World History Since 1300 (Coursera) -- Coursera expands offerings to include humanities. This content is in books and already in online lectures in many formats. What do you get from these? Online quizzes and the online forum with similar people considering similar things. So it's a book club for a university course?
  2. mod_spdy -- Apache module for the SPDY protocol, Google's "faster than HTTP" HTTP.
  3. The Top 10 Dying Industries in the United States (Washington Post) -- between the Internet and China, yesterday's cash cows are today's casseroles.
  4. Notes from JSConf2012 -- excellent conference report: covers what happens, why it was interesting or not, and even summarizes relevant and interesting hallway conversations. AA++ would attend by proxy again. (via an old Javascript Weekly)

April 23 2012

April 20 2012

Four short links: 20 April 2012

  1. Tupac Coachella Behind the Technology (CBS) -- interesting to me is Dr. Dre and Snoop Dogg were considering taking Shakur with them on tour. Just as Hobbit, Tintin, etc. are CG-ing characters to look normal, is the future of "live" spectacle to be this kind of CG show? Will new acts be competing against the Rolling Stones forever?
  2. Javascript All The Way Down (Alex Russell) -- points out that we're fixing so much like compatibility, performance, accessibility, all this stuff with Javascript. We're moving further and further from declarative programming and more and more back to the days of writing heaps of Xlib or Motif toolkit code to implement our UIs and apps.
  3. wkhtmltopdf (Google Code) -- Simple shell utility to convert html to pdf using the webkit rendering engine, and qt. My first piece of "I wrote this, now you can use it too" open source was an HTML to PS converter (this was 1994 or so) via LaTeX. It's a useful thing, no really.
  4. Nicira (Wired) -- moving network management into software so the network hardware is as dumb as possible. Interesting continuation of the End-to-End principle, whereby smarts live at the edges of the network and the conduits are dumb.

April 19 2012

Sage Congress: The synthesis of open source with genetics

For several years, O'Reilly Radar has been covering the exciting
potential that open source software, open data, and a general attitude
of sharing and cooperation bring to health care. Along with many
exemplary open source projects in areas directly affecting the
public — such as the VA's Blue
Button
in electronic medical records and the href="http://wiki.directproject.org/">Direct project in data
exchange — the study of disease is undergoing a paradigm shift.

Sage Bionetworks stands at the
center of a wide range of academic researchers, pharmaceutical
companies, government agencies, and health providers realizing that
the old closed system of tiny teams who race each other to a cure has
got to change. Today's complex health problems, such as Alzheimer's,
AIDS, and cancer, are too big for a single team. And these
institutions are slowly wrenching themselves out of the habit of data
hoarding and finding ways to work together.

A couple weeks ago I talked to the founder of Sage Bionetworks,
Stephen Friend, about recent advances in open source in this area, and
the projects to be highlighted at the upcoming http://sagecongress.org/">Sage Commons congress. Steve is careful
to call this a "congress" instead of a "conference" because all
attendees are supposed to pitch in and contribute to the meme pool. I
covered Sage Congress in a series of
articles last year
. The following podcast ranges over
topics such as:

  • what is Sage Bionetworks [Discussed at the 00:25 mark];
  • the commitment of participants to open source software [Discussed at the 01:01 mark];
  • how open source can support a business model in drug development [Discussed at the 01:40 mark];
  • a look at the upcoming congress [Discussed at the 03:47 mark];
  • citizen-led contributions or network science [Discussed at the 06:12 mark];
  • data sharing philosophy [Discussed at the 09:01 mark];
  • when projects are shared with other institutions [Discussed at the 12:43 mark];
  • how to democratize medicine [Discussed at the 17:10 mark];
  • a portable legal consent approach where the patient controls his or her own data [Discussed at the 20:07 mark];
  • solving the problem of non-sharing in the industry [Discussed at the 22:15 mark]; and
  • key speakers at the congress [Discussed at the 26:35 mark].

Sessions from the congress will be broadcast live via webcast and posted on the Internet.

Four short links: 19 April 2012

  1. Superfastmatch -- open source text comparison tool, used to locate plagiarism/churnalism in online news sites. You can pull out the text engine and use it for your own "find where this text is used elsewhere" applications (e.g., what's being forwarded out in email, how much of this RFP is copy and paste, what's NOT boilerplate in this contract, etc.). (via Pete Warden)
  2. Ten Design Principles for Engaging Math Tasks (Dan Meyer) -- education gold, engagement gold, and some serious ideas you can use in your own apps.
  3. Clustering Related Stories (Jenny Finkel) -- description of how to cluster related stories, talks about some of the tricks. Interesting without being too scary.
  4. Prince of Persia (GitHub) -- I have waited to see if the novelty wore off, but I still find this cool: 1980s source code on GitHub.

April 18 2012

Four short links: 18 April 2012

  1. CartoDB (GitHub) -- open source geospatial database, API, map tiler, and UI. For feature comparison, see Comparing Open Source CartoDB to Fusion Tables (via Nelson Minar).
  2. Future Telescope Array Drives Exabyte Processing (Ars Technica) -- Astronomical data is massive, and requires intense computation to analyze. If it works as planned, Square Kilometer Array will produce over one exabyte (260 bytes, or approximately 1 billion gigabytes) every day. This is roughly twice the global daily traffic of the entire Internet, and will require storage capacity at least 10 times that needed for the Large Hadron Collider. (via Greg Linden)
  3. Faster Touch Screens More Usable (Toms Hardware) -- check out that video! (via Greg Linden)
  4. Why Microsoft's New Open Source Division (Simon Phipps) -- The new "Microsoft Open Technologies, Inc." provides an ideal firewall to protect Microsoft from the risks it has been alleging exist in open source and open standards. As such, it will make it "easier and faster" for them to respond to the inevitability of open source in their market without constant push-back from cautious and reactionary corporate process.

April 17 2012

Microsoft opens up

Open Sign by dlofink, on FlickrWhile Microsoft's previous stance on open source systems is well known, it turns out there's been a serious shift as Microsoft looks to bring more non-.NET programmers into the fold.

On April 12, Jean Paoli, president of a new subsidiary of Microsoft called Microsoft Open Technologies, Inc., wrote about the new initiative. In his words, the subsidiary was created "to advance the company's investment in openness — including interoperability, open standards and open source." This is a public step toward working with open source communities and integrating technologies into Microsoft's closed initiatives, which may not be quite so closed in the future. With that in mind, below I take a look at what's new with Microsoft and open source.

While these projects provide proof that the pendulum is swinging in the open source direction, the impact for Microsoft can and will be much more resounding. New markets, programmers, and communities are at play here if this new tact goes well.

Attracting the polyglot programmers

This shift in ideology will likely help Microsoft on a number of fronts, including finding new programmers and communities. For example, Microsoft may lure developers to Windows 8 — rumored to be launching in October — by making it as easy as possible to get up and running. HTML5/JavaScript as well as C++ can be used to create Windows 8 Metro applications, and Microsoft hasn't forgotten its own .NET developers, who will use C#. The common theme you will see with the Windows 8 release and others is that Microsoft is trying to become less isolated from the rest of the programming community, many of whom are now polyglot programmers.

Hadoop's halo effect

Azure, Microsoft's cloud platform, is slowly gaining momentum as enterprises make the shift to cloud services. The key word here is "slowly." On the other hand, Hadoop, an open source Apache project that's become a central part of the big data movement, has a huge and active community that's improving the code minute by minute. Supporting Hadoop on Azure lets Microsoft incorporate the popularity and visibility of an open source project into a Microsoft initiative that needs more exposure.

A marketing signal

With a Microsoft Openness website that speaks to the relationship it has with open source technologies, and an accompanying Twitter account (@OpenatMicrosoft) with more than 6,500 followers, the Microsoft marketing team also seems to think open source exposure is important. (Side note: Gianugo Rabellino, Microsoft senior director of open source communities, and one of the people tweeting from the @OpenatMicrosoft account, will be presenting at the OSCON conference this summer.)

As Microsoft continues to see viable open source projects gain momentum, you can be sure that it will work on including ways for those languages, libraries, and frameworks to be incorporated into its current and future platforms. But the more meaningful change is that Microsoft is seeing that opening its own technologies to programmers will only make its products better, more accessible, and central to the future of programming.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Photo: Open Sign by dlofink, on Flickr

Related:

April 16 2012

Four short links: 16 April 2012

  1. Peter Thiel's Class 4 Notes -- in perfect competition, marginal revenues equal marginal costs. So high margins for big companies suggest that two or more businesses might be combined: a core monopoly business (search, for Google), and then a bunch of other various efforts (robotic cars, TV, etc.). Cash builds up because it turns out that it doesn’t cost all that much to run the monopoly piece, and it doesn’t make sense to pump it into all the side projects. In a competitive world, you would have to be funding a lot more side projects to stay even. In a monopoly world, you should pour less into side projects, unless politics demand that the cash be spread around. Amazon currently needs to reinvest just 3% of its profits. It has to keep running to stay ahead, but it’s more easy jog than intense sprint. I liked the whole lecture, but this bit really stood out for me.
  2. Kickstarter Disrupting Consumer Electronics (Amanda Peyton) -- good point that most people wouldn't have thought that consumer electronics would lend itself to the same funding system as CDs of a one-act play about artisanal beadwork comic characters. Consumer electronics as a market has been ripe for disruption all along. That said, it’s ridiculously not obvious that disruption would come from the same place that allows an artist with a sharpie, a hotel room and a webcam a way to make the art she wants.
  3. OmniOS -- OmniTI's JEOS. Their team are engineers par excellence, so this promises to be good.
  4. Understanding Amazon's Ebook Strategy (Charlie Stross) -- By foolishly insisting on DRM, and then selling to Amazon on a wholesale basis, the publishers handed Amazon a monopoly on their customers—and thereby empowered a predatory monopsony. So very accurate.

April 14 2012

MySQL in 2012: Report from Percona Live

The big annual MySQL conference, started by MySQL AB in 2003 and run
by my company O'Reilly for several years, lives on under the able
management of Percona. This
fast-growing company started out doing consulting on MySQL,
particularly in the area of performance, and branched out into
development and many other activities. The principals of this company
wrote the most recent two editions of the popular O'Reilly book href="http://shop.oreilly.com/product/0636920022343.do">High
Performance MySQL
.

Percona started offering conferences a couple years ago and decided to
step in when O'Reilly decided not to run the annual MySQL conference
any more. Oracle did not participate in Percona Live, but has
announced href="http://www.oracle.com/us/corporate/press/1577449">its own MySQL
conference for next September.

Percona Live struck me as a success, with about one thousand attendees
and the participation of leading experts from all over the MySQL
world, save for Oracle itself. The big players in the MySQL user
community came out in force: Facebook, HP, Google, Pinterest (the
current darling of the financial crowd), and so on.

The conference followed the pattern laid down by old ones in just
about every way, with the same venue (the Santa Clara Convention
Center, which is near a light-rail but nothing else of interest), the
same food (scrumptious), the standard format of one day of tutorials
and two days of sessions (although with an extra developer day tacked
on, which I will describe later), an expo hall (smaller than before,
but with key participants in the ecosystem), and even community awards
(O'Reilly Media won an award as Corporate Contributor of the Year).
Monty Widenius was back as always with a MariaDB entourage, so it
seemed like old times. The keynotes seemed less well attended than the
ones from previous conferences, but the crowd was persistent and
showed up in impressive numbers for the final events--and I don't
believe it was because everybody thought they might win one of the
door prizes.

Jeremy Zawodny ready to hand out awards
Jeremy Zawodny ready to hand out awards.

Two contrasting database deployments

I checked out two well-attended talks by system architects from two

high-traffic sites: Pinterest and craigslist. The radically divergent
paths they took illustrate the range of options open to data centers
nowadays--and the importance of studying these options so a data
center can choose the path appropriate to its mission and
applications.

Jeremy Zawodny (co-author of the first edition of High Performance
MySQL
) href="http://www.percona.com/live/mysql-conference-2012/sessions/living-sql-and
-nosql-craigslist-pragmatic-approach">presented
the design of craigslist's site, which illustrates the model of
software accretion over time and an eager embrace of heterogeneity.
Among their components are:


  • Memcache, lying between the web servers and the MySQL database in
    classic fashion.

  • MySQL to serve live postings, handle abuse, data for monitoring
    system, and other immediate needs.

  • MongoDB to store almost 3 billion items related to archived (no longer
    live) postings.

  • HAproxy to direct requests to the proper MySQL server in a cluster.

  • Sphinx for text searches, with

    indexes over all live postings, archived postings, and forums.

  • Redis for temporary items such as counters and blobs.

  • An XFS filesystem for images.

  • Other helper functions that Zawodny lumped together as "async
    services."


Care and feeding of this menagerie becomes a job all in itself.
Although craigslist hires enough developers to assign them to
different areas of expertise, they have also built an object layer
that understands MySQL, cache, Sphinx, MongoDB. The original purpose
of this layer was to aid in migrating old data from MySQL to MongoDB
(a procedure Zawodny admitted was painful and time-consuming) but it
was extended into a useful framework that most developers can use
every day.

Zawodny praised MySQL's durability and its method of replication. But
he admitted that they used MySQL also because it was present when they
started and they were familiar with it. So adopting the newer entrants
into the data store arena was by no means done haphazardly or to try
out cool new tools. Each one precisely meets particular needs of the
site.


For instance, besides being fast and offering built-in sharding,
MongoDB was appealing because they don't have to run ALTER TABLE every
time they add a new field to the database. Old entries coexist happily
with newer ones that have different fields. Zawodny also likes using a
Perl client to interact with a database, and the Perl client provided
by MongoDB is unusually robust because it was developed by 10gen
directly, in contrast to many other datastores where Perl was added by
some random volunteer.

The architecture at craigslist was shrewdly chosen to match their
needs. For instance, because most visitors click on the limited set
of current listings, the Memcache layer handles the vast majority of
hits and the MySQL database has a relatively light load.


However, the MySQL deployment is also carefully designed. Clusters are
vertically partitioned in two nested ways. First, different types of
items are stored on separate partitions. Then, within each type, the
nodes are further divided by the type of query:

  • A single master to handle all writes.

  • A group for very fast reads (such as lookups on a primary key)

  • A group for "long reads" taking a few seconds


  • A special node called a "thrash handler" for rare, very complex
    queries

It's up to the application to indicate what kind of query it is
issuing, and HAproxy interprets this information to direct the query
to the proper set of nodes.

Naturally, redundancy is built in at every stage (three HAproxy
instances used in round robin, two Memcache instances holding the same
data, two data centers for the MongoDB archive, etc.).


It's also interesting what recent developments have been eschewed by
craigslist. The self-host everything and use no virtualization.
Zawodny admits this leads to an inefficient use of hardware, but
avoids the overhead associated with virtualization. For efficiency,
they have switched to SSDs, allowing them to scale down from 20
servers to only 3. They don't use a CDN, finding that with aggressive
caching and good capacity planning they can handle the load
themselves. They send backups and logs to a SAN.

Let's turn now from the teeming environment of craigslist to the
decidedly lean operation of Pinterest, a much younger and smaller
organization. As href="http://www.percona.com/live/mysql-conference-2012/sessions/scaling-pinterest">presented
by Marty Weiner and Yashh Nelapati, when they started web-scale
growth in the Autumn of 2011, they reacted somewhat like craigslist,
but with much less thinking ahead, throwing in all sorts of software
such as Cassandra and MongoDB, and diversifying a bit recklessly.
Finally they came to their senses and went on a design diet. Their
resolution was to focus on MySQL--but the way they made it work is
unique to their data and application.

They decided against using a cluster, afraid that bad application code
could crash everything. Sharding is much simpler and doesn't require
much maintenance. Their advice for implementing MySQL sharding
included:

  • Make sure you have a stable schema, and don't add features for a
    couple months.


  • Remove all joins and complex queries for a while.

  • Do simple shards first, such as moving a huge table into its own
    database.

They use Pyres, a
Python clone of Resque, to move data into shards.

However, sharding imposes severe constraints that led them to
hand-crafted work-arounds.

Many sites want to leave open the possibility for moving data between
shards. This is useful, for instance, if they shard along some
dimension such as age or country, and they suddenly experience a rush
of new people in their 60s or from China. The implementation of such a
plan requires a good deal of coding, described in the O'Reilly book href="http://shop.oreilly.com/product/9780596807290.do">MySQL High
Availability
, including the creation of a service that just
accepts IDs and determines what shard currently contains the ID.

The Pinterest staff decided the ID service would introduce a single
point of failure, and decided just to hard-code a shard ID in every ID
assigned to a row. This means they never move data between shards,
although shards can be moved bodily to new nodes. I think this works
for Pinterest because they shard on arbitrary IDs and don't have a
need to rebalance shards.

Even more interesting is how they avoid joins. Suppose they want to
retrieve all pins associated with a certain board associated with a
certain user. In classical, normalized relational database practice,
they'd have to do a join on the comment, pin, and user tables. But
Pinterest maintains extra mapping tables. One table maps users to
boards, while another maps boards to pins. They query the
user-to-board table to get the right board, query the board-to-pin
table to get the right pin, and then do simple queries without joins
on the tables with the real data. In a way, they implement a custom
NoSQL model on top of a relational database.

Pinterest does use Memcache and Redis in addition to MySQL. As with
craigslist, they find that most queries can be handled by Memcache.
And the actual images are stored in S3, an interesting choice for a
site that is already enormous.

It seems to me that the data and application design behind Pinterest
would have made it a good candidate for a non-ACID datastore. They
chose to stick with MySQL, but like organizations that use NoSQL
solutions, they relinquished key aspects of the relational way of
doing things. They made calculated trade-offs that worked for their
particular needs.

My take-away from these two fascinating and well-attended talks was
that how you must understand your application, its scaling and
performance needs, and its data structure, to know what you can
sacrifice and what solution gives you your sweet spot. craigslist
solved its problem through the very precise application of different
tools, each with particular jobs that fulfilled craigslist's

requirements. Pinterest made its own calculations and found an
entirely different solution depending on some clever hand-coding
instead of off-the-shelf tools.

Current and future MySQL

The conference keynotes surveyed the state of MySQL and some
predictions about where it will go.

Conference co-chair Sarah Novotny at keynote
Conference co-chair Sarah Novotny at keynotes.

The world of MySQL is much more complicated than it was a couple years
ago, before Percona got heavily into the work of releasing patches to
InnoDB, before they created entirely new pieces of software, and
before Monty started MariaDB with the express goal of making a better
MySQL than MySQL. You can now choose among Oracle's official MySQL
releases, Percona's supported version, and MariaDB's supported
version. Because these are all open source, a major user such as

Facebook can even apply patches to get the newest features.

Nor are these different versions true forks, because Percona and
MariaDB create their enhancements as patches that they pass back to
Oracle, and Oracle is happy to include many of them in a later
release. I haven't even touched on the commercial ecosystem around
MySQL, which I'll look at later in this article.

In his href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-mysql-
evolution">opening
keynote, Percona founder Peter Zaitsev praised the latest MySQL
release by Oracle. With graceful balance he expressed pleasure that
the features most users need are in the open (community) edition, but
allowed that the proprietary extensions are useful too. In short, he
declared that MySQL is less buggy and has more features than ever.


The href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-making-lamp-cloud">former
CEO of MySQL AB, Mårten Mickos, also found that MySQL is
doing well under Oracle's wing. He just chastised Oracle for failing
to work as well as it should with potential partners (by which I
assume he meant Percona and MariaDB). He lauded their community
managers but said the rest of the company should support them more.

Keynote by Mårten Mickos
Keynote by Mårten Mickos.

href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-new-mysql-cloud-ecosystem">Brian
Aker presented an OpenStack MySQL service developed by his current
employer, Hewlett-Packard. His keynote retold the story that had led
over the years to his developing href="https://launchpad.net/drizzle">Drizzle (a true fork of MySQL
that tries to return it to its lightweight, Web-friendly roots) and
eventually working on cloud computing for HP. He described modularity,
effective use of multiple cores, and cloud deployment as the future of
databases.

A href="http://www.percona.com/live/mysql-conference-2012/sessions/future-perfect
-road-ahead-mysql">panel
on the second day of the conference brought together high-level
managers from many of the companies that have entered the MySQL space
from a variety of directions in a high-level discussion of the
database engine's future. Like most panels, the conversation ranged
over a variety of topics--NoSQL, modular architecture, cloud
computing--but hit some depth only on the topic of security, which was
not represented very strongly at the conference and was discussed here
at the insistence of Slavik Markovich from McAfee.

Keynote by Brian Aker
Keynote by Brian Aker.


Many of the conference sessions disappointed me, being either very
high level (although presumably useful to people who are really new to
various topics, such as Hadoop or flash memory) or unvarnished
marketing pitches. I may have judged the latter too harshly though,
because a decent number of attendees came, and stayed to the end, and
crowded around the speakers for information.

Two talks, though, were so fast-paced and loaded with detail that I
couldn't possibly keep my typing up with the speaker.

One such talk was the href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-what-c
omes-next">keynote
by Mark Callaghan of Facebook. (Like the other keynotes, it should be
posted online soon.) A smattering of points from it:


  • Percona and MariaDB are adding critical features that make replication
    and InnoDB work better.

  • When a logical backup runs, it is responsible for 50% of IOPS.

  • Defragmenting InnoDB improves compression.

  • Resharding is not worthwhile for a large, busy site (an insight also
    discovered by Pinterest, as I reported earlier)


The other fact-filled talk was href="http://www.percona.com/live/mysql-conference-2012/sessions/using-nosql-in
nodb-memcached">by
Yoshinori Matsunobu of Facebook, and concerned how to achieve
NoSQL-like speeds while sticking with MySQL and InnoDB. Much of the
talk discussed an InnoDB memcached plugin, which unfortunately is
still in the "lab" or "pre-alpha" stage. But he also suggested some
other ways to better performance, some involving Memcache and others
more round-about:

  • Coding directly with the storage engine API, which is storage-engine
    independent.


  • Using HandlerSocket, which queues write requests and performs them
    through a single thread, avoiding costly fsync() calls. This can
    achieve 30,000 writes per second, robustly.

Matsunobu claimed that many optimizations are available within MySQL
because a lot of data can fit in main memory. For instance, if you
have 10 million users and store 400 bytes per user, the entire user
table can fit in 20 GB. Matsunobu tests have shown that most CPU time
in MySQL is spent in functions that are not essential for processing
data, such as opening and closing a table. Each statement opens a
separate connection, which in turn requires opening and closing the
table again. Furthermore, a lot of data is sent over the wire besides
the specific fields requested by the client. The solutions in the talk
evade all this overhead.


The commercial ecosystem

Both as vendors and as sponsors, a number of companies have always
lent another dimension to the MySQL conference. Some of these really
have nothing to do with MySQL, but offer drop-in replacements for it.
Others really find a niche for MySQL users. Here are a few that I
happened to talk to:

  • Clustrix provides a very
    different architecture for relational data. They handle sharding
    automatically, permitting such success stories as the massive scaling
    up of the social media site Massive Media NV without extra
    administrative work. Clustrix also claims to be more efficient by
    breaking queries into fragments (such as the WHERE clauses of joins)
    and executing them on different nodes, passing around only the data

    produced by each clause.

  • Akiban also offers faster
    execution through a radically different organization of data. They
    flatten the normalized tables of a normalized database into a single
    data structure: for instance, a customer and his orders may be located
    sequentially in memory. This seems to me an import of the document
    store model into the relational model. Creating, in effect, an object
    that maps pretty closely to the objects used in the application
    program, Akiban allows common queries to be executed very quickly, and
    could be deployed as an adjunct to a MySQL database.

  • Tokutek produced a drop-in
    replacement for InnoDB. The founders developed a new data structure
    called a fractal tree as a faster alternative to the B-tree structures
    normally used for indexes. The existence of Tokutek vindicates both

    the open source distribution of MySQL and its unique modular design,
    because these allowed Tokutek's founders to do what they do
    best--create a new storage engine--without needing to create a whole
    database engine with the related tools and interfaces it would
    require.

  • Nimbus Data Systems creates a
    flash-based hardware appliance that can serve as a NAS or SAN to
    support MySQL. They support a large number of standard data transfer
    protocols, such as InfiniBand, and provide such optimizations as
    caching writes in DRAM and making sure they write complete 64KB blocks
    to flash, thus speeding up transfers as well as preserving the life of
    the flash.


Post-conference events

A low-key developer's day followed Percona Live on Friday. I talked to
people in the Drizzle and
Sphinx tracks.

As a relatively young project, the Drizzle talks were aimed mostly at
developers interested in contributing. I heard talks about their
kewpie test framework and about build and release conventions. But in
keeping with it's goal to make database use easy and light-weight, the
project has added some cool features.

Thanks to a
JSON
interface
and a built-in web server, Drizzle now presents you with a
Web interface for entering SQL commands. The Web interface translates
Drizzle's output to simple HTML tables for display, but you can also
capture the JSON directly, making programmatic access to Drizzle
easier. A developer explained to me that you can also store JSON
directly in Drizzle; it is simply stored as a single text column and
the JSON fields can be queried directly. This reminded me of an XQuery
interface added to some database years ago. There too, the XML was
simply stored as a text field and a new interface was added to run the
XQuery selects.

Sphinx, in contrast to Drizzle, is a mature product with commercial
support and (as mentioned earlier in the article) production
deployments at places such as craigslist, as well as an href="http://shop.oreilly.com/product/9780596809539.do">O'Reilly
book. I understood better, after attending today's sessions, what
makes Sphinx appealing. Its quality is unusually high, due to the use

of sophisticated ranking algorithms from the research literature. The
team is looking at recent research to incorporate even better
algorithms. It is also fast and scales well. Finally, integration with
MySQL is very clean, so it's easy to issue queries to Sphinx and pick
up results.

Recent enhancements include an href="https://github.com/alexksikes/fSphinx">add-on called fSphinx
to make faceted searches faster (through caching) and easier, and
access to Bayesian Sets to find "items similar to this one." In Sphinx
itself, the team is working to add high availability, include a new
morphology (stemming, etc.) engine that handles German, improve
compression, and make other enhancements.


The day ended with a reception and copious glasses of Monty Widenius's
notorious licorice-flavored vodka, an ending that distinguishes the
MySQL conference from others for all time.

April 13 2012

Four short links: 13 April 2012

  1. Change the Game (Video) -- Amy Hoy's talk from Webstock '12, on being contrary and being successful. Was one of the standout talks for me.
  2. Rise4Fun -- software engineering tools from Microsoft Research. (via Hacker News)
  3. Why Obama's JOBS Act Couldn't Suck Worse (Rolling Stone) -- get ready for an avalanche of shareholder suits ten years from now, since post-factum civil litigation will be the only real regulation of the startup market.
  4. Socio-economic Return Of FTTH Investment in Sweden (PDF) -- This preliminary study analyses the socio-economic impacts of the investment in FTTH. The goal of the study was: Is it possible to calculate how much a krona (SEK) invested in fibre will give back to society? The conclusion is that a more comprehensive statistical data and more calculations are needed to give an exact estimate. The study, however, provides an indication that 1 SEK invested over four years brings back a minimum of 1.5 SEK in five years time. The study estimates the need for investment to achieve 100% fibre penetration, identifies and quantifies a number of significant effects of fibre deployment, and then calculates the return on investment. (via Donald Clark)

April 10 2012

Open source is interoperable with smarter government at the CFPB

CFPBWhen you look at the government IT landscape of 2012, federal CIOs are being asked to address a lot of needs. They have to accomplish your mission. They need to be able to scale initiatives to tens of thousands of agency workers. They're under pressure to address not just network security but web security and mobile device security. They also need to be innovative, because all of this is supported by the same of less funding. These are common requirements in every agency.

As the first federal "start-up agency" in a generation, some of those needs at the Consumer Financial Protection Bureau (CFPB) are even more pressing. On the other hand, the opportunity for the agency to be smarter, leaner and "open from the beginning" is also immense.

Progress establishing the agency's infrastructure and culture over the first 16 months has been promising, save for larger context of getting a director at the helm. Enabling open government by design isn't just a catchphrase at the CFPB. There has been a bold vision behind the CFPB from the outset, where a 21st century regulator would leverage new technologies to find problems in the economy before the next great financial crisis escalates.

In the private sector, there's great interest right now is finding actionable insight in large volumes of data. Making sense of big data is increasingly being viewed as a strategic imperative in the public sector as well. Recently, the White House put its stamp on that reality with a $200 million big data research and development initiative, including a focus on improving the available tools. There's now an entire ecosystem of software around Hadoop, which is itself open source code. The problem that now exists in many organizations, across the public and private sector, is not so much that the technology to manipulate big data isn't available: it's that the expertise to apply big data doesn't exist in-house. The data science talent shortage is real.

People who work and play in the open source community understand the importance of sharing code, especially when that action leads to improving the code base. That's not necessarily an ethic or a perspective that has been pervasive across the federal government. That does seem to be slowly changing, with leadership from the top: the White House used Drupal for its site and has since contributed modules back into the open source community, including one that helps with 508 compliance.

In an in-person interview last week, CFPB CIO Chris Willey (@ChrisWilleyDC) and acting deputy CIO Matthew Burton (@MatthewBurton) sat down to talk about the agency's new open source policy, government IT, security, programming in-house, the myths around code-sharing, and big data.

The fact that this government IT leadership team is strongly supportive of sharing code back to the open source community is probably the most interesting part of this policy, as Scott Merrill picked up in his post on the CFPB and Github.

Our interview follows.

In addition to being the leader of the CFPB's development team over the past year and half, Burton was just named acting deputy chief information officer. What will that mean?

Willey: He hasn't been leading the software development team the whole time. In fact, we only really had an org chart as of October. In the time that he's been here, Matt has led his team to some amazing things. We're going to talk about a one of them today, but we've also got a great intranet. We've got some great internal apps that are being built and that we've built. We've unleashed one version of the supervision system that helps bank examiners do their work in the field. We've got a lot of faith he's going to do great things.

What it actually means is that he's going to be backing me up as CIO. Even though we're a fairly small organization, we have an awful lot going on. We have 76 active IT projects, for example. We're just building a team. We're actually doubling in size this fiscal year, from about 35 staff to 70, as well as adding lots of contractors. We're just growing the whole pie. We've got 800 people on board now. We're going to have 1,100 on board in the whole bureau by the end of the fiscal year. There's a lot happening, and I recognize we need to have some additional hands and brain cells helping me out.

With respect to building an internal IT team, what's the thinking behind having technical talent inside of an agency like this one? What does that change, in terms of your relationship with technology and your capacity to work?

Burton: I think it's all about experimentation. Having technical people on staff allows an organization to do new things. I think the way most agencies work is that when they have a technical need, they don't have the technical people on staff to make it happen so instead, that need becomes larger and larger until it justifies the contract. And by then, the problem is very difficult to solve.

By having developers and designers in-house, we can constantly be addressing things as they come up. In some cases, before the businesses even know it's a problem. By doing that, we're constantly staying ahead of the curve instead of always reacting to problems that we're facing.

How do you use open source technology to accomplish your mission? What are the tools you're using now?

Willey: We're actually trying to use open source in every aspect of what we do. It's not just in software development, although that's been a big focus for us. We're trying to do it on the infrastructure side as well.

As we look at network and system monitoring, we look at the tools that help us manage the infrastructure. As I've mentioned in the past, we are 100% in the cloud today. Open source has been a big help for us in giving us the ability to manipulate those infrastructures that we have out there.

At the end of the day, we want to bring in the tools that make the most sense for the business needs. It's not about only selecting open source or having necessarily a preference for open source.

What we've seen is that over time, the open source marketplace has matured. A lot of tools that might not have been ready for prime time a year ago or two years ago are today. By bringing them into the fold, we potentially save money. We potentially have systems that we can extend. We could more easily integrate with the other things that we have inside the shop that maybe we built or maybe things that we've acquired through other means. Open source gives us a lot of flexibility because there's a lot of opportunities to do things that we might not be able to do with some proprietary software.

Can you share a couple of specific examples of open source tools that you're using and what you actually use them for within mission?

Willey: On network monitoring, for example, we're using ZFS, which is an open source monitoring tool. We've been working with Nagios as well. Nagios, we actually inherited from Treasury — and while Treasury's not necessarily known for its use of open source technologies, it uses that internally for network monitoring. Splunk is another one that we have been using for web analysis. [After the interview, Burton and Willey also shared that they built the CFPB's intranet on MediaWiki, the software that drives Wikipedia.]

Burton: On the development side, we've invested a lot in Django and WordPress. Our site is a hybrid of them. It's WordPress at its core, with Django on top of that.

In November of 2010, it was actually a few weeks before I started here, Merici [Vinton] called me and said, "Matt, what should we use for our website?"

And I said, "Well, what's it going to do?"

And she said, "At first, it's going to be a blog with a few pages."

And this website needed to be up and running by February. And there was no hosting; there was nothing. There were no developers.

So I said, "Use WordPress."

And by early February, we had our website up. I'm not sure that would have been possible if we had to go through a lengthy procurement process for something not open source.

We use a lot of jQuery. We use Linux servers. For development ops, we use Selenium and Jenkins and Git to manage our releases and source code. We actually have GitHub Enterprise, which although not open source, is very sharing-focused. It encourages sharing internally. And we're using GitHub on the public side to share our code. It's great to have the same interface internally as we're using externally.

Developers and citizens alike can go to github.com/cfpb and see code that you've released back to the public and for other federal agencies. What projects are there?

Burton: These are the ones that came up between basic building blocks. They range from code that may not strike an outside developer as that interesting but that's really useful for the government, all the way to things that we created from scratch that are very developer-focused and are going to be very useful for any developer.

On the first side of that spectrum, there's an app that we made for transit subsidy involvement. Treasury used to manage our transit subsidy balances. That involved going to a webpage that you would print out, write into with a pen and then fax to someone.

Willey: Or scan and email it.

Burton: Right. And then once you'd had your supervisor sign it, faxed it over to someone, eventually, several weeks later, you would get your benefits. We started to take over that process and the human resources office came to us and asked, "How can we do this better?"

Obviously, that should just be a web form that you type into, that will auto fill any detail it knows about you. You press submit and it goes into the database, which goes directly to the DOT [Department of Transportation]. So that's what we made. We demoed that for DOT and they really like it. USAID is also into it. It's encouraging to see that something really simple could prove really useful for other agencies.

On the other side of the spectrum, we use a lot of Django tools. As an example, we have a tool we just released through our website called "Ask CFPB." It's a Django-based question and answer tool, with a series of questions and answers.

Now, the content is managed in Django. All of the content is managed from our staging server behind the firewall. When we need to get that content, we need to get the update from staging over to production.

Before, what we had to do was pick up the entire database, copy it and them move it over to production, which was kind of a nightmare. And there was no Django tool for selectively moving data modifications.

So we sat there and we thought, "Oh, we really need something to do that because we're going to be doing a lot of that. We can't be copying the database over every time we need to correct a copy. So two of our developers developed a Django app called "Nudge." Basically, you go into a Django and if you've ever seen a Django admin, you just go into it and assess, "Hey, here's everything that's changed. What do you want to move over?"

You can pick and choose what you want to move over and, with the click of a button, it goes to production. I think that's something that every Django developer will have a use for if they have a staging server.

In a way, we were sort of surprised it didn't exist. So, we needed it. We built it. Now we're giving it back and anybody in the world can use it.

You mentioned the cloud. I know that CFPB is very associated with Treasury. Are you using Treasury's FISMA moderate cloud?

Willey: We have a mix of what I would say are private and public clouds. On the public side, we're using our own cloud environments that we have established. On the private side, we are using Treasury for some of our apps. We're slowly migrating off of treasury systems onto our own cloud infrastructure or our own cloud.

In the case of email, for example, we're looking at email as a service. So we'll be looking at Google, Microsoft and others just to see what's out there and what we might be able to use.

Why is it important for the CFPB to share code back to the public? And who else in the federal government has done something like this, aside from the folks at the White House?

Burton:: We see it the same way that we believe the rest of the open source community sees it: The only way this stuff is going to get better and become more viable is if people share. Without that, then it'll only be hobbyists. It'll only be people who build their own little personal thing. Maybe it's great. Maybe it's not. Open source gets better by the community actually contributing to it. So it's self-interest in a lot of ways. If the tools get better, then what we have available to us is, therefore, gets better. We can actually do our mission better.

Using the transit subsidy enrollment application example, it's also an opportunity for government to help itself, for one agency to help another. We've created this thing. Every federal agency has a transit subsidy program. They all need to allow people to enroll in it. Therefore, it's immediately useful to any other agency in the federal government. That's just a matter of government improving its own processes.

If one group does it, why should another group have to figure it out or have to pay lots of money to have it figured out? Why not just share it internally and then everybody benefits?

Why do you think it's taken until 2012 to have that insight actually be made into reality in terms of a policy?

Burton: I think to some degree, the tools have changed. The ability to actually do this easily is a lot better now than it was even a year or two ago. Government also traditionally lags behind the private sector in a lot of ways. I think that's changing, too. With this administration in particular, I think what we've seen is that government has started to become a little bit on parity with the private sector, including some of the thinking around how to use technology to improve business processes. That's really exciting. And I think as a result, there are a lot of great people coming in as developers and designers who want to work in the federal government because they see that change.

Willey: It's also because we're new. There are two things behind that. First, we're able to sort of craft a technology philosophy with a modern perspective. So we can, from our founding, ask "What is the right way to do this?" Other agencies, if they want to do this, have to turn around decades of culture. We don't have that burden. I think that's a big reason why we're able to do this.

The second thing is a lot of agencies don't have the intense need that we do. We have 76 projects to do. We have to use every means available to us.

We can't say, "We're not going to use a large share of the software that's available to us." That's just not an option. We have to say, "Yes, we will consider this as a commercial good, just like any other piece of proprietary software."

In terms of the broader context for technology and policy, how does open source relate to open government?

Willey: When I was working for the District, Apps for Democracy was a big contest that we did around opening data and then asking developers to write applications using that data that could then be used by anybody. We said that the next logical step was to sort of create more participatory government. And in my mind, open sourcing the projects that we do is a way of asking the citizenry to participate in the active government.

So by putting something in the public space, somebody could pick that up. Maybe not the transit subsidy enrollment project — but maybe some other project that we've put out there that's useful outside of government as well as inside of government. Somebody can pick that code up, contribute to it and then we benefit. In that way, the public is helping us make government better.

When you have conversations around open source in government, what do you say about what it means to put your code online and to have people look at it or work on it? Can you take changes that people make to the code base to improve it and then use it yourself?

Willey: Everything that we put out there will be reviewed by our security team. The goal is that, by the time it's out there, not to have any security vulnerabilities. If someone does discover a security vulnerability, however, we'll be sharing that code in a way that makes it much more likely that someone will point it out to us and maybe even provide a fix than they will exploit it because it's out there. They wouldn't be exploiting our instance of the code; they would be working with the code on Github.com.

I've seen people in government with a misperception of what open source means. They hear that it's code that anyone can contribute to. I think that they don't understand that you're controlling your own instance of it. They think that anyone can come along and just write anything into your code that they like. And, of course, it's not like that.

I think as we talk more and more about this to other agencies, we might run into that, but I think it'll be good to have strong advocates in government, especially on the security side, who can say, "No, that's not the case; it doesn't work that way."

Burton: We have a firewall between our public and private instances at Git as well. So even if somebody contributes code, that's also reviewed on the way in. We wouldn't implement it unless we made sure that, from a security perspective, the code was not malicious. We're taking those precautions as well.

I can't point to one specifically, but I know that there have been articles and studies done on the relative security of open source. I think the consensus in the industry is that the peer review process of open source actually helps from a security perspective. It's not that you have a chaos of people contributing code whenever they want to. It improves the process. It's like the thinking behind academic papers. You do peer review because it enhances the quality of the work. I think that's true for open source as well.

We actually want to create a community of peer reviewers of code within the federal government. As we talk to agencies, we want people to actually use the stuff we build. We want them to contribute to it. We actually want them to be a community. As each agency contributes things, the other agencies can actually review that code and help each other from that perspective as well.

It's actually fairly hard. As we build more projects, it's going to put a little bit of a strain on our IT security team, doing an extra level of scrutiny to make sure that the code going out is safe. But the only way to get there is to grow that pie. And I think by talking with other agencies, we'll be able to do that.

A classic open source koan is that "with many eyes, all bugs become shallow." In IT security, is it that with many eyes, all worms become shallow?

Burton: What the Department of Defense said was if someone has malicious intent and the code isn't available, they'll have some way of getting the code. But if it is available and everyone has access to it, then any vulnerabilities that are there are much more likely to be corrected than before they're exploited.

How do you see open source contributing to your ability to get insights from large amounts of data? If you're recruiting developers, can they actually make a difference in helping their fellow citizens?

Burton: It's all about recruiting. As we go out and we bring on data people and software developers, we're looking for that kind of expertise. We're looking for people that have worked with PostgreSQL. We're looking for people that have worked with Solar. We're looking for people that have worked with Hadoop, because then we can start to build that expertise in-house. Those tools are out there.

R is an interesting example. What we're finding is that as more people are coming out of academia into the professional world, they're actually used to using R in school. And then they have to come out and learn a different tool and they're actually working in the marketplace.

It's similar with the Mac versus the PC. You get people using the Mac in college — and suddenly they have to go to a Windows interface. Why impose that on them? If they're going to be extremely productive with a tool like R, why not allow that to be used?

We're starting to see, in some pockets of the bureau, push from the business side to actually use some of these tools, which is great. That's another change I think that's happened in the last couple of years.

Before, there would've been big resistance on that kind of thing. Now that we're getting pushed a little bit, we have to respond to that. We also think it's worth it that we do.

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.