Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 13 2012

Four short links: 13 March 2012

  1. Microsoft Universal Voice Translator -- the promise is that it converts your voice into another language, but the effect is more that it converts your voice into that of Darth You in another language. Still, that's like complaining that the first Wright Brothers flight didn't serve peanuts. (via Hacker News)
  2. Geography of the Basketball Court -- fascinating analytics of where NBA shooters make their shots from. Pretty pictures and sweet summaries even if you don't follow basketball. (via Flowing Data)
  3. Spark Research -- a programming model ("resilient distributed datasets") for applications that reuse an intermediate result in multiple parallel operations. (via Ben Lorica)
  4. Opening Up -- earlier I covered the problems that University of Washington's 3D printing lab had with the university's new IP policy, which prevented them from being as open as they had been. They've been granted the ability to distribute their work under Creative Commons license and are taking their place again as a hub of the emerging 3D printing world. (via BoingBoing)

January 05 2012

Traditional vs self-publishing: Neither is the perfect solution

This post is part of the TOC podcast series, which we'll be featuring here on Radar in the coming months. You can also subscribe to the free TOC podcast through iTunes.


Dan Gillmor (@dangillmor) is one of a growing number of authors who have published with both a traditional house as well as self-published. Like many others, he's decided neither is the perfect solution. In this video podcast, Dan talks about the pros and cons of both options. He offers valuable insight not only for authors trying to decide between traditional and self-publishing, but his thoughts are extremely important for everyone in publishing to hear as they think about their roles going forward.

Key points from the full video interview (below) include:

  • Creative Commons licensing still trips up publishers — It's disappointing, but true, that some publishers simply refuse to deal with an author who wants to use the Creative Commons license. [Discussed at the 1:08 mark.]
  • Fear of Creative Commons is similar to a fear of being DRM free — Both of these tie back to "control," and far too many publishers feel they lose control when using Creative Commons or abandoning DRM. [Discussed at 4:10.]
  • There's a reason authors like to have publishers — Sometimes the lesson isn't learned until an author self-publishes, but there are tasks and services publishers perform that authors tend to take for granted. [Discussed at 5:58.]
  • Should traditional publishers venture into self-publishing? — Be careful to not open the floodgates completely. There's still a need to have certain guard rails in place. [Discussed at 11:30.]
  • Now is the time for experimentation — And yet, as Dan notes, "the traditional publishing industry is even more risk averse than it used to be." [Discussed at 13:58.]


  • Even a self-published project can be a hybrid — Dan's latest book, Mediactive, was self-published but involved at least one rights deal with a traditional publisher. [Discussed at 15:26.]
  • Errata and other minor updates should be easy to address — But they're not! Despite all our advancements in technology and product distribution, most retailers are still unable to deal with changes to an edition. [Discussed at 23:20.]

You can view the entire interview in the following video.

Related:

December 13 2011

Four short links: 13 December 2011

  1. Newton's Notebooks Digitised -- wonderful for historians, professional and amateur. I love (a) his handwriting; (b) the pages full of long division that remind us what an amazing time-saver the calculator and then computer was; (c) use of "yn" for "then (the y is actually a thorn, pronounced "th", and it's from this that we get "ye", actually pronounced pronounced "the"). All that and chromatic separation of light, inverse square law, and alchemical mysteries.
  2. Creative Commons Kicks Off 4.0 Round -- public discussion process around issues that will lead to a new version of the CC licenses.
  3. Shred -- an HTTP client library for node.js. (via Javascript Weekly)
  4. Holding Back the Age of Data (Redmonk) -- Absent a market with well understood licensing and distribution mechanisms, each data negotiation - whether the subject is attribution, exclusivity, license, price or all of the above - is a one off. Very good essay into the evolution of a mature software industry into an immature data industry.

July 15 2011

June 16 2011

Choosing the right license for open data

OpenStreetMapYou can't copyright a fact. But that doesn't mean that data and databases are exempt from legal discussions and licensing requirements, even if the intention is to share the data openly. Such is the case with the collaborative mapping project OpenStreetMap (OSM).

When OpenStreetMap launched, contributions to the project were licensed under the Creative Commons Attribution/ShareAlike license. That meant that anyone could copy OSM data, but if it was incorporated into another project, those same terms and conditions applied (ShareAlike) and the copyright owner had to be credited (Attribution). Although this license doesn't appear controversial and seems to fit nicely with the OpenStreetMap mission, there were problems — if for no other reason than the Creative Commons licenses are meant to handle creative works and not data.

After much discussion with lawyers and with the community, OpenStreetMap opted to make the move to the Open Database License (ODbL), arguing it was more suited to OSM's purposes. I recently asked OSM founder Steve Coast about the decision and the process of making the switch.

What compelled OpenStreetMap to change to the Open Database License?

Steve Coast: Licensing is incredibly important for the community to trust that the data won't be closed off. So we need to make sure that data from OpenStreetMap will always be free and open. It's also important that we are able to stop anyone from trying to close it off or derive from it without giving back to the community. We have a multi-year process to re-license based on advice from multiple sources that Creative Commons is not applicable to data. We wish it were, and it probably will be in the future but it wasn't clear when we began. Until that happens we have a process to move to the Open Database License, which explicitly covers data and not just creative works like photographs or text. The ODbL was in fact started as a result of investigations around the needs of Science Commons and we just helped it to its conclusion.

At some point down the line I personally expect the ODbL and CC to be compatible and we will be able to cross-pollinate once more.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD


What are some of the arguments for and against the re-licensing?

Steve Coast: The arguments for are pretty clear. Creative Commons, as they themselves have asserted, doesn't cover the rights around data. We seek that protection.

The arguments against range from staying the course to believing that CC does in fact cover data. We have some people who feel CC has worked thus far, so why change? We have some that feel CC does cover data. These are legitimate arguments to be discussed, as we have many times. We also have a small amount of data derived from sources (like aerial imagery) that either refuse to change or are unresponsive. With them we have to work closely to find a way through.

What are some of the challenges of the move — in terms of technology, legality and the community?

Steve Coast: Probably the most annoying problem we have is that we cannot release legal advice. So our lawyers helping us give us clear directions for the most part, but we can't just take emails from them and forward them to the community openly. The reason for that is we lose our legal rights, including privileged communications between us and our advisers. Therefore we walk a fine line between being as open as we can and at the same time acting on the best advice we can get. That can be frustrating because people in the community can feel alienated and that we might be hiding something bad.

Technically we will get to the point where we will have to remove some data from the project simply because some people will be unreachable if nothing else. It will be fairly insignificant given the corpus of OSM and the rate at which it grows but still, we'd prefer to not remove anything.

For the most part the community has been fantastic around this, especially the core people going through the process week after week for multiple years. We have our loud minority like any open community and we try to be as accommodating as we can. I would describe it as a vast evangelizing exercise. Explaining what we're doing and why to the first 10 people takes about six month. The next 100 takes about another six months — as those initial 10 talk to another 10. Then the next 1,000 take another six months. The next 10,000 take six months. Then the next 100,000 the same again. So it feels to the initial 10 or so like a very long slog explaining everything many, many times. But we have to be mindful that people come along all the time who have no idea why we're changing or what the plan looks like. So we need some very diplomatic people helping the process along. And of course all of those people looking at the license and the process expose bugs in it, which we have then worked out over time.

Having worked through this move for several years now, what are some of the lessons learned?

Steve Coast: I chose CC-BY-SA for the right community and open reasons but I was not and am not a lawyer. You have to choose a license up front so people know that their contributions will not be closed off. On the other hand if the data had been dual licensed to the OSM Foundation or me personally then we could have switched the license much more quickly. The lesson, I think, is to have multiple options. You don't know which direction things will go in three, four or five years down the line. For all the time and pain this change is costing us, it's extremely healthy and a maturing thing to do. It's been a forcing function on the structure of the foundation that supports the project and has made us build working groups, and have a functioning board and finance structures. On average it's probably been a good thing.

The other lesson is to just be insanely open, and realize there will always been conspiracy theorists whom you will never convince. All our meetings were and are open. We have open minutes. But there are some who will refuse to use a telephone or want you to use technology X to hold your meeting or have it at time Y. You simply can't satisfy everyone. You have to make some choices while trying your best to accommodate demands. Don't get pulled down the rathole of trying to make everyone happy all of the time. You can shift the meeting once a month, or try the occasional new thing, but you have to make progress no matter what.

Looking back, I wish that I'd traveled more to spread the message and talk to more people in person. But that's extremely hard and expensive to do.

Hopefully other projects can start based on the years we have put in to this and license with the ODbL or perhaps dual-license with CC-BY-SA. With luck they will never have to know how much work it took to get here.

This interview was condensed and edited

Related:

June 10 2011

June 02 2011

May 17 2011

Four short links: 17 May 2011

  1. Sorting Out 9/11 (New Yorker) -- the thorniest problem for the 9/11 memorial was the ordering of the names. Computer science to the rescue!
  2. Tagger -- Python library for extracting tags (statistically significant words or phrases) from a piece of text.
  3. Free Science, One Paper at a Time (Wired) -- Jonathan Eisen's attempt to collect and distribute his father's scientific papers (which were written while a federal employee, so in the public domain), thwarted by old-fashioned scientific publishing. “But now,” says Jonathan Eisen, “there’s this thing called the Internet. It changes not just how things can be done but how they should be done.”
  4. Internet Archive Launches Physical Archive -- I'm keen to see how this develops, because physical storage has problems that digital does not. I'd love to see the donor agreement require the donor to give the archive full rights to digitize and distribute under open licenses. That'd put the Internet Archive a step in front of traditional archives, museums, libraries, and galleries, whose donor agreements typically let donors place arbitrary specifications on use and reuse ("must be inaccessible for 50 years", "no commercial use", "no use that compromises the work", etc.), all of which are barriers to wholesale digitization and reuse.

May 06 2011

Collaborative genetics, part 5: Next steps for genetic commons

Previous installment: Private practice, how to respect the patient

Sage is growing, and everything they're doing to promote the commons now will likely continue. They'll sign up more pharma companies to contribute data and more researchers to work in teams, such as in the Federation.

Although genetics seems to be a narrow area, it's pretty central to everything that government, hospitals, and even insurers want to achieve in lowering costs and improving care. This research is at the heart of such tasks as:

  • Making drug development faster and cheaper (drugs are now a major source of inflation in in health care, particularly among the growing elderly population)

  • Discovering in advance which patients will fail to respond to drugs, thus lowering costs and allowing them to access correct treatments faster

  • Improving our knowledge of the incidence and course of diseases in general

From my perspective--knowing little about medical research but a fair among about software--the two biggest areas that need attention are standardized formats and software tools to support such activities as network modeling and analyzing results. Each institution tends to be on its own, but there are probably a lot of refined tools out there that could help everybody.

Researchers may well underestimate how much effort needs to go into standardizing software tools and formats, and how much pay-off that work would produce. Researchers tend to be loners, brave mountaineers who like to scale the peaks on their own and solve each problem through heroism along the way. Investing in a few cams and rappels could greatly enhance their success.

Publicity and public engagement are good for any initiative, but my guess is that, if Sage and its collaborators develop some awesome tools and show more of the results we started to see at this conference, other institutions will find their way to them.

This posting is the last of a five-part series.

May 05 2011

Collaborative genetics, part 4: Private practice, how to respect the patient

Previous installment: Dividing the pie, from research to patents

The fear of revealing patient data pervades the medical field, from the Hippocratic Oath to the signs posted all over hospitals reminding staff not to discuss patients in the hallways and elevators. HIPAA's privacy provisions are parts most routinely cited, and many hospitals overreach their legal mandates, making it even harder than the law requires to get data. Whereas Americans have gotten used to the wanton collection of data in other spheres of life, health care persists in its idyllic island or innocence (and we react with outrage whenever this innocence proves illusory).

In my break-out session about terms of service, a lot of the talk revolved around privacy. The attendees acknowledged respectfully on one level and grumbled about how the laws got in the way of good research on another level. Their attitudes struck me as inconsistent and lacking resolve, but overall I felt that these leaders in the field of health lacked an appreciation for the sacredness of privacy as part of the trust a patient has for her doctor and the health care system.

Even Peter Kapitein, in his keynote, railed that concerns for privacy were just excuses used by institutions to withhold information they didn't want the public to know. This is often true, but I felt he went too far in uttering: "No patient will say, please don't use my data if it will help me or help someone else in my position." This is not what surveys show, such as Dr. Alan Westin's 2008 report to the FTC. When I spoke to Kapitein afterward, he acknowledged that he had exaggerated his point for the sake of rhetoric, and that he recognized the importance of privacy in many situations. Still, I fear that his strong statement might have a bad effect on his audience.

We all know that de-identified data is vulnerable to re-identification and that many patients have good reason to fear what would happen if certain people got word of their conditions. It's widely acknowledged that many patients withhold information from their own doctors out of embarrassment. They still need to have a choice when researchers ask for data too. Distrust of medical research is common among racial minorities, still angry at the notorious Tuskegee syphilis study and recently irked again by researchers' callous attitude toward the family of Henrietta Lacks.

Wilbanks recommends that the terms of service for the commons prohibit unethical uses of data, and specifically the combination of data from different sources for re-identification of patients.

It's ironic that one vulnerability might be forced on the Sage commons by patients themselves. Many patients offer their data to researchers with the stipulation that the patients can hear back about any information the researchers find out about them; this is called a "patient grant-back."

Grant-backs introduce significant ethical concerns, aside from privacy, because researchers could well find that the patient has a genetic makeup strongly disposing him to a condition for which there's no treatment, such as Huntington's Disease. Researchers may also find out things that sound scary and require professional interpretation to put into context. One doctor I talked to said the researcher should communicate any findings to the patient's doctor, not the patient himself. But that would be even harder to arrange.

In terms of privacy, requiring a researcher to contact the patient introduces a new threat of attack and places a huge administrative burden on the researchers, as well as any repository such as the commons. It means that the de-identified data must be linked in a database to contact information for the patient. Even if careful measures are taken to separate the two databases, an intruder has a much better chance of getting the data than if the patient left no such trace. Patients should be told that this is a really bad deal for them.

This posting is one of a five-part series. Final installment: Next steps for genetic commons

May 04 2011

Collaborative genetics, part 3: Dividing the pie, from research to patents

Previous installment: Five Easy Pieces, Sage's Federation

What motivates scientists, companies, and funders to develop bold new treatments? Of course, everybody enters the field out of a passion to save humanity. But back to our question--what motivates scientists, companies, and funders to develop bold new treatments?

The explanations vary not only for different parts of the industry (academics, pharma companies, biotech firms, government agencies such as NIH, foundations, patient advocacy groups) but for institutions at different levels in their field. And of course, individual scientists differ, some seeking only advancement in their departments and the rewards of publishing, whereas others jump whole-heartedly into the money pipeline.

The most illustrious and successful projects in open source and open culture have found ways to attract both funds and content. Sage, the Federation, and Arch2POCM will have to find their own sources of ammunition.

The fuse for this reaction may have to begin with funders in government and the major foundations. Currently they treat only a published paper as fulfillment of a contract. The journals also have to be brought into the game. All the other elements of the data chain that precede and follow publication need to get their due.

Sage is creating a format for citing data sets, which can be used in the same ways researchers cite papers. A researcher named Eric Shadt also announced a new journal Open Network Biology, with the commitment of publishing research papers along with the network models used, the software behind the results, and the underlying data.

Although researchers are competitive, they also recognize the importance of sharing information, so they are persuadable. If a researcher believes someone else may validate and help to improve her data, she has a strong incentive to make it open. Releasing her data can also raise her visibility in the field, independent of publications.

Arch2POCM has even more ticklish tasks. On the one hand, they want to direct researchers toward work that has a good chance of producing a treatment--and even this goal is muddled by the desire to encourage more risk-taking and a willingness to look beyond familiar genes. (Edwards ran statistics on studies in journals and found them severely clustered among a few genes that had already been thoroughly explored, mostly ignoring the other 80% or 90% of the human genome even though it is known to have characteristics of interest in the health field. His highly skewed graph drew a strong response of concern from the audience.)

According to Teri Melese, Director of Research Technologies and Alliances for UCSF School of Medicine, pharma companies already have programs aiming to promote research that uses their compounds by offering independent researchers the opportunity to submit their ideas for studies. But the company has to approve each project , and although the researcher can publish results, the data used in the experiment remains tightly under the control of the researcher or the company. This kind of program shows that nearly infinite degrees of compromise lie between totally closed research systems and a completely open commons--but to get the benefits of openness, companies and researchers will need to relinquish a lot more control than they have been willing to up till now.

The prospect of better research should attract funders, and Arch2POCM targets pharma companies in particular to pony up millions for research. The companies have a precedent for sharing data under the rubric of "pre-competitive research." According to Sage staffer Lara Mangravite, eight pharma companies have donated some research data to Sage.

The big trick is to leave as much of the results in the commons as possible while finding the right point in the development process where companies can extract compounds or other information and commercialize them as drugs. Sage would like to keep the core genetic information free from patents, but is willing to let commercial results be patented. Stephen Friend told me, "It is important to maintain a freedom of interoperability for the data and the metadata found within the hosted models of disease. Individuals and companies can still reduce to practice some of the proposed functions for proteins and file patents on these findings without contaminating the freedom of the information hosted on the commons platform."

The following diagram tries to show, in a necessarily over-simplified form, the elements that go into research and are produced by research. Naturally, results of some research are fed back in circular form to further research. Each input is represented as a circle, and is accompanied by a list of stake-holders who can assert ownership over it. Medicines, patents, and biological markers are listed at the bottom as obvious outputs that are not considered as part of the research inputs.

[Diagram of research inputs and outputs]

Inputs and outputs of research

The relationships ultimately worked out between the Sage commons and the pharma companies--which could be different for different classes of disease--will be crucial. The risk of being too strict is to drive away funds, while the risk of being too accommodating is to watch the commons collapse into just another consortium that divies up rewards among participants without giving the rest of the world open access.

What about making researchers who use the commons return the results of their research to the commons? This is another ticklish issue that was endlessly discussed in (and long after) a break-out session I attended. The Sage commons was compared repeatedly to software, with references to well-known licenses such as the GNU GPL, but analogies were ultimately unhelpful. A genetic commons represents a unique relationship among data related to particular patients, information of use in various stages of research, and commercially valuable products (to the tune of billions of dollars).

Sage and its supporters naturally want to draw as much research in to the commons as they can. It would be easy to draw up a reciprocal terms of service along the lines of, "If you use our data, give back your research results"--easy to draw up but hard to enforce. John Wilbanks, a staff person at Creative Commons who has worked heavily with Sage, said such a stipulation would be a suggestion rather than a requirement. If someone uses data without returning the resulting research, members of the community around the commons could express their disapproval by not citing the research.

But all this bothers me. Every open system recognizes that it has to co-exist with a proprietary outer world, and provide resources to that world. Even the GNU GPL doesn't require you to make your application free software if you compile it with the GNU compiler or run it on Linux. Furthermore, the legal basis for strict reciprocity is dubious, because data is not protected by copyright or any other intellectual property regime. (See my article on collections of information.)

I think that outsiders attending the Congress lacked a fundamental appreciation of the purpose of an information commons. One has to think long-term--not "what am I going to get from this particular contribution?" but "what might I be able to do in 10 years that I can't do today once a huge collection of tools and data is in the public domain?" Nobody knows, when they put something into the commons, what the benefits will be. The whole idea is that people will pop up out of nowhere and use the information in ways that the original donors could not imagine. That was the case for Linux and is starting to come true for Android. It's the driving motivation behind the open government movement. You have to have a certain faith to create a commons.

At regular points during the Congress, attendees pointed out that no legitimate motivation exists in health care unless it is aimed ultimately toward improving life for the patients. The medical field refers to the experiences of patients--the ways they react to drugs and other treatments--as "post-market effects," which gives you a hint where the field's priorities currently lie.

Patients were placed front and center in the opening keynote by Peter Kapitein, a middle-aged Dutch banker who suffered a series of wrenching (but successful) treatments for lymphoma. I'll focus in on one of his statements later.

It was even more impressive to hear a central concern for the patient's experience expressed by Vicki Seyfert-Margolis, Senior Science Advisor to the FDA's Chief Scientist. Speaking for herself, not as a representative of the FDA, she chastised industry, academia, and government alike for not moving fast enough to collaborate and crowdsource their work, then suggested that in the end the patients will force change upon us all. While suggesting that the long approval times for drugs and (especially) medical devices lie in factors outside the FDA's control, she also said the FDA is taking complaints about its process seriously and has launched a holistic view of the situation under its current director.

[Photo of Vicki Seyfert-Margolis

Vicki Seyfert-Margolis keynoting at Sage Commons Congress

The real problem is not slow approval, but a lack of drug candidates. Submissions by pharmaceutical companies have been declining over the years. The problem returns to the old-fashioned ways the industry works: walled off into individual labs that repeat each other's work over and over and can't learn from each other.

One of the Congress's break-out groups was tasked with outreach and building bonds with the public. Not only could Sage benefit if the public understood its mission and accomplishments (one reason I'm writing this blog) but patients are key sources for the information needed in the commons.

There was some discussion about whether Sage should take on the area served by PatientsLikeMe and DIYgenomics, accepting individual donations of information. I'm also a bit dubious about far a Facebook page will reach. The connection between patient input and useful genetic information is attenuated and greatly segmented. It may be better to partner with the organizations that interact directly with individuals among the public.

It's more promising to form relationships with patient advocacy groups, as a staff person from the Genetic Alliance pointed out. Advocacy groups can find patients for drug trials (many trials are canceled for lack of appropriate patients) and turn over genetic and phenotypal information collected from those patients. (A "phenotype" basically refers to everything interesting about you that expresses your state of health. It could include aspects of your body and mental state, vital statistics, a history of diseases and syndromes you've had, and more.)

This posting is one of a five-part series. Next installment: Private practice, how to respect the patient

May 03 2011

Collaborative genetics, part 2: Five Easy Pieces, Sage's Federation

Previous installment:
The ambitious goals of Sage Commons Congress

A pilot project was launched by Sage with four university partners under the moniker of the Federation, which sounds like something out of a spy thriller (or Star Trek, which was honored in stock photos in the PowerPoint presentation about this topic). Hopefully, the only thrill will be that expressed by the participants. Three of them presented the results of their research into aspects of aging and disease at the Congress.

The ultimate goal of the Federation is to bring together labs from different places to act like a single lab. Its current operation is more modest. As a pilot, the Federation received no independent funding. Each institution agreed simply to allow their researchers to collaborate with the other four institutions. Atul Butte of Stanford reported that the lack of explicit funding probably limited collaboration. In particular, the senior staff on each project did very little communication because their attention was always drawn to other tasks demanded by their institutions, such as writing grant proposals. Junior faculty did most of the actual collaboration.

As one audience member pointed out, "passion doesn't scale." But funding will change the Federation as well, skewing it toward whatever gets rewarded.

[Photo of audience and podium]

Audience and podium at Sage Commons Congress

When the Federation grows, the question it faces is whether to incorporate more institutions in a single entity under Sage's benign tutelage or to spawn new Federations that co-exist in their own loose federation. But the issue of motivations and rewards has to be tackled.

Another organization launched by Sage is Arch2POCM, whose name requires a bit of elucidation. The "Arch" refers to "archipelago" as a metaphor for a loose association among collaborating organizations. POCM stands for Proof of Clinical Mechanism. The founders of Arch2POCM believe that if the trials leading to POCM (or the more familiar proof of concept, POC) were done by public/private partnerships free of intellectual property rights, companies could benefit from reduced redundancy while still finding plenty of opportunities to file patents on their proprietary variants.

Arch2POCM, which held its own summit with a range of stakeholders in conjunction with the larger Congress, seeks to establish shared, patent-free research on a firm financial basis, putting organizational processes in place to reward scientists for producing data and research that go into the commons. rch2POCM's reach is ambitious: to find new biological markers (substances or techniques for tracking what happens in genes), and even the compounds (core components of effective drugs) that treat diseases.

The pay-off for a successful Arch2POCM project is enticing. Not only could drugs be developed much more cheaply and quickly, but we might learn more about the precise ways they affect patients so that we can save patients from taking drugs that are ineffective in their individual cases, and eliminate adverse effects. To get there, incentives once again come to the fore. A related platform called Synapse hosts the data models, providing a place to pick targets and host the clinical data produced by the open-access clinical trials.

This posting is one of a five-part series. Next installment: Dividing the pie, from research to patents

May 02 2011

Collaborative genetics, Part 1: The ambitious goals of Sage Commons Congress

In a field rife with drug-addicted industries that derive billions of dollars from a single product, and stocked with researchers who scramble for government grants (sadly cut back by the recent US federal budget), the open sharing of genetic data and tools may seem a dream. But it must be more than a dream when the Sage Commons Congress can draw 150 attendees (turning away many more) from research institutions such as the Netherlands Bioinformatica Centre and Massachusetts General Hospital, leading universities from the US and Europe, a whole roster of drug companies (Pfizer, Merck, Novartis, Lilly, Genentech), tech companies such as Microsoft and Amazon.com, foundations such as Alfred P. Sloan, and representatives from the FDA and the White House. I felt distinctly ill at ease trying to fit into such a well-educated crowd, but was welcomed warmly and soon found myself using words such as "phenotype" and "patient stratification."

Money is not the only complicating factor when trying to share knowledge about our genes and their effect on our health. The complex relationships of information generation, and how credit is handed out for that information, make biomedical data a case study all its own.

The complexity of health research data

I listened a couple weeks ago as researchers at this congress, held by Sage Bionetworks, questioned some of their basic practices, and I realized that they are on the leading edge of redefining what we consider information. For most of the history of science, information consisted of a published paper, and the scientist tucked his raw data in a moldy desk drawer. Now we are seeing a trend in scientific journals toward requiring authors to release the raw data with the paper (one such repository in biology is Dryad). But this is only the beginning. Consider what remains to be done:

  • It takes 18 to 24 months to get a paper published. The journal and author usually don't want to release the data until the date of publication, and some add an arbitrary waiting period after publication. That's an extra 18 to 24 months (a whole era in some fields) during which that data is withheld from researchers who could have built new discoveries on it.

  • Data must be curated, which includes:

  • Being checked for corrupt data and missing fields (experimental artifacts)

  • Normalization

  • Verifying HIPAA compliance and other assurances that data has been properly de-identified

  • Possible formatting according to some standard

  • Reviewing for internal and external validity

  • Advocates of sharing hope this work be crowdsourced to other researchers who want to use the data. But then who gets credited and rewarded for the work?

  • Negative results--experiments showing that a treatment doesn't work--are extremely important, and the data behind them is even more important. Of course, knowing where other researchers or companies failed could boost the efforts of other researchers and companies. Furthermore this data may help accomplish patient stratification--that is, show when some patients will benefit and some will not, even when their symptoms seem the same. The medical field is notorious for suppressing negative results, and the data rarely reaches researchers who can use it.

  • When researchers choose to release data--or are forced to do so by their publishers--it can be in an atrocious state because it missed out on the curation steps just mentioned. The data may also be in a format that makes it hard to extract useful information, either because no one has developed and promulgated an appropriate format, or because the researcher didn't have time to adopt it. Other researchers may not even be able to determine exactly what the format it. Sage is working on very simple text-based formats that provide a lowest common denominator that will help researchers get started.

  • Workflows and practices in the workplace have a big effect on the values taken by the data. These are very hard to document, but can help a great deal in reproducing and validating results. Geneticists are starting to use a workflow documentation tool called Taverna to record the ways they coordinate different software tools and data sets.

  • Data can be interpreted in multiple ways. Different teams look for different criteria and apply different standards of quality. It would be useful to share these variations.

  • A repeated theme at the Congress was "going beyond the narrative." The narrative here is the published article. Each article tells a story and draws conclusions. But a lot goes on behind the scenes in the art and science of medicine. Furthermore, letting new hypotheses emerge from data is just as important as verifying the narrative provided by one's initial hypothesis.

    One of the big questions raised in my mind--and not covered in the conference--was the effect it would have on the education of the next generation of scientists were teams to expose all those hidden aspects of data: the workflows, the curation and validation techniques, the interpretations. Perhaps you wouldn't need to attend the University of California at Berkeley to get a Berkeley education, or risk so many parking tickets along the way. Certainly, young researchers would have powerful resources for developing their craft, just as programmers have with the source code for free software.

    I've just gone over a bit of the material that the organizers of the Sage Commons Congress want their field to share. Let's turn to some of structures and mechanisms.

    Of networks

    Take a step back. Why do geneticists need to share data? There are oodles of precedents, of course: the Human Genome Project, biobricks, the Astrophysics Data System (shown off in a keynote by Alyssa A. Goodman from Harvard), open courseware, open access journals, and countless individual repositories put up by scientists. A particularly relevant data sharing initiative is the International HapMap Project, working on a public map of the human genome "which will describe the common patterns of human DNA sequence variation." This is not a loose crowdsourcing project, but more like a consortium of ten large research centers promising to release results publicly and forgo patents on the results.

    The field of genetics presents specific challenges that frustrate old ways of working as individuals in labs that hoard data. Basically, networks of genetic expression requires networks of researchers to untangle them.

    In the beginning, geneticists modeled activities in the cell through linear paths. A particular protein would activate or inhibit a particular gene that would then trigger other activities with ultimate effects on the human body.

    They found that relatively few activities could be explained linearly, though. The action of a protein might be stymied by the presence of others. And those other actors have histories of their own, with different pathways triggering or inhibiting pathways at many points. Stephen Friend, President of Sage Bionetworks, offers the example of an important gene implicated in breast cancer, the Human Epidermal growth factor Receptor 2, HER2/neu. The drugs that target this protein are weakened when another protein, Akt, is present.

    Trying to map these behaviors, scientists come up with meshes of paths. The field depends now on these network models. And one of its key goals is to evaluate these network models--not as true or false, right or wrong, because they are simply models that represent the life of the cell about as well as the New York subway map represents the life of the city--but for the models' usefulness in predicting outcomes of treatments.

    Network models containing many actors and many paths--that's why collaborations among research projects could contribute to our understanding of genetic expression. But geneticists have no forum for storing and exchanging networks. And nobody records them in the same format, which makes them difficult to build, trade, evaluate, and reuse.

    The Human Genome Project is a wonderful resource for scientists, but it contains nothing about gene expression, nothing about the network models and workflows and methods of curation mentioned earlier, nothing about software tools and templates to promote sharing, and ultimately nothing that can lead to treatments. This huge, multi-dimensional area is what the Sage Commons Congress is taking on.

    More collaboration, and a better understanding of network models, may save a field that is approaching crisis. The return on investment for pharmaceutical research, according to researcher Aled Edwards, has gone down over the past 20 years. In 2009, American companies spent one hundred billion dollars on research but got only 21 drugs approved, and only 7 of those were truly novel. Meanwhile, 90% of drug trials fail. And to throw in a statistic from another talk (Vicki Seyfert-Margolis from the FDA), drug side effects create medical problems in 7% of patients who take the drugs, and require medical interventions in 3% or more cases.

    This posting is one of a five-part series. Next installment: Five Easy Pieces, Sage's Federation

    September 15 2010

    Four short links: 15 September 2010

    1. Privacy Commission Uses CC License For Content -- The office of the New Zealand Privacy Commissioner is releasing its content under the CC-BY license, including fact sheets, newsletters, guidance, case studies, howtos, and more.
    2. Magic iPad Light Painting (BERG London) -- continuing their stunning work, this concept video uses a form of long-exposure stop-motion to turn the iPad into visual magic.
    3. Implementing TLS and Raw Sockets Using Only Flash and Javascript -- interesting first steps to implementing non-trivial security in Javascript ("The Language Of The Future (tm)"). (via ivanristic on Twitter)
    4. How to Read a Patent in 60 Seconds (Dan Shapiro) -- quick guide to the important parts of a patent. For more detail, check out the more detailed docs from the PatentLens.

    Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
    Could not load more posts
    Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
    Just a second, loading more posts...
    You've reached the end.

    Don't be the product, buy the product!

    Schweinderl