Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 11 2012

Visualization of the Week: Avengers Assemble

Marvel's "The Avengers" opened in U.S. theaters last weekend, claiming the largest weekend opening so far this year and setting a new three-day domestic box office record. The film features a superhero team comprised of Iron Man, Thor, The Hulk, Captain America, Black Widow, and Hawkeye. But as fans of the comics know, this isn't the original or the only composition of "The Avengers." When the team first appeared in 1963, it was made up of Iron Man, Ant-Man, Wasp, Thor, and the Hulk. Captain America was discovered frozen in ice in Issue #4. The only real constant of "The Avengers" over the years is its rotating roster of superheroes.

That changing makeup of "The Avengers" is the theme of this week's visualization, created by The New York Times' data artist in residence Jer Thorp. On his blog, Thorp has posted a series of visualizations about these superheroes that uses comicvine.com's API.

Thorp writes:

"My first thought was to use images of the characters in my visualizations, but while the Comic Vine API provides images in all kinds of sizes, the styles of drawing are so varied that it ended up not holding together. Instead, then, I built a small tool that let me go through those characters and pick three colours that I thought represented them the best (everybody gets a shield!)."

Below is a depiction of all "The Avengers" characters, ordered by the frequency in which they appear in the series.

Avengers appearance frequence

And sorting by issue (characters appearing in an issue together form a radial line), here is every appearance of every Avenger team member in every issue.

Avengers 1963-2011

Read the full post to see many more visualizations based on "The Avengers," including the gender ratio of the superhero team and the types of villains they often had to assemble to battle.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

More Visualizations:

Reposted bycheg00 cheg00

May 10 2012

Strata Week: Big data boom and big data gaps

Here are a few of the data stories that caught my attention this week.

Big data booming

The call for speakers for Strata New York has closed, but as Edd Dumbill notes, the number of proposals are a solid indication of the booming interest in big data. The first Strata conference, held in California in 2011, elicited 255 proposals. The following event in New York elicited 230. The most recent Strata, held in California again, had 415 proposals. And the number received for Strata's fall event in New York? That came in at 635.

Edd writes:

"That's some pretty amazing growth. I can thus expect two things from Strata New York. My job in putting the schedule together is going to be hard. And we're going to have the very best content around."

The increased popularity of the Strata conference is just one data point from the week that highlights a big data boom. Here's another: According to a recent report by IDC, the "worldwide ecosystem for Hadoop-MapReduce software is expected to grow at a compound annual rate of 60.2 percent, from $77 million in revenue in 2011 to $812.8 million in 2016."

"Hadoop and MapReduce are taking the software world by storm," says IDC's Carl Olofson. Or as GigaOm's Derrick Harris puts it: "All aboard the Hadoop money train."

A big data gap?

Another report released this week reins in some of the exuberance about big data. This report comes from the government IT network MeriTalk, and it points to a "big data gap" in the government — that is, a gap between the promise and the capabilities of the federal government to make use of big data. That's interesting, no doubt, in terms of the Obama administration's recent $200 million commitment to a federal agency big data initiative.

Among the MeriTalk report's findings: 60% of government IT professionals say their agency is analyzing the data it collects and less than half (40%) are using data to make strategic decisions. Those responding to the survey said they felt as though it would take, on average, three years before their agencies were ready to fully take advantage of big data.

Prismatic and data-mining the news

The largest-ever healthcare fraud scheme was uncovered this past week. Arrests were made in seven cities — some 107 doctors, nurses and social workers were charged, with fraudulent Medicare claims totaling about $452 million. The discoveries about the fraudulent behavior were made thanks in part to data-mining — looking for anomalies in the Medicare filings made by various health care providers.

Prismatic penned a post in which it makes the case for more open data so that there's "less friction" in accessing the sort of information that led to this sting operation.

"Both the recent sting and the Prime case show that you need real journalists and investigators working with technology and data to achieve good results. The challenge now is to scale this recipe and force transparency on a larger scale.

"We need to get more technically sophisticated and start analysing the data sets up front to discover the right questions to ask, not just the answer the questions we already know to ask based on up-front human investigation. If we have to discover each fraud ring or singleton abuse as a one-off case, we'll never be able to wipe out fraud on a large enough scale to matter."

Indeed, despite this being the largest bust ever, it's really just a fraction of the estimated $20 to $100 billion a year in Medicare fraud.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

May 07 2012

A brief history of data journalism

The following is an excerpt from "The Data Journalism Handbook," a collection of essays and resources covering the growing field of data journalism.


Data journalism imagesIn August 2010 some colleagues and I organised what we believe was one of the first international "data journalism" conferences, which took place in Amsterdam. At this time there wasn't a great deal of discussion around this topic and there were only a couple of organizations that were widely known for their work in this area.

The way that media organizations like Guardian and the New York Times handled the large amounts of data released by Wikileaks is one of the major steps that brought the term into prominence. Around that time the term started to enter into more widespread usage, alongside "computer-assisted reporting," to describe how journalists were using data to improve their coverage and to augment in-depth investigations into a given topic.

Speaking to experienced data journalists and journalism scholars on Twitter it seems that one of the earliest formulations of what we now recognise as data journalism was in 2006 by Adrian Holovaty, founder of EveryBlock — an information service which enables users to find out what has been happening in their area, on their block. In his short essay "A fundamental way newspaper sites need to change," he argues that journalists should publish structured, machine-readable data, alongside the traditional "big blob of text":

"For example, say a newspaper has written a story about a local fire. Being able to read that story on a cell phone is fine and dandy. Hooray, technology! But what I really want to be able to do is explore the raw facts of that story, one by one, with layers of attribution, and an infrastructure for comparing the details of the fire — date, time, place, victims, fire station number, distance from fire department, names and years experience of firemen on the scene, time it took for firemen to arrive — with the details of previous fires. And subsequent fires, whenever they happen."

But what makes this distinctive from other forms of journalism which use databases or computers? How — and to what extent — is data journalism different from other forms of journalism from the past?

"Computer-Assisted Reporting" and "Precision Journalism"

Using data to improve reportage and delivering structured (if not machine readable) information to the public has a long history. Perhaps most immediately relevant to what we now call data journalism is "computer-assisted reporting" or "CAR," which was the first organised, systematic approach to using computers to collect and analyze data to improve the news.

CAR was first used in 1952 by CBS to predict the result of the presidential election. Since the 1960s, (mainly investigative, mainly U.S.-based) journalists, have sought to independently monitor power by analyzing databases of public records with scientific methods. Also known as "public service journalism," advocates of these computer-assisted techniques have sought to reveal trends, debunk popular knowledge and reveal injustices perpetrated by public authorities and private corporations. For example, Philip Meyer tried to debunk received readings of the 1967 riots in Detroit — to show that it was not just less-educated Southerners who were participating. Bill Dedman's "The Color of Money" stories in the 1980s revealed systemic racial bias in lending policies of major financial institutions. In his "What Went Wrong," Steve Doig sought to analyze the damage patterns from Hurricane Andrew in the early 1990s, to understand the effect of flawed urban development policies and practices. Data-driven reporting has brought valuable public service, and has won journalists famous prizes.

In the early 1970s the term "precision journalism" was coined to describe this type of news-gathering: "the application of social and behavioral science research methods to the practice of journalism." Precision journalism was envisioned to be practiced in mainstream media institutions by professionals trained in journalism and social sciences. It was born in response to "new journalism," a form of journalism in which fiction techniques were applied to reporting. Meyer suggests that scientific techniques of data collection and analysis rather than literary techniques are what is needed for journalism to accomplish its search for objectivity and truth.

Precision journalism can be understood as a reaction to some of journalism's commonly cited inadequacies and weaknesses: dependence on press releases (later described as "churnalism"), bias towards authoritative sources, and so on. These are seen by Meyer as stemming from a lack of application of information science techniques and scientific methods such as polls and public records. As practiced in the 1960s, precision journalism was used to represent marginal groups and their stories. According to Meyer:

"Precision journalism was a way to expand the tool kit of the reporter to make topics that were previously inaccessible, or only crudely accessible, subject to journalistic scrutiny. It was especially useful in giving a hearing to minority and dissident groups that were struggling for representation."

An influential article published in the 1980s about the relationship between journalism and social science echoes current discourse around data journalism. The authors, two U.S. journalism professors, suggest that in the 1970s and 1980s the public's understanding of what news is broadens from a narrower conception of "news events" to "situational reporting," or reporting on social trends. By using databases of — for example — census data or survey data, journalists are able to "move beyond the reporting of specific, isolated events to providing a context which gives them meaning."

As we might expect, the practise of using data to improve reportage goes back as far as "data" has been around. As Simon Rogers points out, the first example of data journalism at the Guardian dates from 1821. It is a leaked table of schools in Manchester listing the number of students who attended it and the costs per school. According to Rogers this helped to show for the first time the real number of students receiving free education, which was much higher than what official numbers showed.

Data Journalism in the Guardian in 1821
Data Journalism in the Guardian in 1821 (The Guardian)

Another early example in Europe is Florence Nightingale and her key report, "Mortality of the British Army," published in 1858. In her report to the parliament she used graphics to advocate improvements in health services for the British army. The most famous is her "coxcomb," a spiral of sections, each representing deaths per month, which highlighted that the vast majority of deaths were from preventable diseases rather than bullets.

Mortality of the British Army by Florence Nightingale
Mortality of the British Army by Florence Nightingale (Image from Wikipedia)

Data journalism and Computer-Assisted Reporting

At the moment there is a "continuity and change" debate going on around the label "data journalism" and its relationship with these previous journalistic practices which employ computational techniques to analyze datasets.

Some argue that there is a difference between CAR and data journalism. They say that CAR is a technique for gathering and analyzing data as a way of enhancing (usually investigative) reportage, whereas data journalism pays attention to the way that data sits within the whole journalistic workflow. In this sense data journalism pays as much — and sometimes more — attention to the data itself, rather than using data simply as a means to find or enhance stories. Hence we find the Guardian Datablog or the Texas Tribune publishing datasets alongside stories, or even just datasets by themselves for people to analyze and explore.

Another difference is that in the past investigative reporters would suffer from a poverty of information relating to a question they were trying to answer or an issue that they were trying to address. While this is of course still the case, there is also an overwhelming abundance of information that journalists don't necessarily know what to do with. They don't know how to get value out of data. A recent example is the Combined Online Information System, the U.K.'s biggest database of spending information — which was long sought after by transparency advocates, but which baffled and stumped many journalists upon its release. As Philip Meyer recently wrote to me: "When information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important."

On the other hand, some argue that there is no meaningful difference between data journalism and computer-assisted reporting. It is by now common sense that even the most recent media practices have histories, as well as something new in them. Rather than debating whether or not data journalism is completely novel, a more fruitful position would be to consider it as part of a longer tradition, but responding to new circumstances and conditions. Even if there might not be a difference in goals and techniques, the emergence of the label "data journalism" at the beginning of the century indicates a new phase wherein the sheer volume of data that is freely available online combined with sophisticated user-centric tools, self-publishing and crowdsourcing tools enables more people to work with more data more easily than ever before.

Data journalism is about mass data literacy

Digital technologies and the web are fundamentally changing the way information is published. Data journalism is one part in the ecosystem of tools and practices that have sprung up around data sites and services. Quoting and sharing source materials is in the nature of the hyperlink structure of the web and the way we are accustomed to navigate information today. Going further back, the principle that sits at the foundation of the hyperlinked structure of the web is the citation principle used in academic works. Quoting and sharing the source materials and the data behind the story is one of the basic ways in which data journalism can improve journalism, what Wikileaks founder Julian Assange calls "scientific journalism."

By enabling anyone to drill down into data sources and find information that is relevant to them, as well as to verify assertions and challenge commonly received assumptions, data journalism effectively represents the mass democratisation of resources, tools, techniques and methodologies that were previously used by specialists — whether investigative reporters, social scientists, statisticians, analysts or other experts. While currently quoting and linking to data sources is particular to data journalism, we are moving towards a world in which data is seamlessly integrated into the fabric of media. Data journalists have an important role in helping to lower the barriers to understanding and interrogating data, and increasing the data literacy of their readers on a mass scale.

At the moment the nascent community of people who called themselves data journalists is largely distinct from the more mature CAR community. Hopefully in the future we will see stronger ties between these two communities, in much the same way that we see new NGOs and citizen media organizations like ProPublica and the Bureau of Investigative Journalism work hand in hand with traditional news media on investigations. While the data journalism community might have more innovative ways of delivering data and presenting stories, the deeply analytical and critical approach of the CAR community is something that data journalism could certainly learn from.

This excerpt was lightly edited. Links were added for EveryBlock, the Guardian Datablog, Texas Tribune datasets, the Combined Online Information System, and Julian Assange's reference to "scientific journalism."

The Data Journalism Handbook (Early Release) — This collaborative book aims to answer questions like: Where can I find data? What tools can I use? How can I find stories in data? (The digital Early Release edition includes raw and unedited content. You'll receive updates when significant changes are made, as well as the final ebook version.)

Related:

Reposted bynunatak nunatak

May 06 2012

Principles of patient access in Directed Exchange

The Health Insurance Portability and Accountability Act (HIPPA) is good law. HIPPA formalized principles of patient privacy that should have been codified industry norms for more than 50 years (better late than never). HIPPA provided the right to patients in the U.S. to get access to their own healthcare records. The law struck reasonable balances on hundreds of complicated issues in order to achieve these goals. The law solved more problems, by far, than it created. Which is as close to the definition of good government as I can imagine. Patients are better off after HIPPA than before.

Sadly, the "letter of the law" in HIPPA is frequently either ignored or worse, fully embraced, in order to make patient access to their own healthcare data more cumbersome. This is evidenced nowhere better than Regina Holiday's experience with access to her husband's medical records. To make a long story short, she was able to acquire an unpublished manuscript of a Stephen King novel, sooner and for less money than she was able to get her husband's medical records.

Principle zero: Some clinicians will do anything they can to make patient access to their health records impossible or cumbersome.

Regina's work, detailing her experience with her husband is titled 73 cents, because that's how much it cost to get one page of her husband's medical record. HIPPA allows hospitals and clinicians to charge a "reasonable" copying fee for access to patient records. The problem with that is that in the digital age, a single healthcare record print out looks like this:

A single EHR record, printed out
A partial printout of a patient's medical record.



This is what happens when you print out a digital health record. Having patients pay the copying costs for access to medical records makes a simple presumption: there are only a few pages there. Obviously no patient will be able to afford copying costs in the age of all-digital records.

Principle one: Patient access to their own healthcare records must be digital once the record is digital.

Once you concede that access to the patient's medical record must be digital, we can discuss the push vs. pull question. When someone else on the Internet has data that is important to you, you can generally find ways to have it "pushed" to you or you can choose to "pull" it. The simplest example is the weather. You can always check the weather easily online by visiting a website (by pulling). But you can also have software text you when it is going to rain (by pushing).

There are advantages of both push and pull approaches for patient access to data. People who are excited about the pull model tend to focus on the benefits of the "portal" requirements in Meaningful Use, and those that favor the push model are excited about directed exchange. Without getting into the debate, I can posit that there are some cases where push access to patient data is critical. Without supporting patient participation in directed exchange we regulate patients to second-class citizens with regard to healthcare exchange. That is unacceptable. Patients should be first-class citizens in healthcare exchange.

Principle two: Patients should be able to participate in health information exchange as first-class citizens.

The Office of the National Coordinator for Health Information Technology (ONC) should be applauded for requiring directed exchange with patients in the current proposed rule. I hope that ONC does not back off of this new requirement.

The current proposed rule making, however, is silent on a critical issue for directed health information exchange. How do we ensure that providers will not refuse to communicate with patients over directed exchange because of bogus "security concerns"? As we see with the copying costs under HIPPA, every potential barrier to a patient's access to data will be used against patients.

There are already rumors of cases in the pilots of directed exchange where organizations are using the trust architecture of the Direct Project to refuse to communicate with certain parties. While that might be reasonable between institutions (do you really think Planned Parenthood will ever automate communication with Catholic charity clinics or vice-versa?), it is absolutely critical that this not hamper patient-clinician communication.

When we first designed the Direct Project Trust model, we presumed that patient-clinicians communication would take place based on "business-card" identity verification. That meant that when a patient provided a clinician with a public key (no matter how they did that) the clinician would trust it because the patient provided the public key. We did this because we knew that if clinicians could reject a patient's public key based on "security concerns," they would do so. Either the clinicians (or more likely the vendors that they hired) would choose directed exchange "partners" that were "approved" and "secure," ensuring that the patient's experience of directed exchange was merely a more extensive menu of patient portal options. Patient data is very valuable and controlling the flow of patient data is central to more business plans than I care to count.

In order for patients to be first-class citizens in health information exchange, they should have the right to send their records, in an automated fashion, anywhere they want. Even if it meant sending it to a service that the patient was enthusiastic about, but the clinician disapproved of (i.e. qpid.me). In the world of secure email enabled by public-key infrastructure (PKI), that translates to clinicians must accept any public key/direct address presented by a patient in a reasonable manner. This acceptance must be unconditional, but should probably mean limiting the acceptance of that key to communication with just that patient. Anything less than this means that the patient is a second-class citizen with regards to the information exchange of their own data.

Conclusion: ONC should require that clinicians communicate with a patient's chosen directed exchange provider, which means accepting any public key presented by a patient in a reasonable manner.

The community at Direct Trust is working hard to agree on what "reasonable manner" should mean, exactly. Here is my latest proposal on the subject, and here are similar ideas from Dr. David Kibbe. Eventually the Direct Trust community will knock out a firm understanding on the specific ways that might be "reasonable" for a patient to provide a certificate. But we are certainly agreed that without firm requirements on certificate acceptance, this issue will be used by clinicians to limit where patients can send their own data.

As the U.S. federal government is preparing to pay healthcare providers to adopt electronic health records (EHR) they will insist that those doctors/hospitals/etc. show that they are using the new software in clinically meaningful ways. On Monday (May 7, 2012) they will be accepting comments on the second stage of the requirements that clinicians must meet in order to receive compensation. These requirements are usually short-handed as "meaningful use."

I will be submitting this blog post as my comments to that process. Others will be submitting comments that directly contradict the principles and conclusions I write here. Most notably the American Hospital Association (AHA) has argued that the requirements for patient portals and for providing patients with access to their digital record should be entirely removed from the meaningful use standards (PDF). Specifically:

"Our members are particularly concerned with the proposed objective to provide patients with the ability to view, download and transmit large volumes of protected health information via the Internet (a "patient portal"). The AHA believes that this objective is not feasible as proposed, raises significant security issues, and goes well beyond current technical capacity. We also believe that CMS should not include this objective because the Office of Civil Rights, and not CMS, regulates how health care providers and other covered entities fulfill their obligations under the Health Insurance Portability and Accountability Act (HIPAA), including the obligation to give patients access to their health records."

This is fairly ironic, since the report also says:

"To date, OCR has received comments on its own significantly flawed original proposal to implement this section of HITECH, but has yet to finalize the standard."

Apparently, AHA is not satisfied with any government agency's interpretation of giving electronic access to patient data. The AHA would prefer that patients continue to wait the same amount of time for access to their digital records that they do for their paper records. Specifically:

"Further, 30 days are necessary to make determinations about how to respond to a request no matter the format of the protected health information. While providing an electronic copy of protected health information maintained in an EHR eventually may be facilitated more easily by technology, the process of determining which records are relevant and appropriate takes the same amount of time as it does for evaluating paper records."

Of course, this is entirely false. Indeed, HIPPA does maintain that certain parts of healthcare records (i.e. a psychiatrist's notes) and disclosures (i.e. when the FBI asks for records) are not subject to patient access. An EHR should be capable of understanding which parts of an EHR record are subject to HIPPA and which are not. If the EHR system can understand this distinction, then responses to HIPPA requests can be made in near-real-time. If the EHR system cannot make the distinction between which portions of the record to automatically provide to honor a HIPPA patient access request, then having 30 days is not going to be enough. Can you imagine a nurse reading through the entire stack of papers above to ensure that a certain mental health diagnosis is redacted?

One of the most critical features of patient participation in directed exchange is the patient's capacity to prevent the spread of bad information as it is happening. Apparently, the AHA believes that patients should tolerate the spread of mis-information in their health records to other institutions for a month before correcting it. This of course works in every situation where patients can wait a whole month to get correct information to other hospitals and clinicians.

I would like to be the first to welcome the American Hospital Association to the digital age. (Okay, maybe the second.) From a technology perspective, there is nothing at all that would prevent patients from receiving copies of their updated digital health records seconds after it is "signed" by their clinicians. Inside those seconds is plenty of time to digitally determine whether sharing with the patient is appropriate, legal and safe. Seconds after a patient like me receives data, I intend to process it in an automated fashion. It is not unreasonable, in this new digital world, for me to get a text message that a doctor has ordered a medication that I am allergic to. I wish to get that message after the doctor has ordered the medication, but before I receive it in my IV.

In this new digital world, 36 hours is unreasonable. It means that humans continue to be involved in tasks that can be performed perfectly by a computer without errors. Even 36 hours means that doctors, nurses and hospital administrators are still "thinking in paper." Thirty-six hours means that you still do not view me, the patient, as an equal data partner. It means that I am blind to the data in your hospital at the only time it really matters, which is right now. Health data that is 36-hours old can only be analyzed as a post-mortem and data that is 30-days old is already rotting. As a patient, 36 hours is a short-term solution. It is an opportunity for you to rethink how information flows in your hospitals. It is an opportunity for you to rethink the notions of "inside" the hospital and "outside" the hospital.

This is not that I do not take your point regarding the reconciliation of the policies from the perspective of HIPPA and meaningful use. Two time-lines for compliance is difficult. But the reconciliation is to speed HIPPA up, not slow meaningful use down. The notion that you will give patients a stack of paper like the one above 30 days after it is useful is a bad joke. It was a bad joke 20 years ago, when the technologies already existed to fix the problem, but you decided that the patient's experience was not worth that investment.

There is always something you can do, if you feel as strongly about this as I do.

Meaningful Use and Beyond: A Guide for IT Staff in Health Care — Meaningful Use underlies a major federal incentives program for medical offices and hospitals that pays doctors and clinicians to move to electronic health records (EHR). This book is a rosetta stone for the IT implementer who wants to help organizations harness EHR systems.

Photo: Medical record printout by jodi0327, on Flickr

Related:

May 04 2012

Visualization of the Week: The origins of English

Mike Kinde, writer for the site Ideas Illustrated, has created a project that visualizes the etymology of the English words used in various passages — a sports article, a medical article, a United Nations document, Charles Dickens' "Great Expectations," and Mark Twain's "The Adventures of Tom Sawyer."

"Using Douglas Harper's online dictionary of etymology, I paired up words from various passages I found online with entries in the dictionary. For each word, I pulled out the first listed language of origin and then reconstructed the text with some additional HTML infrastructure. The HTML would allow me to associate each word (or word fragment) with a color, title, and hyperlink to a definition."

Below is the passage from "The Adventures of Tom Sawyer," which Kinde says he selected because he assumed it would have an interesting mixture of American and British English words. He found that while 73% of the word fragments are Old English, Twain included words from more than a dozen different sources just in this excerpt alone:

etymology.jpg
Visit the site to hover over the words in the passage to see their origins and definitions. Kinde's blog post also contains interesting etymological findings from other types of literature.

Kinde notes he's not an etymologist and writes:

"I definitely suggest digging in to the full etymology site to explore the full history of each word. I have probably made plenty of translation mistakes as I developed my paragraphs, but I certainly had fun." [Link added.]

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

More Visualizations:

May 03 2012

Strata Week: Google offers big data analytics

Here are the data stories that caught my attention this week.

BigQuery for everyone

Google BigQueryGoogle has released its big data analytics service BigQuery to the public. Initially made available to a small number of developers late last year, now anyone can sign up for the service. A free account lets you query up to 100 GB of data per month, with the option to pay for additional queries and/or storage.

"Google's aim may be to sell data storage in the cloud, as much as it is to sell analytic software," says The New York Times' Quentin Hardy. "A company using BigQuery has to have data stored in the cloud data system, which costs 12 cents a gigabyte a month, for up to two terabytes, or 2,000 gigabytes. Above that, prices are negotiated with Google. BigQuery analysis costs 3.5 cents a gigabyte of data processed."

The interface for BigQuery is meant to lower the bar for these sorts of analytics — there's a UI and a REST interface. In the Times article, Google project manager Ju-kay Kwek says Google is hoping developers build tools that encourage widespread use of the product by executives and other non-developers.

If folks are looking for something to cut their teeth on with BigQuery, GitHub's public timeline is now a publicly available dataset. The data is being synced regularly, so you can query things like popular languages and popular repos. To that end, GitHub is running a data visualization contest.

The Data Journalism Handbook

The Data Journalism Handbook had its release this week at the 2012 International Journalism Festival in Italy. The book, which is freely available and openly licensed, was a joint effort of the European Journalism Centre and the Open Knowledge Foundation. It's meant to serve as a reference for those interested in the field of data journalism.

In the introduction, "Deutsche Welle's" Mirko Lorenz writes:

"Today, news stories are flowing in as they happen, from multiple sources, eye-witnesses, blogs, and what has happened is filtered through a vast network of social connections, being ranked, commented and more often than not, ignored. This is why data journalism is so important. Gathering, filtering and visualizing what is happening beyond what the eye can see has a growing value."


Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.



Save 20% on registration with the code RADAR20

Open data is a joke?

Tim Slee fired a shot across the bow of the open data movement with a post this week arguing that "the open data movement is a joke." Moreover, it's not a movement at all, he contends. Slee turns a critical eye to the Canadian government's open data efforts in particular, noting that: "The Harper government's actions around 'open government,' and the lack of any significant consequences for those actions, show just how empty the word 'open' has become."

Slee is also critical of open data efforts outside the government, calling the open data movement "a phrase dragged out by media-oriented personalities to cloak a private-sector initiative in the mantle of progressive politics."

Open data activist David Eaves responded strongly to Slee's post with one of his own, recognizing his own frustrations with "one of the most — if not the most — closed and controlling [governments] in Canada's history." But Eaves takes exception with the ways in which Slee characterizes the open data movement. He contends that many of the corporations involved with the open data movement — something Slee charges has corrupted open data — are U.S. corporations (and points out that in Canada, "most companies don't even know what open data is"). Eaves adds, too, that many of these corporations are led by geeks.

Eaves writes:

"Just as an authoritarian regime can run on open-source software, so too might it engage in open data. Open data is not the solution for Open Government (I don't believe there is a single solution, or that Open Government is an achievable state of being — just a goal to pursue consistently), and I don't believe anyone has made the case that it is. I know I haven't. But I do believe open data can help. Like many others, I believe access to government information can lead to better informed public policy debates and hopefully some improved services for citizens (such as access to transit information). I'm not deluded into thinking that open data is going to provide a steady stream of obvious 'gotcha moments' where government malfeasance is discovered, but I am hopeful that government data can arm citizens with information that the government is using to inform its decisions so that they can better challenge, and ultimately help hold accountable, said government."

Got data news?

Feel free to email me.

Related:

May 02 2012

Recombinant Research: Breaking open rewards and incentives

In the previous articles in this series I've looked at problems in current medical research, and at the legal and technical solutions proposed by Sage Bionetworks. Pilot projects have shown encouraging results but to move from a hothouse environment of experimentation to the mainstream of one of the world's most lucrative and tradition-bound industries, Sage Bionetworks must aim for its nucleus: rewards and incentives.

Previous article in the series: Sage Congress plans for patient engagement.

Think about the publication system, that wretchedly inadequate medium for transferring information about experiments. Getting the data on which a study was based is incredibly hard; getting the actual samples or access to patients is usually impossible. Just as boiling vegetables drains most of their nutrients into the water, publishing results of an experiment throws away what is most valuable.

But the publication system has been built into the foundation of employment and funding over the centuries. A massive industry provides distribution of published results to libraries and research institutions around the world, and maintains iron control over access to that network through peer review and editorial discretion. Even more important, funding grants require publication (but the data behind the study only very recently). And of course, advancement in one's field requires publication.

Lawrence Lessig, in his keynote, castigated for-profit journals for restricting access to knowledge in order to puff up profits. A chart in his talk showed skyrocketing prices for for-profit journals in comparison to non-profit journals. Lessig is not out on the radical fringe in this regard; Harvard Library is calling the current pricing situation "untenable" in a move toward open access echoed by many in academia.

Lawrence Lessig keynote at Sage Congress
Lawrence Lessig keynote at Sage Congress.

How do we open up this system that seemed to serve science so well for so long, but is now becoming a drag on it? One approach is to expand the notion of publication. This is what Sage Bionetworks is doing with Science Translational Medicine in publishing validated biological models, as mentioned in an earlier article. An even more extensive reset of the publication model is found in Open Network Biology (ONB), an online journal. The publishers require that an article be accompanied by the biological model, the data and code used to produce the model, a description of the algorithm, and a platform to aid in reproducing results.

But neither of these worthy projects changes the external conditions that prop up the current publication system.

When one tries to design a reward system that gives deserved credit to other things besides the final results of an experiment, as some participants did at Sage Congress, great unknowns loom up. Is normalizing and cleaning data an activity worth praise and recognition? How about combining data sets from many different projects, as a Synapse researcher did for the TCGA? How much credit do you assign researchers at each step of the necessary procedure for a successful experiment?

Let's turn to the case of free software to look at an example of success in open sharing. It's clear that free software has swept the computer world. Most web sites use free software ranging from the server on which they run to the language compilers that deliver their code. Everybody knows that the most popular mobile platform, Android, is based on Linux, although fewer realize that the next most popular mobile platforms, Apple's iPhones and iPads, run on a modified version of the open BSD operating system. We could go on and on citing ways in which free and open source software have changed the field.

The mechanism by which free and open source software staked out its dominance in so many areas has not been authoritatively established, but I think many programmers agree on a few key points:

  • Computer professionals encountered free software early in their careers, particularly as students or tinkerers, and brought their predilection for it into jobs they took at stodgier institutions such as banks and government agencies. Their managers deferred to them on choices for programming tools, and the rest is history.

  • Of course, computer professionals would not have chosen the free tools had they not been fit for the job (and often best for the job). Why is free software so good? Probably because the people creating it have complete jurisdiction over what to produce and how much time to spend producing it, unlike in commercial ventures with requirements established through marketing surveys and deadlines set unreasonably by management.

  • Different pieces of free software are easy to hook up, because one can alter their interfaces as necessary. Free software developers tend to look for other tools and platforms that could work with their own, and provide hooks into them (Apache, free database engines such as MySQL, and other such platforms are often accommodated.) Customers of proprietary software, in contrast, experience constant frustration when they try to introduce a new component or change components, because the software vendors are hostile to outside code (except when they are eager to fill a niche left by a competitor with market dominance). Formal standards cannot overcome vendor recalcitrance--a painful truth particularly obvious in health care with quasi-standards such as HL7.

  • Free software scales. Programmers work on it tirelessly until it's as efficient as it needs to be, and when one solution just can't scale any more, programmers can create new components such as Cassandra, CouchDB, or Redis that meet new needs.

Are there lessons we can take from this success story? Biological research doesn't fit the circumstances that made open source software a success. For instance, researchers start out low on the totem pole in very proprietary-minded institutions, and don't get to choose new ways of working. But the cleverer ones are beginning to break out and try more collaboration. Software and Internet connections help.

Researchers tend to choose formats and procedures on an ad hoc, project by project basis. They haven't paid enough attention to making their procedures and data sets work with those produced by other teams. This has got to change, and Sage Bionetworks is working hard on it.

Research is labor-intensive. It needs desperately to scale, as I have pointed out throughout this article, but to do so it needs entire new paradigms for thinking about biological models, workflow, and teamwork. This too is part of Sage Bionetworks' mission.

Certain problems are particularly resistant in research:

  • Conditions that affect small populations have trouble raising funds for research. The Sage Congress initiatives can lower research costs by pooling data from the affected population and helping researchers work more closely with patients.

  • Computation and statistical methods are very difficult fields, and biological research is competing with every other industry for the rare individuals who know these well. All we can do is bolster educational programs for both computer scientists and biologists to get more of these people.

  • There's a long lag time before one knows the effects of treatments. As Heywood's keynote suggested, this is partly solved by collecting longitudinal data on many patients and letting them talk among themselves.

Another process change has revolutionized the computer field: agile programming. That paradigm stresses close collaboration with the end-users whom the software is supposed to benefit, and a willingness to throw out old models and experiment. BRIDGE and other patient initiatives hold out the hope of a similar shift in medical research.

All these things are needed to rescue the study of genetics. It's a lot to do all at once. Progress on some fronts were more apparent than others at this year's Sage Congress. But as more people get drawn in, and sometimes fumbling experiments produce maps for changing direction, we may start to see real outcomes from the efforts in upcoming years.

All articles in this series, and others I've written about Sage Congress, are available through a bit.ly bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

May 01 2012

Recombinant Research: Sage Congress plans for patient engagement

Clinical trials are the pathway for approving drug use, but they aren't good enough. That has become clear as a number of drugs (Vioxx being the most famous) have been blessed by the FDA, but disqualified after years of widespread use reveal either lack of efficacy or dangerous side effects. And the measures taken by the FDA recently to solve this embarrassing problem continue the heavy-weight bureaucratic methods it has always employed: more trials, raising the costs of every drug and slowing down approval. Although I don't agree with the opinion of Avik S. A. Roy (reprinted in Forbes) that Phase III trials tend to be arbitrary, I do believe it is time to look for other ways to test drugs for safety and efficacy.

First article in the series: Recombinant Research: Sage Congress Promotes Data Sharing in Genetics.

But the Vioxx problem is just one instance of the wider malaise afflicting the drug industry. They just aren't producing enough new medications, either to solve pressing public needs or to keep up their own earnings. Vicki Seyfert-Margolis of the FDA built on her noteworthy speech at last year's Sage Congress (reported in one of my articles about the conference) with the statistic that drug companies have submitted 20% fewer medications to the FDA between 2001 and 2007. Their blockbuster drugs produce far fewer profits than before as patents expire and fewer new drugs emerge (a predicament called the "patent cliff"). Seyfert-Margolis intimated that this crisis in the cause of layoffs in the industry, although I heard elsewhere that the companies are outsourcing more research, so perhaps the downsizing is just a reallocation of the same money.

Benefits of patient involvement

The field has failed to rise to the challenges posed by new complexity. Speakers at Sage Congress seemed to feel that genetic research has gone off the tracks. As the previous article in this series explained, Sage Bionetworks wants researchers to break the logjam by sharing data and code in GitHub fashion. And surprisingly, pharma is hurting enough to consider going along with an open research system. They're bleeding from a situation where as much as 80% of each clinical analysis is spent retrieving, formatting, and curating the data. Meanwhile, Kathy Giusti of the Multiple Myeloma Research Foundation says that in their work, open clinical trials are 60% faster.

Attendees at a breakout session where I sat in, including numerous managers from major pharma companies, expressed confidence that they could expand public or "pre-competitive" research in the direction Sage Congress proposed. The sector left to engage is the one that's central to all this work--the public.

If we could collect wide-ranging data from, say, 50,000 individuals (a May 2013 goal cited by John Wilbanks of Sage Bionetworks, a Kauffman Foundation Fellow), we could uncover a lot of trends that clinical trials are too narrow to turn up. Wilbanks ultimately wants millions of such data samples, and another attendee claimed that "technology will be ready by 2020 for a billion people to maintain their own molecular and longitudinal health data." And Jamie Heywood of PatientsLikeMe, in his keynote, claimed to have demonstrated through shared patient notes that some drugs were ineffective long before the FDA or manufacturers made the discoveries. He decried the current system of validating drugs for use and then failing to follow up with more studies, snorting that, "Validated means that I have ceased the process of learning."

But patients have good reasons to keep a close hold on their health data, fearing that an insurance company, an identity thief, a drug marketer, or even their own employer will find and misuse it. They already have little enough control over it, because the annoying consent forms we always have shoved in our faces when we come to a clinic give away a lot of rights. Current laws allow all kinds of funny business, as shown in the famous case of the Vermont law against data mining, which gave the Supreme Court a chance to say that marketers can do anything they damn please with your data, under the excuse that it's de-identified.

In a noteworthy poll by Sage Bionetworks, 80% of academics claimed they were comfortable sharing their personal health data with family members, but only 31% of citizen advocates would do so. If that 31% is more representative of patients and the general public, how many would open their data to strangers, even when supposedly de-identified?

The Sage Bionetworks approach to patient consent

It's basic research that loses. So Wilbanks and a team have been working for the past year on a "portable consent" procedure. This is meant to overcome the hurdle by which a patient has to be contacted and give consent anew each time a new researcher wants data related to his or her genetics, conditions, or treatment. The ideal behind portable consent is to treat the entire research community as a trusted user.

The current plan for portable consent provides three tiers:

Tier 1

No restrictions on data, so long as researchers follow the terms of service. Hopefully, millions of people will choose this tier.

Tier 2

A middle ground. Someone with asthma may state that his data can be used only by asthma researchers, for example.

Tier 3

Carefully controlled. Meant for data coming from sensitive populations, along with anything that includes genetic information.

Synapse provides a trusted identification service. If researchers find a person with useful characteristics in the last two tiers, and are not authorized automatically to use that person's data, they can contact Synapse with the random number assigned to the person. Synapse keeps the original email address of the person on file and will contact him or her to request consent.

Portable consent also involves a lot of patient education. People will sign up through a software wizard that explains the risks. After choosing portable consent, the person decides how much to put in: 23andMe data, prescriptions, or whatever they choose to release.

Sharon Terry of the Genetic Alliance said that patient advocates currently try to control patient data in order to force researchers to share the work they base on that data. Portable consent loosens this control, but the field may be ready for its more flexible conditions for sharing.

Pharma companies and genetics researchers have lots to gain from access to enormous repositories of patient data. But what do the patients get from it? Leaders in health care already recognize that patients are more than experimental subjects and passive recipients of treatment. The recent ONC proposal for Stage 2 of Meaningful Use includes several requirements to share treatment data with the people being treated (which seems kind of a no-brainer when stated this baldly) and the ONC has a Consumer/Patient Engagement Power Team.

Sage Congress is fully engaged in the patient engagement movement too. One result is the BRIDGE initiative, a joint project of Sage Bionetworks and Ashoka with funding from the Robert Wood Johnson Foundation, to solicit questions and suggestions for research from patients. Researchers can go for years researching a condition without even touching on some symptom that patients care about. Listening to patients in the long run produces more cooperation and more funding.

Portable consent requires a leap of faith, because as Wilbanks admits, releasing aggregates of patient data mean that over time, a patient is almost certain to be re-identified. Statistical techniques are just getting too sophisticated and compute power growing too fast for anyone to hide behind current tricks such as using only the first three digits of a five-digit postal code. Portable consent requires the data repository to grant access only to bona fide researchers and to set terms of use, including a ban on re-identifying patients. Still, researchers will have rights to do research, redistribute data, and derive products from it. Audits will be built in.

But as mentioned by Kelly Edwards of the University of Washington, tools and legal contracts can contribute to trust, but trust is ultimately based on shared values. Portable consent, properly done, engages with frameworks like Synapse to create a culture of respect for data.

In fact, I think the combination of the contractual framework in portable consent and a platform like Synapse, with its terms of use, might make a big difference in protecting patient privacy. Seyfert-Margolis cited predictions that 500 million smartphone users will be using medical apps by 2015. But mobile apps are notoriously greedy for personal data and cavalier toward user rights. Suppose all those smartphone users stored their data in a repository with clear terms of use and employed portable consent to grant access to the apps? We might all be safer.

The final article in this series will evaluate the prospects for open research in genetics, with a look at the grip of journal publishing on the field, and some comparisons to the success of free and open source software.

Next: Breaking Open Rewards and Incentives. All articles in this series, and others I've written about Sage Congress, are available through a bit.ly bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

April 30 2012

Recombinant Research: Sage Congress promotes data sharing in genetics

Given the exponential drop in the cost of personal genome sequencing (you can get a basic DNA test from 23andMe for a couple hundred dollars, and a full sequence will probably soon come down to one thousand dollars in cost), a new dawn seems to be breaking forth for biological research. Yet the assessment of genetics research at the recent Sage Congress was highly cautionary. Various speakers chided their own field for tilling the same ground over and over, ignoring the urgent needs of patients, and just plain researching the wrong things.

Sage Congress also has some plans to fix all that. These projects include tools for sharing data and storing it in cloud facilities, running challenges, injecting new fertility into collaboration projects, and ways to gather more patient data and bring patients into the planning process. Through two days of demos, keynotes, panels, and breakout sessions, Sage Congress brought its vision to a high-level cohort of 230 attendees from universities, pharmaceutical companies, government health agencies, and others who can make change in the field.

In the course of this series of articles, I'll pinpoint some of the pain points that can force researchers, pharmaceutical companies, doctors, and patients to work together better. I'll offer a look at the importance of public input, legal frameworks for cooperation, the role of standards, and a number of other topics. But we'll start by seeing what Sage Bionetworks and its pals have done over the past year.

Synapse: providing the tools for genetics collaboration

Everybody understands that change is driven by people and the culture they form around them, not by tools, but good tools can make it a heck of a lot easier to drive change. To give genetics researchers the best environment available to share their work, Sage Bionetworks created the Synapse platform.

Synapse recognizes that data sets in biological research are getting too large to share through simple data transfers. For instance, in his keynote about cancer research (where he kindly treated us to pictures of cancer victims during lunch), UC Santa Cruz professor David Haussler announced plans to store 25,000 cases at 200 gigabytes per case in the Cancer Genome Atlas, also known as TCGA in what seems to be a clever pun on the four nucleotides in DNA. Storage requirements thus work out to 5 petabytes, which Haussler wants to be expandable to 20 petabytes. In the face of big data like this, the job becomes moving the code to the data, not moving the data to the code.

Synapse points to data sets contributed by cooperating researchers, but also lets you pull up a console in a web browser to run R or Python code on the data. Some effort goes into tagging each data set with associated metadata: tissue type, species tested, last update, number of samples, etc. Thus, you can search across Synapse to find data sets that are pertinent to your research.

One group working with Synapse has already harmonized and normalized the data sets in TCGA so that a researcher can quickly mix and run stats on them to extract emerging patterns. The effort took about one and half full-time employees for six months, but the project leader is confident that with the system in place, "we can activate a similar size repository in hours."

This contribution highlights an important principle behind Synapse (appropriately called "viral" by some people in the open source movement): when you have manipulated and improved upon the data you find through Synapse, you should put your work back into Synapse. This work could include cleaning up outlier data, adding metadata, and so on. To make work sharing even easier, Synapse has plans to incorporate the Amazon Simple Workflow Service (SWF). It also hopes to add web interfaces to allow non-programmers do do useful work with data.

The Synapse development effort was an impressive one, coming up with a feature-rich Beta version in a year with just four coders. And Synapse code is entirely open source. So not only is the data distributed, but the creators will be happy for research institutions to set up their own Synapse sites. This may make Synapse more appealing to geneticists who are prevented by inertia from visiting the original Synapse.

Mike Kellen, introducing Synapse, compared its potential impact to that of moving research from a world of journals to a world like GitHub, where people record and share every detail of their work and plans. Along these lines, Synapse records who has used a data set. This has many benefits:

  • Researchers can meet up with others doing related work.

  • It gives public interest advocates a hook with which to call on those who benefit commercially from Synapse--as we hope the pharmaceutical companies will--to contribute money or other resources.

  • Members of the public can monitor accesses for suspicious uses that may be unethical.

There's plenty more work to be done to get data in good shape for sharing. Researchers must agree on some kind of metadata--the dreaded notion of ontologies came up several times--and clean up their data. They must learn about data provenance and versioning.

But sharing is critical for such basics of science as reproducing results. One source estimates that 75% of published results in genetics can't be replicated. A later article in this series will examine a new model in which enough metainformation is shared about a study for it to be reproduced, and even more important to be a foundation for further research.

With this Beta release of Synapse, Sage Bionetworks feels it is ready for a new initiative to promote collaboration in biological research. But how do you get biologists around the world to start using Synapse? For one, try an activity that's gotten popular nowadays: a research challenge.

The Sage DREAM challenge

Sage Bionetworks' DREAM challenge asks genetics researchers to find predictors of the progression of breast cancer. The challenge uses data from 2000 women diagnosed with breast cancer, combining information on DNA alterations affecting how their genes were expressed in the tumors, clinical information about their tumor status, and their outcomes over ten years. The challenge is to build models integrating the alterations with molecular markers and clinical features to predict which women will have the most aggressive disease over a ten year period.

Several hidden aspects of the challenge make it a clever vehicle for Sage Bionetworks' values and goals. First, breast cancer is a scourge whose urgency is matched by its stubborn resistance to diagnosis. The famous 2009 recommendations of U.S. Preventive Services Task Force, after all the controversy was aired, left us with the dismal truth that we don't know a good way to predict breast cancer. Some women get mastectomies in the total absence of symptoms based just on frightening family histories. In short, breast cancer puts the research and health care communities in a quandary.

We need finer-grained predictors to say who is likely to get breast cancer, and standard research efforts up to now have fallen short. The Sage proposal is to marshal experts in a new way that combines their strengths, asking them to publish models that show the complex interactions between gene targets and influences from the environment. Sage Bionetworks will publish data sets at regular intervals that it uses to measure the predictive ability of each model. A totally fresh data set will be used at the end to choose the winning model.

The process behind the challenge--particularly the need to upload code in order to run it on the Synapse site--automatically forces model builders to publish all their code. According to Stephen Friend, founder of Sage Bionetworks, "this brings a level of accountability, transparency, and reproducibility not previously achieved in clinical data model challenges."

Finally, the process has two more effects: it shows off the huge amount of genetic data that can be accessed through Synapse, and it encourages researchers to look at each other's models in order to boost their own efforts. In less than a month, the challenge already received more than 100 models from 10 sources.

The reward for winning the challenge is publication in a respected journal, the gold medal still sought by academic researchers. (More on shattering this obelisk later in the series.) Science Translational Medicine will accept results of the evaluation as a stand-in for peer review, a real breakthrough for Sage Bionetworks because it validates their software-based, evidence-driven process.

Finally, the DREAM challenge promotes use of the Synapse infrastructure, and in particular the method of bringing the code to the data. Google is donating server space for the challenge, which levels the playing field for researchers, freeing them from paying for their own computing.

A single challenge doesn't solve all the problems of incentives, of course. We still need to persuade researchers to put up their code and data on a kind of genetic GitHub, persuade pharmaceutical companies to support open research, and persuade the general public to share data about the phonemes (life data) and genes--all topics for upcoming articles in the series.

Next: Sage Congress Plans for Patient Engagement. All articles in this series, and others I've written about Sage Congress, are available through a bit.ly bundle.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR20

April 27 2012

Passage of CISPA in the U.S. House highlights need for viable cybersecurity legislation

To paraphrase Ben Franklin, he who sacrifices online freedom for the sake of cybersecurity deserves neither. Last night, the Cyber Intelligence Sharing and Protection Act (CISPA) (H.R. 3523) through the United States House of Representatives was sent to a vote a day earlier than scheduled. CISPA passed the House by a vote of 250-180, defying a threatened veto from the White House. The passage of CISPA now sets up a fierce debate in the Senate, where Senate Majority Leader Harry Reid (D-NV) has indicated that he wishes to bring cybersecurity legislation forward for a vote in May.

The votes on H.R. 3523 broke down along largely partisan lines, although dozens of both Democrats and Republicans voted for or against CISPA it in the finally tally. CISPA was introduced last November and approved by the House Intelligence Committee by a 17-1 vote before the end of 2011, which meant that the public has had months to view and comment upon the bill. The bill has 112 cosponsors and received no significant opposition from major U.S. corporations, including the social networking giants and telecommunications companies who would be subject to its contents.

In fact, as an analysis of campaign donations by Maplight showed, over the past two years interest groups that support CISPA have outspent those that oppose it by 12 to 1, ranging from defense contractors, cable and satellite TV providers, software makers, cellular companies and online computer services.

While the version of CISPA that passed shifted before the final vote, ProPublica's explainer on CISPA remains a useful resource for people who wish to understand its contents. Declan McCullagh, CNET's tech policy reporter, has also been following the bill closely since it was introduced and he has published an excellent FAQ explaining how CISPA would affect you.

As TechDirt observed last night, the final version of CISPA — available as a PDF from docs.house.gov contained more scope on the information types collected in the name of security. Specifically, CISPA now would allow the federal government to use information for the purpose of investigation and prosecution of cybersecurity crimes, protection of individuals, and the protection of children. In this context, a "cybersecurity crime" would be defined as any crime that involves network disruption or "hacking."

Civil libertarians, from the Electronic Frontier Foundation (EFF) to the American Civil Liberties Union, have been fiercely resisting CISPA for months. "CISPA goes too far for little reason," said Michelle Richardson, the ACLU legislative counsel, in a statement on Thursday. "Cybersecurity does not have to mean abdication of Americans' online privacy. As we've seen repeatedly, once the government gets expansive national security authorities, there's no going back. We encourage the Senate to let this horrible bill fade into obscurity."

Today, there is widespread alarm online over the passage of CISPA, from David Gewirtz calling it heinous at ZDNet to Alexander Furnas exploring its troubling aspects to it being called a direct threat to Internet privacy over at WebProNews.

The Center for Democracy and Technology issued a statement that it was:

"... disappointed that House leadership chose to block amendments on two core issues we had long identified — the flow of information from the private sector directly to NSA and the use of that information for national security purposes unrelated to cybersecurity. Reps. Thompson, Schakowsky, and Lofgren wrote amendments to address those issues, but the leadership did not allow votes on those amendments. Such momentous issues deserved a vote of the full House. We intend to press these issues when the Senate takes up its cybersecurity legislation."

Alexander Furnas included a warning in his nuanced exploration of the bill at The Atlantic:

"CISPA supporters — a list that surprisingly includes SOPA opponent Congressman Darrell Issa — are quick to point out that the bill does not obligate disclosure of any kind. Participation is 'totally voluntary.' They are right, of course, there is no obligation for a private company to participate in CISPA information sharing. However, this misses the point. The cost of this information sharing — in terms of privacy lost and civil liberties violated — is borne by individual customers and Internet users. For them, nothing about CISPA is voluntary and for them there is no recourse. CISPA leaves the protection of peoples' privacy in the hands of companies who don't have a strong incentive to care. Sure, transparency might lead to market pressure on these companies to act in good conscience; but CISPA ensures that no such transparency exists. Without correctly aligned incentives, where control over the data being gathered and shared (or at least knowledge of that sharing) is subject to public accountability and respectful of individual right to privacy, CISPA will inevitably lead to an eco-system that tends towards disclosure and abuse."

The context that already exists around digital technology, civil rights and national security must also be acknowledged for the purposes of public debate. As the EFF's Trevor Timm emphasized earlier this week, once national security is invoked, both civilian and law enforcement wield enormous powers to track and log information about citizens' lives without their knowledge nor practical ability to gain access to the records involved.

On that count, CISPA provoked significant concerns from the open government community, with the Sunlight Foundation's John Wonderlich calling the bill terrible for transparency because it proposes to limit public oversight of the work of information collection and sharing within the federal government.

"The FOIA is, in many ways, the fundamental safeguard for public oversight of government's activities," wrote Wonderlich. "CISPA dismisses it entirely, for the core activities of the newly proposed powers under the bill. If this level of disregard for public accountability exists throughout the other provisions, then CISPA is a mess. Even if it isn't, creating a whole new FOIA exemption for information that is poorly defined and doesn't even exist yet is irresponsible, and should be opposed."

What's the way forward?

The good news, for those concerned about what passage of the bill will mean for the Internet and online privacy, is that now the legislative process turns to the Senate. The open government community's triumphalism around the passage of the DATA Act and the gathering gloom and doom around CISPA all meet the same reality in this respect: checks and balances in the other chamber of Congress and a threatened veto from the White House.

Well done, founding fathers.

On the latter count, the White House has made it clear that the administration views CISPA as a huge overreach on privacy, driving a truck through existing privacy protections. The Obama administration has stated (PDF) that CISPA:

"... effectively treats domestic cybersecurity as an intelligence activity and thus, significantly departs from longstanding efforts to treat the Internet and cyberspace as civilian spheres. The Administration believes that a civilian agency — the Department of Homeland Security — must have a central role in domestic cybersecurity, including for conducting and overseeing the exchange of cybersecurity information with the private sector and with sector-specific Federal agencies."

At a news conference yesterday in Washington, the Republican leadership of the House characterized the administration's position differently. "The White House believes the government ought to control the Internet, government ought to set standards, and government ought to take care of everything that's needed for cybersecurity," said Speaker of the House John Boehner (R-Ohio), who voted for CISPA. "They're in a camp all by themselves."

Representative Mike Rogers (R-Michigan) -- the primary sponsor of the bill, along with Representative Dutch Ruppersberger (D-Maryland) -- accused opponents of "obfuscation" on the House floor yesterday.

While there are people who are not comfortable with the Department of Homeland Security (DHS) holding the keys to the nation's "cyberdefense" — particularly given the expertise and capabilities that rest in the military and intelligence communities — the prospect of military surveillance of citizens within the domestic United States is not likely to be one that the founding fathers would support, particularly without significant oversight from the Congress.

CISPA does not, however, formally grant either the National Security Agency or DHS any more powers than they already hold under existing legislation, such as the Patriot Act. It would, however, enable more information sharing between private companies and government agencies, including threat information pertinent to legitimate national security concerns.

It's crucial to recognize that cybersecurity legislation has been percolating in the Senate for years now without passage. That issue of civilian oversight is a key issue in the Senate wrangling, where major bills have been circulating for years now without passage, from proposals from Senator Lieberman's office on cybersecurity to the ICE Act from Senator Carper to Senator McCain's proposals.

If the fight over CISPA is "just beginning", as Andy Greenberg wrote in Forbes today, it's important for everyone that's getting involved because of concerns over civil liberties or privacy recognizes that CISPA is not like SOPA, as Brian Fung wrote in the American Prospect, particularly after provisions regarding intellectual property were dropped:

"At some point, privacy groups will have to come to an agreement with Congress over Internet legislation or risk being tarred as obstructionists. That, combined with the fact that most ordinary Americans lack the means to distinguish among the vagaries of different bills, suggests that Congress is likely to win out over the objections of EFF and the ACLU sooner rather than later. Thinking of CISPA as just another SOPA not only prolongs the inevitable — it's a poor analogy that obscures more than it reveals."

That doesn't mean that those objections aren't important or necessary. It does mean, however, that anyone who wishes to join the debate must recognize that genuine security threats do exist, even though massive hype about a potential "Cyber 9/11" perpetuated by contractors that stand to benefit from spending continues to pervade the media. There are legitimate concerns regarding the theft of industrial secrets, "crimesourcing" by organized crime and the reality of digital agents from the Chinese, Iranian and Russian governments — along with non-state actors — exploring the IT infrastructure of the United States.

The simple reality is that in Washington, national security trumps everything. It's not like intellectual property or energy or education or healthcare. What anyone who wishes to get involved in this debate will need to do is to support an affirmative vision for what roles the federal government and the private sector should play in securing the nation's critical infrastructure against electronic attacks. And the relationship of business and government complicates cybersecurity quite a bit, as "Inside Cyber Warfare" author Jeffrey Carr explained here at Radar in February:

"Due to the dependence of the U.S. government upon private contractors, the insecurity of one impacts the security of the other. The fact is that there are an unlimited number of ways that an attacker can compromise a person, organization or government agency due to the interdependencies and connectedness that exist between both."

The good news today is that increased awareness of the issue will drive more public debate about what's to be done. During the week the Web changed Washington in January, the world saw how the Internet can act as a platform for collective action against a bill.

Civil liberties groups have vowed to continue advocating against the passage of any vaguely drafted bill in the Senate.

On Monday, more than 60 distinguished IT security professionals, academics and engineers published an open letter to Congress urging opposition to any "'cybersecurity' initiative that does not explicitly include appropriate methods to ensure the protection of users’ civil liberties."

The open question now, as with intellectual property, is whether major broadcast and print media outlets in the United States will take their role of educating citizens seriously enough for the nation to meaningfully participate in legislative action.

This is a debate that will balance the freedoms that the nation has fought hard to achieve and defend throughout its history against the dangers we collectively face in a century when digital technologies have become interwoven into the everyday lives of citizens. We live in a networked age, with new attendant risks and rewards.

Citizens should hold their legislators accountable for supporting bills that balance civil liberties, public oversight and privacy protections with improvements to how the public and private sector monitors, mitigates and shares information about network security threats in the 21st century.

Visualization of the Week: The living city

This week's visualization comes from Interactive Things, a design and technology studio based in Zurich, Switzerland. The company has been working on a project called "Ville Vivante" — the living city.

Here's how the coordinators describe the project:

"Based on the premise that the mobile phone has become the center of our everyday communication and main source of information, the City of Geneva, in cooperation with the Lift conference, decided to take the challenge to visualize the digital traces created by our mobile phones ... the goal of the project was to visualize the urban flow of everyday life and make it visible to passersby at the local train station."

geneva_atnight.jpg
Screenshot from the "Ville Vivante" visualization. See additional screens and learn more about the project.

In the Interactive Things' blog post, the company talks about some of the design decisions that went into making visualizations that were aimed at the public and installed in public places.

The data comes from the Swiss telecommunications company Swisscom, and the dataset used for the visualization is based on one week's worth of mobile data usage.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

More Visualizations:

April 26 2012

Fitness for geeks

Programmers who spend 14 hours a day in front of a computer terminal writing code know how hard it is to step out of the cubicle and learn how to live a more healthy lifestyle. But getting fit doesn't need to be so daunting, and a growing number of technophiles are finding ways to make the process more appealing and relevant to their interest in data, design, and discovery. The increasing popularity of projects such as Quantified Self, smartphone apps, and gadgets dedicated to monitoring your body, generating metrics and routines for your exercise regime, and tracking your progress has created a community of like-minded geeks to share in your struggle, and even make it fun.

I recently talked with Bruce Perry, author of the just-released Fitness for Geeks, about some of the tools this crowd is using, some others they might be missing, and how the rest of us can use these tips to get healthy too. Highlights from our conversation include:

  • Debug your wetware. A programmer becomes fitter by becoming more knowledgeable about her internal software and learning how to optimize it for maximum performance and efficiency. [Discussed at the 00:21 mark]
  • Get some sleep. This one's pretty obvious, but now there are many new ways to quantify and analyze your sleep. Zeo Sleep Manager monitors your brainwaves during sleep and displays graphs for your review when you wake up, communicating wirelessly to a software-enabled clock and the web, Use your personal dashboard to identify your sleep cycles, analyze your REM, and measure the effects of different daily events (such as a stressful day or a drink before bed) on sleep. [Discussed at the 1:54 mark]
  • Use apps to assist your workouts and quantify your health. Tools such as FitBit, Nike+, Garmin Connect, AlpineReplay, and RestWise connect you and your health to the digital world where so much of the rest of your life is lived. [Discussed at the 3:46 mark]
  • Just get outside. You don't need a sophisticated routine, as long as you're moving. Doing the same thing over and over tends to create a static effect that plateaus. But you can randomize your workouts to make them more interesting. Tools such as GAIN Fitness and CrossFit's Workout of the Day (WOD) Generator will use algorithms to generate your own daily protocol. [Discussed at the 4:53 mark]
  • Fast. Intermittent fasting has been shown to lower blood pressure, normalize insulin and glucose levels, and even provide more efficient workouts while fasting. The basic guidelines for intermittent fasting is to eat only within an 8-hour window (eat dinner, don't eat at night, skip breakfast) and go the remaining16 hours on just water, coffee, and tea. [Discussed at the 7:26 mark]
  • Resist extremes. Bruce says it's okay to do a marathon or similarly challenging event for the experience, but that the oxidative stress can have a significantly negative effect on your overall and long-term health. Instead, revolve your exercise program around short-duration, high-intensity training, such as sprinting, followed by 30-40 minutes of high-intensity weights. [Discussed at the 09:09 mark]
  • Practice good stress. Various forms of acute stresses (known as hormesis) — such as moderate and high-intensity exercise, hot and cold exposure, one drink at night — can improve your health. [Discussed at the 13:02 mark]
  • Personal experiences with fitness apps. Bruce talks about using Endomondo, GPS data, and Google Earth to scout out an off-piste ski area, and I mention my own use of Google's MyTracks Android app for marathon training. [Discussed at the 15:19 mark]

The full interview is available in the following video:

Fitness for Geeks — This guide will help you experiment with one crucial system you usually ignore — your body and its health. Long hours focusing on code or circuits tends to stifle notions of nutrition, but with this book you can approach fitness through science.

April 23 2012

Big data in Europe

The worldwide Big Data Week kicks off today with gatherings in the U.K., U.S., Germany, Finland and Australia. As part of their global focus, Big Data Week founder/organizer Stewart Townsend (@stewarttownsend) of DataSift, and Carlos Somohano (@ds_ldn), founder of the Data Science London community and the Data Science Hackathon, have been tracking big data in Europe. This is an area that we're exploring here at Radar and through October's Strata Conference in London, so I asked Townsend and Somohano to share their perspectives on the European data scene. They combined their thoughts in the following Q&A.

Are the U.S. and Europe at similar stages in big data growth, adoption and application?

Townsend and Somohano: Based on our experience across industry verticals and markets in Europe, we believe the U.S. is leading the way in adopting so called "big data." The U.K. is perhaps the European leader, where the level of adoption is picking up quite quickly, although still lagging behind the U.S.

In Germany, surprisingly, many organizations are not adopting big data as quickly as in the U.S. and U.K. markets. In some southern European markets, big data is still quite a new concept. Practically speaking, it's not on the radar there yet.

Part of our organizational mission at Big Data Week is to promote big data across the whole European ecosystem and act as messengers of the benefits of adopting big data early.

What are the key differences between big data in the U.S. and Europe?

Townsend and Somohano: Many large organizations in Europe are still in the early stages of the adoption cycle. This is perhaps due to the level of confusion around aspects like terminology (i.e. what do we mean by "big data"?) and the acute lack of skills around "new" big data technologies.

Today, most of the new developments and technologies around big data come from the U.S., and this presents somewhat of a challenge for some European organizations as they try to keep up with the rapid level of change. Perhaps the speed of both the decision-making cycle and the organizational change in most European companies is a bit slower than in the U.S. counterparts, and this may be reflected in the adoption of big data. We also see that the concept of data science and the role of the data scientist is well adopted in many U.S. companies, and not so much in Europe.

Where is Europe excelling with big data?

Townsend and Somohano: The financial services sector — particularly investment banking and trading in London — is one of the early adopters of big data. In this sector, both the experience and expertise in big data is on par with big data leaders in the U.S. Equally, the level of investment in big data in this sector — despite the economic downturn — is healthy, with a positive outlook.

Technology startups are also becoming European leaders in big data, primarily around London, and to a lesser degree in Berlin. There is a relatively large number of startups that are not only adopting but developing new business models and value from big data in social media and business-to-consumer services.

Finally, some verticals like oil and gas, utilities, and manufacturing are increasingly adopting big data in areas like sensor data, telemetry, and operational data streams. Our research indicates that retail is perhaps a late adopter.

The U.K. government has a number of robust open data initiatives. How are those shaping the big data space?

Townsend and Somohano: Quite notably, the U.K. government is becoming one of the leaders in open data, although perhaps not so in big data. This is perhaps due to the fact that the key drivers of open data initiatives are mainly information transparency, information privacy, information accessibility for citizens, and European regulatory changes. It's also worth mentioning that the U.K. government has been involved in very large-scale IT projects in the past (e.g. NHS IT), which could qualify as big data program initiatives. For diverse reasons, these projects were not successful and experienced massive budget overruns and delays. We believe this could be a factor in the U.K. government not focusing its main effort on big data for now. However, the open data initiative is driving the open release of massive large public datasets, which eventually will require a strategic big data approach from the U.K. government.

What's the state of data science in Europe?

Townsend and Somohano: Similarly to the status of big data, Europe lags behind in the adoption levels of data science. Even in the U.K. — an early adopter in many ways — data science is still viewed in some sectors with a certain dose of skepticism. That's because data science is still understood as as new name for practices like business intelligence or analytics.

Additionally, in many large organizations in Europe the role of the data scientist is still not associated with a clear job description from the business and HR perspectives — and even in IT in some cases. Contrary to the corporate environment, where the data scientist role is still not fully recognized, in the European startup scene there is a healthy and vibrant data science community. This is perhaps best exemplified by our organization Data Science London. In its mission to promote data science and the data scientist role, our community is dedicated to the free, open dissemination of data science concepts and ideas.

Big Data Week is being held April 23-28. What are you hoping the event yields?

Townsend and Somohano: The main goal of Big Data Week is to bring together the big data communities around the world by hosting a series of events and activities to promote big data. We aim to spread the knowledge and understanding of big data challenges and opportunities from the technology, business, and commercial perspectives.


Big Data Week will feature the Data Science Hackathon, a 24-hour in-person/online event organized by Data Science London. Full Hackathon details are available here.


This interview was edited and condensed.

Related:

April 20 2012

Visualization of the Week: The history of shipping routes

Last week's featured visualization mapped the geographic background and intended destination of the passengers on the ill-fated Titanic. This week's visualization also examines oceanic travel, but of a different sort.

The visualization comes from Ben Schmidt and shows more than 100 years of trade routes, from 1750 onward.

Schmidt writes:

"This shows mostly Spanish, Dutch, and English routes — they are surprisingly constant over the period (although some empires drop in and out of the record), but the individual voyages are fun. And there are some macro patterns — the move of British trade towards India, the effect of the American Revolution and the Napoleonic Wars, and so on."

Schmidt has another visualization on his blog that breaks down the data based on seasonal patterns. A grad student at Princeton, Schmidt admits that "19th century ocean trade isn't exactly much my field," but he does offers some insights into why the maps look this way and what's missing (Darwin's Beagle, U.S. data, and so on).

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

More Visualizations:

April 19 2012

Strata Week: The rise of the robot essay graders

Here are a few of the data stories that caught my attention this week.

Automated essay-scoring software scores as well as humans

Taking a test at the Real Estate Investing College by Casey Serin, on FlickrRobot essay graders: They grade the same as humans. That's the conclusion of a study conducted by University of Akron's Dean of the College of Education Mark Shermis and Kaggle data scientist Ben Hamner. The researchers examined some 22,000 essays that were administered to junior and high school students as part of their states' standardized testing process, comparing the grades given by human graders and those given by automated grading software. They found that "overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre" (PDF of the report).

"The demonstration showed conclusively that automated essay scoring systems are fast, accurate, and cost effective," says Tom Vander Ark, managing partner at the investment firm Learn Capital, in a press release touting the study's results.

The study coincides with an active competition hosted on Kaggle and sponsored by the Hewlett Foundation, in which data scientists are challenged with developing the best algorithm to automatically grade student essays. "Better tests support better learning," noted the foundation's Education Program Director Barbara Chow in the press release. "This demonstration of rapid and accurate automated essay scoring will encourage states to include more writing in their state assessments. And, the more we can use essays to assess what students have learned, the greater the likelihood they'll master important academic content, critical thinking, and effective communication."

Personally, I like writing for a human audience. Bots leave really stupid blog comments — but I bet there's an algorithm for that too.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Scaling Instagram

The billion-dollar acquisition of the mobile photo-sharing app Instagram was big news last week. The news coincided with a presentation by co-founder Mike Krieger at an AirBnB Tech Talk about how the startup managed to scale to 30 million users worldwide with a small team of back-end developers (a very small team, in fact). Krieger's presentation is interesting in its own right, of course, but news of the acquisition by Facebook certainly fueled interest — in the deal and in the tech under the Instagram hood.

Krieger's slides can be found here. The presentation details some of the early and ongoing challenges of handling the app's increasing number of users and their photos (including the recent roll-out of an Android app, which added another million new users in just 12 hours). Although Instagram hasn't suffered any major outages of the likes seen by Twitter and Tumblr, Krieger does note a number of early problems, including a missing favicon.ico that was causing a lot of 404 errors in Django.

Auditing data.gov.uk

The UK's National Audit Office has just released its look at the government's open data efforts, reports The Guardian. Although the open data initiative gets good marks for the "tsunami of data" it's released — 8,300 datasets — there remain questions about cost and usage.

Governmental departments estimate they spend between £53,000 and £500,000 each year on publishing the data, with the police crime maps, for example, costing £300,000 to set up and £150,000 per year to maintain. And it's not clear that the data is in demand, according to the National Audit Office report: "None of the departments reported significant spontaneous public demand for the standard dataset releases." This doesn't account for the ways in which third-party vendors may be using the data, however.

Big Data Week

April 23-29 is "Big Data Week," an event created by DataSift that will feature meetups and hackathons in several cities around the world. Big Data Week aims to bring together the "core communities" — data scientists, data technologies, data visualization, and data business. A list of events is available on the Big Data Week website.

Got data news?

Feel free to email me.

Photo: Taking a test at the Real Estate Investing College

April 16 2012

What it takes to build great machine learning products

Machine learning (ML) is all the rage, riding tight on the coattails of the "big data" wave. Like most technology hype, the enthusiasm far exceeds the realization of actual products. Arguably, not since Google's tremendous innovations in the late '90s/early 2000s has algorithmic technology led to a product that has permeated the popular culture. That's not to say there haven't been great ML wins since, but none have as been as impactful or had computational algorithms at their core. Netflix may use recommendation technology, but Netflix is still Netflix without it. There would be no Google if Page, Brin, et al., hadn't exploited the graph structure of the web and anchor text to improve search.

So why is this? It's not for lack of trying. How many startups have aimed to bring natural language processing (NLP) technology to the masses, only to fade into oblivion after people actually try their products? The challenge in building great products with ML lies not in just understanding basic ML theory, but in understanding the domain and problem sufficiently to operationalize intuitions into model design. Interesting problems don't have simple off-the-shelf ML solutions. Progress in important ML application areas, like NLP, come from insights specific to these problems, rather than generic ML machinery. Often, specific insights into a problem and careful model design make the difference between a system that doesn't work at all and one that people will actually use.

The goal of this essay is not to discourage people from building amazing products with ML at their cores, but to be clear about where I think the difficulty lies.

Progress in machine learning

Machine learning has come a long way over the last decade. Before I started grad school, training a large-margin classifier (e.g., SVM) was done via John Platt's batch SMO algorithm. In that case, training time scaled poorly with the amount of training data. Writing the algorithm itself required understanding quadratic programming and was riddled with heuristics for selecting active constraints and black-art parameter tuning. Now, we know how to train a nearly performance-equivalent large-margin classifier in linear time using a (relatively) simple online algorithm (PDF). Similar strides have been made in (probabilistic) graphical models: Markov-chain Monte Carlo (MCMC) and variational methods have facilitated inference for arbitrarily complex graphical models [1]. Anecdotally, take at look at papers over the last eight years in the proceedings of the Association for Computational Linguistics (ACL), the premiere natural language processing publication. A top paper from 2011 has orders of magnitude more technical ML sophistication than one from 2003.

On the education front, we've come a long way as well. As an undergrad at Stanford in the early-to-mid 2000s, I took Andrew Ng's ML course and Daphne Koller's probabilistic graphical model course. Both of these classes were among the best I took at Stanford and were only available to about 100 students a year. Koller's course in particular was not only the best course I took at Stanford, but the one that taught me the most about teaching. Now, anyone can take these courses online.

As an applied ML person — specifically, natural language processing — much of this progress has made aspects of research significantly easier. However, the core decisions I make are not which abstract ML algorithm, loss-function, or objective to use, but what features and structure are relevant to solving my problem. This skill only comes with practice. So, while it's great that a much wider audience will have an understanding of basic ML, it's not the most difficult part of building intelligent systems.

Interesting problems are never off the shelf

The interesting problems that you'd actually want to solve are far messier than the abstractions used to describe standard ML problems. Take machine translation (MT), for example. Naively, MT looks like a statistical classification problem: You get an input foreign sentence and have to predict a target English sentence. Unfortunately, because the space of possible English is combinatorially large, you can't treat MT as a black-box classification problem. Instead, like most interesting ML applications, MT problems have a lot of structure and part of the job of a good researcher is decomposing the problem into smaller pieces that can be learned or encoded deterministically. My claim is that progress in complex problems like MT comes mostly from how we decompose and structure the solution space, rather than ML techniques used to learn within this space.

Machine translation has improved by leaps and bounds throughout the last decade. I think this progress has largely, but not entirely, come from keen insights into the specific problem, rather than generic ML improvements. Modern statistical MT originates from an amazing paper, "The mathematics of statistical machine translation" (PDF), which introduced the noisy-channel architecture on which future MT systems would be based. At a very simplistic level, this is how the model works [2]: For each foreign word, there are potential English translations (including the null word for foreign words that have no English equivalent). Think of this as a probabilistic dictionary. These candidate translation words are then re-ordered to create a plausible English translation. There are many intricacies being glossed over: how to efficiently consider candidate English sentences and their permutations, what model is used to learn the systematic ways in which reordering occurs between languages, and the details about how to score the plausibility of the English candidate (the language model).

The core improvement in MT came from changing this model. So, rather than learning translation probabilities of individual words, to instead learn models of how to translate foreign phrases to English phrases. For instance, the German word "abends" translates roughly to the English prepositional phrase "in the evening." Before phrase-based translation (PDF), a word-based model would only get to translate to a single English word, making it unlikely to arrive at the correct English translation [3]. Phrase-based translation generally results in more accurate translations with fluid, idiomatic English output. Of course, adding phrased-based emissions introduces several additional complexities, including how to how to estimate phrase-emissions given that we never observe phrase segmentation; no one tells us that "in the evening" is a phrase that should match up to some foreign phrase. What's surprising here is that there aren't general ML improvements that are making this difference, but problem-specific model design. People can and have implemented more sophisticated ML techniques for various pieces of an MT system. And these do yield improvements, but typically far smaller than good problem-specific research insights.

Franz Och, one of the authors of the original Phrase-based papers, went on to Google and became the principle person behind the search company's translation efforts. While the intellectual underpinnings of Google's system go back to Och's days as a research scientist at the Information Sciences Institute (and earlier as a graduate student), much of the gains beyond the insights underlying phrase-based translation (and minimum-error rate training, another of Och's innovations) came from a massive software engineering effort to scale these ideas to the web. That effort itself yielded impressive research into large-scale language models and other areas of NLP. It's important to note that Och, in addition to being a world-class researcher, is also, by all accounts, an incredibly impressive hacker and builder. It's this rare combination of skill that can bring ideas all the way from a research project to where Google Translate is today.

Defining the problem

But I think there's an even bigger barrier beyond ingenious model design and engineering skills. In the case of machine translation and speech recognition, the problem being solved is straightforward to understand and well-specified. Many of the NLP technologies that I think will revolutionize consumer products over the next decade are much vaguer. How, exactly, can we take the excellent research in structured topic models, discourse processing, or sentiment analysis and make a mass-appeal consumer product?

Consider summarization. We all know that in some way, we'll want products that summarize and structure content. However, for computational and research reasons, you need to restrict the scope of this problem to something for which you can build a model, an algorithm, and ultimately evaluate. For instance, in the summarization literature, the problem of multi-document summarization is typically formulated as selecting a subset of sentences from the document collection and ordering them. Is this the right problem to be solving? Is the best way to summarize a piece of text a handful of full-length sentences? Even if a summarization is accurate, does the Franken-sentence structure yield summaries that feel inorganic to users?

Or, consider sentiment analysis. Do people really just want a coarse-grained thumbs-up or thumbs-down on a product or event? Or do they want a richer picture of sentiments toward individual aspects of an item (e.g., loved the food, hated the decor)? Do people care about determining sentiment attitudes of individual reviewers/utterances, or producing an accurate assessment of aggregate sentiment?

Typically, these decisions are made by a product person and are passed off to researchers and engineers to implement. The problem with this approach is that ML-core products are intimately constrained by what is technically and algorithmically feasible. In my experience, having a technical understanding of the range of related ML problems can inspire product ideas that might not occur to someone without this understanding. To draw a loose analogy, it's like architecture. So much of the construction of a bridge is constrained by material resources and physics that it doesn't make sense to have people without that technical background design a bridge.

The goal of all this is to say that if you want to build a rich ML product, you need to have a rich product/design/research/engineering team. All the way from the nitty gritty of how ML theory works to building systems to domain knowledge to higher-level product thinking to technical interaction and graphic design; preferably people who are world-class in one of these areas but also good in several. Small talented teams with all of these skills are better equipped to navigate the joint uncertainty with respect to product vision as well as model design. Large companies that have research and product people in entirely different buildings are ill-equipped to tackle these kinds of problems. The ML products of the future will come from startups with small founding teams that have this full context and can all fit in the proverbial garage.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

[1]: Although MCMC is a much older statistical technique, its broad use in large-scale machine learning applications is relatively recent.

[2]: The model is generative, so what's being described here is from the point-of-view of inference; the model's generative story works in reverse.

[3]: IBM model 3 introduced the concept of fertility to allow a given word to generate multiple independent target translation words. While this could generate the required translation, the probability of the model doing so is relatively low.

Related:

April 13 2012

Visualization of the Week: Mapping the Titanic's passengers

It's been 100 years since the tragedy of the Titanic — the sinking of the ship, that is, not the James Cameron movie. One of the worst disasters in maritime history, more than 1,500 people lost their lives when the ship struck an iceberg at 11:40 p.m. on April 14, 2012. It took less than three hours for the "unsinkable" ship to sink.

GIS software maker ESRI has created an interactive map visualizing the fates of the ship's passengers. The map breaks the passengers down by class and by geographic background, highlighting their different origins and destinations as well as their different survival rates. The first-class passengers were primarily from "affluent" European and American cities, while the third-class passengers came from a variety of locations, including Scandinavia, Ireland, Slovenia and Lebanon. Many were emigrants, planning on making a new life in America. Just 183 (27%) of those third-class passengers survived, while 324 (60%) of the first-class passengers survived.

titanic_viz2.jpg
Screenshot from ESRI's Titanic visualization. See the full version.

You can view passengers' origins, where they boarded the Titanic, as well as their intended destination. See the full visualization here.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

More Visualizations:

April 12 2012

Strata Week: Add structured data, lose local flavor?

Here are a few of the data stories that caught my attention this week:

A possible downside to Wikidata

Wikidata data model diagram
Screenshot from the Wikidata Data Model page.

The Wikimedia Foundation — the good folks behind Wikipedia — recently proposed a Wikidata initiative. It's a new project that would build out a free secondary database to collect structured data that could provide support in turn for Wikipedia and other Wikimedia projects. According to the proposal:

"Many Wikipedia articles contain facts and connections to other articles that are not easily understood by a computer, like the population of a country or the place of birth of an actor. In Wikidata, you will be able to enter that information in a way that makes it processable by the computer. This means that the machine can provide it in different languages, use it to create overviews of such data, like lists or charts, or answer questions that can hardly be answered automatically today."

But in The Atlantic this week, Mark Graham, a research fellow at the Oxford Research Institute, takes a look at the proposal, calling these "changes that have worrying connotations for the diversity of knowledge in the world's sixth most popular website." Graham points to the different language editions of Wikipedia, noting that the encyclopedic knowledge contained therein is always highly diverse. "Not only does each language edition include different sets of topics, but when several editions do cover the same topic, they often put their own, unique spin on the topic. In particular, the ability of each language edition to exist independently has allowed each language community to contextualize knowledge for its audience."

Graham fears that emphasizing a standardized, machine-readable, semantic-oriented Wikipedia will lose this local flavor:

"The reason that Wikidata marks such a significant moment in Wikipedia's history is the fact that it eliminates some of the scope for culturally contingent representations of places, processes, people, and events. However, even more concerning is that fact that this sort of congealed and structured knowledge is unlikely to reflect the opinions and beliefs of traditionally marginalized groups."

His arguments raise questions about the perceived universality of data, when in fact what we might find instead is terribly nuanced and localized, particularly when that data is contributed by humans who are distributed globally.

The intricacies of Netflix personalization

Netflix suggestion buttonNetflix's recommendation engine is often cited as a premier example of how user data can be mined and analyzed to build a better service. This week, Netflix's Xavier Amatriain and Justin Basilico penned a blog post offering insights into the challenges that the company — and thanks to the Netflix Prize, the data mining and machine learning communities — have faced in improving the accuracy of movie recommendation engines.

The Netflix post raises some interesting questions about how the means of content delivery have changed recommendations. In other words, when Netflix refocused on its streaming product, viewing interests changed (and not just because the selection changed). The same holds true for the multitude of ways in which we can now watch movies via Netflix (there are hundreds of different device options for accessing and viewing content from the service).

Amatriain and Basilico write:

"Now it is clear that the Netflix Prize objective, accurate prediction of a movie's rating, is just one of the many components of an effective recommendation system that optimizes our members' enjoyment. We also need to take into account factors such as context, title popularity, interest, evidence, novelty, diversity, and freshness. Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts."

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

Reposted byRK RK

April 10 2012

Open source is interoperable with smarter government at the CFPB

CFPBWhen you look at the government IT landscape of 2012, federal CIOs are being asked to address a lot of needs. They have to accomplish your mission. They need to be able to scale initiatives to tens of thousands of agency workers. They're under pressure to address not just network security but web security and mobile device security. They also need to be innovative, because all of this is supported by the same of less funding. These are common requirements in every agency.

As the first federal "start-up agency" in a generation, some of those needs at the Consumer Financial Protection Bureau (CFPB) are even more pressing. On the other hand, the opportunity for the agency to be smarter, leaner and "open from the beginning" is also immense.

Progress establishing the agency's infrastructure and culture over the first 16 months has been promising, save for larger context of getting a director at the helm. Enabling open government by design isn't just a catchphrase at the CFPB. There has been a bold vision behind the CFPB from the outset, where a 21st century regulator would leverage new technologies to find problems in the economy before the next great financial crisis escalates.

In the private sector, there's great interest right now is finding actionable insight in large volumes of data. Making sense of big data is increasingly being viewed as a strategic imperative in the public sector as well. Recently, the White House put its stamp on that reality with a $200 million big data research and development initiative, including a focus on improving the available tools. There's now an entire ecosystem of software around Hadoop, which is itself open source code. The problem that now exists in many organizations, across the public and private sector, is not so much that the technology to manipulate big data isn't available: it's that the expertise to apply big data doesn't exist in-house. The data science talent shortage is real.

People who work and play in the open source community understand the importance of sharing code, especially when that action leads to improving the code base. That's not necessarily an ethic or a perspective that has been pervasive across the federal government. That does seem to be slowly changing, with leadership from the top: the White House used Drupal for its site and has since contributed modules back into the open source community, including one that helps with 508 compliance.

In an in-person interview last week, CFPB CIO Chris Willey (@ChrisWilleyDC) and acting deputy CIO Matthew Burton (@MatthewBurton) sat down to talk about the agency's new open source policy, government IT, security, programming in-house, the myths around code-sharing, and big data.

The fact that this government IT leadership team is strongly supportive of sharing code back to the open source community is probably the most interesting part of this policy, as Scott Merrill picked up in his post on the CFPB and Github.

Our interview follows.

In addition to being the leader of the CFPB's development team over the past year and half, Burton was just named acting deputy chief information officer. What will that mean?

Willey: He hasn't been leading the software development team the whole time. In fact, we only really had an org chart as of October. In the time that he's been here, Matt has led his team to some amazing things. We're going to talk about a one of them today, but we've also got a great intranet. We've got some great internal apps that are being built and that we've built. We've unleashed one version of the supervision system that helps bank examiners do their work in the field. We've got a lot of faith he's going to do great things.

What it actually means is that he's going to be backing me up as CIO. Even though we're a fairly small organization, we have an awful lot going on. We have 76 active IT projects, for example. We're just building a team. We're actually doubling in size this fiscal year, from about 35 staff to 70, as well as adding lots of contractors. We're just growing the whole pie. We've got 800 people on board now. We're going to have 1,100 on board in the whole bureau by the end of the fiscal year. There's a lot happening, and I recognize we need to have some additional hands and brain cells helping me out.

With respect to building an internal IT team, what's the thinking behind having technical talent inside of an agency like this one? What does that change, in terms of your relationship with technology and your capacity to work?

Burton: I think it's all about experimentation. Having technical people on staff allows an organization to do new things. I think the way most agencies work is that when they have a technical need, they don't have the technical people on staff to make it happen so instead, that need becomes larger and larger until it justifies the contract. And by then, the problem is very difficult to solve.

By having developers and designers in-house, we can constantly be addressing things as they come up. In some cases, before the businesses even know it's a problem. By doing that, we're constantly staying ahead of the curve instead of always reacting to problems that we're facing.

How do you use open source technology to accomplish your mission? What are the tools you're using now?

Willey: We're actually trying to use open source in every aspect of what we do. It's not just in software development, although that's been a big focus for us. We're trying to do it on the infrastructure side as well.

As we look at network and system monitoring, we look at the tools that help us manage the infrastructure. As I've mentioned in the past, we are 100% in the cloud today. Open source has been a big help for us in giving us the ability to manipulate those infrastructures that we have out there.

At the end of the day, we want to bring in the tools that make the most sense for the business needs. It's not about only selecting open source or having necessarily a preference for open source.

What we've seen is that over time, the open source marketplace has matured. A lot of tools that might not have been ready for prime time a year ago or two years ago are today. By bringing them into the fold, we potentially save money. We potentially have systems that we can extend. We could more easily integrate with the other things that we have inside the shop that maybe we built or maybe things that we've acquired through other means. Open source gives us a lot of flexibility because there's a lot of opportunities to do things that we might not be able to do with some proprietary software.

Can you share a couple of specific examples of open source tools that you're using and what you actually use them for within mission?

Willey: On network monitoring, for example, we're using ZFS, which is an open source monitoring tool. We've been working with Nagios as well. Nagios, we actually inherited from Treasury — and while Treasury's not necessarily known for its use of open source technologies, it uses that internally for network monitoring. Splunk is another one that we have been using for web analysis. [After the interview, Burton and Willey also shared that they built the CFPB's intranet on MediaWiki, the software that drives Wikipedia.]

Burton: On the development side, we've invested a lot in Django and WordPress. Our site is a hybrid of them. It's WordPress at its core, with Django on top of that.

In November of 2010, it was actually a few weeks before I started here, Merici [Vinton] called me and said, "Matt, what should we use for our website?"

And I said, "Well, what's it going to do?"

And she said, "At first, it's going to be a blog with a few pages."

And this website needed to be up and running by February. And there was no hosting; there was nothing. There were no developers.

So I said, "Use WordPress."

And by early February, we had our website up. I'm not sure that would have been possible if we had to go through a lengthy procurement process for something not open source.

We use a lot of jQuery. We use Linux servers. For development ops, we use Selenium and Jenkins and Git to manage our releases and source code. We actually have GitHub Enterprise, which although not open source, is very sharing-focused. It encourages sharing internally. And we're using GitHub on the public side to share our code. It's great to have the same interface internally as we're using externally.

Developers and citizens alike can go to github.com/cfpb and see code that you've released back to the public and for other federal agencies. What projects are there?

Burton: These are the ones that came up between basic building blocks. They range from code that may not strike an outside developer as that interesting but that's really useful for the government, all the way to things that we created from scratch that are very developer-focused and are going to be very useful for any developer.

On the first side of that spectrum, there's an app that we made for transit subsidy involvement. Treasury used to manage our transit subsidy balances. That involved going to a webpage that you would print out, write into with a pen and then fax to someone.

Willey: Or scan and email it.

Burton: Right. And then once you'd had your supervisor sign it, faxed it over to someone, eventually, several weeks later, you would get your benefits. We started to take over that process and the human resources office came to us and asked, "How can we do this better?"

Obviously, that should just be a web form that you type into, that will auto fill any detail it knows about you. You press submit and it goes into the database, which goes directly to the DOT [Department of Transportation]. So that's what we made. We demoed that for DOT and they really like it. USAID is also into it. It's encouraging to see that something really simple could prove really useful for other agencies.

On the other side of the spectrum, we use a lot of Django tools. As an example, we have a tool we just released through our website called "Ask CFPB." It's a Django-based question and answer tool, with a series of questions and answers.

Now, the content is managed in Django. All of the content is managed from our staging server behind the firewall. When we need to get that content, we need to get the update from staging over to production.

Before, what we had to do was pick up the entire database, copy it and them move it over to production, which was kind of a nightmare. And there was no Django tool for selectively moving data modifications.

So we sat there and we thought, "Oh, we really need something to do that because we're going to be doing a lot of that. We can't be copying the database over every time we need to correct a copy. So two of our developers developed a Django app called "Nudge." Basically, you go into a Django and if you've ever seen a Django admin, you just go into it and assess, "Hey, here's everything that's changed. What do you want to move over?"

You can pick and choose what you want to move over and, with the click of a button, it goes to production. I think that's something that every Django developer will have a use for if they have a staging server.

In a way, we were sort of surprised it didn't exist. So, we needed it. We built it. Now we're giving it back and anybody in the world can use it.

You mentioned the cloud. I know that CFPB is very associated with Treasury. Are you using Treasury's FISMA moderate cloud?

Willey: We have a mix of what I would say are private and public clouds. On the public side, we're using our own cloud environments that we have established. On the private side, we are using Treasury for some of our apps. We're slowly migrating off of treasury systems onto our own cloud infrastructure or our own cloud.

In the case of email, for example, we're looking at email as a service. So we'll be looking at Google, Microsoft and others just to see what's out there and what we might be able to use.

Why is it important for the CFPB to share code back to the public? And who else in the federal government has done something like this, aside from the folks at the White House?

Burton:: We see it the same way that we believe the rest of the open source community sees it: The only way this stuff is going to get better and become more viable is if people share. Without that, then it'll only be hobbyists. It'll only be people who build their own little personal thing. Maybe it's great. Maybe it's not. Open source gets better by the community actually contributing to it. So it's self-interest in a lot of ways. If the tools get better, then what we have available to us is, therefore, gets better. We can actually do our mission better.

Using the transit subsidy enrollment application example, it's also an opportunity for government to help itself, for one agency to help another. We've created this thing. Every federal agency has a transit subsidy program. They all need to allow people to enroll in it. Therefore, it's immediately useful to any other agency in the federal government. That's just a matter of government improving its own processes.

If one group does it, why should another group have to figure it out or have to pay lots of money to have it figured out? Why not just share it internally and then everybody benefits?

Why do you think it's taken until 2012 to have that insight actually be made into reality in terms of a policy?

Burton: I think to some degree, the tools have changed. The ability to actually do this easily is a lot better now than it was even a year or two ago. Government also traditionally lags behind the private sector in a lot of ways. I think that's changing, too. With this administration in particular, I think what we've seen is that government has started to become a little bit on parity with the private sector, including some of the thinking around how to use technology to improve business processes. That's really exciting. And I think as a result, there are a lot of great people coming in as developers and designers who want to work in the federal government because they see that change.

Willey: It's also because we're new. There are two things behind that. First, we're able to sort of craft a technology philosophy with a modern perspective. So we can, from our founding, ask "What is the right way to do this?" Other agencies, if they want to do this, have to turn around decades of culture. We don't have that burden. I think that's a big reason why we're able to do this.

The second thing is a lot of agencies don't have the intense need that we do. We have 76 projects to do. We have to use every means available to us.

We can't say, "We're not going to use a large share of the software that's available to us." That's just not an option. We have to say, "Yes, we will consider this as a commercial good, just like any other piece of proprietary software."

In terms of the broader context for technology and policy, how does open source relate to open government?

Willey: When I was working for the District, Apps for Democracy was a big contest that we did around opening data and then asking developers to write applications using that data that could then be used by anybody. We said that the next logical step was to sort of create more participatory government. And in my mind, open sourcing the projects that we do is a way of asking the citizenry to participate in the active government.

So by putting something in the public space, somebody could pick that up. Maybe not the transit subsidy enrollment project — but maybe some other project that we've put out there that's useful outside of government as well as inside of government. Somebody can pick that code up, contribute to it and then we benefit. In that way, the public is helping us make government better.

When you have conversations around open source in government, what do you say about what it means to put your code online and to have people look at it or work on it? Can you take changes that people make to the code base to improve it and then use it yourself?

Willey: Everything that we put out there will be reviewed by our security team. The goal is that, by the time it's out there, not to have any security vulnerabilities. If someone does discover a security vulnerability, however, we'll be sharing that code in a way that makes it much more likely that someone will point it out to us and maybe even provide a fix than they will exploit it because it's out there. They wouldn't be exploiting our instance of the code; they would be working with the code on Github.com.

I've seen people in government with a misperception of what open source means. They hear that it's code that anyone can contribute to. I think that they don't understand that you're controlling your own instance of it. They think that anyone can come along and just write anything into your code that they like. And, of course, it's not like that.

I think as we talk more and more about this to other agencies, we might run into that, but I think it'll be good to have strong advocates in government, especially on the security side, who can say, "No, that's not the case; it doesn't work that way."

Burton: We have a firewall between our public and private instances at Git as well. So even if somebody contributes code, that's also reviewed on the way in. We wouldn't implement it unless we made sure that, from a security perspective, the code was not malicious. We're taking those precautions as well.

I can't point to one specifically, but I know that there have been articles and studies done on the relative security of open source. I think the consensus in the industry is that the peer review process of open source actually helps from a security perspective. It's not that you have a chaos of people contributing code whenever they want to. It improves the process. It's like the thinking behind academic papers. You do peer review because it enhances the quality of the work. I think that's true for open source as well.

We actually want to create a community of peer reviewers of code within the federal government. As we talk to agencies, we want people to actually use the stuff we build. We want them to contribute to it. We actually want them to be a community. As each agency contributes things, the other agencies can actually review that code and help each other from that perspective as well.

It's actually fairly hard. As we build more projects, it's going to put a little bit of a strain on our IT security team, doing an extra level of scrutiny to make sure that the code going out is safe. But the only way to get there is to grow that pie. And I think by talking with other agencies, we'll be able to do that.

A classic open source koan is that "with many eyes, all bugs become shallow." In IT security, is it that with many eyes, all worms become shallow?

Burton: What the Department of Defense said was if someone has malicious intent and the code isn't available, they'll have some way of getting the code. But if it is available and everyone has access to it, then any vulnerabilities that are there are much more likely to be corrected than before they're exploited.

How do you see open source contributing to your ability to get insights from large amounts of data? If you're recruiting developers, can they actually make a difference in helping their fellow citizens?

Burton: It's all about recruiting. As we go out and we bring on data people and software developers, we're looking for that kind of expertise. We're looking for people that have worked with PostgreSQL. We're looking for people that have worked with Solar. We're looking for people that have worked with Hadoop, because then we can start to build that expertise in-house. Those tools are out there.

R is an interesting example. What we're finding is that as more people are coming out of academia into the professional world, they're actually used to using R in school. And then they have to come out and learn a different tool and they're actually working in the marketplace.

It's similar with the Mac versus the PC. You get people using the Mac in college — and suddenly they have to go to a Windows interface. Why impose that on them? If they're going to be extremely productive with a tool like R, why not allow that to be used?

We're starting to see, in some pockets of the bureau, push from the business side to actually use some of these tools, which is great. That's another change I think that's happened in the last couple of years.

Before, there would've been big resistance on that kind of thing. Now that we're getting pushed a little bit, we have to respond to that. We also think it's worth it that we do.

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.