Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 07 2012

Strata Week: Data prospecting with Kaggle

Here are a few of the data stories that caught my attention this week:

Prospecting for data

KaggleThe data science competition site Kaggle is extending its features with a new service called Prospect. Prospect allows companies to submit a data sample to the site without having a pre-ordained plan for a contest. In turn, the data scientists using Kaggle can suggest ways in which machine learning could best uncover new insights and answer less-obvious questions — and what sorts of data competitions could be based on the data.

As GigaOm's Derrick Harris describes it: "It's part of a natural evolution of Kaggle from a plucky startup to an IT company with legs, but it's actually more like a prequel to Kaggle's flagship predictive modeling competitions than it is a sequel." It's certainly a good way for companies to get their feet wet with predictive modeling.

Practice Fusion, a web-based electronic health records system for physicians, has launched the inaugural Kaggle Prospect challenge.

HP's big data plans

Last year, Hewlett Packard made a move away from the personal computing business and toward enterprise software and information management. It's a move that was marked in part by the $10 billion it paid to acquire Autonomy. Now we know a bit more about HP's big data plans for its Information Optimization Portfolio, which has been built around Autonomy's Intelligent Data Operating Layer (IDOL).

ReadWriteWeb's Scott M. Fulton takes a closer look at HP's big data plans.

The latest from Cloudera

Cloudera released a number of new products this week: Cloudera Manager 3.7.6; Hue 2.0.1; and of course CDH 4.0, its Hadoop distribution.

CDH 4.0 includes:

"... high availability for the filesystem, ability to support multiple namespaces, HBase table and column level security, improved performance, HBase replication and greatly improved usability and browser support for the Hue web interface. Cloudera Manager 4 includes multi-cluster and multi-version support, automation for high availability and MapReduce2, multi-namespace support, cluster-wide heatmaps, host monitoring and automated client configurations."

Social data platform DataSift also announced this week that it was powering its Hadoop clusters with CDH to perform the "Big Data heavy lifting to help deliver DataSift's Historics, a cloud-computing platform that enables entrepreneurs and enterprises to extract business insights from historical public Tweets."

Have data news to share?

Feel free to email us.

OSCON 2012 Data Track — Today's system architectures embrace many flavors of data: relational, NoSQL, big data and streaming. Learn more in the Data track at OSCON 2012, being held July 16-20 in Portland, Oregon.

Save 20% on registration with the code RADAR

Related:

May 31 2012

Strata Week: MIT and Massachusetts bet on big data

Here are a few of the big data stories that caught my attention this week.

MIT makes a big data push

MIT unveiled its big data research plans this week with a new initiative: bigdata@csail. CSAIL is the university's Computer Science and Artificial Intelligence Laboratory. According to the initiative's website, the project will "identify and develop the technologies needed to solve the next generation data challenges which require the ability to scale well beyond what today's computing platforms, algorithms, and methods can provide."

The research will be funded in part by Intel, which will contribute $2.5 million per year for up to five years. As part of the announcement, Massachusetts Governor Deval Patrick added that his state was forming a Massachusetts Big Data initiative that would provide matching grants for big data research, something he hopes will make the state "well-known for big data research."

Cisco's predictions for the Internet

Cisco released its annual forecast for Internet networking. Not surprisingly, Cisco projects massive growth in networking, with annual global IP traffic reaching 1.3 zettabytes by 2016. "The projected increase of global IP traffic between 2015 and 2016 alone is more than 330 exabytes," according to the company's press release, "which is almost equal to the total amount of global IP traffic generated in 2011 (369 exabytes)."

Cisco points to a number of factors contributing to the explosion, including more Internet-connected devices, more users, faster Internet speeds, and more video.

Open data startup Junar raises funding

The Chilean data startup Junar announced this week that it had raised a seed round of funding. The startup is an open data platform with the goal of making it easy for anyone to collect, analyze, and publish. GigaOm's Barb Darrow writes:

"Junar's Open Data Platform promises to make it easier for users to find the right data (regardless of its underlying format); enhance it with analytics; publish it; enable interaction with comments and annotation; and generate reports. Throughout the process it also lets user manage the workflow and track who has accessed and downloaded what, determine which data sets are getting the most traction etc."

Junar joins a number of open data startups and marketplaces that offer similar or related services, including Socrata and DataMarket.

Have data news to share?

Feel free to email me.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR

May 24 2012

Strata Week: Visualizing a better life

Here are a few of the data stories that caught my attention this week:

Visualizing a better life

How do you compare the quality of life in different countries? As The Guardian's Simon Rogers points out, GDP has commonly been the indicator used to show a country's economic strength, but it's insufficient for comparing the quality of life and happiness of people.

To help build a better picture of what quality of life means to people, the Organization for Economic Cooperation and Development OECD built the Your Better Life Index. The index lets people select the things that matter to them: housing, income, jobs, community, education, environment, governance, health, life satisfaction, safety and work-life balance. The OECD launched the tool last year and offered an update this week, adding data on gender and inequality.

OECD.jpg
Screenshot from OECD's Your Better Life Index.

"It's counted as a major success by the OECD," writes Rogers, "particularly as users consistently rank quality of life indicators such as education, environment, governance, health, life satisfaction, safety and work-life balance above more traditional ones. Designed by Moritz Stefaner and Raureif, it's also rather beautiful."

The countries that come out on top most often based on users' rankings: "Denmark (life satisfaction and work-life balance), Switzerland (health and jobs), Finland (education), Japan (safety), Sweden (environment), and the USA (income)."

Researchers' access to data

The New York Times' John Markoff examines social science research and the growing problem of datasets that are not made available to other scholars. Opening data helps make sure that research results can be verified. But Markoff suggests that in many cases, data is being kept private and proprietary.

Much of the data he's talking about here is:

"... gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers' privacy. But to many scientists, the practice is an invitation to bad science, secrecy and even potential fraud."

"The debate will only intensify as large companies with deep pockets do more research about their users," Markoff predicts.

Updates to Hadoop

Apache has released the alpha version of Hadoop 2.0.0. We should stress "alpha" here, and as Hortonworks' Arun Murthy notes, it's "not ready to run in production." However, he adds the update "is still an important step forward, as it represents the very first release that delivers new and important capabilities," including: HDFS HA (manual failover) and next generation MapReduce.

In other Hadoop news, MapR has unveiled a series of new features and initiatives for its Hadoop distribution, including release of a fully compliant ODBC 3.52 driver, support for the Linux Pluggable Authentication Modules (PAM), and the availability of the source code for several of its components.

Have data news to share?

Feel free to email me.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR


Related:


May 17 2012

Strata Week: Google unveils its Knowledge Graph

Here's what caught my attention in the data space this week.

Google's Knowledge Graph

Google Knowledge Graph"Google does the semantic Web," says O'Reilly's Edd Dumbill, "except they call it the Knowledge Graph." That Knowledge Graph is part of an update to search that Google unveiled this week.

"We've always believed that the perfect search engine should understand exactly what you mean and give you back exactly what you want," writes Amit Singhal, Senior VP of Engineering, in the company's official blog post.

That post makes no mention of the semantic web, although as ReadWriteWeb's Jon Mitchell notes, the Knowledge Graph certainly relies on it, following on and developing from Google's acquisition of the semantic database Freebase in 2010.

Mitchell describes the enhanced search features:

"Most of Google users' queries are ambiguous. In the old Google, when you searched for "kings," Google didn't know whether you meant actual monarchs, the hockey team, the basketball team or the TV series, so it did its best to show you web results for all of them.

"In the new Google, with the Knowledge Graph online, a new box will come up. You'll still get the Google results you're used to, including the box scores for the team Google thinks you're looking for, but on the right side, a box called "See results about" will show brief descriptions for the Los Angeles Kings, the Sacramento Kings, and the TV series, Kings. If you need to clarify, click the one you're looking for, and Google will refine your search query for you."

Yahoo's fumbles

The news from Yahoo hasn't been good for a long time now, with the most recent troubles involving the departure of newly appointed CEO Scott Thompson over the weekend and a scathing blog post this week by Gizmodo's Mathew Honan titled "How Yahoo Killed Flickr and Lost the Internet." Ouch.

Over on GigaOm, Derrick Harris wonders if Yahoo "sowed the seeds of its own demise with Hadoop." While Hadoop has long been pointed to as a shining innovation from Yahoo, Harris argues that:

"The big problem for Yahoo is that, increasingly, users and advertisers want to be everywhere on the web but at Yahoo. Maybe that's because everyone else that's benefiting from Hadoop, either directly or indirectly, is able to provide a better experience for consumers and advertisers alike."

De-funding data gathering

The appropriations bill that recently passed the U.S. House of Representatives axes funding for the Economic Census and the American Community Survey. The former gathers data about 25 million businesses and 1,100 industries in the U.S., while the latter collects data from three million American households every year.

Census Bureau director Robert Groves writes that the bill "devastates the nation's statistical information about the status of the economy and the larger society." BusinessWeek chimes in that the end to these surveys "blinds business," noting that businesses rely "heavily on it to do such things as decide where to build new stores, hire new employees, and get valuable insights on consumer spending habits."

Got data news to share?

Feel free to email me.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR


Related:


May 10 2012

Strata Week: Big data boom and big data gaps

Here are a few of the data stories that caught my attention this week.

Big data booming

The call for speakers for Strata New York has closed, but as Edd Dumbill notes, the number of proposals are a solid indication of the booming interest in big data. The first Strata conference, held in California in 2011, elicited 255 proposals. The following event in New York elicited 230. The most recent Strata, held in California again, had 415 proposals. And the number received for Strata's fall event in New York? That came in at 635.

Edd writes:

"That's some pretty amazing growth. I can thus expect two things from Strata New York. My job in putting the schedule together is going to be hard. And we're going to have the very best content around."

The increased popularity of the Strata conference is just one data point from the week that highlights a big data boom. Here's another: According to a recent report by IDC, the "worldwide ecosystem for Hadoop-MapReduce software is expected to grow at a compound annual rate of 60.2 percent, from $77 million in revenue in 2011 to $812.8 million in 2016."

"Hadoop and MapReduce are taking the software world by storm," says IDC's Carl Olofson. Or as GigaOm's Derrick Harris puts it: "All aboard the Hadoop money train."

A big data gap?

Another report released this week reins in some of the exuberance about big data. This report comes from the government IT network MeriTalk, and it points to a "big data gap" in the government — that is, a gap between the promise and the capabilities of the federal government to make use of big data. That's interesting, no doubt, in terms of the Obama administration's recent $200 million commitment to a federal agency big data initiative.

Among the MeriTalk report's findings: 60% of government IT professionals say their agency is analyzing the data it collects and less than half (40%) are using data to make strategic decisions. Those responding to the survey said they felt as though it would take, on average, three years before their agencies were ready to fully take advantage of big data.

Prismatic and data-mining the news

The largest-ever healthcare fraud scheme was uncovered this past week. Arrests were made in seven cities — some 107 doctors, nurses and social workers were charged, with fraudulent Medicare claims totaling about $452 million. The discoveries about the fraudulent behavior were made thanks in part to data-mining — looking for anomalies in the Medicare filings made by various health care providers.

Prismatic penned a post in which it makes the case for more open data so that there's "less friction" in accessing the sort of information that led to this sting operation.

"Both the recent sting and the Prime case show that you need real journalists and investigators working with technology and data to achieve good results. The challenge now is to scale this recipe and force transparency on a larger scale.

"We need to get more technically sophisticated and start analysing the data sets up front to discover the right questions to ask, not just the answer the questions we already know to ask based on up-front human investigation. If we have to discover each fraud ring or singleton abuse as a one-off case, we'll never be able to wipe out fraud on a large enough scale to matter."

Indeed, despite this being the largest bust ever, it's really just a fraction of the estimated $20 to $100 billion a year in Medicare fraud.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

May 03 2012

Strata Week: Google offers big data analytics

Here are the data stories that caught my attention this week.

BigQuery for everyone

Google BigQueryGoogle has released its big data analytics service BigQuery to the public. Initially made available to a small number of developers late last year, now anyone can sign up for the service. A free account lets you query up to 100 GB of data per month, with the option to pay for additional queries and/or storage.

"Google's aim may be to sell data storage in the cloud, as much as it is to sell analytic software," says The New York Times' Quentin Hardy. "A company using BigQuery has to have data stored in the cloud data system, which costs 12 cents a gigabyte a month, for up to two terabytes, or 2,000 gigabytes. Above that, prices are negotiated with Google. BigQuery analysis costs 3.5 cents a gigabyte of data processed."

The interface for BigQuery is meant to lower the bar for these sorts of analytics — there's a UI and a REST interface. In the Times article, Google project manager Ju-kay Kwek says Google is hoping developers build tools that encourage widespread use of the product by executives and other non-developers.

If folks are looking for something to cut their teeth on with BigQuery, GitHub's public timeline is now a publicly available dataset. The data is being synced regularly, so you can query things like popular languages and popular repos. To that end, GitHub is running a data visualization contest.

The Data Journalism Handbook

The Data Journalism Handbook had its release this week at the 2012 International Journalism Festival in Italy. The book, which is freely available and openly licensed, was a joint effort of the European Journalism Centre and the Open Knowledge Foundation. It's meant to serve as a reference for those interested in the field of data journalism.

In the introduction, "Deutsche Welle's" Mirko Lorenz writes:

"Today, news stories are flowing in as they happen, from multiple sources, eye-witnesses, blogs, and what has happened is filtered through a vast network of social connections, being ranked, commented and more often than not, ignored. This is why data journalism is so important. Gathering, filtering and visualizing what is happening beyond what the eye can see has a growing value."


Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.



Save 20% on registration with the code RADAR20

Open data is a joke?

Tim Slee fired a shot across the bow of the open data movement with a post this week arguing that "the open data movement is a joke." Moreover, it's not a movement at all, he contends. Slee turns a critical eye to the Canadian government's open data efforts in particular, noting that: "The Harper government's actions around 'open government,' and the lack of any significant consequences for those actions, show just how empty the word 'open' has become."

Slee is also critical of open data efforts outside the government, calling the open data movement "a phrase dragged out by media-oriented personalities to cloak a private-sector initiative in the mantle of progressive politics."

Open data activist David Eaves responded strongly to Slee's post with one of his own, recognizing his own frustrations with "one of the most — if not the most — closed and controlling [governments] in Canada's history." But Eaves takes exception with the ways in which Slee characterizes the open data movement. He contends that many of the corporations involved with the open data movement — something Slee charges has corrupted open data — are U.S. corporations (and points out that in Canada, "most companies don't even know what open data is"). Eaves adds, too, that many of these corporations are led by geeks.

Eaves writes:

"Just as an authoritarian regime can run on open-source software, so too might it engage in open data. Open data is not the solution for Open Government (I don't believe there is a single solution, or that Open Government is an achievable state of being — just a goal to pursue consistently), and I don't believe anyone has made the case that it is. I know I haven't. But I do believe open data can help. Like many others, I believe access to government information can lead to better informed public policy debates and hopefully some improved services for citizens (such as access to transit information). I'm not deluded into thinking that open data is going to provide a steady stream of obvious 'gotcha moments' where government malfeasance is discovered, but I am hopeful that government data can arm citizens with information that the government is using to inform its decisions so that they can better challenge, and ultimately help hold accountable, said government."

Got data news?

Feel free to email me.

Related:

April 26 2012

April 12 2012

Strata Week: Add structured data, lose local flavor?

Here are a few of the data stories that caught my attention this week:

A possible downside to Wikidata

Wikidata data model diagram
Screenshot from the Wikidata Data Model page.

The Wikimedia Foundation — the good folks behind Wikipedia — recently proposed a Wikidata initiative. It's a new project that would build out a free secondary database to collect structured data that could provide support in turn for Wikipedia and other Wikimedia projects. According to the proposal:

"Many Wikipedia articles contain facts and connections to other articles that are not easily understood by a computer, like the population of a country or the place of birth of an actor. In Wikidata, you will be able to enter that information in a way that makes it processable by the computer. This means that the machine can provide it in different languages, use it to create overviews of such data, like lists or charts, or answer questions that can hardly be answered automatically today."

But in The Atlantic this week, Mark Graham, a research fellow at the Oxford Research Institute, takes a look at the proposal, calling these "changes that have worrying connotations for the diversity of knowledge in the world's sixth most popular website." Graham points to the different language editions of Wikipedia, noting that the encyclopedic knowledge contained therein is always highly diverse. "Not only does each language edition include different sets of topics, but when several editions do cover the same topic, they often put their own, unique spin on the topic. In particular, the ability of each language edition to exist independently has allowed each language community to contextualize knowledge for its audience."

Graham fears that emphasizing a standardized, machine-readable, semantic-oriented Wikipedia will lose this local flavor:

"The reason that Wikidata marks such a significant moment in Wikipedia's history is the fact that it eliminates some of the scope for culturally contingent representations of places, processes, people, and events. However, even more concerning is that fact that this sort of congealed and structured knowledge is unlikely to reflect the opinions and beliefs of traditionally marginalized groups."

His arguments raise questions about the perceived universality of data, when in fact what we might find instead is terribly nuanced and localized, particularly when that data is contributed by humans who are distributed globally.

The intricacies of Netflix personalization

Netflix suggestion buttonNetflix's recommendation engine is often cited as a premier example of how user data can be mined and analyzed to build a better service. This week, Netflix's Xavier Amatriain and Justin Basilico penned a blog post offering insights into the challenges that the company — and thanks to the Netflix Prize, the data mining and machine learning communities — have faced in improving the accuracy of movie recommendation engines.

The Netflix post raises some interesting questions about how the means of content delivery have changed recommendations. In other words, when Netflix refocused on its streaming product, viewing interests changed (and not just because the selection changed). The same holds true for the multitude of ways in which we can now watch movies via Netflix (there are hundreds of different device options for accessing and viewing content from the service).

Amatriain and Basilico write:

"Now it is clear that the Netflix Prize objective, accurate prediction of a movie's rating, is just one of the many components of an effective recommendation system that optimizes our members' enjoyment. We also need to take into account factors such as context, title popularity, interest, evidence, novelty, diversity, and freshness. Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts."

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

Reposted byRK RK

April 05 2012

Strata Week: New life for an old census

Here are a few of the data stories that caught my attention this week

Now available in digital form: The 1940 census

The National Archives released the 1940 U.S. Census records on Monday, after a mandatory 72-year waiting period. The release marks the single largest collection of digital information ever made available online by the agency.

Screenshot from the 1940 Census available through Archives.org
Screenshot from the digital version of the 1940 Census.

The 1940 Census, conducted as a door-to-door survey, included questions about age, race, occupation, employment status, income, and participation in New Deal programs — all important (and intriguing) following the previous decade's Great Depression. One data point: in 1940, there were 5.1 million farmers. According to the 2010 American Community Survey (not the census, mind you), there were just 613,000.

The ability to glean these sorts of insights proved to be far more compelling than the National Archives anticipated, and the website hosting the data, Archives.com, was temporarily brought down by the traffic load. The site is now up, so anyone can investigate the records of approximately 132 million Americans. The records are searchable by map — or rather, "the appropriate enumeration district" — but not by name.

A federal plan for big data

The Obama administration unveiled its "Big Data Research and Development Initiative" late last week, with more than $200 million in financial commitments. Among the White House's goals: to "advance state-of-the-art core technologies needed to to collect, store, preserve, manage, analyze, and share huge quantities of data."

The new big data initiative was announced with a number of departments and agencies already on board with specific plans, including grant opportunities from the Department of Defense and National Science Foundation, new spending on an XDATA program by DARPA to build new computational tools as well as open data initiatives, such as the the 1000 Genomes Project.

"In the same way that past Federal investments in information-technology R&D led to dramatic advances in supercomputing and the creation of the Internet, the initiative we are launching today promises to transform our ability to use big data for scientific discovery, environmental and biomedical research, education, and national security," said Dr. John P. Holdren, assistant to the President and director of the White House Office of Science and Technology Policy in the official press release (PDF).

Personal data and social context

When the Girls Around Me app was released, using data from Foursquare and Facebook to notify users when there were females nearby, many commentators called it creepy. "Girls Around Me is the perfect complement to any pick-up strategy," the app's website once touted. "And with millions of chicks checking in daily, there's never been a better time to be on the hunt."

"Hunt" is an interesting choice of words here, and the Cult of Mac, among other blogs, asked if the app was encouraging stalking. Outcry about the app prompted Foursquare to yank the app's API access, and the app's developers later pulled the app voluntarily from the App Store.

Many of the responses to the app raised issues about privacy and user data, and questioned whether women in particular should be extra cautious about sharing their information with social networks. But as Amit Runchal writes in TechCrunch, this response blames the victims:

"You may argue, the women signed up to be a part of this when they signed up to be on Facebook. No. What they signed up for was to be on Facebook. Our identities change depending on our context, no matter what permissions we have given to the Big Blue Eye. Denying us the right to this creates victims who then get blamed for it. 'Well,' they say, 'you shouldn't have been on Facebook if you didn't want to ...' No. Please recognize them as a person. Please recognize what that means."

Writing here at Radar, Mike Loukides expands on some of these issues, noting that the questions are always about data and social context:

"It's useful to imagine the same software with a slightly different configuration. Girls Around Me has undeniably crossed a line. But what if, instead of finding women, the app was Hackers Around Me? That might be borderline creepy, but most people could live with it, and it might even lead to some wonderful impromptu hackathons. EMTs Around Me could save lives. I doubt that you'd need to change a single line of code to implement either of these apps, just some search strings. The problem isn't the software itself, nor is it the victims, but what happens when you move data from one context into another. Moving data about EMTs into context where EMTs are needed is socially acceptable; moving data into a context that facilitates stalking isn't acceptable, and shouldn't be."

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

March 29 2012

Strata Week: The allure of a data haven

Here are a few of the data stories that caught my attention this week:

Sealand's siren song

Principality of Sealand coat of armsArs Technica's James Grimmelmann examines the recent history of the Principality of Sealand, a World War II anti-aircraft platform located six miles off the coast of England. Some reports claim Wikileaks is looking to relocate its servers there, ostensibly out of reach of legal threats and government interference. Why Sealand? It claims it's an independent nation, and as such it "sounds perfect for WikiLeaks: a friendly, legally unassailable host with an anything-goes attitude," writes Grimmelmann.

But as Grimmelmann notes, Sealand's history isn't exactly the "cryptographers' paradise" one might expect. In the early 2000s another company called HavenCo set up shop there with a "no-questions-asked colocation" facility. Dandy in theory, but not in practice. The endeavor was never remotely successful, and the company spiraled downward, eventually becoming "nationalized" by Sealand. "HavenCo no longer had real technical experts or the competitive advantage of being willing to host legally risky content," Grimmelmann writes. "What it did have was an absurdly inefficient cost structure. Every single piece of equipment, drop of fuel, and scrap of food had to be brought in by boat or helicopter. By 2006, 'Sealand' hosting was in a London data center. By 2008, even the HavenCo website was offline."

It's a fascinating story about the promises of data havens and the long-arm of the law. It's also a cautionary tale for Wikileaks, suggests Grimmelmann. "Sealand isn't going to save WikiLeaks any more than putting the site's servers in a former nuclear bunker would. The legal system figured out a long time ago that throwing the account owner in jail works just as well as seizing the server."

ThinkUp reboots

ThinkUp, one of the flagship products from the non-profit Expert Labs, will get a reboot as a for-profit company, write founders Gina Trapani and Anil Dash. The ThinkUp app is an open source tool that allows users to store, search and analyze all their social media activity (posts to Facebook, Twitter, Google+, etc.).

It's a simple tool, says Dash:

"But what ThinkUp represents is a lot of important concepts: Owning your actions and words on the web. Encouraging more positive and fruitful conversations on social networks. Gaining insights into ourselves and our friends based on what we say and share. And the possibility of discovering important information or different perspectives if we can return the web back to its natural state of not being beholden to any one company or proprietary network."

ThinkUp will remain open source but it will evolve to include an "easy-to-use product with mainstream appeal," says Trapani. Expert Labs will be winding down, but the new company that has grown out of it will share many parts of the organization's original mission.

Factual's Gil Elbaz profiled in The New York Times

With the headline "Just the Facts, Yes All of Them," The New York Times profiles Gil Elbaz, the founder of the data startup Factual. "The world is one big data problem," Elbaz tells journalist Quentin Hardy.

"Data has always been seen as just a side effect in computing, something you look up while you are doing work," Elbaz says in the Times piece. "We see it as a whole separate layer that everyone is going to have to tap into, data you want to solve a problem, but that you might not have yourself, and completely reliable."

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Principality of Sealand coat of arms via Wikimedia Commons.

Got data news?

Feel free to email me.

Related:

March 22 2012

Strata Week: Machine learning vs domain expertise

Here are a few of the data stories that caught my attention this week:

Debating the future of subject area expertise

Data Science Debate panel at Strata CA 12
The "Data Science Debate" panel at Strata California 2012. Watch the debate.

The Oxford-style debate at Strata continues to be one of the most-talked-about events from the conference. This week, it's O'Reilly's Mike Loukides who weighs in with his thoughts on the debate, which had the motion "In data science, domain expertise is more important than machine learning skill." (For those that weren't there, the machine learning side "won." See Mike Driscoll's summary and full video from the debate.)

Loukides moves from the unreasonable effectiveness of data to examine the "unreasonable necessity of subject experts." He writes that:

"Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes 'unreasonably effective' through the conversation that takes place after the numbers have been crunched ... We can only take our inexplicable results at face value if we're just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they're based. And that's the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can't forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems."

Microsoft hires former Yahoo chief scientist

Microsoft has hired Raghu Ramakrishnan as a technical fellow for its Server and Tools Business (STB), reports ZDNet's Mary Jo Foley. According to his new company bio, Ramakrishnan's work will involve "big data and integration between STB's cloud offerings and the Online Services Division's platform assets."

Ramakrishnan comes to Microsoft from Yahoo, where he's been the chief scientist for three divisions — Audience, Cloud Platforms and Search. As Foley notes, Ramakrishnan's move is another indication that Microsoft is serious about "playing up its big data assets." Strata chair Edd Dumbill examined Microsoft's big data strategy earlier this year, noting in particular its work on a Hadoop distribution for Windows server and Azure.

Analyzing the value of social media data

How much is your data worth? The Atlantic's Alexis Madrigal does a little napkin math based on figures from the Internet Advertising Bureau to come up with a broad and ambiguous range between half a cent and $1,200 — depending on how you decide to make the calculation, of course.

In an effort to make those measurements easier and more useful, Google unveiled some additional reports as part of its Analytics product this week. It's a move Google says will help marketers:

"... identify the full value of traffic coming from social sites and measure how they lead to direct conversions or assist in future conversions; understand social activities happening both on and off of your site to help you optimize user engagement and increase social key performance indicators (KPIs); and make better, more efficient data-driven decisions in your social media marketing programs."

Engagement and conversion metrics for each social network will now be trackable through Google Analytics. Partners for this new Social Data Hub, include Disqus, Echo, Reddit, Diigo, and Digg, among others.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

March 15 2012

Strata Week: Infographics for all

Here are some of the data stories that caught my attention this week.

More infographics incoming, thanks to Visual.ly Create

The visualization site Visual.ly launched a new tool this week that helps users create their own infographics. Aptly called Visual.ly Create, the new feature lets people take publicly available datasets (such as information from a Twitter hashtag), select a template, and publish their own infographics.

Visual.ly infographic of the #strataconf tag
Segment from a Visual.ly Create infographic of the #stratconf hashtag.

As GigaOm's Derrick Harris observes, it's fairly easy to spot the limitations with this service — in the data you can use, in the templates that are available, and in the visualizations that are created. But after talking to Visual.ly's co-founder and Chief Content Officer Lee Sherman about some "serious customization options" that are in the works, Harris wonders if a tool like this could be something to spawn interest in data science:

"The problem is that we need more people with math skills to meet growing employer demand for data scientists and data analysts. But how do you get started caring about data in the first place when the barriers are so high? Really working with data requires a deep understanding of both math and statistics, and Excel isn't exactly a barrel of monkeys (nor are the charts it creates)."

Could Visual.ly be an on-ramp for more folks to start caring about and playing with data?

San Francisco upgrades its open data initiative

Late last week, San Francisco Mayor Ed Lee unveiled the new data.SFgov.org, a cloud-based open data website that will replace DataSF.org, one of the earliest examples of civic open data initiatives.

San Francisco Data banner

"By making City data more accessible to the public secures San Francisco's future as the world's first 2.0 City," said Lee in an announcement. "It's only natural that we move our Open Data platform to the cloud and adopt modern open interface to facilitate that flow and access to information and develop better tools to enhance City services."

The city's Chief Innovation Officer Jay Nath told TechCrunch that the update to the website expands access to information while saving the city money.

The new site contains some 175 datasets, including map-based crime data, active business listings, and various financial datasets. It's powered by the Seattle-based data startup Socrata.

The personal analytics of Stephen Wolfram

"One day I'm sure everyone will routinely collect all sorts of data about themselves," writes Mathematica and Wolfram Alpha creator Stephen Wolfram. "But because I've been interested in data for a very long time, I started doing this long ago. I actually assumed lots of other people were doing it too, but apparently they were not. And so now I have what is probably one of the world's largest collections of personal data."

And what a fascinating collection of data it is, including emails received and sent, phone calls made, calendar events planned, keystrokes made, and steps taken. Through this, you can see Wolfram's sleep, social, and work patterns, and even how various chapters of his book and Mathematica projects took shape.

"The overall pattern is fairly clear," Wolfram writes. "It's meetings and collaborative work during the day, a dinnertime break, more meetings and collaborative work, and then in the later evening more work on my own. I have to say that looking at all this data, I am struck by how shockingly regular many aspects of it are. But in general, I am happy to see it. For my consistent experience has been that the more routine I can make the basic practical aspects of my life, the more I am able to be energetic — and spontaneous — about intellectual and other things."

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

March 08 2012

Strata Week: Profiling data journalists

Here are a few of the data stories that caught my attention this week.

Profiling data journalists

Over the past week, O'Reilly's Alex Howard has profiled a number of practicing data journalists, following up on the National Institute for Computer-Assisted Reporting's (NICAR) 2012 conference. Howard argues that data journalism has enormous importance, but "given the reality that those practicing data journalism remain a tiny percentage of the world's media, there's clearly still a need for its foremost practitioners to show why it matters, in terms of impact."

Howard's profiles include:

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Surveying data marketplaces

Edd Dumbill takes a look at data marketplaces, the online platforms that host data from various publishers and offer it for sale to consumers. Dumbill compares four of the most mature data marketplaces — Infochimps, Factual, Windows Azure Data Marketplace, and DataMarket — and examines their different approaches and offerings.

Dumbill says marketplaces like these are useful in three ways:

"First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume."

Analyzing sports stats

The Atlantic's Dashiell Bennett examines the MIT Sloan Sports Analytics Conference, a "festival of sports statistics" that has grown over the past six years from 175 attendees to more than 2,200.

Bennett writes:

"For a sports conference, the event is noticeably athlete-free. While a couple of token pros do occasionally appear as panel guests, this is about the people behind the scenes — those who are trying to figure out how to pick those athletes for their team, how to use them on the field, and how much to pay them without looking like a fool. General managers and team owners are the stars of this show ... The difference between them and the CEOs of most companies is that the sports guys have better data about their employees ... and a lot of their customers have it memorized."

Got data news?

Feel free to email me.

Related:

March 01 2012

Strata Week: Datasift lets you mine two years of Twitter data

Here are a few of the data stories that caught my attention this week.

Twitter's historical archives, via Datasift

DataSiftDatasift, one of the two companies that has official access to the Twitter firehose (the other being Gnip) announced its new Historics service this week, giving customers access to up to two years' worth of historical Tweets. (By comparison, Gnip offers 30 days of Twitter data, and other developers and users have access to roughly a week's worth of Tweets.)

GigaOm's Barb Darrow responded to those who might be skeptical about the relevance of this sort of historic Twitter data in a service that emphasizes real-time. Darrow noted that DataSift CEO Rob Bailey said companies planning new products, promotions or price changes would do well to study the impact of their past actions before proceeding and that Twitter is the perfect venue for that.

Another indication of the desirability of this new Twitter data: the waiting list for Historics already includes a number of Fortune 500 companies. The service will get its official launch in April.

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Building a school of data

Although there are plenty of ways to receive formal training in math, statistics and engineering, there aren't a lot of options when it comes to an education specifically in data science.

To that end, the Open Knowledge Foundation and Peer to Peer University (P2PU) have proposed a School of Data, arguing that:

"It will be years before data specialist degree paths become broadly available and accepted, and even then, time-intensive degree courses may not be the right option for journalists, activists, or computer programmers who just need to add data skills to their existing expertise. What is needed are flexible, on-demand, shorter learning options for people who are actively working in areas that benefit from data skills, particularly those who may have already left formal education programmes."

The organizations are seeking volunteers to help develop the project, whether that's in the form of educational materials, learning challenges, mentorship, or a potential student body.

Strata in California

The Strata Conference wraps up today in Santa Clara, Calif. If you missed Strata this year and weren't able to catch the livestream of the conference, look for excerpts and videos posted here on Radar and through the O'Reilly YouTube channel in the coming weeks.

And be sure to make plans for Strata New York, being held October 23-25. That event will mark the merger with Hadoop World. The call for speaker proposals for Strata NY is now open.

Got data news?

Feel free to email me.

Related:

February 23 2012

Strata Week: Infochimps makes a platform play

Here are a few of the data stories that caught my attention this week.

Infochimps makes its big data expertise available in a platform

The big data marketplace Infochimps announced this week that it will begin offering the platform that it's built for itself to other companies — as both a platform-as-a-service and an on-premise solution. "The technical needs for Infochimps are pretty substantial," says CEO Joe Kelly, and the company now plans to help others get up-to-speed with implementing a big data infrastructure.

Infochimps has offered datasets for download or via API for a number of years (see my May 2011 interview with the company here), but the startup is now making the transition to offer its infrastructure to others. Likening its big data marketplace to an "iTunes for data," Infochimps says it's clear that we still need a lot more "iPods" in production before most companies are able to handle the big data deluge.

Infochimps will now offer its in-house expertise to others. That includes a number of tools that one might expect: AWS, Hadoop, and Pig. But it also includes Ironfan, Infochimps' management tool built on top of Chef.

Infochimps isn't abandoning the big data marketplace piece of its business. However, its move to support companies with their big data efforts is indication there's still quite a bit of work to do before everyone's quite ready to "do stuff" with the big data we're accumulating.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.



Save 20% on registration with the code RADAR20

How do you anonymize online publications?

A fascinating piece of research is set to to appear at IEEE S&P on the subject of Internet-scale authorship identification based on "stylometry," which is an analysis of writing style. The paper was co-authored by Arvind Narayanan, Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song. They've been able to correctly identify writers 20% of the time based on looking at what they've published online before. It's a finding with serious implications for online anonymity and free speech, the team notes.

"The good news for authors who would like to protect themselves against de-anonymization is it appears that manually changing one's style is enough to throw off these attacks," says Narayanan.

Open data for the public data

O'Reilly Media has just published a report on "Data for the Public Good." In the report, Alex Howard makes the argument for a systemic approach to thinking about open data and the public sector, examining the case for a "public good" around public data as well as around governmental, journalistic, healthcare, and crisis situations (to name but a few scenarios and applications).

Howard notes that the success of recent open data initiatives "won't depend on any single chief information officer, chief executive or brilliant developer. Data for the public good will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, in whatever form it is delivered." Although many municipalities have made the case for open data initiatives, there's more to the puzzle, Howard argues, including recognizing the importance of personal data and making the case for a "hybridized public-private data."

The "Data for the Public Good" report is available for free as a PDF, ePUB, or MOBI download.

Got data news?

Feel free to email me.

Related:

February 16 2012

Strata Week: The data behind Yahoo's front page

Here are a few of the data stories that caught my attention this week.

Data and personalization drive Yahoo's front page

Yahoo offered a peak behind the scenes of its front page with the release of the Yahoo C.O.R.E. Data Visualization. The visualization provides a way to view some of the demographic details behind what Yahoo visitors are clicking on.

The C.O.R.E. (Content Optimization and Relevance Engine) technology was created by Yahoo Labs. The tech is used by Yahoo News and its Today module to personalize results for its visitors — resulting in some 13,000,000 unique story combinations per day. According to Yahoo:

"C.O.R.E. determines how stories should be ordered, dependent on each user. Similarly, C.O.R.E. figures out which story categories (i.e. technology, health, finance, or entertainment) should be displayed prominently on the page to help deepen engagement for each viewer."

Screenshot from Yahoo's CORE visualization
Screenshot from Yahoo's CORE data visualization. See the full visualization here.

Scaling Tumblr

Over on the High Scalability blog, Todd Huff examines how the blogging site Tumblr was able to scale its infrastructure, something that Huff describes as more challenging than the scaling that was necessary at Twitter.

To put give some idea of the scope of the problem, Hoff cites these figures:

"Growing at over 30% a month has not been without challenges. Some reliability problems among them. It helps to realize that Tumblr operates at surprisingly huge scales: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers."

Hoff interviews Blake Matheny, distributed systems engineer at Tumblr, for a look at the architecture of both "old" and "new" Tumblr. When the startup began, it was hosted on Rackspace where "it gave each custom domain blog an A record. When they outgrew Rackspace there were too many users to migrate."

The article also describes the Tumblr firehose, noting again its differences from Twitter's. "A challenge is to distribute so much data in real-time," Huff writes. "[Tumblr} wanted something that would scale internally and that an application ecosystem could reliably grow around. A central point of distribution was needed." Although Tumblr initially used Scribe/Hadoop, "this model stopped scaling almost immediately, especially at peak where people are creating 1000s of posts a second."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


Visualization creation

Data scientist Pete Warden offers his own lessons learned about building visualizations this week in a story here on Radar. His first tip: "Play with your data" -- that is, before you decide what problem you want to solve or visualization you want to create, take the time to know the data you're working with.

Warden writes:

"The more time you spend manipulating and examining the raw information, the more you understand it at a deep level. Knowing your data is the essential starting point for any visualization."

Warden explains how he was able to create a visualization for his new travel startup, Jetpac, that showed where American Facebook users go on vacation. Warden's tips aren't simply about the tools he used; he also walks through the conceptualization of the project as well as the crunching of the data.

Got data news?

Feel free to email me.

Related:

February 09 2012

Strata Week: Your personal automated data scientist

Here are a few of the data stories that caught my attention this week:

Wolfram|Alpha Pro: An on-call data scientist

The computational knowledge engine Wolfram|Alpha unveiled a pro version this week. For $4.99 per month ($2.99 for students), Wolfram|Alpha Pro offers access to more of the computational power "under the hood" of the site, in part by allowing users to upload their own datasets, which Wolfram|Alpha will in turn analyze.

This includes:

  • Text files — Wolfram|Alpha will respond with the character and word count, provide an estimate on how long it would take to read aloud, and reveal the most common word, average sentence length and more.
  • Spreadsheets — It will crunch the numbers and return a variety of statistics and graphs.
  • Image files — It will analyze the image's dimensions, size, and colors, and let you apply several different filters.

Wolfram Alpha Pro example
Wolfram|Alpha Pro subscribers can upload and analyze their own datasets.

There's also a new extended keyboard that contains the Greek alphabet and other special characters for manually entering data. Data and analysis from these entries and any queries can also be downloaded.

"In a sense," writes Wolfram's founder Stephen Wolfram, "the concept is to imagine what a good data scientist would do if confronted with your data, then just immediately and automatically do that — and show you the results."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Crisis-mapping and data protection standards

Ushahidi's Patrick Meier takes a look at the recently released Data Protection Manual issued by the International Organization for Migration (IOM). According to the IOM, the manual is meant to serve as a guide to help:

" ... protect the personal data of the migrants in its care. It follows concerns about the general increase in data theft and loss and the recognition that hackers are finding ever more sophisticated ways of breaking into personal files. The IOM Data Protection Manual aims to protect the integrity and confidentiality of personal data and to prevent inappropriate disclosure."

Meier describes the manual as "required reading" but notes that there is no mention of social media in the 150-page document. "This is perfectly understandable given IOM's work," he writes, "but there is no denying that disaster-affected communities are becoming more digitally-enabled — and thus, increasingly the source of important, user-generated information."

Meier moves through the Data Protection Manual's principles, highlighting the ones that may be challenged when it comes to user-generated, crowdsourced data and raising important questions about consent, privacy, and security.

Doubting the dating industry's algorithms

Many online dating websites claim that their algorithms are able to help match singles with their perfect mate. But a forthcoming article in "Psychological Science in the Public Interest," a journal of the Association for Psychological Science, casts some doubt on the data science of dating.

According to the article's lead author Eli Finkel, associate professor of social psychology at Northwestern University, "there is no compelling evidence that any online dating matching algorithm actually works." Finkel argues that dating sites' algorithms do not "adhere to the standards of science," and adds that "it is unlikely that their algorithms can work, even in principle, given the limitations of the sorts of matching procedures that these sites use."

It's "relationship science" versus the in-take questions that most dating sites ask in order to help users create their profiles and suggest matches. Finkel and his coauthors note that some of the strongest predictors for good relationships — such as how couples interact under pressure — aren't assessed by dating sites.

The paper calls for the creation of a panel to grade the scientific credibility of each online dating site.

Got data news?

Feel free to email me.

Related:

February 02 2012

Strata Week: The Megaupload seizure and user data

Here are a few of the data stories that caught my attention this week.

Megaupload's seizure and questions about controlling user data

When the file-storage and sharing site Megaupload had its domain name seized, assets frozen and website shut down in mid-January, the U.S. Justice Department contended that the owners were operating a site dedicated to copyright infringement. But that posed a huge problem for those who were using Megaupload for the legitimate and legal storage of their files. As the EFF noted, these users weren't given any notice of the seizure, nor were they given an opportunity to retrieve their data.

Moreover, it seemed this week that those users would have all their data deleted, as Megaupload would no longer be able to pay its server fees.

While it appears that users have won a two-week reprieve before any deletion actually occurs, the incident does raise a number of questions about users' data rights and control in the cloud. Specifically: What happens to user data when a file hosting / cloud provider goes under? And how much time and notice should users have to reclaim their data?

Megaupload seizure notice
This is what you see when you visit Megaupload.com.

Bloomberg opens its market data distribution technology

The financial news and information company Bloomberg opened its market data distribution interface this week. The BLPAPI is available under a free-use license at open.bloomberg.com. According to the press release, some 100,000 people already use the BLPAPI, but with this week's announcement, the interface will be more broadly available.

The company introduced its Bloomberg Open Symbology back in 2009, a move to provide an alternative to some of the proprietary systems for identifying securities (particularly those services offered by Bloomberg's competitor Thomson Reuters). This week's opening of the BLPAPI is a similar gesture, one that the company says is part of its "Open Market Data Initiative, an ongoing effort to embrace and promote open solutions for the financial services industry."

The BLPAPI works with a range of programming languages, including Java, C, C++, .NET, COM and Perl. But while the interface itself is free to use, the content is not.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


Pentaho moves Kettle to the Apache 2.0 license

Pentaho's extract-transform-load technology Pentaho Kettle is being moved to the Apache License, Version 2.0. Kettle was previously available under the GNU Lesser General Public License (LGPL).

By moving to the Apache license, Pentaho says it will be more in line with the licensing of Hadoop, Hbase, and a number of NoSQL projects.

Kettle downloads and documentation are available at the Pentaho Big Data Community Home.

Oscar screeners and movie piracy data

Andy Baio took a look at some of the data surrounding piracy and the Oscar screening process. There has long been concern that the review copies of movies distributed to members of the Academy of Motion Arts and Sciences were making their way online. Baio observed that while a record number of films have been nominated for Oscars this year (37), just eight of the "screeners" have been leaked online, "a record low that continues the downward trend from last year."

However, while the number of screeners available online has diminished, almost all of the nominated films (34) had already been leaked online. "If the goal of blocking leaks is to keep the films off the Internet, then the MPAA [Motion Picture Association of America] still has a long way to go," Baio wrote.

Baio has a number of additional observations about these leaks (and he also made the full data dump available for others to examine). But as the MPAA and others are making arguments (and helping pen related legislation) to crack down on Internet privacy, a good look at piracy trends seems particularly important.

Got data news?

Feel free to email me.

Related:

January 26 2012

Strata Week: Genome research kicks up a lot of data

Here are a few of the data stories that caught my attention this week.

Genomics data and the cloud

Bootstrap DNA by Charles Jencks, 2003 by mira66, on FlickrGigaOm's Derrick Harris explores some of the big data obstacles and opportunities surrounding genome research. He notes that:

When the Human Genome Project successfully concluded in 2003, it had taken 13 years to complete its goal of fully sequencing the human genome. Earlier this month, two firms — Life Technologies and Illumina — announced instruments that can do the same thing in a day, one for only $1,000. That's likely going to mean a lot of data.

But as Harris observes, the promise of quick and cheap genomics is leading to other problems, particularly as the data reaches a heady scale. A fully sequenced human genome is about 100GB of raw data. But citing DNAnexus founder Andreas Sundquist, Harris says that:

... volume increases to about 1TB by the time the genome has been analyzed. He [Sundquist] also says we're on pace to have 1 million genomes sequenced within the next two years. If that holds true, there will be approximately 1 million terabytes (or 1,000 petabytes, or 1 exabyte) of genome data floating around by 2014.

That makes the promise of a $1,000 genome sequencing service challenging when it comes to storing and processing petabytes of data. Harris posits that it will be cloud computing to the rescue here, providing the necessary infrastructure to handle all that data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Stanley Fish versus the digital humanities

Literary critic and New York Times opinionator Stanley Fish has been on a bit of a rampage in recent weeks, taking on the growing field of the "digital humanities." Prior to the annual Modern Language Association meeting, Fish cautioned that alongside the traditional panels and papers on Ezra Pound and William Shakespeare and the like, there were going to be a flood of sessions devoted to:

...'the digital humanities,' an umbrella term for new and fast-moving developments across a range of topics: the organization and administration of libraries, the rethinking of peer review, the study of social networks, the expansion of digital archives, the refining of search engines, the production of scholarly editions, the restructuring of undergraduate instruction, the transformation of scholarly publishing, the re-conception of the doctoral dissertation, the teaching of foreign languages, the proliferation of online journals, the redefinition of what it means to be a text, the changing face of tenure — in short, everything.

That "everything" was narrowed down substantially in Fish's editorial this week, in which he blasted the digital humanities for what he sees as its fixation "with matters of statistical frequency and pattern." In other words: data and computational analysis.

According to Fish, the problem with digital humanities is that this new scholarship relies heavily on the machine — and not the literary critic — for interpretation. Fish contends that digital humanities scholars are all teams of statisticians and positivists, busily digitizing texts so they can data-mine them and systematically and programmatically uncover something of interest — something worthy of interpretation.

University of Illinois, Urbana-Champaign English professor Ted Underwood argues that Fish not only mischaracterizes what digital humanities scholars do, but he misrepresents how his own interpretive tradition works:

... by pretending that the act of interpretation is wholly contained in a single encounter with evidence. On his account, we normally begin with a hypothesis (which seems to have sprung, like Sin, fully-formed from our head), and test it against a single sentence.

One of the most interesting responses to Fish's recent rants about the humanities' digital turn comes from University of North Carolina English professor Daniel Anderson, who demonstrates in the following video a far fuller picture of what "digital" "data" — creation and interpretation — looks like:

Hadoop World merges with O'Reilly's Strata New York conference

Two of the big data events announced they'll be merging this week: Hadoop World will now be part of the Strata Conference in New York this fall.

[Disclosure: The Strata events are run by O'Reilly Media.]

Cloudera first started Hadoop World back in 2009, and as Hadoop itself has seen increasing adoption, Hadoop World, too, has become more popular. Strata is a newer event — its first conference was held in Santa Clara, Calif., in February 2011, and it expanded to New York in September 2011.

With the merger, Hadoop World will be a featured program at Strata New York 2012 (Oct. 23-25).

In other Hadoop-related news this week, Strata chair Edd Dumbill took a close look at Microsoft's Hadoop strategy. Although it might be surprising that Microsoft has opted to adopt an open source technology as the core of its big data plans, Dumbill argues that:

Hadoop, by its sheer popularity, has become the de facto standard for distributed data crunching. By embracing Hadoop, Microsoft allows its customers to access the rapidly-growing Hadoop ecosystem and take advantage of a growing talent pool of Hadoop-savvy developers.

Also, Cloudera data scientist Josh Willis takes a closer look at one aspect of that ecosystem: the work of scientists whose research falls outside of statistics and machine learning. His blog post specifically addresses one use case for Hadoop — seismology, for which there is now Seismic Hadoop — but the post also provides a broad look at what constitutes the practice of data science.

Got data news?

Feel free to email me.

Photo: Bootstrap DNA by Charles Jencks, 2003 by mira66, on Flickr

Related:

January 19 2012

Strata Week: A home for negative and null results

Here are a few of the data stories that caught my attention this week:

Figshare sees the upside of negative results

FigshareScience data-sharing site Figshare relaunched its website this week, adding several new features. Figshare lets researchers publish all of their data online, including negative and null results.

Using the site, researchers can now upload and publish all file formats, including videos and datasets that are often deemed "supplemental materials" or excluded from current publishing models. This is part of a larger "open science" effort. According to Figshare:

"... by opening up the peer review process, researchers can easily publish null results, avoiding the file drawer effect and helping to make scientific research more efficient. Figshare uses creative commons licensing to allow frictionless sharing of research data whilst allowing users to maintain their ownership."

As the startup argues: "Unless we as scientists publish all of our data, we will never achieve access to the sum of all scientific knowledge."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Accel's $100 million data fund makes its first ($52.5 million) investment

Late last year, the investment firm Accel Partners announced a new $100 Million Big Data Fund, with a promise to invest in big data startups. This year, the first investment from that fund was revealed, with a whopping $52.5 million going to Code 42.

Founded in 2001, Code 42 is the creator of the backup software CrashPlan, and the company describes itself as building "high-performance hardware and easy-to-use software solutions that protect the world's data."

Describing the investment, GigaOm's Stacey Higginbotham writes:

"With the growth in mobile devices and the data stored on corporate and consumer networks that is moving not only from device to server, but device to device, [CEO Matthew] Dornquast realized Code 42's software could become more than just a backup and sharing service, but a way for corporations to understand what data and how data was moving between employees and the devices they use."

Higginbotham also cites Accel Partners' Ping Li, who notes that further investments from its Big Data Fund are unlikely to be so sizable.

LinkedIn open sources DataFu

LinkedInLinkedIn has been a heavy user of Apache Pig for performing analysis with Hadoop on projects such as its People You May Know tool, among other things. For more advanced tasks like these, Pig supports User Defined Functions (UDFs), which allow the integration of custom code into scripts.

This week, LinkedIn announced the release of DataFu, the consolidation of its UDFs into a single, general-purpose library. DataFu enables users to "run PageRank on a large number of independent graphs, perform set operations such as