Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 12 2012

Strata Week: A .data TLD?

Here are some of the data stories that caught my attention this week.

Should there be a .data TLD?

radar.dataICANN is ready to open top-level domains (TLD) to the highest bidder, and as such, Wolfram Alpha's Stephen Wolfram posits it's time for a .data TLD. In a blog post on the Wolfram site, he argues that the new top-level domains provide an opportunity for the creation of a .data domain that could create a "parallel construct to the ordinary web, but oriented toward structured data intended for computational use. The notion is that alongside a website like, there'd be"

Wolfram continues:

If a human went to, there'd be a structured summary of what data the organization behind it wanted to expose. And if a computational system went there, it'd find just what it needs to ingest the data, and begin computing with it.

So how would a .data TLD change the way humans and computers interact with data? Or would it change anything? If you've got ideas of how .data could be put to use, please share them in the comments.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Cloudera addresses what Apache Hadoop 1.0 means to its customers

Last week, the Apache Software Foundation (ASF) announced that Hadoop had reached version 1.0. This week, Cloudera took to its blog to explain what that milestone means to its customers.

The post, in part, explains how Hadoop has branched from its trunk, noting that all of this has caused some confusion for Cloudera customers:

More than a year after Apache Hadoop 0.20 branched, significant feature development continued on just that branch and not on trunk. Two major features were added to branches off 0.20.2. One feature was authentication, enabling strong security for core Hadoop. The other major feature was append, enabling users to run Apache HBase without risk of data loss. The security branch was later released as 0.20.203. These branches and their subsequent release have been the largest source of confusion for users because since that time, releases off of the 0.20 branches had features that releases off of trunk did not have and vice versa.

Cloudera explains to its customers that it's offered the equivalent for "approximately a year now" and compares the Apache Hadoop efforts to its own offerings. The post is an interesting insight into not just how the ASF operates, but how companies that offer services around those projects have to iterate and adapt.

Disqus says that pseudonymous commenters are best

Debates over blog comments have resurfaced recently, with a back and forth about whether or not they're good, bad, evil, or irrelevant. Adding some fuel to the fire (or data to the discussion, at least) comes Disqus with its own research based on its commenting service.

According to the Disqus research, commenters using pseudonyms actually are "the most valuable contributors to communities," as their comments are both the highest quantity and quality. Those findings run counter to the idea that those who comment online without using their real names actually lessen rather than enhance quality conversations.

Disqus' data indicates that pseudonymity might engender a more engaged and more engaging community. That notion stands in contrast to arguments that anonymity leads to more trollish and unruly behavior.

Got data news?

Feel free to email me.


January 05 2012

Strata Week: Unfortunately for some, Uber's dynamic pricing worked

Here are a few of the data stories that caught my attention this week.

Uber's dynamic pricing

Uber logoMany passengers using the luxury car service Uber on New Year's Eve suffered from sticker shock when they saw that a hefty surcharge had been added to their bills — a charge ranging from 3 to more than 6 times the regular cost of an Uber fare. Some patrons took to Twitter to complain about the pricing, and Uber responded with several blog posts and Quora answers, trying to explain the startup's usage of "dynamic pricing."

The idea, writes Uber engineer Dom Anthony Narducci, is that:

... when our utilization is approaching too high of levels to continue to provide low ETA's and good dispatches, we raise prices to reduce demand and increase supply. On New Year's Eve (and just after midnight), this system worked perfectly; demand was too high, so the price bumped up. Over and over and over and over again.

In other words, in order to maintain the service that Uber is known for — reliability — the company adjusted prices based on the supply and demand for transportation. And on New Year's Eve, says Narducci, "As for how the prices got that high, at a super simplistic level, it was because things went right."

TechCrunch contributor Semil Shah points to other examples of dynamic pricing, such as for airfares and hotels, and argues that we might see more of this in the future. "Starting now, consumers should also prepare to experience the underbelly of this phenomenon, a world where prices for goods and services that are in demand, either in quantity or at a certain time, aren't the same price for each of us."

But Reuters' Felix Salmon argues that this sort of algorithmic and dynamic pricing might not work well for most customers. It isn't simply that the prices for Uber car rides are high (they are always higher than a taxi anyway). He contends that the human brain really can't — or perhaps doesn't want to — handle this sort of complicated cost/benefit analysis for a decision like "should I take a cab or call Uber or just walk home." As such, he calls Uber:

... a car service for computers, who always do their sums every time they have to make a calculation. Humans don't work that way. And the way that Uber is currently priced, it's always going to find itself in a cognitive zone of discomfort as far as its passengers are concerned.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Apache Hadoop reaches v1.0

Hadoop logoThe Apache Software Foundation announced that Apache Hadoop has reached v1.0, an indication that the big data tool has achieved a certain level of stability and enterprise-readiness.

V1.0 "reflects six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists, and systems engineers, bringing a highly stable, enterprise-ready release of the fastest-growing big data platform," said the ASF in its announcement.

The designation by the Apache Software Foundation reaffirms the interest in and development of Hadoop, a major trend in 2011 and likely to be such again in 2012.

Proposed bill would repeal open access for federal-funded research

What's the future for open data, open science, and open access in 2012? Hopefully, a bill introduced late last month isn't a harbinger of what's to come.

The Research Works Act (HR 3699) is a proposed piece of legislation that would repeal the open-access policy at the National Institutes of Health (NIH) and prohibit similar policies from being introduced at other federal agencies. HR 3699 has been referred to the Committee on Oversight and Government Reform.

The main section of the bill is quite short:

"No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that

  • causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work; or
  • requires that any actual or prospective author, or the employer of such an actual or prospective author, assent to network dissemination of a private-sector research work."

The bill would prohibit the NIH and other federal agencies from requiring that grant recipients publish in open-access journals.

Got data news?

Feel free to email me.


December 15 2011

Strata Week: A new Internet data transfer speed record

Here are a few of the data stories that caught my attention this week:

New world record for data transfer speed

Scientists announced this week that they had broken the world record for Internet speed by transferring data at 186 Gbps.

Researchers built an optical fiber network between the University of Victoria Computing Centre in Victoria, British Columbia, and the Washington State Convention Center in Seattle, Wash. According to a Caltech press release, "with a simultaneous data rate of 88 Gbps in the opposite direction, the team reached a sustained two-way data rate of 186 Gbps between two data centers, breaking the team's previous peak-rate record of 119 Gbps set in 2009."

The new record-breaking speed is fast enough to transfer roughly 100,000 Blu-ray disks a day. The research on faster Internet speeds is underway to better handle the data coming from the Large Hadron Collider at CERN. "More than 100 petabytes (more than four million Blu-ray disks) of data have been processed, distributed, and analyzed using a global grid of 300 computing and storage facilities located at laboratories and universities around the world," according to Caltech, "and the data volume is expected to rise a thousand-fold as physicists crank up the collision rates and energies at the LHC." Faster data transfer will hopefully make it possible for more researchers to be able to work with the petabyte-scale data from CERN.

The following video explains the hardware and technology behind the latest speed record:

Data predictions for 2012

This was a "coming out" year for big data and data science, according to O'Reilly's Edd Dumbill, who posted his 2012 data predictions this week. Dumbill has identified five areas in which he thinks we'll see more development in the next year:

  • More powerful and expressive tools for analysis. Specifically, better programming language support.
  • Development of data science workflows and tools. In other words, there will be clearer processes for how data teams work.
  • Rise of data marketplaces — the "directory" and the "delivery."
  • Streaming data processing, as opposed to batch processing.
  • Increased understanding of and demand for visualization. "If becoming a data-driven organization is about fostering a better feel for data among all employees, visualization plays a vital role in delivering data manipulation abilities to those without direct programming or statistical skills," Dumbill writes.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.


December 01 2011

Strata Week: New open-data initiatives in Canada and the UK

Here are a few of the data stories that caught my attention this week.

Open data from StatsCan

Statistics CanadaEmbassy Magazine broke the news this week that all of Statistics Canada's online data will be made available to the public for free, released under the Government of Canada's Open Data License Agreement beginning in February 2012. Statistics Canada is the federal agency commissioned with producing statistics to help understand the Canadian economy, culture, resources, and population. (It runs the Canadian census every five years.)

The decision to make the data freely and openly available "has been in the works for years," according to Statistics Canada spokesperson Peter Frayne. The Canadian government did launch an open-data initiative earlier this year, and the move on the part of StatsCan dovetails philosophically with that. Frayne said that the decision to make the data free was not a response to the controversial decision last summer when the agency dropped its mandatory long-form census.

Open government activist David Eaves responds with a long list of "winners" from the decision, including all of the consumers of StatsCan's data:

Indirectly, this includes all of us, since provincial and local governments are big consumers of StatsCan data and so now — assuming it is structured in such a manner — they will have easier (and cheaper) access to it. This is also true of large companies and non-profits which have used StatsCan data to locate stores, target services and generally allocate resources more efficiently. The opportunity now opens for smaller players to also benefit.

Eaves continues, stressing the importance of these smaller players:

Indeed, this is the real hope. That a whole new category of winners emerges. That the barrier to use for software developers, entrepreneurs, students, academics, smaller companies and non-profits will be lowered in a manner that will enable a larger community to make use of the data and therefore create economic or social goods.

Moving to Big Data: Free Strata Online Conference — In this free online event, being held Dec. 7, 2011, at 9AM Pacific, we'll look at how big data stacks and analytical approaches are gradually finding their way into organizations as well as the roadblocks that can thwart efforts to become more data driven. (This Strata Online Conference is sponsored by Microsoft.)

Register to attend this free Strata Online Conference

Open data from Whitehall

The British government also announced the availability of new open datasets this week. The Guardian reports that personal health records, transportation data, housing prices, and weather data will be included "in what promises to be the most dramatic release of public data since the 2010 election."

The government will also form an Open Data Institute (ODI), led by Sir Tim Berners-Lee. The ODI will involve both businesses and academic institutions, and will focus on helping transform the data for commercial benefit for U.K. companies as well as for the government. The ODI will also work on the development of web standards to support the government's open-data agenda.

The Guardian notes that the health data that's to be released will be the largest of its kind outside of U.S. veterans' medical records. The paper cites the move as something recommended by the Wellcome Trust earlier this year: "Integrated databases ... would make England unique, globally, for such research." Both medical researchers and pharmaceutical companies will be able to access the data for free.

Dell open sources its Hadoop deployment tool

HadoopHadoop adoption and investment has been one of the big data trends of 2011, with stories about Hadoop appearing in almost every edition of Strata Week. GigaOm's Derrick Harris contends that Hadoop's good fortunes will only continue in 2012, listing six reasons why next year may actually go down as "The Year of Hadoop."

This week's Hadoop-related news involves the release of the source code to Crowbar, Dell's Hadoop deployment tool. Silicon Angle's Klint Finley writes that:

Crowbar is an open-source deployment tool developed by Dell originally as part of its Dell OpenStack Cloud service. It started as a tool for installing Open Stack, but can deploy other software through the use of plug-in modules called 'barclamps' ... The goal of the Hadoop barclamp is to reduce Hadoop deployment time from weeks to a single day.

Finley notes that Crowbar isn't competition to Cloudera's line of Hadoop management tools.

What Muncie read

What Middletown Read"People don't read anymore," Steve Jobs once told The New York Times. It's a fairly common complaint, one that certainly predates the computer age — television was to blame, then video games. But our knowledge about reading habits of the past is actually quite slight. That's what makes the database based on ledgers from the Muncie, Ind., public library so marvelous.

The ledgers, which were discovered by historian Frank Felsenstein, chronicle every book checked out of the library, along with the name of the patron who checked it out, between November 1891 and December 1902. That information is now available in the What Middletown Read database.

In a New York Times story on the database, Anne Trubek notes that even at the turn of the 20th century, most library patrons were not reading "the classics":

What do these records tell us Americans were reading? Mostly fluff, it's true. Women read romances, kids read pulp and white-collar workers read mass-market titles. Horatio Alger was by far the most popular author: 5 percent of all books checked out were by him, despite librarians who frowned when boys and girls sought his rags-to-riches novels (some libraries refused to circulate Alger's distressingly individualist books). Louisa May Alcott is the only author who remains both popular and literary today (though her popularity is far less). "Little Women" was widely read, but its sequel "Little Men" even more so, perhaps because it was checked out by boys, too.

Got data news?

Feel free to email me.


November 22 2011

Strata Week: 4.74 degrees of Kevin Bacon

Here are some of the data stories that caught my attention this week:

There are less than six degrees between you and Kevin Bacon

You know the game: there are no more than six degrees of separation between actor Kevin Bacon and anyone working in Hollywood. You can start with Sir Alec Guinness, or you can start with Sasha Grey — there are, at most, six links that connect that performer to Kevin Bacon. The game is built on an older notion of social connections, one dating back as far as the late 1920s when Hungarian author Frigyes Karinthy argued there were no more than five acquaintances that separated people.

It might not be all that surprising that the Internet affords different — closer? — relationships than those of the earliest 20th century Hungary. Indeed, according to Facebook's data team, the connections it now affords are actually much closer than the "six degrees of separation" maxim. Between Facebook relationships, there are only 4.74 degrees ("hops" it calls them).

Facebook's Data team writes:

Thus, when considering even the most distant Facebook user in the Siberian tundra or the Peruvian rainforest, a friend of your friend probably knows a friend of their friend. When we limit our analysis to a single country, be it the US, Sweden, Italy, or any other, we find that the world gets even smaller, and most pairs of people are only separated by three degrees (four hops). It is important to note that while [Stanley] Milgram was motivated by the same question (how many individuals separate any two people), these numbers are not directly comparable; his subjects only had limited knowledge of the social network, while we have a nearly complete representation of the entire thing. Our measurements essentially describe the shortest possible routes that his subjects could have found.

The Facebook study involves a sizable dataset — some 721 million Facebook users. That's the largest, by far, of any study of its kind. But as The New York Times points out, the Facebook study still raises questions about what exactly do we mean when we talk about "friends" and the relationships and connections we've created between people online.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Stock market data meets social media data

GnipSocial media data aggregator Gnip announced this week that it was expanding into the financial services market with the launch of a new product aimed at hedge funds and stock traders. Gnip MarketStream will provide real-time data from both Twitter and StockTwits.

Gnip is one of two companies licensed to handle the Twitter firehose (the other, which I wrote about last week, is DataSift). The company says that it already includes in its customer base a number of hedge funds, and now in partnership with StockTwits — a financial data platform that created the $(TICKER) tag on Twitter — Gnip says it will provide real-time social data to trading companies.

Hunch acquired by eBay

HunchEBay has acquired the recommendation engine Hunch. The price tag was around $80 million, according to Mike Arrington. Founded in 2007 by Chris Dixon and Caterina Fake, Hunch built a "taste graph," a website and API that provided insights into users' affinities with different people, services, brands, and websites.

In an interview with Betabeat, Dixon pointed to precisely this sort of insight as the rationale for the startup's acquisition: "eBay is a very unique retailer," Dixon told Betabeat. "When grandma posts a sweater for sale, it doesn't have metadata to help sort and identify it. In working to understand users' tastes on the open web, this is the challenge we have been solving at Hunch." Betabeat says that, "Something like 70% of items on eBay don't have traditional metadata like product IDs." No metadata on eBay transactions means little opportunity for eBay itself to build a sophisticated recommendation engine, and clearly the acquisition of Hunch is meant to address that.

But there's a flip side to this equation, too, as Betabeat describes it: "eBay's 97 million users and 200 million active listings add up to 9 petabytes of data across two billion daily page views. 'There is only so much you can teach your system with the big academic data sets that are publicly available,' Dixon said. 'With eBay's data behind us, expect Hunch to get much, much better'."

Faster than the speed of light?

Earlier this fall, scientists at CERN said they'd clocked neutrinos traveling faster than the speed of light. Since this changes everything, there's been an open call to other scientists to replicate the findings.

There've been a couple of new salvos this week: The Italian Institute for Nuclear Physics (INFN), which runs the Gran Sasso lab, has just confirmed the test results, The Economist reported. Then, another Italian site reported different findings. The ICARUS experiment said that, no, the original research hadn't adequately accounted for the neutrino's energy upon arrival. Recalculating the data, they argued that the neutrinos were still traveling at the speed of light — no faster.

So is the Theory of Relativity still intact? Stay tuned ...

Got data news?

Feel free to email me.


November 17 2011

Strata Week: Why ThinkUp matters

Here are a few of the data stories that caught my attention this week.

ThinkUp hits 1.0

ThinkUpThinkUp, a tool out of Expert Labs, enables users to archive, search and export their Twitter, Facebook and Google+ history — both posts and post replies. It also allows users to see their network activity, including new followers, and to map that information. Originally created by Gina Trapani, ThinkUp is free and open source, and will run on a user's own web server.

That's crucial, says Expert Labs' founder Anil Dash, who describes ThinkUp's launch as "software that matters." He writes that "ThinkUp's launch matters to me because of what it represents: The web we were promised we would have. The web that I fell in love with, and that has given me so much. A web that we can hack, and tweak, and own." Imagine everything you've ever written on Twitter, every status update on Facebook, every message on Google+ and every response you've had to those posts — imagine them wiped out by the companies that control those social networks.

Why would I ascribe such awful behavior to the nice people who run these social networks? Because history shows us that it happens. Over and over and over. The clips uploaded to Google Videos, the sites published to Geocities, the entire relationships that began and ended on Friendster: They're all gone. Some kind-hearted folks are trying to archive those things for the record, and that's wonderful. But what about the record for your life, a private version that's not for sharing with the world, but that preserves the information or ideas or moments that you care about?

It's in light of this, no doubt, that ReadWriteWeb's Jon Mitchell calls ThinkUp "the social media management tool that matters most." Indeed, as we pour more of our lives into these social sites, tools like ThinkUp, along with endeavors like the Locker Project, mark important efforts to help people own, control and utilize their own data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

DataSift opens up its Twitter firehose

DataSiftDataSift, one of only two companies licensed by Twitter to syndicate its firehose (the other being Gnip), officially opened to the public this week. That means that those using DataSift can in turn mine all the social data that comes from Twitter — data that comes at a rate of some 250 million tweets per day. DataSift's customers can analyze this data for more than just keyword searches and can apply various filters, including demographic information, sentiment, gender, and even Klout score. The company also offers data from MySpace and plans to add Google+ and Facebook data soon.

DataSift, which was founded by Tweetmeme's Nick Halstead and raised $6 million earlier this year, is available as a pay-as-you-go subscription model.

Google's BigQuery service opens to more developers

Google announced this week that it was letting more companies have access to its piloting of BigQuery, its big data analytics service. The tool was initially developed for internal use at Google, and it was opened to a limited number of developers and companies at Google I/O earlier this year. Now, Google is allowing a few more companies into the fold (you can indicate your interest here), offering them the service for free — with the promise to notify them in 30 days if it plans to charge — as well as adding some user interface improvements.

In addition to a GUI for the web-based version, Google has improved the REST API for BigQuery as well. The new API offers granular control over permissions and lets you run multiple jobs in the background.

BigQuery is based on the Google tool formerly known as Dremel, which the company discussed in a research paper published last year:

[Dremel] is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data.

In the blog post announcing the changes to BigQuery, Google cites Michael J. Franklin, Professor of Computer Science at UC Berkeley, who calls BigQuery's ability to process big data "jaw-dropping."

Got data news?

Feel free to email me.


November 10 2011

Strata Week: The social graph that isn't

Here are a few of the data stories that caught my attention this week:

Not social. Not a graph.

Graph Paper by Calsidyrose, on FlickrIt's hardly surprising that the founder of a "bookmarking site for introverts" would have something to say about the "social graph." But what Pinboard's Maciej Ceglowski has penned in a blog post titled "The Social Graph Is Neither" is arguably the must-read article of the week.

The social graph is neither a graph, nor is it social, Ceglowski posits. He argues that today's social networks have failed to capture the complexities and intricacies of our social relationships (there's no graph) and have become something that's at best contrived and at worst icky (actually, that's not the "worst," but it's the adjective Ceglowski uses).

From his post:

Imagine the U.S. Census as conducted by direct marketers — that's the social graph. Social networks exist to sell you crap. The icky feeling you get when your friend starts to talk to you about Amway or when you spot someone passing out business cards at a birthday party, is the entire driving force behind a site like Facebook. Because their collection methods are kind of primitive, these sites have to coax you into doing as much of your social interaction as possible while logged in, so they can see it.

But if today's social networks are troublesome, they're also doomed, Ceglowski contends, much as the CompuServes and the Prodigys of an earlier era were undone. It's not so much a question of their being out-innovated, but rather they were out-democratized. As the global network spread, the mass marketing has given way to grassroots efforts.

"My hope," Ceglowski writes, "is that whatever replaces Facebook and Google+ will look equally inevitable and that our kids will think we were complete rubes for ever having thrown a sheep or clicked a +1 button. It's just a matter of waiting things out and leaving ourselves enough freedom to find some interesting, organic, and human ways to bring our social lives online."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Cloudera raises $40 million

ClouderaThe Hadoop-based startup Cloudera announced this week that it has raised another $40 million in funding, led by Ignition Partners, Greylock, Accel, Meritech Capital Partners, and In-Q-Tel. This brings the total investment in the company to some $76 million, a solid endorsement of not just Cloudera but of the Hadoop big data solution.

Hadoop is a trend that we've covered almost weekly here as part of the Strata Week news roundup. And GigaOm's Derrick Harris has run some estimates on the numbers of the Hadoop ecosystem at large, finding that: "Hadoop-based startups have raised $104.5 million since May. The same set of companies has raised $159.7 million since 2009 when Cloudera closed its first round."

While it's easy to label Hadoop as one of the buzzwords of 2011, the amount of investor interest, as well as the amount of adoption, is an indication that many people see this as a cornerstone of a big data strategy as well as a good source of revenue for the coming years.

Kaggle raises $11 million to crowdsource big data

KaggleIt's a much smaller round of investment than Cloudera's, to be sure, but Kaggle's $11 million Series A round announced this week is still noteworthy. Kaggle provides a platform for running big data competitions. "We're making data science a sport," so its tagline reads.

But it's more than that. There remains a gulf between data scientists and those who have data problems to solve. Kaggle helps bridge this gap by letting companies outsource their big data problems to third-party data scientists and software developers, with prizes going to the best solutions. Kaggle claims it has a community of more than 17,000 PhD-level data scientists, ready to take on and resolve companies' data problems.

Kaggle has thus far enabled several important breakthroughs, including a competition that helped identify new ways to map dark matter in the universe. That's a project that had been worked on for several decades by traditional methods, but those in the Kaggle community tackled it in a couple of weeks.

The Supreme Court looks at GPS data tracking

The U.S. Supreme Court heard oral arguments this week in United States v. Jones, a case that could have major implications on mobile data, GPS and privacy. At issue is whether police need a warrant in order to attach a tracking device to a car to monitor a suspect's movements.

Surveillance via technology is clearly much easier and more efficient than traditional surveillance methods. Why follow a suspect around all day, for example, when you can attach a device to his or her car and just watch the data transmission? But it's clear that the data you get from a GPS device is much more enhanced than human surveillance, so it raises all sorts of questions about what constitutes a reasonable search. And while you needn't get a warrant to shadow someone's car, attaching that GPS tracking device might just violate the Fourth Amendment and the protection against unreasonable search and seizure.

But what's at stake is much larger than just sticking a tracking device to the underbelly of a criminal suspect's vehicle. After all, every cell phone owner gives off an incredible amount of mobile location data, something that the government could conceivably tap into and monitor.

During oral arguments, Supreme Court justices seemed skeptical about the government's power to use technology in this way.

Got data news?

Feel free to email me.

Photo: Graph Paper by Calsidyrose, on Flickr


November 03 2011

Strata Week: Cloudera founder has a new data product

Here are a few of the data stories that caught my attention this week:

Odiago: Cloudera founder Christophe Bisciglia's next big data project

Odiago and WibiDataCloudera founder Christophe Bisciglia unveiled his new data startup this week: Odiago. The company's product, WibiData (say it out loud), uses Apache Hadoop and Hbase to analyze consumer web data. Database industry analyst Curt Monash describes WibiData on his DBMS2 blog:

WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop, is a data management and analytic execution layer. That's where the secret sauce resides.

GigaOm's Derrick Harris posits that Odiago points to "the future of Hadoop-based products." Rather than having to "roll your own" Hadoop solutions, future Hadoop users will be able to build their apps to tap into other products that do the "heavy lifting."

Hortonworks launches its data platform

Hadoop company Hortonworks, which spun out of Yahoo earlier this year, officially announced its products and services this week. The Hortonworks Data Platform is an open source distribution powered by Apache Hadoop. It includes the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase and Zookeeper, as well as HCatalog and open APIs for integration. THe Hortonworks Data Platform also includes Ambari, another Apache project, that will serve as the Hadoop installation and management system.

It's possible Hortonworks' efforts will pick up the pace of the Hadoop release cycle and address what ReadWriteWeb's Scott Fulton sees as the "degree of fragmentation and confusion." But as GigaOm's Derrick Harris points out, there is still "so much Hadoop in so many places, with multiple companies offering their own Hadoop solutions.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Big education content meets big education data

A couple of weeks ago, the adaptive learning startup Knewton announced that it had raised an additional $33 million. This latest round was led by Pearson, the largest education company in the world. As such, the announcement this week that Knewton and Pearson are partnering is hardly surprising.

But this partnership does mark an important development for big data, textbook publishing, and higher education.

Knewton's adaptive learning platform will be integrated with Pearson's digital courseware, giving students individualized content as they move through the materials. To begin with, Knewton will work with just a few of the subjects within Pearson's MyLab and Mastering catalog. There are more than 750 courses in that catalog, and the adaptive learning platform will be integrated with more of them soon. The companies also say they plan to "jointly develop a line of custom, next-generation digital course solutions, and will explore new products in the K12 and international markets."

The data from Pearson's vast student customer base — some 9 million higher ed students use Pearson materials — will certainly help Knewton refine its learning algorithms. In turn, the promise of adaptive learning systems means that students and teachers will be able to glean insights from the learning process — what students understand, what they don't — in real time. It also means that teachers can provide remediation aimed at students' unique strengths and weaknesses.

Got data news?

Feel free to email me.


October 27 2011

Strata Week: IBM puts Hadoop in the cloud

Here are a few of the data stories that caught my attention this week.

IBM's cloud-based Hadoop offering looks to make data analytics easier

IBM HadoopAt its conference in Las Vegas this week, IBM made a number of major big-data announcements, including making its Hadoop-based product InfoSphere BigInsights available immediately via the company's SmartCloud platform. InfoSphere BigInsights was unveiled earlier this year, and it is hardly the first offering that Big Blue is making to help its customers handle big data. The last few weeks have seen other major players also move toward Hadoop offerings — namely Oracle and Microsoft — but IBM is offering its service in the cloud, something that those other companies aren't yet doing. (For its part, Microsoft does say that a Hadoop service will come to Azure by the end of the year.)

IBM joins Amazon Web Services as the only other company currently offering Hadoop in the cloud, notes GigaOm's Derrick Harris. "Big data — and Hadoop, in particular — has largely been relegated to on-premise deployments because of the sheer amount of data involved," he writes, "but the cloud will be a more natural home for those workloads as companies begin analyzing more data that originates on the web."

Harris also points out that IBM's Hadoop offering is "fairly unique" insofar as it targets businesses rather than programmers. IBM itself contends that "bringing big data analytics to the cloud means clients can capture and analyze any data without the need for Hadoop skills, or having to install, run, or maintain hardware and software."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Cleaning up location data with Factual Resolve

The data platform Factual launched a new API for developers this week that tackles one of the more frustrating problems with location data: incomplete records. Called Factual Resolve, the new offering is, according to a company blog post, an "entity resolution API that can complete partial records, match one entity against another, and aid in de-duping and normalizing datasets."

Developers using Resolve tell it what they know about an entity (say, a venue name) and the API can return the rest of the information that Factual knows based on its database of U.S. places — address, category, latitude and longitude, and so on.

Tyler Bell, Factual's director of product, discussed the intersection of location and big data at this year's Where 2.0 conference. The full interview is contained in the following video:

Google and governments' data requests

As part of its efforts toward better transparency, Google has updated its Government Requests tool this week with information about the number of requests the company has received for user data since the beginning of 2011.

This is the first time that Google is disclosing not just the number of requests, but the number of user accounts specified as well. It's also made the raw data available so that interested developers and researchers can study and visualize the information.

According to Google, requests from U.S. government officials for content removal were up 70% in this reporting period (January-June 2011) versus the previous six months. And the number of user data requests was up by 29% compared to the previous reporting period. Google also says it received requests from local law enforcement agencies to take down various YouTube videos — one on police brutality, one that was allegedly defamatory — but Google says that it did not comply. But of the 5,950 user data requests (impacting some 11,000 user accounts) submitted between January and June 2011, Google says that it has complied with 93%, either fully or partially.

The U.S. was hardly the only government making an increased number of requests to Google. Spain, South Korea, and the U.K., for example, also made more requests. Several countries, including Sri Lanka and the Cook Islands, made their first requests.

Got data news?

Feel free to email me.


October 20 2011

Strata Week: A step toward personal data control

Here are a few of the data stories that caught my attention this week.

Your data in your locker

SinglyEarlier this month, John Battelle wrote a post on his blog where he wished for a service to counter the ways in which all our personal data is scattered across so many applications and devices. He was looking for a tool that would pull together the data from these various places into something that "queries all my various social actions and curates them into one publicly addressable instance independent of any larger platform like AOL, Facebook, Apple, or Google ... I'm pretty sure this is what Singly and the Locker Project will make theoretically possible."

Battelle and Singly's Jason Cavnar discussed the Locker Project in more detail in another post on Battelle's blog this week.

As Cavnar argued:

Data doesn't do us justice. This is about LIFE. Our lives. Or as our colleague Lindsay (@lschutte) says — 'your story.' Not data. Data is just a manifestation of the actual life we are leading. Our data (story) should be ours to own, remember, re-use, discover with and share.

If that sounds appealing then there's good news ahead. Singly 1.0 begins its roll-out to developers this week, as ReadWriteWeb's Marshall Kirkpatrick reports. Developers will be able to build apps that "search, sort and visualize contacts, links and photos that have been published by their own accounts on various social networks but also by all the accounts they are subscribed to there." The apps will live on Github and will deploy on Github for now. There are also several restrictions as far as using other people's apps — for example, you can only do so to visualize your own data.

Even with limitations, Singly is a first step in what will be a much-anticipated and a hugely important move for personal data control.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Bad graphics and good data journalism

A sample word cloudLast week, New York Times senior software architect Jacob Harris issued a challenge to the growing number of data journalists. Want to visualize your work? Avoid word clouds.

Word clouds are, he argued, much like tag clouds before them: "the mullets of the Internet." That is, taking a particular dataset and merely visualizing the frequency of words therein via tools like Wordle and the like is simply "filler visualization." (And Harris said it's also personally painful to the NYT data science team.)

Harris pointed to the numerous problems with utilizing word clouds as the sole form of textual analysis. At the very least, they only take advantage of word frequency, which doesn't necessarily tell you that much:

For starters, word clouds support only the crudest sorts of textual analysis, much like figuring out a protein by getting a count only of its amino acids. This can be wildly misleading; I created a word cloud of Tea Party feelings about Obama, and the two largest words were implausibly "like" and "policy," mainly because the importuned word "don't" was automatically excluded. (Fair enough: Such stopwords would otherwise dominate the word clouds.) A phrase or thematic analysis would reach more accurate conclusions. When looking at the word cloud of the War Logs, does the equal sizing of the words "car" and "blast" indicate a large number of reports about car bombs or just many reports about cars or explosions? How do I compare the relative frequency of lesser-used words? Also, doesn't focusing on the occurrence of specific words instead of concepts or themes miss the fact that different reports about truck bombs might be use the words "truck," "vehicle," or even "bongo" (since the Kia Bongo is very popular in Iraq)?

The Guardian's Simon Rogers responded to Harris. Rogers acknowledged there are plenty of poor visualizations out there, but he added an important point:

Calling for better graphics is also like calling for more sunshine and free chocolate — who's going to disagree with that? What they do is ignore why people produce their own graphics. We often use free tools because they are quick and tell the story simply. But, when we have the time, nothing beats having a good designer create something beautiful — and the Guardian graphics team produces lovely visualisation for the Datablog all the time — such as this one. What is the alternative online for those who don't have access to a team of trained designers?

That last question is crucial, particularly as not everyone has access to designers or software to be able to do much more with their data than create simple visualizations (i.e. word clouds). Rogers said that it's probably fine to have a lot of less-than-useful graphics, because, if nothing else, it "shows that data analysis is part of all our lives now, not just the preserve of a few trained experts handing out pearls of wisdom."

Mary Meeker examines the global growth of mobile data

Among the most-anticipated speakers at Web 2.0 Summit this week was Mary Meeker. The former Morgan Stanley analyst and now partner at Kleiner Perkins gave her annual "Internet Trends" presentation, which is always chock full of data.

Meeker's full Web 2.0 Summit presentation is available in the following video:

Meeker noted that 81% of users of the top 10 global Internet properties come from outside the U.S. Furthermore, in the last three years alone, China has added more Internet users than there are in all of the United States (246 million new Chinese users online versus 244 million total U.S. users online). Although companies like Apple, Amazon, and Google continue to dominate, Meeker pointed out that some of the largest and fasted growing Internet companies are also based outside the U.S. — Chinese companies like Baidu and Tencent, for example, and Russian companies like And beyond just market value, she pointed to global innovations, such as Sweden's Spotify and Israel's Waze.

The growth in Internet usage continues to be in mobile. Meeker highlighted the global scale and spread of mobile growth, noting that it's in countries like Turkey, India, Brazil and China where we are seeing the largest year-over-year expansion in mobile subscribers.

Suggesting that it may be time to reevaluate Maslow's hierarchy of needs, Meeker posited that Internet access is rapidly becoming a crucial need that sits at the top of a new hierarchy.

Apache Cassandra reaches 1.0

Apache CassandraThe Apache Software Foundation announced this week the release of Cassandra v1.0.

Cassandra, originally developed by Facebook to power its Inbox Search, was open sourced in 2008. Although it's been a top-level Apache project for more than a year now, the 1.0 release marks Cassandra's maturity and readiness for more widespread implementation. The technology has been adopted beyond Facebook by companies like Cisco, Cloudkick, Digg, Reddit, Twitter and Walmart Labs.

Of course, Cassandra is just one of many non-relational databases on the market, with the most recent addition coming from Oracle. But Jonathan Ellis, the vice president of the Apache Cassandra project, explained to PCWorld why Cassandra remains competitive:

[Its] architecture is suited for multi-data center environments, because it does not rely on a leader node to coordinate activities of the database. Data can be written to a local node, thereby eliminating the additional network communications needed to coordinate with a sometimes geographically distant master node. Also, because Cassandra is a column-based storage engine, it can store richer data sets than the typical key-value storage engine.

Got data news?

Feel free to email me.


October 13 2011

Strata Week: Simplifying MapReduce through Java

Here are a few of the data stories that caught my attention this week:

Crunch looks to make MapReduce easier

Despite the growing popularity of MapReduce and other data technologies, there's still a steep learning curve associated with these tools. Some have even wondered if they're worth introducing to programming students.

All of this makes the introduction of Crunch particularly good news. Crunch is a new Java library from Cloudera that's aimed at simplifying the writing, testing, and running of MapReduce pipelines. In other words, developers won't need to write a lot of custom code or libraries, which as Cloudera data scientist Josh Willis points out, "is a serious drain on developer productivity."

He adds that:

Crunch shares a core philosophical belief with Google's FlumeJava: novelty is the enemy of adoption. For developers, learning a Java library requires much less up-front investment than learning a new programming language. Crunch provides full access to the power of Java for writing functions, managing pipeline execution, and dynamically constructing new pipelines, obviating the need to switch back and forth between a data flow language and a real programming language.

The Crunch library has been released under the Apache license, and the code can be downloaded here.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Querying the web with Datafiniti

DatafinitiDatafiniti launched this week into public beta, calling itself the "first search engine for data." That might just sound like a nifty startup slogan, but when you look at what Datafiniti queries and how it works, the engine begins to look profoundly ambitious and important.

Datafiniti enables its users to enter a search query (or make an API call) against the web. Or, that's the goal at least. As it stands, Datafiniti lets users make calls about location, products, news, real estate, and social identity. But that's a substantial number of datasets, using information that's publicly available on the web.

Although Datafiniti demands you enter SQL parameters, it tries to make the process of doing so fairly easy, with a guide that pops up beneath the search box to help you phrase things properly. That interface is just one of the indications that Datafiniti is making a move to help democratize big data search.

The company grew out of a previous startup named 80Legs. As Shion Deysarker, founder of Datafiniti told me, it was clear that the web-crawling services provided by 80Legs were really just being utilized to ask specific queries. Things like, what's the average listing price for a home in Houston? How many times has a brand name been mentioned on Twitter or Facebook over the last few months? And so on.

Deysarker frames Datafiniti in terms of data access, arguing that until now a few providers have controlled the data. The startup wants to help developers and companies overcome both access and expense issues associated with gathering, processing, curating and accessing datasets. It plans to offer both subscription-based and unit-based pricing.

Keep tabs on the Large Hadron Collider from your smartphone

LHSee screenshotNew apps don't often make it into my data news roundup, but it's hard to ignore this one: LHSee is an Android app from the University of Oxford that delivers data directly from the ATLAS experiment at CERN. The app lets you see data from collisions at the Large Hadron Collider.

The ATLAS experiment describes itself as an effort to learn about "the basic forces that have shaped our Universe since the beginning of time and that will determine its fate. Among the possible unknowns are the origin of mass, extra dimensions of space, unification of fundamental forces, and evidence for dark matter candidates in the Universe."

The LHSee app provides detailed information into how CERN and the Large Hadron Collider work. It also offers a "Hunt the Higgs Boson" game as well as opportunities to watch 3-D collisions streamed live from CERN. The app is available for free through the Android Market.

Got data news?

Feel free to email me.


October 06 2011

Strata Week: Oracle's big data play

Here are the data stories that caught my attention this week:

Oracle's big data week

Eyes have been on Oracle this week as it holds its OpenWorld event in San Francisco. The company has made a number of major announcements, including unveiling its strategy for handling big data. This includes its Big Data Appliance, which will use a new Oracle NoSQL database as well as an open-source distribution of Hadoop and R.

Edd Dumbill examined the Oracle news, arguing that "it couldn't be a plainer validation of what's important in big data right now or where the battle for technology dominance lies." He notes that whether one is an Oracle customer or not, the company's announcement "moves the big data world forward," pointing out that there is now a de facto agreement that Hadoop and R are core pieces of infrastructure.

GigaOm's Derrick Harris reached out to some of the startups who also offer these core pieces, including Norman Nie, the CEO of Revolution Analytics, and Mike Olson, CEO of Cloudera. Not surprisingly perhaps, the startups are "keeping brave faces, but the consensus is that Oracle's forays into their respective spaces just validate the work they've been doing, and they welcome the competition."

Oracle's entry as a big data player also brings competition to others in the space, such as IBM and EMC, as all the major enterprise providers wrestle to claim supremacy over whose capabilities are the biggest and fastest. And the claim that "we're faster" was repeated over and over by Oracle CEO Larry Ellison as he made his pitch to the crowd at OpenWorld.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Who wrote Hadoop?

As ReadWriteWeb's Joe Brockmeier notes, ascertaining the contributions to open-source projects is sometimes easier said than done. Who gets credit — companies or individuals — can be both unclear and contentious. Such is the case with a recent back-and-forth between Cloudera's Mike Olson and Hortonworks' Owen O'Malley over who's responsible for the contributions to Hadoop.

O'Malley wrote a blog post titled "The Yahoo! Effect," which, as the name suggests, describes Yahoo's legacy and its continuing contributions to the Hadoop core. O'Malley argues that "from its inception until this past June, Yahoo! contributed more than 84% of the lines of code still in Apache Hadoop trunk." (Editor's note: The link to "trunk" was inserted for clarity.) O'Malley adds that so far this year, the biggest contributors to Hadoop are Yahoo! and Hortonworks.

Lines of code contributed to apache hadoop trunkLines of code contributed to Apache Hadoop Trunk (from Owen O'Malley's post, "The Yahoo! Effect")

That may not be a surprising argument to hear from Hortonworks, the company that was spun out of Yahoo! earlier this year to focus on the commercialization and development of Hadoop.

But Cloudera's Mike Olson challenges that argument — again, not a surprise, as Cloudera has long positioned itself as a major contributor to Hadoop, a leader in the space, and of course now the employer of former Yahoo! engineer Doug Cutting, the originator of the technology. Olson takes issue with O'Malley's calculations and in a blog post of his own, contends that these calculations don't accurately take into account the companies that people now work for:

Five years is an eternity in the tech industry, however, and many of those developers moved on from Yahoo! between 2006 and 2011. If you look at where individual contributors work today — at the organizations that pay them, and at the different places in the industry where they have carried their expertise and their knowledge of Hadoop — the story is much more interesting.

Olson also argues that it isn't simply a matter of who's contributing to the Apache Hadoop core, but rather who is working on:

... the broader ecosystem of projects. That ecosystem has exploded in recent years, and most of the innovation around Hadoop is now happening in new projects. That's not surprising — as Hadoop has matured, the core platform has stabilized, and the community has concentrated on easing adoption and simplifying use.

Got data news?

Feel free to email me.


September 29 2011

Strata Week: Facebook builds a new look for old data

Here are a few of the data stories that caught my attention this week.

Facebook data and the "story of your life"

Last week at its F8 developer conference, Facebook announced several important changes, including updates to its Open Graph that enable what it calls "frictionless sharing" as well as a more visual version of users' profiles — the "Timeline." As is always the case with a Facebook update, particularly one that involves a new UI, there have been a number of vocal responses. And not surprisingly, given Facebook's history, there have been a slew of questions raised about how the changes will impact users' privacy.

Facebook Timeline

Some of those concerns stem from the fact that now, with the Timeline, a person's entire (Facebook) history can be accessed and viewed far more easily. On stage at F8, CEO Mark Zuckerberg described the Timeline as a way to "curate the story of your life." But whether or not you view the new visual presentation of your Facebook profile with such grand, sweeping terms, it's clear that the new profile is a way for Facebook to re-present user data. Some of this data may be things that would have otherwise been forgotten — not just banal status updates, but progress in games, friendships made, relationships broken and so on. As Facebook describes it:

The way your profile works today, 99 percent of the stories you share vanish. The only way to find the posts that matter is to click 'Older Posts' at the bottom of the page. Again. And again. Imagine if there was an easy way to rediscover the things you shared, and collect all your best moments in a single place.

That new way was helped, in part, by Facebook's hiring earlier this year of data visualization experts Nicholas Felton and Ryan Case. But turning old Facebook data into new user profiles has caused some consternation, including the insistence by Silicon Filter's Frederic Lardinois that "sorry, Facebook, but the stuff I share on your site is not the story of my life."

But it wasn't an announcement from the stage at F8 that raised the most questions about Facebook data this week. Rather, it was a post by Nik Cubrilovic arguing that "logging out of Facebook is not enough." Cubrilovic discovered that even if you log out of Facebook, its cookies persisted. "With my browser logged out of Facebook," he wrote, "whenever I visit any page with a Facebook like button, or share button, or any other widget, the information, including my account ID, is still being sent to Facebook. The only solution to Facebook not knowing who you are is to delete all Facebook cookies."

Facebook responded to Cubrilovic and addressed the issue so that upon logout the cookie containing a user's ID is destroyed.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Faster than the speed of light?

Just as some tech pundits were busy debating whether the latest changes to Facebook had "changed everything," an observation by the physicists at CERN also prompted many to say the same thing — "this changes everything" — in terms of what we know about physics and relativity.

But not so fast. The scientists at the particle accelerator in Switzerland have been measuring how neutrinos change their properties as they travel. According to their measurements (some three years' worth of data and 15,000 calculations), the neutrinos appeared to have traveled from Geneva to Gran Sasso, Italy, faster than the speed of light. According to Einstein, that's not possible: nothing can exceed the speed of light.

CERN researchers released the data in hopes that other scientists can help shed some light on the findings. The scientists at Fermilab are among those who will pour over the information.

For those who need a little brushing up on their physics, an article at CNET has a good illustration of the experiment. High school physics teacher John Burk also has a great explanation of the discovery and the calculations behind it as well as insights into why the discovery and the discussions demonstrate good science (but in some cases, lousy journalism).

Data and baseball (and Brad Pitt)

MoneyballThe film "Moneyball," based on the 2003 bestseller by Michael Lewis, was released this past week. It's the story of Oakland Athletics general manager Billy Beane — played in the film by Brad Pitt — who helped revive the franchise by using data analysis to assemble his team.

Of course, stats and baseball have long gone hand in hand, but Beane argued that the on-base percentage (OPB) was a better indicator of a player's strengths than just batting average. By looking at the numbers that other teams weren't, Beane was able to assemble a team made up of players that other teams viewed as less valuable. (And by extension, of course, this helped Beane assemble that team at a much lower price.)

Thanks, in part, to Pitt's star power, a story about data making a difference is a Hollywood hit, and the movie's release has spurred others to ask if that sort of strategy can work elsewhere. In a post called "Moneyball for Startups," SplatF's Dan Frommer asked if it would be applicable to tech investing, a question that received a number of interesting follow-up discussions from around the web.

In a recent webcast, "Codermetrics" author Jonathan Alexander examined software teams through a "Moneyball" lens.

Got data news?

Feel free to email me.


September 22 2011

Strata Week: Crowdsourcing and gaming spur a scientific breakthrough

Here are a few of the data stories that caught my attention this week.

Crowdsourcing and gaming helps in the fight against HIV

Foldit gamePlayers of the online protein-folding game have solved a scientific problem in three weeks' time that has stumped researchers for more than a decade. Scientists have been trying to figure out the structure of a protein-cutting enzyme from an AIDS-like virus, but failing to do so, turned the information over to players, challenging them to see if they could produce an accurate model.

"We wanted to see if human intuition could succeed where automated methods had failed," Dr. Firas Khatib of the University of Washington Department of Biochemistry told Science Daily. And indeed, it did.

The goal was to work out the three-dimensional structure of different proteins. Players, most of whom were not trained scientists, competed with one another and were scored based on the stability of what they built. But they could also work together on solving the various puzzles. And, in this case, by playing, the gamers generated models that were good enough for the researchers to determine the enzyme's actual structure. This included elements that could be targeted by drugs that could take on the enzyme.

Twitter open sources Storm and acquires Julpan

As Twitter indicated it would do last month, the company has open sourced Storm, its Hadoop-like, real-time data processing tool. Storm was developed by Backtype, which Twitter acquired earlier this year, and Twitter engineer Nathan Marz, formerly the lead engineer at Backtype, made the open source release official at the Strange Loop developer conference. Along with the code, there's extensive documentation of the project, as well as other resources Marz lists on a Hacker News thread about the project.

The open sourcing of Storm wasn't the only data news from Twitter this week. The company has also acquired Julpan, a New York City-based startup that analyzes real-time data collected from the social web.

The acquisition is the latest in a series of moves by Twitter to build out its own analytics capabilities — moves that include the acquisition of BackType — to analyze the more than 200 million Tweets that are now posted per day.

Julpan is headed by former Google data scientist Ori Allon. Allon built "Orion," a search algorithm that became a key part of Google's search relevancy efforts when the company acquired the rights to it in 2006. Allon left Google in 2010 to found Julpan.

The politics of search

Google Chairman Eric Schmidt testified before the Senate Judiciary Subcommittee on Antitrust, Competition Policy and Consumer Rights yesterday — a hearing that GigaOm's Stacey Higginbotham said demonstrated a "fundamental conflict of cultures" between Silicon Valley and Washington DC.

The purpose of the Senate hearing is to investigate Google's search practices and to ascertain whether or not Google's dominance over search and search advertising warrants an anti-trust response from the government. As Senator Patrick Leahy put it, the hearings are meant to see whether "Google is in a position to determine who will succeed and fail on the Internet." Many of the questions from the senators involved how Google handles search ranking. Senator Mike Lee accused the company of cooking search results, an accusation that Schmidt denied.

"First, we built search for users, not websites," Schmidt testified. "And no matter what we do, there will always be some websites unhappy with where they rank. Search is subjective, and there's no 'correct' set of search results. Our scientific process is designed to provide the answers that consumers will find most useful."

GigaOm's Higginbotham describes what she sees as a clash of cultures between the Senators and Google — and between politics and algorithms — thusly:

Schmidt, like any computer scientist, tried to argue that the algorithms do what they are supposed to do. From a computer science view, if an algorithm is fair, then changing to protect a certain class of those affected by it makes it fundamentally unfair to others (something Congress routinely does with exceptions and carve outs when it's making legislation). In fact, the biggest elephant in the room was a clash of cultures between the Silicon Valley culture of the free market — and using technology to create a better consumer experience — and Washington D.C.'s inherent cynicism and pandering to constituents.

Strata Conference in New York

O'Reilly's Strata Conference has been going on this week in New York City, with Strata Jumpstart and Strata Summit kicking off the week.

Video and speaker slides from earlier in the week are available online, and you can watch streaming video from the rest of the week's events.

Got data news?

Feel free to email me.


September 15 2011

Strata Week: Investors circle big data

This was a busy week for data stories. Here are a few that caught my attention:

Big money for big data

Opera SolutonsThere's recently been a steady stream of funding news for big data, database, and data mining companies. Last Thursday, Hadoop-based data analytics startup Platfora raised $5.7 million from Andreessen Horowitz. On Monday, 10gen announced it had raised $20 million for MongoDB, its open-source, NoSQL database. On Tuesday, Xignite said it had raised $10 million to build big data repositories for financial organizations; data storage provider Zetta announced a $9 million round; and Walmart announced it had acquired the ad targeting and data mining startup OneRiot (the terms of the deal were not disclosed). Finally, yesterday, big data analytics company Opera Solutions announced that it had raised a whopping $84 million in its first round of funding.

GigaOm's Derrick Harris offers the story behind Opera Solution's massive round of funding, noting that the company was already growing fast and doing more than $100 million per year in revenue. He also points to the company's penchant for hiring PhDs (90 so far), "something that makes it more akin to blue-chipper IBM than to many of today's big data startups pushing Hadoop or NoSQL technologies." Harris also notes that at a half-billion-dollar valuation and with 600-plus employees, Opera Solutions isn't a great acquisitions target for other big companies, even those wanting to beef up their analytics offerings. He contends this could allow Opera Solutions to remain independent and perhaps make some acquisitions of its own.

Ushahidi and Wikipedia team up for WikiSweeper

Wikipedia and UshahidiThe crisis-mapping platform Ushahidi unveiled a new tool this week to help Wikipedia editors track changes and verify sources on articles. The project, called WikiSweeper, is aimed at those highly- and rapidly-edited articles that are associated with major events.

As Ushahidi writes on its blog:

When a globally-relevant news story breaks, relevant Wikipedia pages are the subject of hundreds of edits as events unfold. As each editor looks to editing and maintaining the quality and credibility of the page, they need to manually track the news cycle, each using their own spheres of reference. The decisions that are made to accept one source while rejecting others remains opaque, as are the strategies that editors develop to alert and keep track of the latest information coming in from a variety of different sources.

WikiSweeper is based on Ushahidi's own open-source Sweeper tool, and its application to Wikipedia will help Ushahidi in turn build out its own project. After all, during major events, information comes in from multiple sources at a breakneck pace, and in crisis response, the accuracy and trustworthiness of the sources need to be quickly and transparently identified. As Ushahidi points out, this makes it a "win-win" for both organizations as they gain better tools for dealing with real-time news and social data.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Angry Birds take down pigs and the economy

Invoking the seasonal declarations come March about the amount of time Americans waste at work watching the NCAA college basketball tournament, The Atlantic's Alexis Madrigal has pointed to a far more insidious and year-round problem: the amount of hours American workers lose by playing Angry Birds.

Drawing on data about the number of minutes people spend playing Angry Birds per day — 200 million — Madrigal has calculated the resulting lost hours and lost wages. He estimates about 43,333,333 on-the-clock hours are spent playing Angry Birds each year, accounting for $1.5 billion in lost wages per year.

Obviously there are some really big assumptions in this calculation. The first is that five percent of the total Angry Bird hours are played by Americans at work ... we don't know the international breakdown, nor do we know how often people play at work. But, five percent seemed like a reasonable assumption. Second, the Pew income data for smartphone ownership is not that precise, particularly on the upper ($75,000+) and lower (less than $30,000) ends. I had to pick numbers, so I basically split Americans up into four categories: people earning $30,000, $50,000, $75,000, and $100,000, then I calculated simple hourly wages for those groups (income/52/40) and did a weighted average based on smartphone adoption in those categories. The $35 per hour number I used is comparable with the $38 that Challenger, Gray, and Christmas used for fantasy sports players. But this is certainly a rough approximation. Put it this way: I bet this estimate is right to the order of magnitude, if not in the details.

Take that, Gladwell

Malcolm Gladwell raised the ire of many social-media-savvy activists last year by claiming that "the revolution will not be tweeted." Writing in The New Yorker, Gladwell dismissed social media as a tool for change. He argued that bonds formed online are "weak" and unable to withstand the sorts of demands necessary for social change.

Gladwell's assertions have been countered in many places, and a new article analyzing social media's role in the Arab Spring takes the rebuttals to a new level.

"After analyzing over 3 million tweets, gigabytes of YouTube content and thousands of blog posts, a new study finds that social media played a central role in shaping political debates in the Arab Spring. Conversations about revolution often preceded major events on the ground, and social media carried inspiring stories of protest across international borders," the authors write.

The authors describe their research methodology for extracting and analyzing the texts from blogs and tweets, but also lamented some of the problems they faced, particularly with access to the Twitter archive.

Got data news?

Feel free to email me.


September 08 2011

Strata Week: MapReduce gets its arms around a million songs

Here are some of the data stories that caught my attention this week.

A millions songs and MapReduce

Million Song DatasetEarlier this year, Echo Nest and LabROSA at Columbia University released the Million Song Dataset, a freely available collection of audio and metadata for a million contemporary popular music tracks. The purpose of the dataset, among other things, was to help encourage research on music algorithms. But as Paul Lamere, director of Echo Nest's Developer Platform, makes clear, getting started with the dataset can be daunting.

In a post on his Music Machinery blog, Lamere explains how to use Amazon's Elastic MapReduce to process the data. In fact, Echo Nest has loaded the entire Million Song Dataset onto a single S3 bucket, available at The bucket contains approximately 300 files, each with data on about 3,000 tracks. Lamere also points to a small subset of the data — just 20 tracks — available in a file on GitHub, and he also created to parse track data and return a dictionary containing all of it.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

GPS steps in where memory fails

Garmin 305 GPS deviceAfter decades of cycling without incident, The New York Times science writer John Markoff experienced what every cyclist dreads: a major crash, one that resulted in a broken nose, a deep gash on his knee, and road rash aplenty. He was knocked unconscious by the crash, unable to remember what had happened to cause it. In a recent piece in the NYT, he chronicled the steps he took to reconstruct the accident.

He did so by turning to the GPS data tracked by the Garmin 305 on his bicycle. Typically, devices like this are utilized to track the distance and location of rides as well as a cyclist's pedaling and heart rates. But as Markoff investigated his own crash, he found that the data stored in these types of devices can be use to ascertain what happens in cycling accidents.

In investigating his own memory-less crash, Markoff was able to piece together data about his trip:

My Garmin was unharmed, and when I uploaded the data I could see that in the roughly eight seconds before I crashed, my speed went from 30 to 10 miles per hour — and then 0 — while my heart rate stayed a constant 126. By entering the GPS data into Google Maps, I could see just where I crashed. I realized I did have several disconnected memories. One was of my hands being thrown off the handlebars violently, but I had no sense of where I was when it happened. With a friend, Bill Duvall, who many years ago also raced for the local bike club Pedali Alpini, I went back to the spot. La Honda Road cuts a steep and curving path through the redwoods. Just above where the GPS data said I crashed, we could see a long, thin, deep pothole. (It was even visible in Google's street view.) If my tire hit that, it could easily have taken me down. I also had a fleeting recollection of my mangled dark glasses, and on the side of the road, I stooped and picked up one of the lenses, which was deeply scratched. From the swift deceleration, I deduced that when my hands were thrown from the handlebars, I must have managed to reach my brakes again in time to slow down before I fell. My right hand was pinned under the brake lever when I hit the ground, causing the nasty road rash.

It's one thing for a rider to reconstruct his own accident, but Markoff says insurance companies are also starting to pay attention to this sort of data. As one lawyer notes in the Times article, "Frankly, it's probably going to be a booming new industry for experts."

Crowdsourcing and crisis mapping from WWI

The explosion of mobile, mapping, and web technologies has facilitated the rise of crowdsourcing during crisis situations, giving citizens and NGOs — among others — the ability to contribute to and coordinate emergency responses. But as Patrick Meier, director of crisis mapping and partnerships at Ushahidi has found, there are examples of crisis mapping that pre-date our Internet age.

Meier highlights maps he discovered from World War I at the National Air and Space Museum, pointing to the government's request for citizens to help with the mapping process:

In the event of a hostile aircraft being seen in country districts, the nearest Naval, Military or Police Authorities should, if possible, be advised immediately by Telephone of the time of appearance, the direction of flight, and whether the aircraft is an Airship or an Aeroplane.

And he asks a number of very interesting questions: How often were these maps updated? What sources were used? And "would public opinion at the time have differed had live crowdsourced crisis maps existed?"

Got data news?

Feel free to email me.


September 01 2011

Strata Week: What happens when 200,000 hard drives work together?

Here are a few of the data stories that caught my attention this week.

IBM's record-breaking data storage array

Hard Drive by walknboston, on FlickrIBM Research is building a new data storage array that's almost 10 times larger than anything that's been built before. The data array is comprised of 200,000 hard drives working together, with a storage capacity of 120 petabytes — that's 120 million gigabytes. To give you some idea of the capacity of the new "drive," writes MIT Technology Review, "a 120-petabyte drive could hold 24 billion typical five-megabyte MP3 files or comfortably swallow 60 copies of the biggest backup of the Web, the 150 billion pages that make up the Internet Archive's WayBack Machine."

Data storage at that scale creates a number of challenges, including — no surprise — cooling such a massive system. But other problems include handling failure, backups and indexing. The new storage array will benefit from other research that IBM has been doing to help boost supercomputers' data access. Its General Parallel File System was designed with this massive volume in mind. The GPFS spreads files across multiple disks so that many parts of a file can be read or written at once. This system already demonstrated that it can perform when it set a new scanning speed record last month by indexing 10 billion files in just 43 minutes.

IBM's new 120-petabyte drive was built at the request of an unnamed client that needed a new supercomputer for "detailed simulations of real-world phenomena."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Infochimps' new Geo API

InfoChimpsThe data marketplace Infochimps released a new Geo API this week, giving developers access to a number of disparate location-related datasets via one API with a unified schema.

According to Infochimps, the API addresses several pain points that those working with geodata face:

  1. Difficulty in integrating several different APIs into one unified app
  2. Lack of ability to display all results when zoomed out to a large radius
  3. Limitation of only being able to use lat/long

To address these issues, Infochimps has created a new simple schema to help make data consistent and unified when drawn from multiple sources. The company has also created a "summarizer" to intelligently cluster and better display data. And finally, it has also enabled the API to handle queries other than just those traditionally associated with geodata, namely latitude and longitude.

As we seek to pull together and analyze all types of data from multiple sources, this move toward a unified schema will become increasingly important.

Hurricane Irene and weather data

The arrival of Hurricane Irene last week reiterated the importance not only of emergency preparedness but of access to real-time data — weather data, transportation data, government data, mobile data, and so on.

New York Times Hurricane Irene tracker
Screenshot from the New York Times' interactive Hurricane Irene tracking map. See the full version.

As Alex Howard noted here on Radar, crisis data is becoming increasingly social:

We've been through hurricanes before. What's different about this one is the unprecedented levels of connectivity that now exist up and down the East Coast. According to the most recent numbers from the Pew Internet and Life Project, for the first time, more than 50% of American adults use social networks. 35% of American adults have smartphones. 78% of American adults are connected to the Internet. When combined, those factors mean that we now see earthquake tweets spread faster than the seismic waves themselves. The growth of an Internet of things is an important evolution. What we're seeing this weekend is the importance of an Internet of people."

Got data news?

Feel free to email me.

Hard drive photo: Hard Drive by walknboston, on Flickr


August 25 2011

Strata Week: Green pigs and data

Here are a few of the data stories that caught my attention this week:

Predicting Angry Birds

Angry BirdsAngry Birds maker Rovio will begin using predictive analytics technology from the Seattle-based company Medio to help improve game play for its popular pig-smashing game.

According to the press release announcing the partnership, Angry Birds has been downloaded more 300 million times and is on course to reach 1 billion downloads. But it isn't merely downloaded a lot; it's played a lot, too. The game, which sees up to 1.4 billion minutes of game play per week, generates an incredible amount of data: user demographics, location, and device information are just a few of the data points.

Users' data has always been important in gaming, as game developers must refine their games to maximize the amount of time players spend as well as track their willingness to spend money on extras or to click on related ads. As casual gaming becomes a bigger and more competitive industry, game makers like Rovio will rely on analytics to keep their customers engaged.

As GigaOm's Derrick Harris notes, quoting Zynga's recent S-1 filing, this is already a crucial part of that gaming giant's business:

The extensive engagement of our players provides over 15 terabytes of game data per day that we use to enhance our games by designing, testing and releasing new features on an ongoing basis. We believe that combining data analytics with creative game design enables us to create a superior player experience.

By enlisting the help of Medio for predictive analytics, it's clear that Rovio is taking that same tactic to improve the Angry Bird experience.

Unstructured data and HP's next chapter

HP made a number of big announcements last week as it revealed plans for an overhaul. These plans include ending production of its tablet and smartphones, putting the development of WebOS on hold, and spending some $10 billion to acquire the British enterprise software company Autonomy.

AutonomyThe New York Times described the shift in HP as a move to "refocus the company on business products and services," and the acquisition of Autonomy could help drive that via its big data analytics. HP's president and CEO Léo Apotheker said in a statement: "Autonomy presents an opportunity to accelerate our strategic vision to decisively and profitably lead a large and growing space ... Together with Autonomy, we plan to reinvent how both unstructured and structured data is processed, analyzed, optimized, automated and protected."

As MIT Technology Review's Tom Simonite puts it, HP wants Autonomy for its "math skills" and the acquisition will position HP to take advantage of the big data trend.

Founded in 1996, Autonomy has a lengthy history of analyzing data, with an emphasis on unstructured data. Citing an earlier Technology Review interview, Simonite quotes Autonomy founder Mike Lynch's estimate that about 85% of the information inside a business is unstructured. "[W]e are human beings, and unstructured information is at the core of everything we do," Lynch said. "Most business is done using this kind of human-friendly information."

Simonite argues that by acquiring Autonomy, HP could "take a much more dominant position in the growing market for what Autonomy's Lynch dubs 'meaning-based computing.'"

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Using data to uncover stories for the Daily Dot

After several months of invitation-only testing, the web got its own official daily newspaper this week with the launch of The Daily Dot. CEO Nick White and founding editor Owen Thomas said the publication will focus on the news from various online communities and social networks.

GigaOm's Mathew Ingram gave The Daily Dot a mixed review, calling its focus on web communities "an interesting idea," but he questioned if the "home town newspaper" metaphor really makes sense. The number of kitten stories on the Daily Dot's front page aside, ReadWriteWeb's Marshall Kirkpatrick sees The Daily Dot as part of the larger trend toward data journalism, and he highlighted some of the technology that the publication is using to uncover the Web world's news, including Hadoop and assistance from Ravel Data.

"It's one thing to crawl, it's another to understand the community," Daily Dot CEO White told Kirkpatrick. "What we really offer is thinking about how the community ticks. The gestures and modalities on Reddit are very different from Youtube; it's sociological, not just math."

Got data news?

Feel free to email me.


August 18 2011

August 04 2011

Strata Week: Hadoop adds security to its skill set

Here are a few of the data stories that caught my eye this week.

Where big data and security collide

HadoopCould security be the next killer app for Hadoop? That's what GigaOm's Derrick Harris suggests: "The open-source, data-processing tool is already popular for search engines, social-media analysis, targeted marketing and other applications that can benefit from clusters of machines churning through unstructured data — now it's turning its attention to security data." Noting the universality of security concerns, Harris suggests that "targeted applications" using Hadoop might be a solid starting point for mainstream businesses to adopt the technology.

Juniper Networks' Chris Hoff has also analyzed the connections between big data and security in a couple of recent posts on his Rational Survivability blog. Hoff contends that while we've had the capabilities to analyze security-related data for some time, that's traditionally happened with specialized security tools, meaning that insights are "often disconnected from the transaction and value of the asset from which they emanate."

Hoff continues:

Even when we do start to be able to integrate and correlate event, configuration, vulnerability or logging data, it's very IT-centric. It's very INFRASTRUCTURE-centric. It doesn't really include much value about the actual information in use/transit or the implication of how it's being consumed or related to.

But as both Harris and Hoff argue, Hadoop might help address this as it can handle all an organization's unstructured data and can enable security analysis that isn't "disconnected." And both Harris and Hoff point to Zettaset as an example of a company that is tackling big data and security analysis by using Hadoop.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

What's your most important personal data?

Concerns about data security also occur at the personal level. To that end, The Locker Project, Singly's open source project to help people collect and control their personal data, recently surveyed people about the data they see as most important.

The survey asked people to choose from the following: contacts, messages, events, check-ins, links, photos, music, movies, or browser history. The results are in, and no surprise: photos were listed as the most important, with 37% of respondents (67 out of 179) selecting that option. Forty-six people listed their contacts, and 23 said their messages were most important.

Interestingly, browser history, events, and check-ins were rated the lowest. As Singly's Tom Longson ponders:

Do people not care about where they went? Is this data considered stale to most people, and therefore irrelevant? I personally believe I can create a lot of value from Browser History and Check-ins. For example, what websites are my friends going to that I'm not? Also, what places should I be going that I'm not? These are just a couple of ideas.

But just as revealing as the ranking of data were the reasons that people gave for why certain types were most important, as you can see in the word cloud created from their responses.

Singly word cloud from data surveyClick to enlarge. See Singly's associated analysis of this data.

House panel moves forward on data retention law

The U.S. Congress is in recess now, but among the last-minute things it accomplished before vacation was passage by the House Judiciary Committee of "The Protecting Children from Internet Pornographers Act of 2011." Ostensibly aimed at helping track pedophiles and pornographers online, the bill has raised a number of concerns about Internet data and surveillance. If passed, the law would require, among other things, that Internet companies collect and retain the IP addresses of all users for at least one year.

Representative Zoe Lofgren was one of the opponents of the legislation in committee, trying unsuccessfully to introduce amendments that would curb its data retention requirements. She also tried to have the name of the law changed to the "Keep Every American's Digital Data for Submission to the Federal Government Without a Warrant Act of 2011."

In addition to concerns over government surveillance, TechDirt's Mike Masnick and the Cato Institute's Julian Sanchez have also pointed to the potential security issues that could arise from lengthy data retention requirements. Sanchez writes:

If I started storing big piles of gold bullion and precious gems in my home, my previously highly secure apartment would suddenly become laughably insecure, without my changing my security measures at all. If a company significantly increases the amount of sensitive or valuable information stored in its systems — because, for example, a government mandate requires them to keep more extensive logs — then the returns to a single successful intrusion (as measured by the amount of data that can be exfiltrated before the breach is detected and sealed) increase as well. The costs of data retention need to be measured not just in terms of terabytes, or man hours spent reconfiguring routers. The cost of detecting and repelling a higher volume of more sophisticated attacks has to be counted as well.

New data from a very old map

Gough MapAnd in more pleasant "storing old data" news: the Gough Map, the oldest surviving map of Great Britain, dating back to the 14th century, has now been digitized and made available online.

The project to digitize the map, which now resides in Oxford University's Bodleian Library took 15 months to complete. According to the Bodleian, the project explored the map's "'linguistic geographies,' that is the writing used on the map by the scribes who created it, with the aim of offering a re-interpretation of the Gough Map's origins, provenance, purpose and creation of which so little is known."

Among the insights gleaned includes the revelation that the text on the Gough Map is the work of at least two different scribes — one from the 14th century and a later one, from the 15th century, who revised some pieces. Furthermore, it was also discovered that the map was made closer to 1375 than 1360, the data often given to it.

Got data news?

Feel free to email me.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!