Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 02 2012

Strata Week: The Megaupload seizure and user data

Here are a few of the data stories that caught my attention this week.

Megaupload's seizure and questions about controlling user data

When the file-storage and sharing site Megaupload had its domain name seized, assets frozen and website shut down in mid-January, the U.S. Justice Department contended that the owners were operating a site dedicated to copyright infringement. But that posed a huge problem for those who were using Megaupload for the legitimate and legal storage of their files. As the EFF noted, these users weren't given any notice of the seizure, nor were they given an opportunity to retrieve their data.

Moreover, it seemed this week that those users would have all their data deleted, as Megaupload would no longer be able to pay its server fees.

While it appears that users have won a two-week reprieve before any deletion actually occurs, the incident does raise a number of questions about users' data rights and control in the cloud. Specifically: What happens to user data when a file hosting / cloud provider goes under? And how much time and notice should users have to reclaim their data?

Megaupload seizure notice
This is what you see when you visit

Bloomberg opens its market data distribution technology

The financial news and information company Bloomberg opened its market data distribution interface this week. The BLPAPI is available under a free-use license at According to the press release, some 100,000 people already use the BLPAPI, but with this week's announcement, the interface will be more broadly available.

The company introduced its Bloomberg Open Symbology back in 2009, a move to provide an alternative to some of the proprietary systems for identifying securities (particularly those services offered by Bloomberg's competitor Thomson Reuters). This week's opening of the BLPAPI is a similar gesture, one that the company says is part of its "Open Market Data Initiative, an ongoing effort to embrace and promote open solutions for the financial services industry."

The BLPAPI works with a range of programming languages, including Java, C, C++, .NET, COM and Perl. But while the interface itself is free to use, the content is not.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Pentaho moves Kettle to the Apache 2.0 license

Pentaho's extract-transform-load technology Pentaho Kettle is being moved to the Apache License, Version 2.0. Kettle was previously available under the GNU Lesser General Public License (LGPL).

By moving to the Apache license, Pentaho says it will be more in line with the licensing of Hadoop, Hbase, and a number of NoSQL projects.

Kettle downloads and documentation are available at the Pentaho Big Data Community Home.

Oscar screeners and movie piracy data

Andy Baio took a look at some of the data surrounding piracy and the Oscar screening process. There has long been concern that the review copies of movies distributed to members of the Academy of Motion Arts and Sciences were making their way online. Baio observed that while a record number of films have been nominated for Oscars this year (37), just eight of the "screeners" have been leaked online, "a record low that continues the downward trend from last year."

However, while the number of screeners available online has diminished, almost all of the nominated films (34) had already been leaked online. "If the goal of blocking leaks is to keep the films off the Internet, then the MPAA [Motion Picture Association of America] still has a long way to go," Baio wrote.

Baio has a number of additional observations about these leaks (and he also made the full data dump available for others to examine). But as the MPAA and others are making arguments (and helping pen related legislation) to crack down on Internet privacy, a good look at piracy trends seems particularly important.

Got data news?

Feel free to email me.


November 17 2011

Strata Week: Why ThinkUp matters

Here are a few of the data stories that caught my attention this week.

ThinkUp hits 1.0

ThinkUpThinkUp, a tool out of Expert Labs, enables users to archive, search and export their Twitter, Facebook and Google+ history — both posts and post replies. It also allows users to see their network activity, including new followers, and to map that information. Originally created by Gina Trapani, ThinkUp is free and open source, and will run on a user's own web server.

That's crucial, says Expert Labs' founder Anil Dash, who describes ThinkUp's launch as "software that matters." He writes that "ThinkUp's launch matters to me because of what it represents: The web we were promised we would have. The web that I fell in love with, and that has given me so much. A web that we can hack, and tweak, and own." Imagine everything you've ever written on Twitter, every status update on Facebook, every message on Google+ and every response you've had to those posts — imagine them wiped out by the companies that control those social networks.

Why would I ascribe such awful behavior to the nice people who run these social networks? Because history shows us that it happens. Over and over and over. The clips uploaded to Google Videos, the sites published to Geocities, the entire relationships that began and ended on Friendster: They're all gone. Some kind-hearted folks are trying to archive those things for the record, and that's wonderful. But what about the record for your life, a private version that's not for sharing with the world, but that preserves the information or ideas or moments that you care about?

It's in light of this, no doubt, that ReadWriteWeb's Jon Mitchell calls ThinkUp "the social media management tool that matters most." Indeed, as we pour more of our lives into these social sites, tools like ThinkUp, along with endeavors like the Locker Project, mark important efforts to help people own, control and utilize their own data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

DataSift opens up its Twitter firehose

DataSiftDataSift, one of only two companies licensed by Twitter to syndicate its firehose (the other being Gnip), officially opened to the public this week. That means that those using DataSift can in turn mine all the social data that comes from Twitter — data that comes at a rate of some 250 million tweets per day. DataSift's customers can analyze this data for more than just keyword searches and can apply various filters, including demographic information, sentiment, gender, and even Klout score. The company also offers data from MySpace and plans to add Google+ and Facebook data soon.

DataSift, which was founded by Tweetmeme's Nick Halstead and raised $6 million earlier this year, is available as a pay-as-you-go subscription model.

Google's BigQuery service opens to more developers

Google announced this week that it was letting more companies have access to its piloting of BigQuery, its big data analytics service. The tool was initially developed for internal use at Google, and it was opened to a limited number of developers and companies at Google I/O earlier this year. Now, Google is allowing a few more companies into the fold (you can indicate your interest here), offering them the service for free — with the promise to notify them in 30 days if it plans to charge — as well as adding some user interface improvements.

In addition to a GUI for the web-based version, Google has improved the REST API for BigQuery as well. The new API offers granular control over permissions and lets you run multiple jobs in the background.

BigQuery is based on the Google tool formerly known as Dremel, which the company discussed in a research paper published last year:

[Dremel] is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data.

In the blog post announcing the changes to BigQuery, Google cites Michael J. Franklin, Professor of Computer Science at UC Berkeley, who calls BigQuery's ability to process big data "jaw-dropping."

Got data news?

Feel free to email me.


Sponsored post

November 01 2011

If your data practices were made public, would you be nervous?

Solon BarocasThe practice of data mining often elicits a knee-jerk reaction from consumers, with some viewing it as as a violation of their privacy. In a recent interview, Solon Barocas (@s010n), a doctoral student at New York University, discussed the perceptions of data mining and how companies can address data mining's reputation.

Highlights from the interview (below) included:

  • What do consumers think data mining entails? "Data mining almost intuitively for most consumers implies scavenging through the data, trying to find secrets that you don't necessarily want people to know," Barocas said. "It's really difficult to explain what data mining actually is. I think of it, in a sense, to be a particular form of machine learning. And these are complicated things — very, very complicated. A challenge for people in the industry, regulators, and anyone else interested in these issues, is to figure out a way to communicate these technical things to a lay audience." [Discussed at the 0:41 mark.]
  • Do we need a different phrase in lieu of "data-mining"? Barocas argued: "[We should] try to push back against the misuses of the term, re-appropriate the term data mining, and explain it's not 'data-dredging.' It's not this case of running through everyone's data. We need to instead explain data mining is a kind of analysis that lets us discover interesting and important new trends. I think there's an enormous amount of value in data mining and being able to explain precisely what that value is without making it seem like it's just snooping." [Discussed at 1:12.]
  • What "ethical red flags" should companies and data scientists be aware of? "There are potential problems all along the line," said Barocas, as after all, it can be difficult for companies performing analysis to know what to collect and what not to collect. "The rule of thumb: If your practice was made public — widely public — would you be nervous?" Barocas said he realizes that's "not a very sophisticated rule," but it's one that might guide responsibility in the data mining space. [Discussed at 2:50.]

The full interview is available in the video below:

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Some quotes from this interview were edited and condensed for clarity.


October 20 2011

Strata Week: A step toward personal data control

Here are a few of the data stories that caught my attention this week.

Your data in your locker

SinglyEarlier this month, John Battelle wrote a post on his blog where he wished for a service to counter the ways in which all our personal data is scattered across so many applications and devices. He was looking for a tool that would pull together the data from these various places into something that "queries all my various social actions and curates them into one publicly addressable instance independent of any larger platform like AOL, Facebook, Apple, or Google ... I'm pretty sure this is what Singly and the Locker Project will make theoretically possible."

Battelle and Singly's Jason Cavnar discussed the Locker Project in more detail in another post on Battelle's blog this week.

As Cavnar argued:

Data doesn't do us justice. This is about LIFE. Our lives. Or as our colleague Lindsay (@lschutte) says — 'your story.' Not data. Data is just a manifestation of the actual life we are leading. Our data (story) should be ours to own, remember, re-use, discover with and share.

If that sounds appealing then there's good news ahead. Singly 1.0 begins its roll-out to developers this week, as ReadWriteWeb's Marshall Kirkpatrick reports. Developers will be able to build apps that "search, sort and visualize contacts, links and photos that have been published by their own accounts on various social networks but also by all the accounts they are subscribed to there." The apps will live on Github and will deploy on Github for now. There are also several restrictions as far as using other people's apps — for example, you can only do so to visualize your own data.

Even with limitations, Singly is a first step in what will be a much-anticipated and a hugely important move for personal data control.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Bad graphics and good data journalism

A sample word cloudLast week, New York Times senior software architect Jacob Harris issued a challenge to the growing number of data journalists. Want to visualize your work? Avoid word clouds.

Word clouds are, he argued, much like tag clouds before them: "the mullets of the Internet." That is, taking a particular dataset and merely visualizing the frequency of words therein via tools like Wordle and the like is simply "filler visualization." (And Harris said it's also personally painful to the NYT data science team.)

Harris pointed to the numerous problems with utilizing word clouds as the sole form of textual analysis. At the very least, they only take advantage of word frequency, which doesn't necessarily tell you that much:

For starters, word clouds support only the crudest sorts of textual analysis, much like figuring out a protein by getting a count only of its amino acids. This can be wildly misleading; I created a word cloud of Tea Party feelings about Obama, and the two largest words were implausibly "like" and "policy," mainly because the importuned word "don't" was automatically excluded. (Fair enough: Such stopwords would otherwise dominate the word clouds.) A phrase or thematic analysis would reach more accurate conclusions. When looking at the word cloud of the War Logs, does the equal sizing of the words "car" and "blast" indicate a large number of reports about car bombs or just many reports about cars or explosions? How do I compare the relative frequency of lesser-used words? Also, doesn't focusing on the occurrence of specific words instead of concepts or themes miss the fact that different reports about truck bombs might be use the words "truck," "vehicle," or even "bongo" (since the Kia Bongo is very popular in Iraq)?

The Guardian's Simon Rogers responded to Harris. Rogers acknowledged there are plenty of poor visualizations out there, but he added an important point:

Calling for better graphics is also like calling for more sunshine and free chocolate — who's going to disagree with that? What they do is ignore why people produce their own graphics. We often use free tools because they are quick and tell the story simply. But, when we have the time, nothing beats having a good designer create something beautiful — and the Guardian graphics team produces lovely visualisation for the Datablog all the time — such as this one. What is the alternative online for those who don't have access to a team of trained designers?

That last question is crucial, particularly as not everyone has access to designers or software to be able to do much more with their data than create simple visualizations (i.e. word clouds). Rogers said that it's probably fine to have a lot of less-than-useful graphics, because, if nothing else, it "shows that data analysis is part of all our lives now, not just the preserve of a few trained experts handing out pearls of wisdom."

Mary Meeker examines the global growth of mobile data

Among the most-anticipated speakers at Web 2.0 Summit this week was Mary Meeker. The former Morgan Stanley analyst and now partner at Kleiner Perkins gave her annual "Internet Trends" presentation, which is always chock full of data.

Meeker's full Web 2.0 Summit presentation is available in the following video:

Meeker noted that 81% of users of the top 10 global Internet properties come from outside the U.S. Furthermore, in the last three years alone, China has added more Internet users than there are in all of the United States (246 million new Chinese users online versus 244 million total U.S. users online). Although companies like Apple, Amazon, and Google continue to dominate, Meeker pointed out that some of the largest and fasted growing Internet companies are also based outside the U.S. — Chinese companies like Baidu and Tencent, for example, and Russian companies like And beyond just market value, she pointed to global innovations, such as Sweden's Spotify and Israel's Waze.

The growth in Internet usage continues to be in mobile. Meeker highlighted the global scale and spread of mobile growth, noting that it's in countries like Turkey, India, Brazil and China where we are seeing the largest year-over-year expansion in mobile subscribers.

Suggesting that it may be time to reevaluate Maslow's hierarchy of needs, Meeker posited that Internet access is rapidly becoming a crucial need that sits at the top of a new hierarchy.

Apache Cassandra reaches 1.0

Apache CassandraThe Apache Software Foundation announced this week the release of Cassandra v1.0.

Cassandra, originally developed by Facebook to power its Inbox Search, was open sourced in 2008. Although it's been a top-level Apache project for more than a year now, the 1.0 release marks Cassandra's maturity and readiness for more widespread implementation. The technology has been adopted beyond Facebook by companies like Cisco, Cloudkick, Digg, Reddit, Twitter and Walmart Labs.

Of course, Cassandra is just one of many non-relational databases on the market, with the most recent addition coming from Oracle. But Jonathan Ellis, the vice president of the Apache Cassandra project, explained to PCWorld why Cassandra remains competitive:

[Its] architecture is suited for multi-data center environments, because it does not rely on a leader node to coordinate activities of the database. Data can be written to a local node, thereby eliminating the additional network communications needed to coordinate with a sometimes geographically distant master node. Also, because Cassandra is a column-based storage engine, it can store richer data sets than the typical key-value storage engine.

Got data news?

Feel free to email me.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...