Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 12 2012

Strata Week: Add structured data, lose local flavor?

Here are a few of the data stories that caught my attention this week:

A possible downside to Wikidata

Wikidata data model diagram
Screenshot from the Wikidata Data Model page.

The Wikimedia Foundation — the good folks behind Wikipedia — recently proposed a Wikidata initiative. It's a new project that would build out a free secondary database to collect structured data that could provide support in turn for Wikipedia and other Wikimedia projects. According to the proposal:

"Many Wikipedia articles contain facts and connections to other articles that are not easily understood by a computer, like the population of a country or the place of birth of an actor. In Wikidata, you will be able to enter that information in a way that makes it processable by the computer. This means that the machine can provide it in different languages, use it to create overviews of such data, like lists or charts, or answer questions that can hardly be answered automatically today."

But in The Atlantic this week, Mark Graham, a research fellow at the Oxford Research Institute, takes a look at the proposal, calling these "changes that have worrying connotations for the diversity of knowledge in the world's sixth most popular website." Graham points to the different language editions of Wikipedia, noting that the encyclopedic knowledge contained therein is always highly diverse. "Not only does each language edition include different sets of topics, but when several editions do cover the same topic, they often put their own, unique spin on the topic. In particular, the ability of each language edition to exist independently has allowed each language community to contextualize knowledge for its audience."

Graham fears that emphasizing a standardized, machine-readable, semantic-oriented Wikipedia will lose this local flavor:

"The reason that Wikidata marks such a significant moment in Wikipedia's history is the fact that it eliminates some of the scope for culturally contingent representations of places, processes, people, and events. However, even more concerning is that fact that this sort of congealed and structured knowledge is unlikely to reflect the opinions and beliefs of traditionally marginalized groups."

His arguments raise questions about the perceived universality of data, when in fact what we might find instead is terribly nuanced and localized, particularly when that data is contributed by humans who are distributed globally.

The intricacies of Netflix personalization

Netflix suggestion buttonNetflix's recommendation engine is often cited as a premier example of how user data can be mined and analyzed to build a better service. This week, Netflix's Xavier Amatriain and Justin Basilico penned a blog post offering insights into the challenges that the company — and thanks to the Netflix Prize, the data mining and machine learning communities — have faced in improving the accuracy of movie recommendation engines.

The Netflix post raises some interesting questions about how the means of content delivery have changed recommendations. In other words, when Netflix refocused on its streaming product, viewing interests changed (and not just because the selection changed). The same holds true for the multitude of ways in which we can now watch movies via Netflix (there are hundreds of different device options for accessing and viewing content from the service).

Amatriain and Basilico write:

"Now it is clear that the Netflix Prize objective, accurate prediction of a movie's rating, is just one of the many components of an effective recommendation system that optimizes our members' enjoyment. We also need to take into account factors such as context, title popularity, interest, evidence, novelty, diversity, and freshness. Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts."

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

Reposted byRK RK

November 22 2011

Strata Week: 4.74 degrees of Kevin Bacon

Here are some of the data stories that caught my attention this week:

There are less than six degrees between you and Kevin Bacon

You know the game: there are no more than six degrees of separation between actor Kevin Bacon and anyone working in Hollywood. You can start with Sir Alec Guinness, or you can start with Sasha Grey — there are, at most, six links that connect that performer to Kevin Bacon. The game is built on an older notion of social connections, one dating back as far as the late 1920s when Hungarian author Frigyes Karinthy argued there were no more than five acquaintances that separated people.

It might not be all that surprising that the Internet affords different — closer? — relationships than those of the earliest 20th century Hungary. Indeed, according to Facebook's data team, the connections it now affords are actually much closer than the "six degrees of separation" maxim. Between Facebook relationships, there are only 4.74 degrees ("hops" it calls them).

Facebook's Data team writes:

Thus, when considering even the most distant Facebook user in the Siberian tundra or the Peruvian rainforest, a friend of your friend probably knows a friend of their friend. When we limit our analysis to a single country, be it the US, Sweden, Italy, or any other, we find that the world gets even smaller, and most pairs of people are only separated by three degrees (four hops). It is important to note that while [Stanley] Milgram was motivated by the same question (how many individuals separate any two people), these numbers are not directly comparable; his subjects only had limited knowledge of the social network, while we have a nearly complete representation of the entire thing. Our measurements essentially describe the shortest possible routes that his subjects could have found.

The Facebook study involves a sizable dataset — some 721 million Facebook users. That's the largest, by far, of any study of its kind. But as The New York Times points out, the Facebook study still raises questions about what exactly do we mean when we talk about "friends" and the relationships and connections we've created between people online.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Stock market data meets social media data

GnipSocial media data aggregator Gnip announced this week that it was expanding into the financial services market with the launch of a new product aimed at hedge funds and stock traders. Gnip MarketStream will provide real-time data from both Twitter and StockTwits.

Gnip is one of two companies licensed to handle the Twitter firehose (the other, which I wrote about last week, is DataSift). The company says that it already includes in its customer base a number of hedge funds, and now in partnership with StockTwits — a financial data platform that created the $(TICKER) tag on Twitter — Gnip says it will provide real-time social data to trading companies.

Hunch acquired by eBay

HunchEBay has acquired the recommendation engine Hunch. The price tag was around $80 million, according to Mike Arrington. Founded in 2007 by Chris Dixon and Caterina Fake, Hunch built a "taste graph," a website and API that provided insights into users' affinities with different people, services, brands, and websites.

In an interview with Betabeat, Dixon pointed to precisely this sort of insight as the rationale for the startup's acquisition: "eBay is a very unique retailer," Dixon told Betabeat. "When grandma posts a sweater for sale, it doesn't have metadata to help sort and identify it. In working to understand users' tastes on the open web, this is the challenge we have been solving at Hunch." Betabeat says that, "Something like 70% of items on eBay don't have traditional metadata like product IDs." No metadata on eBay transactions means little opportunity for eBay itself to build a sophisticated recommendation engine, and clearly the acquisition of Hunch is meant to address that.

But there's a flip side to this equation, too, as Betabeat describes it: "eBay's 97 million users and 200 million active listings add up to 9 petabytes of data across two billion daily page views. 'There is only so much you can teach your system with the big academic data sets that are publicly available,' Dixon said. 'With eBay's data behind us, expect Hunch to get much, much better'."

Faster than the speed of light?

Earlier this fall, scientists at CERN said they'd clocked neutrinos traveling faster than the speed of light. Since this changes everything, there's been an open call to other scientists to replicate the findings.

There've been a couple of new salvos this week: The Italian Institute for Nuclear Physics (INFN), which runs the Gran Sasso lab, has just confirmed the test results, The Economist reported. Then, another Italian site reported different findings. The ICARUS experiment said that, no, the original research hadn't adequately accounted for the neutrino's energy upon arrival. Recalculating the data, they argued that the neutrinos were still traveling at the speed of light — no faster.

So is the Theory of Relativity still intact? Stay tuned ...

Got data news?

Feel free to email me.

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl