Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

November 22 2011

Strata Week: 4.74 degrees of Kevin Bacon

Here are some of the data stories that caught my attention this week:

There are less than six degrees between you and Kevin Bacon

You know the game: there are no more than six degrees of separation between actor Kevin Bacon and anyone working in Hollywood. You can start with Sir Alec Guinness, or you can start with Sasha Grey — there are, at most, six links that connect that performer to Kevin Bacon. The game is built on an older notion of social connections, one dating back as far as the late 1920s when Hungarian author Frigyes Karinthy argued there were no more than five acquaintances that separated people.

It might not be all that surprising that the Internet affords different — closer? — relationships than those of the earliest 20th century Hungary. Indeed, according to Facebook's data team, the connections it now affords are actually much closer than the "six degrees of separation" maxim. Between Facebook relationships, there are only 4.74 degrees ("hops" it calls them).

Facebook's Data team writes:

Thus, when considering even the most distant Facebook user in the Siberian tundra or the Peruvian rainforest, a friend of your friend probably knows a friend of their friend. When we limit our analysis to a single country, be it the US, Sweden, Italy, or any other, we find that the world gets even smaller, and most pairs of people are only separated by three degrees (four hops). It is important to note that while [Stanley] Milgram was motivated by the same question (how many individuals separate any two people), these numbers are not directly comparable; his subjects only had limited knowledge of the social network, while we have a nearly complete representation of the entire thing. Our measurements essentially describe the shortest possible routes that his subjects could have found.

The Facebook study involves a sizable dataset — some 721 million Facebook users. That's the largest, by far, of any study of its kind. But as The New York Times points out, the Facebook study still raises questions about what exactly do we mean when we talk about "friends" and the relationships and connections we've created between people online.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Stock market data meets social media data

GnipSocial media data aggregator Gnip announced this week that it was expanding into the financial services market with the launch of a new product aimed at hedge funds and stock traders. Gnip MarketStream will provide real-time data from both Twitter and StockTwits.

Gnip is one of two companies licensed to handle the Twitter firehose (the other, which I wrote about last week, is DataSift). The company says that it already includes in its customer base a number of hedge funds, and now in partnership with StockTwits — a financial data platform that created the $(TICKER) tag on Twitter — Gnip says it will provide real-time social data to trading companies.

Hunch acquired by eBay

HunchEBay has acquired the recommendation engine Hunch. The price tag was around $80 million, according to Mike Arrington. Founded in 2007 by Chris Dixon and Caterina Fake, Hunch built a "taste graph," a website and API that provided insights into users' affinities with different people, services, brands, and websites.

In an interview with Betabeat, Dixon pointed to precisely this sort of insight as the rationale for the startup's acquisition: "eBay is a very unique retailer," Dixon told Betabeat. "When grandma posts a sweater for sale, it doesn't have metadata to help sort and identify it. In working to understand users' tastes on the open web, this is the challenge we have been solving at Hunch." Betabeat says that, "Something like 70% of items on eBay don't have traditional metadata like product IDs." No metadata on eBay transactions means little opportunity for eBay itself to build a sophisticated recommendation engine, and clearly the acquisition of Hunch is meant to address that.

But there's a flip side to this equation, too, as Betabeat describes it: "eBay's 97 million users and 200 million active listings add up to 9 petabytes of data across two billion daily page views. 'There is only so much you can teach your system with the big academic data sets that are publicly available,' Dixon said. 'With eBay's data behind us, expect Hunch to get much, much better'."

Faster than the speed of light?

Earlier this fall, scientists at CERN said they'd clocked neutrinos traveling faster than the speed of light. Since this changes everything, there's been an open call to other scientists to replicate the findings.

There've been a couple of new salvos this week: The Italian Institute for Nuclear Physics (INFN), which runs the Gran Sasso lab, has just confirmed the test results, The Economist reported. Then, another Italian site reported different findings. The ICARUS experiment said that, no, the original research hadn't adequately accounted for the neutrino's energy upon arrival. Recalculating the data, they argued that the neutrinos were still traveling at the speed of light — no faster.

So is the Theory of Relativity still intact? Stay tuned ...

Got data news?

Feel free to email me.


January 26 2011

Data markets aren't coming. They're already here

Jud Valeski (@jvaleski) is cofounder and CEO of Gnip, a social media data provider that aggregates feeds from sites like Twitter, Facebook, Flickr, delicious, and others into one API.

Jud will be speaking at Strata next week on a panel titled "What's Mine is Yours: the Ethics of Big Data Ownership."

If you're attending Strata, you can also find out more about growing business of data marketplaces at a "Data Marketplaces" panel with Ian White of Urban Mapping, Peter Marney of Thomson Reuters, Moe Khosravy of Microsoft, and Dennis Yang of Infochimps.

My interview with Jud follows.

Why is social media data important? What can we do with it or learn from it?

Jud ValeskiJud Valeski: Social media today is the first time a reasonably large population has communicated digitally in relative public. The ability to programmatically analyze collective conversation has never really existed. Being able to analyze the collective human consciousness has been the dream of researchers and analysts since day one.

The data itself is important because it can be analyzed to assist in disaster detection and relief. It can be analyzed for profit in an industry that has always struggled to pinpoint how and where to spend money. It can be analyzed to determine financial market viability (stock trading, for example). It can be analyzed to understand community sentiment, which has political ramifications; we all want our voices heard in order to shape public policy.

What are some of the most common or surprising queries run through Gnip?

Jud Valeski: We don't look at the queries our customers use. One pattern we have seen, however, is that there are some people who try to use the software to siphon as much data as possible out of a given publisher. "More data, more data, more data." We hear that all the time. But how our customers configure the Gnip software is up to them.

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD

With Gnip, customers can choose the data sources they want not just by site but also by category within the site. Can you tell me more about the options for Twitter, which include Decahose, Halfhose, and Spritzer?

Jud Valeski: We tend to categorize social media sources into three buckets: Volume, Coverage, or Both. Volume streams provide a consumer with a sampled rate of volume (Decahose is 10%, for example, while a full firehose is 100% of some service's activities). Statisticians and analysts like the Volume stuff.

Coverage streams exist to provide full coverage of a certain set of things (e.g., keywords, or the User Mention Stream for Twitter). Advertisers like Coverage streams because their interests are very targeted. There are some products that fall into both categories, but Volume and Coverage tend to describe the overall view.

For Twitter in particular, we use their algorithm as described on their dev pages, adjusted for each particular volume rate desired.

Gnip is currently the only licensed reseller of the full Twitter firehose. Are there other partnerships coming up?

Jud Valeski: "Currently" is the operative word here. While we're enjoying the implied exclusivity of the current conditions, we fully expect Twitter to grow its VAR tier to ensure a more competitive marketplace.

From my perspective, Twitter enabling VARs allows them to focus on what is near and dear to their hearts — developer use cases, promoted Tweets, end users, and the display ecosystem — while enabling firms focused on the data-delivery business to distribute underlying data for non-display use. Gnip provides stream enrichments for all of the data that flows through our software. Those enrichments include format and protocol normalization, as well as stream augmentation features such as global URL unwinding. Those value-adds make social media API integration and data leverage much easier than doing a bunch of one-off integrations yourself.

We're certainly working on other partnerships of this level of significance, but we have nothing to announce at this time.

What do you wish more people understood about data markets and/or the way large datasets can be used?

Jud Valeski: First, data is not free, and there's always someone out there that wants to buy it. As an end-user, educate yourself with how the content you create using someone else's service could ultimately be used by the service-provider.

Second, black markets are a real problem, and just because "everyone else is doing it" doesn't mean it's okay. As an example, botnet-like distributed IP address polling infrastructure is commonly used to extract more data from a publisher's service than their API usage terms allow. While perhaps fun to build and run (sometimes), these approaches clearly result in aggregated pools of publisher data that the publisher never intended to promote. Once collected, the aggregated pools of data are sold to data-hungry analytics firms. This results in end-user frustration, in that the content they produced was used in a manner that flagrantly violated the terms under which they signed up. These databases are frequently called out as infringing on privacy.

Everyone loves a good Robin Hood story, and that's how I'd characterize the overall state of data collection today.

How has real-time data changed the field of customer relationship management (CRM)?

Jud Valeski: CRM firms have a new level of awareness. They no longer rely exclusively on dated user studies. A customer service rep may know about your social life through their dashboard the moment you are connected to them over the phone.

I ultimately see the power of understanding collective consciousness in responding to customer service issues. We haven't even scratched the surface here. Imagine if Company X reached out to you directly every time you had a problem with their product or service. Proactivity can pay huge dividends. Companies haven't tapped even 10% of the potential here, and part of that is because they're not spending enough money in the area yet.

Today, "social" is a checkbox that CRM tools attempt to check off just to keep the boss happy. Tomorrow, social data and metaphors will define the tools outright.

Have you learned anything as a social media user yourself from working on Gnip? Is there anything social media users should be more aware of?

Jud Valeski: Read the terms of service for social media services you're using before you complain about privacy policies or how and where your data is being used. Unless you are on a private network, your data is treated as public for all to use, see, sell, or buy. Don't kid yourself. Of course, this brings us all the way back around to black markets. Black markets — and publishers' generally lackadaisical response to them — cloud these waters.

If you can't make it to Strata, you can learn more about the architectural challenges of distributing social and location data across the web in real time, and how Gnip has evolved to address those challenges, in Jud's contribution to "Beautiful Data."

December 12 2010

Strata Gems: The emerging marketplace for social data

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Let it snow.

Strata 2011If you want to analyze social media data in significant volumes, it can be inconvenient and costly to aggregate it yourself. Aggregator and reseller Gnip is at the forefront of the growing marketplace for social data. As well as providing a unified API to aggregated public data from social web sites, Gnip are the first authorized reseller for the entirety of Twitter's public output.

Who uses this Twitter data, and for what? The ultimate end users of most aggregated social data are corporations. The data is used either for brand management and monitoring purposes, or as part of workflow systems that help them address customer issues online. However, the raw feeds themselves are most often provided to suppliers of social monitoring systems, not the end user.

Right now, you can't just install a "Twitter sink" and point the Twitter firehose into it. Gnip's licensees are mostly vendors of social CRM systems. According to Gnip's estimate there are over 500 social monitoring products available in the US alone, but the bigger market is in integration with existing CRMs. Rather than using separate systems, it makes more sense to introduce social media data into the existing CRM investments of large enterprises.

By reselling their public data, Twitter is the first social service to monetize its raw data in this way. Over the course of the next twelve months this trend is likely to accelerate. With that acceleration, we'll also see attendant issues over public awareness of the ultimate destination of their social media scribblings. Writing something in public for your friends is one thing, but people may be surprised to know their every utterance is being carefully watched by their favorite brands.

Gnip's CEO Jud Valeski will be participating in the Strata panel What's Mine is Yours: the Ethics of Big Data Ownership.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...