Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 01 2012

Strata Week: Datasift lets you mine two years of Twitter data

Here are a few of the data stories that caught my attention this week.

Twitter's historical archives, via Datasift

DataSiftDatasift, one of the two companies that has official access to the Twitter firehose (the other being Gnip) announced its new Historics service this week, giving customers access to up to two years' worth of historical Tweets. (By comparison, Gnip offers 30 days of Twitter data, and other developers and users have access to roughly a week's worth of Tweets.)

GigaOm's Barb Darrow responded to those who might be skeptical about the relevance of this sort of historic Twitter data in a service that emphasizes real-time. Darrow noted that DataSift CEO Rob Bailey said companies planning new products, promotions or price changes would do well to study the impact of their past actions before proceeding and that Twitter is the perfect venue for that.

Another indication of the desirability of this new Twitter data: the waiting list for Historics already includes a number of Fortune 500 companies. The service will get its official launch in April.

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Building a school of data

Although there are plenty of ways to receive formal training in math, statistics and engineering, there aren't a lot of options when it comes to an education specifically in data science.

To that end, the Open Knowledge Foundation and Peer to Peer University (P2PU) have proposed a School of Data, arguing that:

"It will be years before data specialist degree paths become broadly available and accepted, and even then, time-intensive degree courses may not be the right option for journalists, activists, or computer programmers who just need to add data skills to their existing expertise. What is needed are flexible, on-demand, shorter learning options for people who are actively working in areas that benefit from data skills, particularly those who may have already left formal education programmes."

The organizations are seeking volunteers to help develop the project, whether that's in the form of educational materials, learning challenges, mentorship, or a potential student body.

Strata in California

The Strata Conference wraps up today in Santa Clara, Calif. If you missed Strata this year and weren't able to catch the livestream of the conference, look for excerpts and videos posted here on Radar and through the O'Reilly YouTube channel in the coming weeks.

And be sure to make plans for Strata New York, being held October 23-25. That event will mark the merger with Hadoop World. The call for speaker proposals for Strata NY is now open.

Got data news?

Feel free to email me.

Related:

November 17 2011

Strata Week: Why ThinkUp matters

Here are a few of the data stories that caught my attention this week.

ThinkUp hits 1.0

ThinkUpThinkUp, a tool out of Expert Labs, enables users to archive, search and export their Twitter, Facebook and Google+ history — both posts and post replies. It also allows users to see their network activity, including new followers, and to map that information. Originally created by Gina Trapani, ThinkUp is free and open source, and will run on a user's own web server.

That's crucial, says Expert Labs' founder Anil Dash, who describes ThinkUp's launch as "software that matters." He writes that "ThinkUp's launch matters to me because of what it represents: The web we were promised we would have. The web that I fell in love with, and that has given me so much. A web that we can hack, and tweak, and own." Imagine everything you've ever written on Twitter, every status update on Facebook, every message on Google+ and every response you've had to those posts — imagine them wiped out by the companies that control those social networks.

Why would I ascribe such awful behavior to the nice people who run these social networks? Because history shows us that it happens. Over and over and over. The clips uploaded to Google Videos, the sites published to Geocities, the entire relationships that began and ended on Friendster: They're all gone. Some kind-hearted folks are trying to archive those things for the record, and that's wonderful. But what about the record for your life, a private version that's not for sharing with the world, but that preserves the information or ideas or moments that you care about?

It's in light of this, no doubt, that ReadWriteWeb's Jon Mitchell calls ThinkUp "the social media management tool that matters most." Indeed, as we pour more of our lives into these social sites, tools like ThinkUp, along with endeavors like the Locker Project, mark important efforts to help people own, control and utilize their own data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

DataSift opens up its Twitter firehose

DataSiftDataSift, one of only two companies licensed by Twitter to syndicate its firehose (the other being Gnip), officially opened to the public this week. That means that those using DataSift can in turn mine all the social data that comes from Twitter — data that comes at a rate of some 250 million tweets per day. DataSift's customers can analyze this data for more than just keyword searches and can apply various filters, including demographic information, sentiment, gender, and even Klout score. The company also offers data from MySpace and plans to add Google+ and Facebook data soon.

DataSift, which was founded by Tweetmeme's Nick Halstead and raised $6 million earlier this year, is available as a pay-as-you-go subscription model.

Google's BigQuery service opens to more developers

Google announced this week that it was letting more companies have access to its piloting of BigQuery, its big data analytics service. The tool was initially developed for internal use at Google, and it was opened to a limited number of developers and companies at Google I/O earlier this year. Now, Google is allowing a few more companies into the fold (you can indicate your interest here), offering them the service for free — with the promise to notify them in 30 days if it plans to charge — as well as adding some user interface improvements.

In addition to a GUI for the web-based version, Google has improved the REST API for BigQuery as well. The new API offers granular control over permissions and lets you run multiple jobs in the background.

BigQuery is based on the Google tool formerly known as Dremel, which the company discussed in a research paper published last year:

[Dremel] is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data.

In the blog post announcing the changes to BigQuery, Google cites Michael J. Franklin, Professor of Computer Science at UC Berkeley, who calls BigQuery's ability to process big data "jaw-dropping."


Got data news?

Feel free to email me.

Related:

July 14 2011

Strata Week: There's money in data sifting

Here are a few of the data stories that caught my attention this week.

Big bucks for DataSift and for data from Twitter's firehose

DataSiftThe social media data mining platform DataSift — one of the two companies that has the rights to re-syndicate the data from Twitter's firehose — announced this week that it has raised $6 million in a Series A round. (The other company with those rights is Gnip, whose handling of the firehose we recently covered.) DataSift aggregates data from other social media streams as well as Twitter, including Facebook, WordPress, and Digg. While providing the tools to "sift" this content and layering it with other metadata makes DataSift compelling, it's the company's connection to Twitter that may have piqued the most interest.

DataSift grew out of the company MediaSift, the same business that created Tweetmeme, a tool that fell into disfavor when Twitter launched its own sharing button. That move on the part of Twitter to take over functions that third-party developers once provided has had some negative implications on those in the Twitter ecosystem. At this stage, it seems like Twitter is willing to leave some of the big data processing to other companies.

Investor Mark Suster of GRP Partners, whose firm was one of the leaders in this round of DataSift's investment, made the announcement that he was "doubling down on the Twitter ecosystem." For its part, DataSift "has a product that will turn the stream into a lake," says Suster. In other words, "The Twitter stream like most others is ephemeral. If you don't bottle it as it passes by you it's gone. DataSift has a product that builds a permanent database for you of just the information you want to capture."

But Suster's announcement also reiterates the importance of Twitter, something that seems particularly relevant in light of the new Google Plus. Suster describes Twitter as real-time, open, asymmetric, social, viral, location-aware, a referral network, explicit, and implicit. But as the buzz over Google Plus continues, it's not clear that Twitter really holds the corner on all of these characteristics any longer.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

What's under the Google Plus hood?

Google PlusSpeaking of Google Plus, there's been lots of commentary and speculation about how successful the launch of the new social network has been, gauged in part on how quickly that network is growing. According to Ancestry.com founder Paul Allen, Google Plus was set to break the 10 million user mark on July 12, just two weeks after its launch. Ubermedia's Bill Goss went so far as to predict that Google Plus would become the fastest growing social network in history.

So how does Google do it (in terms of the technology)? According to the project's technical lead — and OSCON speakerJoseph Smarr:

Our stack is pretty standard fare for Google apps these days: we use Java servlets for our server code and JavaScript for the browser-side of the UI, largely built with the (open-source) Closure framework, including Closure's JavaScript compiler and template system. A couple nifty tricks we do: we use the HTML5 History API to maintain pretty-looking URLs even though it's an AJAX app (falling back on hash-fragments for older browsers); and we often render our Closure templates server-side so the page renders before any JavaScript is loaded, then the JavaScript finds the right DOM nodes and hooks up event handlers, etc. to make it responsive (as a result, if you're on a slow connection and you click on stuff really fast, you may notice a lag before it does anything, but luckily most people don't run into this in practice). Our backends are built mostly on top of BigTable and Colossus/GFS, and we use a lot of other common Google technologies such as MapReduce (again, like many other Google apps do).


(Google's Joseph Smarr, a member of the Google+ team, will discuss the future of the social web at OSCON. Save 20% on registration with the code OS11RAD.)

Data products for education

DonorsChooseThe charitable giving site DonorsChoose has been running a contest called Hacking Education, and the contest's finalists have just been announced. DonorsChoose lets people make charitable contributions to public schools, supporting teachers' projects with a Kickstarter-like site for education. DonorsChoose opened up its data to developers — this data encompassed more than 300,000 classroom projects that have inspired some $80 million in charitable giving.

The finalists were chosen from over 50 apps and analyses and included a visualization of the kinds of projects teachers proposed and the kinds donors supported, a .NET Factbook, and an automatic press release system so that local journalists could be notified about projects. The grand prize winner has yet to be chosen, but that project will receive a trophy — and a big thumbs up — from Stephen Colbert.

Got data news?

Feel free to email me.



Related:


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl