Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 14 2012

Now available: "Planning for Big Data"

Planning for Big DataEarlier this month, more than 2,500 people came together for the O'Reilly Strata Conference in Santa Clara, Calif. Though representing diverse fields, from insurance to media and high-tech to healthcare, attendees buzzed with a new-found common identity: they are data scientists. Entrepreneurial and resourceful, combining programming skills with math, data scientists have emerged as a new profession leading the march toward data-driven business.

This new profession rides on the wave of big data. Our businesses are creating ever more data, and as consumers we are sources of massive streams of information, thanks to social networks and smartphones. In this raw material lies much of value: insight about businesses and markets, and the scope to create new kinds of hyper-personalized products and services.

Five years ago, only big business could afford to profit from big data: Walmart and Google, specialized financial traders. Today, thanks to an open source project called Hadoop, commodity Linux hardware and cloud computing, this power is in reach for everyone. A data revolution is sweeping business, government and science, with consequences as far reaching and long lasting as the web itself.

Where to start?

Every revolution has to start somewhere, and the question for many is "how can data science and big data help my organization?" After years of data processing choices being straightforward, there's now a diverse landscape to negotiate. What's more, to become data driven, you must grapple with changes that are cultural as well as technological.

Our aim with Strata is to help you understand what big data is, why it matters, and where to get started. In the wake the recent conference, we're delighted to announce the publication of our "Planning for Big Data" book. Available as a free download, the book contains the best insights from O'Reilly Radar authors over the past three months, including myself, Alistair Croll, Julie Steele and Mike Loukides.

"Planning for Big Data" is for anybody looking to get a concise overview of the opportunity and technologies associated with big data. If you're already working with big data, hand this book to your colleagues or executives to help them better appreciate the issues and possibilities.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at

Related data reports and ebooks

March 02 2012

Visualization of the Week: Visualizing the Strata Conference

The Strata Conference wrapped up yesterday. Who was there and where did they come from? That's what The Guardian and Information Lab looked to discover in the following visualization.

The Guardian's visualization of the Strata Conference
Click to enlarge.

You can view the entire visualization here.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

More Visualizations:

Sponsored post

March 01 2012

Strata Week: Datasift lets you mine two years of Twitter data

Here are a few of the data stories that caught my attention this week.

Twitter's historical archives, via Datasift

DataSiftDatasift, one of the two companies that has official access to the Twitter firehose (the other being Gnip) announced its new Historics service this week, giving customers access to up to two years' worth of historical Tweets. (By comparison, Gnip offers 30 days of Twitter data, and other developers and users have access to roughly a week's worth of Tweets.)

GigaOm's Barb Darrow responded to those who might be skeptical about the relevance of this sort of historic Twitter data in a service that emphasizes real-time. Darrow noted that DataSift CEO Rob Bailey said companies planning new products, promotions or price changes would do well to study the impact of their past actions before proceeding and that Twitter is the perfect venue for that.

Another indication of the desirability of this new Twitter data: the waiting list for Historics already includes a number of Fortune 500 companies. The service will get its official launch in April.

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Building a school of data

Although there are plenty of ways to receive formal training in math, statistics and engineering, there aren't a lot of options when it comes to an education specifically in data science.

To that end, the Open Knowledge Foundation and Peer to Peer University (P2PU) have proposed a School of Data, arguing that:

"It will be years before data specialist degree paths become broadly available and accepted, and even then, time-intensive degree courses may not be the right option for journalists, activists, or computer programmers who just need to add data skills to their existing expertise. What is needed are flexible, on-demand, shorter learning options for people who are actively working in areas that benefit from data skills, particularly those who may have already left formal education programmes."

The organizations are seeking volunteers to help develop the project, whether that's in the form of educational materials, learning challenges, mentorship, or a potential student body.

Strata in California

The Strata Conference wraps up today in Santa Clara, Calif. If you missed Strata this year and weren't able to catch the livestream of the conference, look for excerpts and videos posted here on Radar and through the O'Reilly YouTube channel in the coming weeks.

And be sure to make plans for Strata New York, being held October 23-25. That event will mark the merger with Hadoop World. The call for speaker proposals for Strata NY is now open.

Got data news?

Feel free to email me.


February 09 2012

Strata Week: Your personal automated data scientist

Here are a few of the data stories that caught my attention this week:

Wolfram|Alpha Pro: An on-call data scientist

The computational knowledge engine Wolfram|Alpha unveiled a pro version this week. For $4.99 per month ($2.99 for students), Wolfram|Alpha Pro offers access to more of the computational power "under the hood" of the site, in part by allowing users to upload their own datasets, which Wolfram|Alpha will in turn analyze.

This includes:

  • Text files — Wolfram|Alpha will respond with the character and word count, provide an estimate on how long it would take to read aloud, and reveal the most common word, average sentence length and more.
  • Spreadsheets — It will crunch the numbers and return a variety of statistics and graphs.
  • Image files — It will analyze the image's dimensions, size, and colors, and let you apply several different filters.

Wolfram Alpha Pro example
Wolfram|Alpha Pro subscribers can upload and analyze their own datasets.

There's also a new extended keyboard that contains the Greek alphabet and other special characters for manually entering data. Data and analysis from these entries and any queries can also be downloaded.

"In a sense," writes Wolfram's founder Stephen Wolfram, "the concept is to imagine what a good data scientist would do if confronted with your data, then just immediately and automatically do that — and show you the results."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Crisis-mapping and data protection standards

Ushahidi's Patrick Meier takes a look at the recently released Data Protection Manual issued by the International Organization for Migration (IOM). According to the IOM, the manual is meant to serve as a guide to help:

" ... protect the personal data of the migrants in its care. It follows concerns about the general increase in data theft and loss and the recognition that hackers are finding ever more sophisticated ways of breaking into personal files. The IOM Data Protection Manual aims to protect the integrity and confidentiality of personal data and to prevent inappropriate disclosure."

Meier describes the manual as "required reading" but notes that there is no mention of social media in the 150-page document. "This is perfectly understandable given IOM's work," he writes, "but there is no denying that disaster-affected communities are becoming more digitally-enabled — and thus, increasingly the source of important, user-generated information."

Meier moves through the Data Protection Manual's principles, highlighting the ones that may be challenged when it comes to user-generated, crowdsourced data and raising important questions about consent, privacy, and security.

Doubting the dating industry's algorithms

Many online dating websites claim that their algorithms are able to help match singles with their perfect mate. But a forthcoming article in "Psychological Science in the Public Interest," a journal of the Association for Psychological Science, casts some doubt on the data science of dating.

According to the article's lead author Eli Finkel, associate professor of social psychology at Northwestern University, "there is no compelling evidence that any online dating matching algorithm actually works." Finkel argues that dating sites' algorithms do not "adhere to the standards of science," and adds that "it is unlikely that their algorithms can work, even in principle, given the limitations of the sorts of matching procedures that these sites use."

It's "relationship science" versus the in-take questions that most dating sites ask in order to help users create their profiles and suggest matches. Finkel and his coauthors note that some of the strongest predictors for good relationships — such as how couples interact under pressure — aren't assessed by dating sites.

The paper calls for the creation of a panel to grade the scientific credibility of each online dating site.

Got data news?

Feel free to email me.


January 31 2012

Embracing the chaos of data

A data scientist and a former Apple engineer, Pete Warden (@petewarden) is now the CTO of the new travel photography startup Jetpac. Warden will be a keynote speaker at the upcoming Strata Conference, where he'll explain why we should rethink our approach to data. Specifically, rather than pursue the perfection of structured information, Warden says we should instead embrace the chaos of unstructured data. He expands on that idea in the following interview.

What do you mean asking data scientists to embrace the chaos of data?

Pete WardenPete Warden: The heart of data science is designing instruments to turn signals from the real world into actionable information. Fighting the data providers to give you those signals in a convenient form is a losing battle, so the key to success is getting comfortable with messy requirements and chaotic inputs. As an engineer, this can feel like a deal with the devil, as you have to accept error and uncertainty in your results. But the alternative is no results at all.

Are we wasting time trying to make unstructured data structured?

Pete Warden: Structured data is always better than unstructured, when you can get it. The trouble is that you can't get it. Most structured data is the result of years of effort, so it is only available with a lot of strings, either financial or through usage restrictions.

The first advantage of unstructured data is that it's widely available because the producers don't see much value in it. The second advantage is that because there's no "structuring" work required, there's usually a lot more of it, so you get much broader coverage.

A good comparison is Yahoo's highly-structured web directory versus Google's search index built on unstructured HTML soup. If you were looking for something that was covered by Yahoo, its listing was almost always superior, but there were so many possible searches that Google's broad coverage made it more useful. For example, I hear that 30% of search queries are "once in history" events — unique combinations of terms that never occur again.

Dealing with unstructured data puts the burden on the consuming application instead of the publisher of the information, so it's harder to get started, but the potential rewards are much greater.

How do you see data tools developing over the next few years? Will they become more accessible to more people?

Pete Warden: One of the key trends is the emergence of open-source projects that deal with common patterns of unstructured input data. This is important because it allows one team to solve an unstructured-to-structured conversion problem once, and then the entire world can benefit from the same solution. For example, turning street addresses into latitude/longitude positions is a tough problem that involves a lot of fuzzy textual parsing, but open-source solutions are starting to emerge.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Associated photo on home and category pages: "mess with graphviz by Toms Bauģis, on Flickr


December 08 2011

Strata Week: The looming data science talent shortage

Here are a few of the big-data stories that caught my attention this week.

Data scientists in demand

This week, EMC released (pdf) the findings of its recent survey of the data science community. Calling it the largest ever survey of its kind, the EMC Data Science Study included responses from more than 500 data scientists, information analysts, and data specialists from the U.S., U.K., France, Germany, India and China.

The majority of respondents (83%) said they believed that new technologies would increase the need for data scientists. But 64% also felt as though this new demand for data scientists would outstrip the supply (31% said demand would "significantly outpace" supply). Just 12% felt as though future data science jobs would be filled by current business intelligence professionals.

Chart from Data Science Revealed studyThe source for future talent? College students, not surprisingly — 34% said future data science jobs would go to computer science grads; 24% said these jobs would go to those from other disciplines. And in the case of data scientists, those may well be college students with masters or PhDs — some 40% of data scientists have an advanced degree, and nearly one in 10 have a doctorate. In comparison, less than 1% of business intelligence professionals have a PhD.

But the problems that the data science community faces aren't simply a future talent shortage. Just a third of respondents said they were confident in their company's ability to make data-driven business decisions. Again, respondents pointed to a shortage of employees with the right training or skills (32%). Budget shortages were also an issue (32%).

Another problem uncovered by the survey: data accessibility. Just 12% of business intelligence analysts and 22% of data scientists say they "strongly believe" that employees have the access they need to run experiments on data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Carrier IQ and big data

The mobile intelligence company Carrier IQ has gone from obscurity to infamy following the discovery by Android developer Trevor Eckhart that Carrier IQ's rootkit software could record all sorts of user data — texts, web browsing, keystrokes, and even phone calls.

The software is on an estimated 100 million phones — Android and iOS alike — and the news of it has prompted calls for an FTC investigation, questions from a Senator, and class-action lawsuits.

Carrier IQ issued a statement, explaining that "Our software makes your phone better by delivering intelligence on the performance of mobile devices and networks to help the operators provide optimal service efficiency."

But at GigaOm, Kevin Fitchard called Carrier IQ's relationships to handset makers and carriers a "bizarre big-data triangle":

This is big data for the mobile world — massive databases of consumer behavior delving into when, how and in what manner we use our devices. By Carrier IQ's own admission, its software is embedded in more than 150 million handsets. There are plenty of companies that would find that information enormously useful. The problem is Carrier IQ never got permission from all these smartphone users to collect that data, never told them it was gathering it, and never provided a way of opting out.

DataSift will soon offer access to historical tweets

DataSift Historical DataIt was April of last year when Twitter announced it was donating its entire archive to the Library of Congress, and since then, researchers have been waiting to get their hands on this older Twitter data.

As it currently stands, you can only search Twitter back as far as a week. And while you can get access to the Twitter firehose, that's little help at looking at the historical record.

But starting soon, developers and researchers will have access to a bit more of that record when DataSift begins offering historical data. DataSift's alpha version will offer access to 60 days' worth of the Twitter feed, and when the service formally launches next year, DataSift promises more data.

It's not quite the Library of Congress, which, as we noted earlier this year, is working on the technology infrastructure to make the historical Tweets indexable and accessible. The Library of Congress does have access to the Twitter firehose (via the other stream provider, Gnip), so it looks like that's where the complete record will, for now at least, reside.

Got data news?

Feel free to email me.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...