Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 14 2011

Five big data predictions for 2012

As the "coming out" year for big data and data science draws to a close, what can we expect over the next 12 months?

More powerful and expressive tools for analysis

HadoopThis year has seen consolidation and engineering around improving the basic storage and data processing engines of NoSQL and Hadoop. That will doubtless continue, as we see the unruly menagerie of the Hadoop universe increasingly packaged into distributions, appliances and on-demand cloud services. Hopefully it won't be long before that's dull, yet necessary, infrastructure.

Looking up the stack, there's already an early cohort of tools directed at programmers and data scientists (Karmasphere, Datameer), as well as Hadoop connectors for established analytical tools such as Tableau and R. But there's a way to go in making big data more powerful: that is, to decrease the cost of creating experiments.

Here are two ways in which big data can be made more powerful.

  1. Better programming language support. As we consider data, rather than business logic, as the primary entity in a program, we must create or rediscover idiom that lets us focus on the data, rather than abstractions leaking up from the underlying Hadoop machinery. In other words: write shorter programs that make it clear what we're doing with the data. These abstractions will in turn lend themselves to the creation of better tools for non-programmers.
  2. We require better support for interactivity. If Hadoop has any weakness, it's in the batch-oriented nature of computation it fosters. The agile nature of data science will favor any tool that permits more interactivity.

Streaming data processing

Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.

Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.

For some applications, there just isn't enough storage in the world to store every piece of data your business might receive: at some point you need to make a decision to throw things away. Having streaming computation abilities enables you to analyze data or make decisions about discarding it without having to go through the store-compute loop of map/reduce.

Emerging contenders in the real-time framework category include Storm, from Twitter, and S4, from Yahoo.

Rise of data marketplaces

Your own data can become that much more potent when mixed with other datasets. For instance, add in weather conditions to your customer data, and discover if there are weather related patterns to your customers' purchasing patterns. Acquiring these datasets can be a pain, especially if you want to do it outside of the IT department, and with some exactness. The value of data marketplaces is in providing a directory to this data, as well as streamlined, standardized methods of delivering it. Microsoft's direction of integrating its Azure marketplace right into analytical tools foreshadows the coming convenience of access to data.

Development of data science workflows and tools

As data science teams become a recognized part of companies, we'll see a more regularized expectation of their roles and processes. One of the driving attributes of a successful data science team is its level of integration into a company's business operations, as opposed to being a sidecar analysis team.

EMC ChorusSoftware developers already have a wealth of infrastructure that is both logistical and social, including wikis and source control, along with tools that expose their process and requirements to business owners. Integrated data science teams will need their own versions of these tools to collaborate effectively. One example of this is EMC Greenplum's Chorus, which provides a social software platform for data science. In turn, use of these tools will support the emergence of data science process within organizations.

Data science teams will start to evolve repeatable processes, hopefully agile ones. They could do worse than to look at the ground-breaking work newspaper data teams are doing at news organizations such as The Guardian and New York Times: given short timescales these teams take data from raw form to a finished product, working hand-in-hand with the journalist.

Increased understanding of and demand for visualization

Visualization fulfills two purposes in a data workflow: explanation and exploration. While business people might think of a visualization as the end result, data scientists also use visualization as a way of looking for questions to ask and discovering new features of a dataset.

If becoming a data-driven organization is about fostering a better feel for data among all employees, visualization plays a vital role in delivering data manipulation abilities to those without direct programming or statistical skills.

Throughout a year dominated by business' constant demand for data scientists, I've repeatedly heard from data scientists about what they want most: people who know how to create visualizations.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Related:

September 16 2011

Top Stories: September 12-16, 2011

Here's a look at the top stories published across O'Reilly sites this week.

Building data science teams
A data science team needs people with the right skills and perspectives, and it also requires strong tools, processes, and interaction between the team and the rest of the company.

The evolution of data products
The real changes in our lives will come from products that have the richness of data without calling attention to the data.

The work of data journalism: Find, clean, analyze, create ... repeat
Simon Rogers discusses the grunt work and tools behind The Guardian's data stories.

Social data: A better way to track TV
PeopleBrowsr CEO Jodee Rich says social data offers a better way to see what TV audiences watch and what they care about.


When media rebooted, it brought marketing with it
In this TOC podcast, Twist Image president Mitch Joel talks about some of the common challenges facing the music, magazine and book publishing sectors.




Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively. Save 30% on registration with the code ORM30.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl