Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 22 2012

Big data in the cloud

Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide. So goes the marketing message, but what does that look like in reality? Both "cloud" and "big data" have broad definitions, obscured by considerable hype. This article breaks down the landscape as simply as possible, highlighting what's practical, and what's to come.

IaaS and private clouds



What is often called "cloud" amounts to virtualized servers: computing
resource that presents itself as a regular server, rentable per
consumption. This is generally called infrastructure as a service
(IaaS), and is offered by platforms such as Rackspace Cloud or Amazon
EC2. You buy time on these services, and install and configure your
own software, such as a Hadoop cluster or NoSQL database. Most of the
solutions I described in my Big Data Market Survey can be deployed on
IaaS services.



Using IaaS clouds doesn't mean you must handle all deployment
manually: good news for the clusters of machines big data
requires. You can use orchestration frameworks, which handle the
management of resources, and automated infrastructure tools, which
handle server installation and configuration. RightScale offers a
commercial multi-cloud management platform that mitigates some of the
problems of managing servers in the cloud.



Frameworks such as OpenStack and Eucalyptus aim to present a uniform
interface to both private data centers and the public
cloud. Attracting a strong flow of cross industry support, OpenStack
currently addresses computing resource (akin to Amazon's EC2) and
storage (parallels Amazon S3).



The race is on to make private clouds and IaaS services more usable:
over the next two years using clouds should become much more
straightforward as vendors adopt the nascent standards. There'll be a
uniform interface, whether you're using public or private cloud
facilities, or a hybrid of the two.



Particular to big data, several configuration tools already target
Hadoop explicitly: among them Dell's Crowbar, which aims to make
deploying and configuring clusters simple, and Apache Whirr, which is
specialized for running Hadoop services and other clustered data processing systems.



Today, using IaaS gives you a broad choice of cloud supplier, the
option of using a private cloud, and complete control: but you'll be
responsible for deploying, managing and maintaining your clusters.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Platform solutions

Using IaaS only brings you so far for with big data applications: they handle the creation of computing and storage resources, but don't address anything at a higher level. The set up of Hadoop and Hive or a similar solution is down to you.

Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or platform as a service (PaaS), these services remove the need to configure or scale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer.

The general PaaS market is burgeoning, with major players including VMware (Cloud Foundry) and Salesforce (Heroku, force.com). As big data and machine learning requirements percolate through the industry, these players are likely to add their own big-data-specific services. For the purposes of this article, though, I will be sticking to the vendors who already have implemented big data solutions.

Today's primary providers of such big data platform services are Amazon, Google and Microsoft. You can see their offerings summarized in the table toward the end of this article. Both Amazon Web Services and Microsoft's Azure blur the lines between infrastructure as a service and platform: you can mix and match. By contrast, Google's philosophy is to skip the notion of a server altogether, and focus only on the concept of the application. Among these, only Amazon can lay claim to extensive experience with their product.

Amazon Web Services

Amazon has significant experience in hosting big data processing. Use of Amazon EC2 for Hadoop was a popular and natural move for many early adopters of big data, thanks to Amazon's expandable supply of compute power. Building on this, Amazon launched Elastic Map Reduce in 2009, providing a hosted, scalable Hadoop service.

Applications on Amazon's platform can pick from the best of both the IaaS and PaaS worlds. General purpose EC2 servers host applications that can then access the appropriate special purpose managed solutions provided by Amazon.

As well as Elastic Map Reduce, Amazon offers several other services relevant to big data, such as the Simple Queue Service for coordinating distributed computing, and a hosted relational database service. At the specialist end of big data, Amazon's High Performance Computing solutions are tuned for low-latency cluster computing, of the sort required by scientific and engineering applications.


Elastic Map Reduce

Elastic Map Reduce (EMR) can be programmed in the usual Hadoop ways, through Pig, Hive or other programming language, and uses Amazon's S3 storage service to get data in and out.

Access to Elastic Map Reduce is through Amazon's SDKs and tools, or with GUI analytical and IDE products such as those offered by Karmasphere. In conjunction with these tools, EMR represents a strong option for experimental and analytical work. Amazon's EMR pricing makes it a much more attractive option to use EMR, rather than configure EC2 instances yourself to run Hadoop.

When integrating Hadoop with applications generating structured data, using S3 as the main data source can be unwieldy. This is because, similar to Hadoop's HDFS, S3 works at the level of storing blobs of opaque data. Hadoop's answer to this is HBase, a NoSQL database that integrates with the rest of the Hadoop stack. Unfortunately, Amazon does not currently offer HBase with Elastic Map Reduce.

DynamoDB

Instead of HBase, Amazon provides DynamoDB, its own managed, scalable NoSQL database. As this a managed solution, it represents a better choice than running your own database on top of EC2, in terms of both performance and economy.

DynamoDB data can be exported to and imported from S3, providing interoperability with EMR.

Google

Google's cloud platform stands out as distinct from its competitors. Rather than offering virtualization, it provides an application container with defined APIs and services. Developers do not need to concern themselves with the concept of machines: applications execute in the cloud, getting access to as much processing power as they need, within defined resource usage limits.

To use Google's platform, you must work within the constraints of its APIs. However, if that fits, you can reap the benefits of the security, tuning and performance improvements inherent to the way Google develops all its services.

AppEngine, Google's cloud application hosting service, offers a MapReduce facility for parallel computation over data, but this is more of a feature for use as part of complex applications rather than for analytical purposes. Instead, BigQuery and the Prediction API form the core of Google's big data offering, respectively offering analysis and machine learning facilities. Both these services are available exclusively via REST APIs, consistent with Google's vision for web-based computing.

BigQuery

BigQuery is an analytical database, suitable for interactive analysis over datasets of the order of 1TB. It works best on a small number of tables with a large number of rows. BigQuery offers a familiar SQL interface to its data. In that, it is comparable to Apache Hive, but the typical performance is faster, making BigQuery a good choice for exploratory data analysis.

Getting data into BigQuery is a matter of directly uploading it, or importing it from Google's Cloud Storage system. This is the aspect of BigQuery with the biggest room for improvement. Whereas Amazon's S3 lets you mail in disks for import, Google doesn't currently have this facility. Streaming data into BigQuery isn't viable either, so regular imports are required for constantly updating data. Finally, as BigQuery only accepts data formatted as comma-separated value (CSV) files, you will need to use external methods to clean up the data beforehand.

Rather than provide end-user interfaces itself, Google wants an ecosystem to grow around BigQuery, with vendors incorporating it into their products, in the same way Elastic Map Reduce has acquired tool integration. Currently in beta test, to which anybody can apply, BigQuery is expected to be publicly available during 2012.

Prediction API

Many uses of machine learning are well defined, such as classification, sentiment analysis, or recommendation generation. To meet these needs, Google offers its Prediction API product.

Applications using the Prediction API work by creating and training a model hosted within Google's system. Once trained, this model can be used to make predictions, such as spam detection. Google is working on allowing these models to be shared, optionally with a fee. This will let you take advantage of previously trained models, which in many cases will save you time and expertise with training.

Though promising, Google's offerings are in their early days. Further integration between its services is required, as well as time for ecosystem development to make their tools more approachable.

Microsoft

I have written in some detail about Microsoft's big data strategy in Microsoft's plan for Hadoop and big data. By offering its data platforms on Windows Azure in addition to Windows Server, Microsoft's aim is to make either on-premise or cloud-based deployments equally viable with its technology. Azure parallels Amazon's web service offerings in many ways, offering a mix of IaaS services with managed applications such as SQL Server.

Hadoop is the central pillar of Microsoft's big data approach, surrounded by the ecosystem of its own database and business intelligence tools. For organizations already invested in the Microsoft platform, Azure will represent the smoothest route for integrating big data into the operation. Azure itself is pragmatic about language choice, supporting technologies such as Java, PHP and Node.js in addition to Microsoft's own.

As with Google's BigQuery, Microsoft's Hadoop solution is currently in closed beta test, and is expected to be generally available sometime in the middle of 2012.

Big data cloud platforms compared

The following table summarizes the data storage and analysis capabilities of Amazon, Google and Microsoft's cloud platforms. Intentionally excluded are IaaS solutions without dedicated big data offerings.




































































  Amazon Google Microsoft Product(s) Amazon Web Services Google Cloud Services Windows Azure Big data storage S3 Cloud Storage HDFS on Azure Working storage Elastic Block Store AppEngine (Datastore, Blobstore) Blob, table, queues NoSQL database DynamoDB1 AppEngine Datastore Table storage Relational database Relational Database Service (MySQL or Oracle) Cloud SQL (MySQL compatible) SQL Azure Application hosting EC2 AppEngine Azure Compute Map/Reduce service Elastic MapReduce (Hadoop) AppEngine (limited capacity) Hadoop on Azure2 Big data analytics Elastic MapReduce (Hadoop interface3) BigQuery2 (TB-scale, SQL interface) Hadoop on Azure (Hadoop interface3) Machine learning Via Hadoop + Mahout on EMR or EC2 Prediction API Mahout with Hadoop Streaming processing Nothing prepackaged: use custom solution on EC2 Prospective Search API 4 StreamInsight2 ("Project Austin") Data import Network, physically ship drives Network Network Data sources Public Data Sets A few sample datasets Windows Azure Marketplace Availability Public production Some services in private beta Some services in private beta

Conclusion

Cloud-based big data services offer considerable advantages in removing the overhead of configuring and tuning your own clusters, and in ensuring you pay only for what you use. The biggest issue is always going to be data locality, as it is slow and expensive to ship data. The most effective big data cloud solutions will be the ones where the data is also collected in the cloud. This is an incentive to investigate EC2, Azure or AppEngine as a primary application platform, and an indicator that PaaS competitors such as Cloud Foundry and Heroku will have to address big data as a priority.

It is early days yet for big data in the cloud, with only Amazon offering battle-tested solutions at this point. Cloud services themselves are at an early stage, and we will see both increasing standardization and innovation over the next two years.

However, the twin advantages of not having to worry about infrastructure and economies of scale mean it is well worth investigating cloud services for your big data needs, especially for an experimental or green-field project. Looking to the future, there's no doubt that big data analytical capability will form an essential component of utility computing solutions.

Notes:

1 In public beta.

2 In controlled beta test.

3 Hive and Pig compatible.

4 Experimental status.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

December 15 2011

Where is the OkCupid for elections?

OK Candidate and ElectnextTo date, we've generally been more adept at collecting and storing data than making sense of it. The companies, individuals and governments that become the most adept at data analysis are doing more than finding the signal in the noise: They are creating a strategic capability. Sometimes, the data comes from unexpected directions. For instance, OkCupid's approach to dating with data has earned it millions of users. In the process, OkCupid has gained great insight into the dynamics of dating in the 21st century, which it then shared on its blog.

Based upon their success, I wondered aloud at this year's Newsfoo whether a similar data-driven web app could be built to help citizens match themselves up with candidates:

After Tim tweeted the observation, I quickly learned two things:

  1. Albert Sun, Daniel Bachhuber, Ashwin Shandilya and Jay Zalowitz had built exactly that app at the 2011 Times Open Hack Day on the day I posed the question. OkCandidate is a web app that matches up a citizen with a Republican presidential candidate. (There's no comparable matching engine for Barack Obama, perhaps given that Democrats expect that the current incumbent of the White House will be the Democratic Party's nominee in 2012.) OkCandidate presents a straightforward series of questions about a wide range of core foreign and domestic issues with ratings to allow the user to rank the importance of agreeing with a given candidate. The app is open source, so if you want to try to improve the code, click on over to OkCandidate on GitHub.
  2. ElectNext, a Philadelphia-based startup, has focused on solving this problem. The "eHarmony for voters," as TechCrunch describes it, aims to match you to your candidate. I also learned that ElectNext won the Judges' Choice Award at the 2011 Web 2.0 Expo/NY Startup Showcase. In the video below, Joanne Wilson and Mo Koyfman discuss the startup from a venture capitalist's perspective.

The politics of big data

Creating a better issue-matching engine for voters and candidates is a genuinely useful civic function. The not-so-hidden opportunity here, however, may be to gather a rich dataset from those choices in precisely the same way that OkCupid has done for dating. That's clearly part of the mindset here: "The data on individual users we don't share with anyone," ElectNext founder Keya Danenbaum told Fast Company. "But the way we foresee using all this information we're collecting is ... eventually to aggregate that and say something really interesting in a poll type of report."

How news organizations and campaigns alike collect, store and analyze data is going to matter much more. Close watchers of the intersection of politics and technology already think the Obama campaign's data crunching may help the president win re-election. As Personal Democracy Media co-founder Micah Sifry put it back in April, "it's the data, stupid."

Big data is "powering the race for the White House," wrote Patrick Ruffini, president of Engage, an interactive agency in D.C.:

The hottest job in today's Presidential campaigns is the Data Mining Scientist — whose job it is to sort through terabytes of data and billions of behaviors tracked in voter files, consumer databases, and site logs. They'll use the numbers to uncover hidden patterns that predict how you'll vote, if you'll pony up with a donation, and if you'll influence your friends to support a candidate.

Alistair Croll, the co-chair of the Strata Conference, thinks it's a strategic capability. "After Eisenhower, you couldn't win an election without radio," he told me at Strata, Calif., in February. "After JFK, you couldn't win an election without television. After Obama, you couldn't win an election without social networking. I predict that in 2012, you won't be able to win an election without big data."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

December 09 2011

Publishing News: Agency pricing, out of the pan and into the fire

Here's a look at the publishing stories that caught my attention this week.

Antitrust investigations focus on Apple and publishers

ibooks2.jpgOn Tuesday, the European Commission opened an antitrust investigation into pricing deals struck between Apple and five international publishers: Hachette Livre, HarperCollins, Simon & Schuster, Penguin and Holtzbrinck (the publishing houses were raided back in March). The Bookseller reports:

The Commission said it would investigate whether publishers and Apple had engaged in illegal agreements or practices that would restrict competition, and would also examine "the character and terms of the agency agreements entered into by the above named five publishers and retailers for the sale of e-books," with "concerns that these practices may breach EU antitrust rules that prohibit cartels and restrictive business practices."

On Wednesday, the U.S. Justice Department confirmed it, too, was investigating.

Reuters provided the background for these investigations:

Publishers adopted the agency model last year when Apple launched the iPad, allowing publishers to set the price of the sale of e-books. In turn, they would share revenue with the retailer. In the past, publishers would sell ebooks on a wholesale model for 50% of the retail price ... In the traditional "wholesale model," publishers set a recommended retail price, but the seller is free to offer deep discounts.

Bloomberg reports that "Publishers' deals with retailers are also under scrutiny."

Publishers need to get a grip on their data and take control of their advertising

Google_logoDavid Soloff at Advertising Age took a look this week at declining advertising revenues for newspapers and magazines and placed the blame squarely on the publishers. Soloff writes:

Publishers have not generated much of the almost infinite supply of channel-choking inventory, but they have also done next to nothing to preserve what is good and proprietary and "premium" about their own inventory. In some cases, they have chosen lowest common denominator ad networks, exchanges and supply side platforms to do the hard work of selling.

He says publishers need to regain control of their advertising inventories and that "big data tools can dig them out of the undifferentiated, over-supplied, machine-driven nightmare of the sell side." His take on how to put the "premium" back in premium content is well worth the read.

Publishers may want to get a grip on their data and take control of their advertising sooner rather than later. Google's retail push against Amazon may very well have consequences for online ad revenues, particularly in the retail space. Ken Doctor over at the Nieman Lab took a look at Google's plans to enter the retail/ shipping business and its possible implications. He points out that "[r]etailers don't want to advertise; they want to sell stuff," and he says there's no loyalty in advertising:

Give [retailers] new routes to sell stuff, and deliver it more cheaply than they could before, and they'll migrate their ad/marketing/lead generation dollars. So, if Google can really make it easier to personalize, routinize and make more efficient the selling process, it will place itself between the seller and the buyer. As it does that, it replaces the newspaper as middleman, further reducing much of the revenue that is keeping newsrooms staffed, even if many of them are now half-staffed at best.

Read it Later identifies the most-read authors on the web

Read it Later recently passed 4 million users. Earlier this year, the service used data gathered from its users to look at online reading behavior and how it's affected by the "time shifting" content afforded by mobile technologies. This week, the company released a new study identifying the most-read authors on the web. The study looked at data gathered between May and October 2011, which was based on 47 million-plus saves, according to the report.

Who came out on top? Have a look:

Read It Later's most-read web authors

The study also looked at longevity and loyalty — the authors with the best return rates, or those with stories readers returned to in some way. The report points out that "[t]he most interesting thing isn't just that we found different authors for the top 'return rate,' but also different categories of content and types of publishers."

Author return rate

The full report can be found here.

TOC NY 2012 — O'Reilly's TOC Conference, being held Feb. 13-15, 2012, in New York City, is where the publishing and tech industries converge. Practitioners and executives from both camps will share what they've learned and join together to navigate publishing's ongoing transformation.

Register to attend TOC 2012

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl