Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 24 2012

Top stories: February 20-24, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Data for the public good
The explosion of big data, open data and social data offers new opportunities to address humanity's biggest challenges. The open question is no longer if data can be used for the public good, but how.

Building the health information infrastructure for the modern epatient
The National Coordinator for Health IT, Dr. Farzad Mostashari, discusses patient empowerment, data access and ownership, and other important trends in healthcare.

Big data in the cloud
Big data and cloud technology go hand-in-hand, but it's comparatively early days. Strata conference chair Edd Dumbill explains the cloud landscape and compares the offerings of Amazon, Google and Microsoft.

Everyone has a big data problem
MetaLayer's Jonathan Gosier talks about the need to democratize data tools because everyone has a big data problem.

Three reasons why direct billing is ready for its close-up
David Sims looks at the state of direct billing and explains why it's poised to catch on beyond online games and media.


Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

Cloud photo: Big cloud by jojo nicdao, on Flickr

February 22 2012

Big data in the cloud

Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide. So goes the marketing message, but what does that look like in reality? Both "cloud" and "big data" have broad definitions, obscured by considerable hype. This article breaks down the landscape as simply as possible, highlighting what's practical, and what's to come.

IaaS and private clouds



What is often called "cloud" amounts to virtualized servers: computing
resource that presents itself as a regular server, rentable per
consumption. This is generally called infrastructure as a service
(IaaS), and is offered by platforms such as Rackspace Cloud or Amazon
EC2. You buy time on these services, and install and configure your
own software, such as a Hadoop cluster or NoSQL database. Most of the
solutions I described in my Big Data Market Survey can be deployed on
IaaS services.



Using IaaS clouds doesn't mean you must handle all deployment
manually: good news for the clusters of machines big data
requires. You can use orchestration frameworks, which handle the
management of resources, and automated infrastructure tools, which
handle server installation and configuration. RightScale offers a
commercial multi-cloud management platform that mitigates some of the
problems of managing servers in the cloud.



Frameworks such as OpenStack and Eucalyptus aim to present a uniform
interface to both private data centers and the public
cloud. Attracting a strong flow of cross industry support, OpenStack
currently addresses computing resource (akin to Amazon's EC2) and
storage (parallels Amazon S3).



The race is on to make private clouds and IaaS services more usable:
over the next two years using clouds should become much more
straightforward as vendors adopt the nascent standards. There'll be a
uniform interface, whether you're using public or private cloud
facilities, or a hybrid of the two.



Particular to big data, several configuration tools already target
Hadoop explicitly: among them Dell's Crowbar, which aims to make
deploying and configuring clusters simple, and Apache Whirr, which is
specialized for running Hadoop services and other clustered data processing systems.



Today, using IaaS gives you a broad choice of cloud supplier, the
option of using a private cloud, and complete control: but you'll be
responsible for deploying, managing and maintaining your clusters.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Platform solutions

Using IaaS only brings you so far for with big data applications: they handle the creation of computing and storage resources, but don't address anything at a higher level. The set up of Hadoop and Hive or a similar solution is down to you.

Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or platform as a service (PaaS), these services remove the need to configure or scale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer.

The general PaaS market is burgeoning, with major players including VMware (Cloud Foundry) and Salesforce (Heroku, force.com). As big data and machine learning requirements percolate through the industry, these players are likely to add their own big-data-specific services. For the purposes of this article, though, I will be sticking to the vendors who already have implemented big data solutions.

Today's primary providers of such big data platform services are Amazon, Google and Microsoft. You can see their offerings summarized in the table toward the end of this article. Both Amazon Web Services and Microsoft's Azure blur the lines between infrastructure as a service and platform: you can mix and match. By contrast, Google's philosophy is to skip the notion of a server altogether, and focus only on the concept of the application. Among these, only Amazon can lay claim to extensive experience with their product.

Amazon Web Services

Amazon has significant experience in hosting big data processing. Use of Amazon EC2 for Hadoop was a popular and natural move for many early adopters of big data, thanks to Amazon's expandable supply of compute power. Building on this, Amazon launched Elastic Map Reduce in 2009, providing a hosted, scalable Hadoop service.

Applications on Amazon's platform can pick from the best of both the IaaS and PaaS worlds. General purpose EC2 servers host applications that can then access the appropriate special purpose managed solutions provided by Amazon.

As well as Elastic Map Reduce, Amazon offers several other services relevant to big data, such as the Simple Queue Service for coordinating distributed computing, and a hosted relational database service. At the specialist end of big data, Amazon's High Performance Computing solutions are tuned for low-latency cluster computing, of the sort required by scientific and engineering applications.


Elastic Map Reduce

Elastic Map Reduce (EMR) can be programmed in the usual Hadoop ways, through Pig, Hive or other programming language, and uses Amazon's S3 storage service to get data in and out.

Access to Elastic Map Reduce is through Amazon's SDKs and tools, or with GUI analytical and IDE products such as those offered by Karmasphere. In conjunction with these tools, EMR represents a strong option for experimental and analytical work. Amazon's EMR pricing makes it a much more attractive option to use EMR, rather than configure EC2 instances yourself to run Hadoop.

When integrating Hadoop with applications generating structured data, using S3 as the main data source can be unwieldy. This is because, similar to Hadoop's HDFS, S3 works at the level of storing blobs of opaque data. Hadoop's answer to this is HBase, a NoSQL database that integrates with the rest of the Hadoop stack. Unfortunately, Amazon does not currently offer HBase with Elastic Map Reduce.

DynamoDB

Instead of HBase, Amazon provides DynamoDB, its own managed, scalable NoSQL database. As this a managed solution, it represents a better choice than running your own database on top of EC2, in terms of both performance and economy.

DynamoDB data can be exported to and imported from S3, providing interoperability with EMR.

Google

Google's cloud platform stands out as distinct from its competitors. Rather than offering virtualization, it provides an application container with defined APIs and services. Developers do not need to concern themselves with the concept of machines: applications execute in the cloud, getting access to as much processing power as they need, within defined resource usage limits.

To use Google's platform, you must work within the constraints of its APIs. However, if that fits, you can reap the benefits of the security, tuning and performance improvements inherent to the way Google develops all its services.

AppEngine, Google's cloud application hosting service, offers a MapReduce facility for parallel computation over data, but this is more of a feature for use as part of complex applications rather than for analytical purposes. Instead, BigQuery and the Prediction API form the core of Google's big data offering, respectively offering analysis and machine learning facilities. Both these services are available exclusively via REST APIs, consistent with Google's vision for web-based computing.

BigQuery

BigQuery is an analytical database, suitable for interactive analysis over datasets of the order of 1TB. It works best on a small number of tables with a large number of rows. BigQuery offers a familiar SQL interface to its data. In that, it is comparable to Apache Hive, but the typical performance is faster, making BigQuery a good choice for exploratory data analysis.

Getting data into BigQuery is a matter of directly uploading it, or importing it from Google's Cloud Storage system. This is the aspect of BigQuery with the biggest room for improvement. Whereas Amazon's S3 lets you mail in disks for import, Google doesn't currently have this facility. Streaming data into BigQuery isn't viable either, so regular imports are required for constantly updating data. Finally, as BigQuery only accepts data formatted as comma-separated value (CSV) files, you will need to use external methods to clean up the data beforehand.

Rather than provide end-user interfaces itself, Google wants an ecosystem to grow around BigQuery, with vendors incorporating it into their products, in the same way Elastic Map Reduce has acquired tool integration. Currently in beta test, to which anybody can apply, BigQuery is expected to be publicly available during 2012.

Prediction API

Many uses of machine learning are well defined, such as classification, sentiment analysis, or recommendation generation. To meet these needs, Google offers its Prediction API product.

Applications using the Prediction API work by creating and training a model hosted within Google's system. Once trained, this model can be used to make predictions, such as spam detection. Google is working on allowing these models to be shared, optionally with a fee. This will let you take advantage of previously trained models, which in many cases will save you time and expertise with training.

Though promising, Google's offerings are in their early days. Further integration between its services is required, as well as time for ecosystem development to make their tools more approachable.

Microsoft

I have written in some detail about Microsoft's big data strategy in Microsoft's plan for Hadoop and big data. By offering its data platforms on Windows Azure in addition to Windows Server, Microsoft's aim is to make either on-premise or cloud-based deployments equally viable with its technology. Azure parallels Amazon's web service offerings in many ways, offering a mix of IaaS services with managed applications such as SQL Server.

Hadoop is the central pillar of Microsoft's big data approach, surrounded by the ecosystem of its own database and business intelligence tools. For organizations already invested in the Microsoft platform, Azure will represent the smoothest route for integrating big data into the operation. Azure itself is pragmatic about language choice, supporting technologies such as Java, PHP and Node.js in addition to Microsoft's own.

As with Google's BigQuery, Microsoft's Hadoop solution is currently in closed beta test, and is expected to be generally available sometime in the middle of 2012.

Big data cloud platforms compared

The following table summarizes the data storage and analysis capabilities of Amazon, Google and Microsoft's cloud platforms. Intentionally excluded are IaaS solutions without dedicated big data offerings.




































































  Amazon Google Microsoft Product(s) Amazon Web Services Google Cloud Services Windows Azure Big data storage S3 Cloud Storage HDFS on Azure Working storage Elastic Block Store AppEngine (Datastore, Blobstore) Blob, table, queues NoSQL database DynamoDB1 AppEngine Datastore Table storage Relational database Relational Database Service (MySQL or Oracle) Cloud SQL (MySQL compatible) SQL Azure Application hosting EC2 AppEngine Azure Compute Map/Reduce service Elastic MapReduce (Hadoop) AppEngine (limited capacity) Hadoop on Azure2 Big data analytics Elastic MapReduce (Hadoop interface3) BigQuery2 (TB-scale, SQL interface) Hadoop on Azure (Hadoop interface3) Machine learning Via Hadoop + Mahout on EMR or EC2 Prediction API Mahout with Hadoop Streaming processing Nothing prepackaged: use custom solution on EC2 Prospective Search API 4 StreamInsight2 ("Project Austin") Data import Network, physically ship drives Network Network Data sources Public Data Sets A few sample datasets Windows Azure Marketplace Availability Public production Some services in private beta Some services in private beta

Conclusion

Cloud-based big data services offer considerable advantages in removing the overhead of configuring and tuning your own clusters, and in ensuring you pay only for what you use. The biggest issue is always going to be data locality, as it is slow and expensive to ship data. The most effective big data cloud solutions will be the ones where the data is also collected in the cloud. This is an incentive to investigate EC2, Azure or AppEngine as a primary application platform, and an indicator that PaaS competitors such as Cloud Foundry and Heroku will have to address big data as a priority.

It is early days yet for big data in the cloud, with only Amazon offering battle-tested solutions at this point. Cloud services themselves are at an early stage, and we will see both increasing standardization and innovation over the next two years.

However, the twin advantages of not having to worry about infrastructure and economies of scale mean it is well worth investigating cloud services for your big data needs, especially for an experimental or green-field project. Looking to the future, there's no doubt that big data analytical capability will form an essential component of utility computing solutions.

Notes:

1 In public beta.

2 In controlled beta test.

3 Hive and Pig compatible.

4 Experimental status.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

December 14 2011

Four short links: 14 December 2011

  1. The HipHop Virtual Machine (Facebook) -- inside the new virtual machine for PHP from Facebook.
  2. PHP Fog's Free Thinkup Hosting (Expert Labs) -- ThinkUp archives your tweets and other social media activity for you to search, visualize, and analyze. PHPFog hosts PHP apps scalably, and I'm delighted to be an advisor. Andy's made a video showing how to get up and running with ThinkUp in 3m. (This is impressive given how long I squinted at ThinkUp and swore trying to get it going on my colo box just a year ago)
  3. The Secret Lives of Links (Luke Wroblewski) -- notes on a talk by Jared Spool. On the Walgreen’s site, 21% of people go to photos, 16% go to search, 11% go to prescriptions, 6% go to pharmacy link, 5% go to find stores. Total traffic is 59% for these five links. The total amount of page used for these 5 links is ~4% of page space. The most important stuff on the page occupies less than 1/20th of the page. This violates Fitts's Law. Makes me think of the motor and sensory homunculi.
  4. VC Memes -- the success kid is my favourite, I think.

December 05 2011

Why cloud services are a tempting target for attackers

The largest cloud providers today are Google, Microsoft, and Amazon; each offering multiple services and platforms for their respective customers. For example, Microsoft Azure, Google Apps, and Amazon EC2 are all hosting and development platforms. Google Docs, Acrobat.com, and Microsoft Office 365 all provide basic word processing, spreadsheets and other applications for individuals to use via the web instead of on their individual desktops. Then, of course, there's social networks, online gaming, and video and music sharing services — all of which rely on a hosted environment that can accommodate millions of users interacting from anywhere on earth, yet all connected somewhere in cyberspace. While the benefits are many, both to individuals and to corporations, there are three distinct disadvantages from an individual and national security perspective:

  • The cloud provider is not responsible for securing its customers' data.
  • Attacking a cloud-based service provides an economy of scale to the attacker.
  • Mining the cloud provides a treasure trove of information for domestic and foreign intelligence services.

No security provisions

A Ponemon Institute study (pdf) on cloud security revealed that 69% of cloud users surveyed said that the providers are responsible, and the providers seemed to agree. However, when you review the terms of service for the world's largest cloud providers, responsibility for a breach of customer data lies exclusively with the customer.

For example:

  • From Amazon: "Amazon has no liability for .... (D) any unauthorized access to, alteration of, or the deletion, destruction, damage, loss or failure to store any of your content or other data."
  • From Google: "Customer will indemnify, defend, and hold harmless Google from and against all liabilities, damages, and costs (including settlement costs and reasonable attorneys' fees) arising out of a third-party claim: (i) regarding Customer Data..."
  • From Microsoft: "Microsoft will not be liable for any loss that you may incur as a result of someone else using your password or account, either with or without your knowledge. However, you could be held liable for losses incurred by Microsoft or another party due to someone else using your account or password."

Not only do none of the three top cloud providers assume any responsibility for data security, Microsoft goes one step further and places a legal burden upon its customers that it refuses to accept for itself.

An economy of scale

NASDAQ's Directors Desk is an electronic boardroom cloud service that stores critical information for more than 10,000 board members of several hundred Fortune 500 corporations. In February 2011, an un-named federal official revealed to the Wall Street Journal's Devlin Barrett that the system had been breached for more than a year. It's unknown how much information was compromised as well as how or when it will be used.

From an adversary's perspective, this type of breach offers an economy of scale that has never been seen before. In the past, several hundred Fortune 500 companies would have to be attacked, one company at a time, which costs the adversary time and money — not to mention risk. Now, one attack can yield the same amount of valuable data with a significant reduction in resources expended as well as risk of exposure.

An intelligence goldmine

China's national champion firm Huawei is moving from selling telecommunications network equipment toward developing Infrastructure-as-a-Service software (IaaS) needed to provide a highly scalable public cloud like Microsoft's Azure or Amazon's EC2. If it sells IaaS with the same strategy that it uses in selling routers and switches, Amazon, Google, and Microsoft can expect to begin losing a lot of enterprise business to Huawei, which will cut pricing by 15% or more against its nearest competitor. Cloud customers can expect their data to reside in giant state-of-the-art server farms located in Beijing's "Cloud Valley" — a dedicated 7,800-square-meter industrial area that is home to 10 companies focusing on various aspects of cloud technology, such as distributed data centers, cloud servers, thin terminals, cloud storage, cloud operating systems, intelligent knowledge bases, data mining systems, and cloud system integration.

Cloud computing has been designated a strategic technology by the People's Republic of China's State Council in its 12th Five-Year Plan and placed under the control of the Ministry of Industry and Information Technology (MIIT). MIIT will be funding research and development for SaaS (Software as a Service), PaaS (Platform as a Service), and IaaS (Infrastructure as a Service) models as well as virtualization technology, distributed storage technology, massive data management technology, and other unidentified core technologies. Orient Securities LLC has predicted that by 2015, cloud computing in China will be a 1 trillion yuan market.

According to the U.S.-China Council website, MIIT was created in 2008 and absorbed some functions from other departments, including the Commission of Science, Technology, and Industry for National Defense (COSTIND):

From COSTIND, MIIT will inherit functions relating to the management of the defense industry, with a scope that covers the national defense department, the China National Space Administration, and certain administrative responsibilities of other major defense-oriented state companies, such as the China North Industries Co. and China State Shipbuilding Corp. MIIT will also control weapons research and production in both military establishments and dual-role corporations as well as R&D and production relating to "defense conversion" — the conversion of military facilities to non-military use.

Clearly, the PRC has made a serious commitment to cloud computing for the long term. This doesn't portend well for today's private cloud service providers like NetApp or public cloud providers like Amazon, Google, and Microsoft — especially if buying decisions are based on price.

What to consider

The move to the cloud is both inevitable and filled with risk for high-value government employees, corporate executives, and companies engaged in key market sectors like energy, banking, defense, nanotechnology, advanced aircraft design, and mobile wireless communications, among others.

To make matters more complicated, cloud providers may move data to different server farms around the world rather than keep it in the same country as the corporation or individual that owns it. That could potentially put the customer's data at risk for being legally compromised under foreign laws that would apply to the host company doing business there. For example, Microsoft UK's managing director Gordon Frazier was recently asked at the Office 365 launch, "Can Microsoft guarantee that EU-stored data, held in EU-based datacenters, will not leave the European Economic Area under any circumstances — even under a request by the Patriot Act?" Frazier replied, "Microsoft cannot provide those guarantees. Neither can any other company."

The best advice for individuals and companies at this time is to insist that cloud providers build a measurably secure infrastructure while providing legal guarantees and without the use of foreign data farms. Until that occurs, and it's highly unlikely to happen without strong consumer pressure, there are significant and escalating risks in hosting valuable data with any cloud provider.

Inside Cyber Warfare, 2nd Edition — Jeffrey Carr's second edition of "Inside Cyber Warfare" goes beyond the headlines of attention-grabbing DDoS attacks and takes a deep look inside recent cyber-conflicts, including the use of Stuxnet.

Associated photo on home and category pages: Dark Cloud, Blue Sky 2 by shouldbecleaning, on Flickr.

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl