Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

October 25 2012

Four short links: 25 October 2012

  1. Big Data: the Big Picture (Vimeo) — Jim Stogdill’s excellent talk: although Big Data is presented as part of the Gartner Hype Cycle, it’s an epoch of the Information Age which will have significant effects on the structure of corporations and the economy.
  2. Impala (github) — Cloudera’s open source (Apache) implementation of Google’s F1 (PDF), for realtime queries across clusters. Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries. Furthermore, Impala does not leverage MapReduce, allowing Impala to return result in real-time. (via Wired)
  3. druid (github) — open source (GPLv2) a distributed, column-oriented analytical datastore. It was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. See also the announcement of its open-sourcing.
  4. Supersonic (Google Code) — an ultra-fast, column oriented query engine library written in C++. It provides a set of data transformation primitives which make heavy use of cache-aware algorithms, SIMD instructions and vectorised execution, allowing it to exploit the capabilities and resources of modern, hyper pipelined CPUs. It is designed to work in a single process. Apache-licensed.

June 07 2012

Strata Week: Data prospecting with Kaggle

Here are a few of the data stories that caught my attention this week:

Prospecting for data

KaggleThe data science competition site Kaggle is extending its features with a new service called Prospect. Prospect allows companies to submit a data sample to the site without having a pre-ordained plan for a contest. In turn, the data scientists using Kaggle can suggest ways in which machine learning could best uncover new insights and answer less-obvious questions — and what sorts of data competitions could be based on the data.

As GigaOm's Derrick Harris describes it: "It's part of a natural evolution of Kaggle from a plucky startup to an IT company with legs, but it's actually more like a prequel to Kaggle's flagship predictive modeling competitions than it is a sequel." It's certainly a good way for companies to get their feet wet with predictive modeling.

Practice Fusion, a web-based electronic health records system for physicians, has launched the inaugural Kaggle Prospect challenge.

HP's big data plans

Last year, Hewlett Packard made a move away from the personal computing business and toward enterprise software and information management. It's a move that was marked in part by the $10 billion it paid to acquire Autonomy. Now we know a bit more about HP's big data plans for its Information Optimization Portfolio, which has been built around Autonomy's Intelligent Data Operating Layer (IDOL).

ReadWriteWeb's Scott M. Fulton takes a closer look at HP's big data plans.

The latest from Cloudera

Cloudera released a number of new products this week: Cloudera Manager 3.7.6; Hue 2.0.1; and of course CDH 4.0, its Hadoop distribution.

CDH 4.0 includes:

"... high availability for the filesystem, ability to support multiple namespaces, HBase table and column level security, improved performance, HBase replication and greatly improved usability and browser support for the Hue web interface. Cloudera Manager 4 includes multi-cluster and multi-version support, automation for high availability and MapReduce2, multi-namespace support, cluster-wide heatmaps, host monitoring and automated client configurations."

Social data platform DataSift also announced this week that it was powering its Hadoop clusters with CDH to perform the "Big Data heavy lifting to help deliver DataSift's Historics, a cloud-computing platform that enables entrepreneurs and enterprises to extract business insights from historical public Tweets."

Have data news to share?

Feel free to email us.

OSCON 2012 Data Track — Today's system architectures embrace many flavors of data: relational, NoSQL, big data and streaming. Learn more in the Data track at OSCON 2012, being held July 16-20 in Portland, Oregon.

Save 20% on registration with the code RADAR

Related:

January 12 2012

Strata Week: A .data TLD?

Here are some of the data stories that caught my attention this week.

Should there be a .data TLD?

radar.dataICANN is ready to open top-level domains (TLD) to the highest bidder, and as such, Wolfram Alpha's Stephen Wolfram posits it's time for a .data TLD. In a blog post on the Wolfram site, he argues that the new top-level domains provide an opportunity for the creation of a .data domain that could create a "parallel construct to the ordinary web, but oriented toward structured data intended for computational use. The notion is that alongside a website like wolfram.com, there'd be wolfram.data."

Wolfram continues:

If a human went to wolfram.data, there'd be a structured summary of what data the organization behind it wanted to expose. And if a computational system went there, it'd find just what it needs to ingest the data, and begin computing with it.

So how would a .data TLD change the way humans and computers interact with data? Or would it change anything? If you've got ideas of how .data could be put to use, please share them in the comments.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Cloudera addresses what Apache Hadoop 1.0 means to its customers

Last week, the Apache Software Foundation (ASF) announced that Hadoop had reached version 1.0. This week, Cloudera took to its blog to explain what that milestone means to its customers.

The post, in part, explains how Hadoop has branched from its trunk, noting that all of this has caused some confusion for Cloudera customers:

More than a year after Apache Hadoop 0.20 branched, significant feature development continued on just that branch and not on trunk. Two major features were added to branches off 0.20.2. One feature was authentication, enabling strong security for core Hadoop. The other major feature was append, enabling users to run Apache HBase without risk of data loss. The security branch was later released as 0.20.203. These branches and their subsequent release have been the largest source of confusion for users because since that time, releases off of the 0.20 branches had features that releases off of trunk did not have and vice versa.

Cloudera explains to its customers that it's offered the equivalent for "approximately a year now" and compares the Apache Hadoop efforts to its own offerings. The post is an interesting insight into not just how the ASF operates, but how companies that offer services around those projects have to iterate and adapt.

Disqus says that pseudonymous commenters are best

Debates over blog comments have resurfaced recently, with a back and forth about whether or not they're good, bad, evil, or irrelevant. Adding some fuel to the fire (or data to the discussion, at least) comes Disqus with its own research based on its commenting service.

According to the Disqus research, commenters using pseudonyms actually are "the most valuable contributors to communities," as their comments are both the highest quantity and quality. Those findings run counter to the idea that those who comment online without using their real names actually lessen rather than enhance quality conversations.

Disqus' data indicates that pseudonymity might engender a more engaged and more engaging community. That notion stands in contrast to arguments that anonymity leads to more trollish and unruly behavior.

Got data news?

Feel free to email me.

Related:

October 13 2011

Strata Week: Simplifying MapReduce through Java

Here are a few of the data stories that caught my attention this week:

Crunch looks to make MapReduce easier

Despite the growing popularity of MapReduce and other data technologies, there's still a steep learning curve associated with these tools. Some have even wondered if they're worth introducing to programming students.

All of this makes the introduction of Crunch particularly good news. Crunch is a new Java library from Cloudera that's aimed at simplifying the writing, testing, and running of MapReduce pipelines. In other words, developers won't need to write a lot of custom code or libraries, which as Cloudera data scientist Josh Willis points out, "is a serious drain on developer productivity."

He adds that:

Crunch shares a core philosophical belief with Google's FlumeJava: novelty is the enemy of adoption. For developers, learning a Java library requires much less up-front investment than learning a new programming language. Crunch provides full access to the power of Java for writing functions, managing pipeline execution, and dynamically constructing new pipelines, obviating the need to switch back and forth between a data flow language and a real programming language.

The Crunch library has been released under the Apache license, and the code can be downloaded here.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Querying the web with Datafiniti

DatafinitiDatafiniti launched this week into public beta, calling itself the "first search engine for data." That might just sound like a nifty startup slogan, but when you look at what Datafiniti queries and how it works, the engine begins to look profoundly ambitious and important.

Datafiniti enables its users to enter a search query (or make an API call) against the web. Or, that's the goal at least. As it stands, Datafiniti lets users make calls about location, products, news, real estate, and social identity. But that's a substantial number of datasets, using information that's publicly available on the web.

Although Datafiniti demands you enter SQL parameters, it tries to make the process of doing so fairly easy, with a guide that pops up beneath the search box to help you phrase things properly. That interface is just one of the indications that Datafiniti is making a move to help democratize big data search.

The company grew out of a previous startup named 80Legs. As Shion Deysarker, founder of Datafiniti told me, it was clear that the web-crawling services provided by 80Legs were really just being utilized to ask specific queries. Things like, what's the average listing price for a home in Houston? How many times has a brand name been mentioned on Twitter or Facebook over the last few months? And so on.

Deysarker frames Datafiniti in terms of data access, arguing that until now a few providers have controlled the data. The startup wants to help developers and companies overcome both access and expense issues associated with gathering, processing, curating and accessing datasets. It plans to offer both subscription-based and unit-based pricing.

Keep tabs on the Large Hadron Collider from your smartphone

LHSee screenshotNew apps don't often make it into my data news roundup, but it's hard to ignore this one: LHSee is an Android app from the University of Oxford that delivers data directly from the ATLAS experiment at CERN. The app lets you see data from collisions at the Large Hadron Collider.

The ATLAS experiment describes itself as an effort to learn about "the basic forces that have shaped our Universe since the beginning of time and that will determine its fate. Among the possible unknowns are the origin of mass, extra dimensions of space, unification of fundamental forces, and evidence for dark matter candidates in the Universe."

The LHSee app provides detailed information into how CERN and the Large Hadron Collider work. It also offers a "Hunt the Higgs Boson" game as well as opportunities to watch 3-D collisions streamed live from CERN. The app is available for free through the Android Market.

Got data news?

Feel free to email me.

Related:

October 06 2011

Strata Week: Oracle's big data play

Here are the data stories that caught my attention this week:

Oracle's big data week

Eyes have been on Oracle this week as it holds its OpenWorld event in San Francisco. The company has made a number of major announcements, including unveiling its strategy for handling big data. This includes its Big Data Appliance, which will use a new Oracle NoSQL database as well as an open-source distribution of Hadoop and R.

Edd Dumbill examined the Oracle news, arguing that "it couldn't be a plainer validation of what's important in big data right now or where the battle for technology dominance lies." He notes that whether one is an Oracle customer or not, the company's announcement "moves the big data world forward," pointing out that there is now a de facto agreement that Hadoop and R are core pieces of infrastructure.

GigaOm's Derrick Harris reached out to some of the startups who also offer these core pieces, including Norman Nie, the CEO of Revolution Analytics, and Mike Olson, CEO of Cloudera. Not surprisingly perhaps, the startups are "keeping brave faces, but the consensus is that Oracle's forays into their respective spaces just validate the work they've been doing, and they welcome the competition."

Oracle's entry as a big data player also brings competition to others in the space, such as IBM and EMC, as all the major enterprise providers wrestle to claim supremacy over whose capabilities are the biggest and fastest. And the claim that "we're faster" was repeated over and over by Oracle CEO Larry Ellison as he made his pitch to the crowd at OpenWorld.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Who wrote Hadoop?

As ReadWriteWeb's Joe Brockmeier notes, ascertaining the contributions to open-source projects is sometimes easier said than done. Who gets credit — companies or individuals — can be both unclear and contentious. Such is the case with a recent back-and-forth between Cloudera's Mike Olson and Hortonworks' Owen O'Malley over who's responsible for the contributions to Hadoop.

O'Malley wrote a blog post titled "The Yahoo! Effect," which, as the name suggests, describes Yahoo's legacy and its continuing contributions to the Hadoop core. O'Malley argues that "from its inception until this past June, Yahoo! contributed more than 84% of the lines of code still in Apache Hadoop trunk." (Editor's note: The link to "trunk" was inserted for clarity.) O'Malley adds that so far this year, the biggest contributors to Hadoop are Yahoo! and Hortonworks.

Lines of code contributed to apache hadoop trunkLines of code contributed to Apache Hadoop Trunk (from Owen O'Malley's post, "The Yahoo! Effect")

That may not be a surprising argument to hear from Hortonworks, the company that was spun out of Yahoo! earlier this year to focus on the commercialization and development of Hadoop.

But Cloudera's Mike Olson challenges that argument — again, not a surprise, as Cloudera has long positioned itself as a major contributor to Hadoop, a leader in the space, and of course now the employer of former Yahoo! engineer Doug Cutting, the originator of the technology. Olson takes issue with O'Malley's calculations and in a blog post of his own, contends that these calculations don't accurately take into account the companies that people now work for:

Five years is an eternity in the tech industry, however, and many of those developers moved on from Yahoo! between 2006 and 2011. If you look at where individual contributors work today — at the organizations that pay them, and at the different places in the industry where they have carried their expertise and their knowledge of Hadoop — the story is much more interesting.

Olson also argues that it isn't simply a matter of who's contributing to the Apache Hadoop core, but rather who is working on:

... the broader ecosystem of projects. That ecosystem has exploded in recent years, and most of the innovation around Hadoop is now happening in new projects. That's not surprising — as Hadoop has matured, the core platform has stabilized, and the community has concentrated on easing adoption and simplifying use.

Got data news?

Feel free to email me.

Related:

December 22 2010

Strata Gems: Whirr makes Hadoop and Cassandra a snap

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: DIY personal sensing and automation.

Strata 2011 The cloud makes clusters easy, but for rapid prototyping purposes, bringing up clusters still involves quite a bit of effort. It's getting easier by the day though, as a variety of tools emerge to simplify the commissioning and management of cloud resources.

Whirr is one such tool: a simple utility and a Java API for running cloud services. It presents a uniform interface to cloud providers, so you don't have to know each service's API in order to negotiate their peculiarities. Furthermore, Whirr abstracts away the repetitive bits of setting up services such as Hadoop or Cassandra.

Whirr's command-line tool can be used to bring up clusters in the cloud. Bringing up a Hadoop cluster is as easy as this one-liner:

whirr launch-cluster \
    --service-name=hadoop \
    --cluster-name=myhadoopcluster \
    --instance-templates='1 jt+nn,1 dn+tt' \
    --provider=ec2 \
    --identity=$AWS_ACCESS_KEY_ID \
    --credential=$AWS_SECRET_ACCESS_KEY \
    --private-key-file=~/.ssh/id_rsa
	

When the cluster has launched, a script (~/.whirr/myhadoopcluster/hadoop-proxy.sh) is created, which will set up a secure tunnel to the remote cluster, letting the user execute regular Hadoop commands from their own machine.

Whirr's service-name and instance-templates parameters are the key to running different services. The instance templates are a concise notation for specifying the contents of a cluster, and are defined on a per-service basis. The Hadoop example above, 1 jt+nn,1 dn+tt, specifies one node with the roles of "named node" and "job tracker", and one node with roles of "data node" and "task tracker".

Services currently supported by Whirr include:

  • Hadoop (both Apache and Cloudera Distribution for Hadoop)
  • Cassandra
  • Zookeeper

Adding new services involves providing initialization scripts, and implementing a small amount of Java code. Whirr is open source, currently hosted as an Apache Incubator project, and development is being led by Cloudera engineers.

  • For in-person instruction on getting started with Hadoop or Cassandra, check out the Strata 2011 Tutorials.

July 14 2010

Four short links: 14 July 2010

  1. Flume -- Cloudera open source project to solve the problem of how to get data into cloud apps, from collection to processing to storage. Flume is a distributed service that makes it very easy to collect and aggregate your data into a persistent store such as HDFS. Flume can read data from almost any source - log files, Syslog packets, the standard output of any Unix process - and can deliver it to a batch processing system like Hadoop or a real-time data store like HBase. All this can be configured dynamically from a single, central location - no more tedious configuration file editing and process restarting. Flume will collect the data from wherever existing applications are storing it, and whisk it away for further analysis and processing. (via mikeolson on Twitter)
  2. How Microbes Defend and Define Us (NYTimes) -- there's been a lot of talk about the microbiome at Sci Foo in the last few years, now it's bubbling out into the world. Turns out that "bacteria bad, megafauna good" is as simplistic and inaccurate as "Muslim bad, Christian good". Fancy that. (via Jim Stogdill)
  3. Startup Model Patently Flawed (Nature) -- "There is a lot of stuff that academics are realizing isn't patentable but they can commercialize for themselves by starting a company," says Scott Shane, an economist at Case Western Reserve University in Cleveland, Ohio, and a co-author of the study. Because surveys of entrepreneurial activity — including government assessments — typically focus on patent activity, they may be significantly underestimating academics' efforts, he notes. (via pkedrosky on Twitter)
  4. Open Data on Russian Government Spending (OKFN) -- a group outside government is adding analytics to the data that departments are required to release.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl