Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 25 2010

December 24 2010

Strata Gems: CouchDB in the browser

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Whirr makes Hadoop and Cassandra a snap.

Strata 2011 An earlier Strata Gem explained how to try MongoDB without installing any software. You can get a similar experience for CouchDB thanks to JSCouch, which also provides a great introduction to writing MapReduce functions.

CouchDB was the first document-store database to become popular under the "NoSQL" label, and offers some unique and interesting features.

  • RESTful JSON API — rather than requiring programming language-specific bindings, CouchDB can be accessed over HTTP, with JSON used as the data format
  • MapReduce — CouchDB provides query and view functionality via MapReduce operations, which are specified in JavaScript
  • Replication — with incremental replication and bi-directional conflict resolution, CouchDB is suitable for connecting replicated databases with intermittent connectivity. The Ubuntu One cloud service uses this feature

To demonstrate the basics of CouchDB, Mu Dynamics and Jan Lehnardt created a partial implementation of CouchDB that runs in the browser, JSCouch.

The sample database in the demonstration contains information about photos and their metadata, a few such documents are shown below.

{"name":"fish.jpg","created_at":"Fri, 24 Dec 2010 06:35:03 GMT",

{"name":"trees.jpg","created_at":"Fri, 24 Dec 2010 06:33:45 GMT",

If you click on the "map/reduce" tab, you will find a drop-down for selecting pre-written JavaScript functions for performing certain operations. Let's take a look at a query to compute the sum of all the image sizes. Here's the map function:

function(doc) {

And the reduce function:

function(keys, values, rereduce) {
	return sum(values);

With a variety of NoSQL databases to choose from, and developers the unwitting kingmakers, easy demonstration interfaces such as JSCouch can help smooth the path to technology adoption.

Sponsored post
5371 6093 500
rockyourmind, foods, 2010-2020.

So Long, and Thanks for All the Fish.
Reposted fromRockYourMind RockYourMind

December 22 2010

Strata Gems: Whirr makes Hadoop and Cassandra a snap

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: DIY personal sensing and automation.

Strata 2011 The cloud makes clusters easy, but for rapid prototyping purposes, bringing up clusters still involves quite a bit of effort. It's getting easier by the day though, as a variety of tools emerge to simplify the commissioning and management of cloud resources.

Whirr is one such tool: a simple utility and a Java API for running cloud services. It presents a uniform interface to cloud providers, so you don't have to know each service's API in order to negotiate their peculiarities. Furthermore, Whirr abstracts away the repetitive bits of setting up services such as Hadoop or Cassandra.

Whirr's command-line tool can be used to bring up clusters in the cloud. Bringing up a Hadoop cluster is as easy as this one-liner:

whirr launch-cluster \
    --service-name=hadoop \
    --cluster-name=myhadoopcluster \
    --instance-templates='1 jt+nn,1 dn+tt' \
    --provider=ec2 \
    --identity=$AWS_ACCESS_KEY_ID \
    --credential=$AWS_SECRET_ACCESS_KEY \

When the cluster has launched, a script (~/.whirr/myhadoopcluster/ is created, which will set up a secure tunnel to the remote cluster, letting the user execute regular Hadoop commands from their own machine.

Whirr's service-name and instance-templates parameters are the key to running different services. The instance templates are a concise notation for specifying the contents of a cluster, and are defined on a per-service basis. The Hadoop example above, 1 jt+nn,1 dn+tt, specifies one node with the roles of "named node" and "job tracker", and one node with roles of "data node" and "task tracker".

Services currently supported by Whirr include:

  • Hadoop (both Apache and Cloudera Distribution for Hadoop)
  • Cassandra
  • Zookeeper

Adding new services involves providing initialization scripts, and implementing a small amount of Java code. Whirr is open source, currently hosted as an Apache Incubator project, and development is being led by Cloudera engineers.

  • For in-person instruction on getting started with Hadoop or Cassandra, check out the Strata 2011 Tutorials.

December 21 2010

Strata Gems: DIY personal sensing and automation

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Turn MySQL into blazing fast NoSQL.

Strata 2011 Tomorrow's augmented reality is being built today on mobile devices. The Tasker application for Android is a fun platform for prototyping personal automation and sensing applications. Described modestly as an application which "performs tasks based on contexts," it gives non-programmers access to the sensing and system features of the phone.

Following a simple rule metaphor of taking action upon matched conditions, Tasker can respond to states such as:

  • Presence of a certain wifi network or Bluetooth device
  • GPS location, time or date
  • Phone event, such as incoming text messages or calls

The array of resulting actions that can be taken include altering the phone's attributes by adjusting volume, brightness, playing music and, more interestingly, taking photos or performing network actions such as performing HTTP GET or POST and sending email.


Screenshot, taken from the Tasker home page

All of these conditional rules can be set up from the application's interface on the phone itself.

While Tasker's initial appeal is in having your phone respond to its environment, with its network capabilities Tasker provides the ability to create sensing applications that can connect simple web applications with your personal attributes, responding to your location or proximity to others.

An example application might be tracking and recording your mileage expenses for work by having the phone log your GPS trails whenever it senses it's connected to the car's Bluetooth system.

A plugin interface offers developers the ability to add modules that detect more sophisticated conditions or take custom actions.

December 20 2010

Strata Gems: Turn MySQL into blazing fast NoSQL

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: What your inbox knows.

Strata 2011 The trend for NoSQL stores such as memcache for fast key-value storage should give us pause for thought: what have regular database vendors been doing all this time? An important new project, HandlerSocket, seeks to leverage MySQL's raw speed for key-value storage.

NoSQL databases offer fast key-value storage for use in backing web applications, but years of work on regular relational databases has hardly ignored performance. The main performance hit with regular databases is in interpreting queries.

HandlerSocket is a MySQL server plugin that interfaces directly with the InnoDB storage engine. Yoshinori Matsunobu, one of HandlerSocket's creators at Japanese internet and gaming company Dena, reports over 750,000 queries per second performance on commodity server hardware: compared with 420,000 using memcache, and 105,000 using regular SQL access to MySQL. Furthermore, since the underlying InnoDB storage is used, HandlerSocket offers a NoSQL-type interface that doesn't have to trade away ACID compliance.

With the additional benefits of being able to use the mature MySQL tool ecosystem for monitoring, replication and administration, HandlerSocket presents a compelling case for using a single database system. As the HandlerSocket protocol can be used on the same database and tables used for regular SQL access, the problems of inconsistency and replication created by multiple tiers of databases can be mitigated.

HandlerSocket has now been integrated into Percona's XtraDB, an enhanced version of the InnoDB storage engine for MySQL. You can also compile and install HandlerSocket yourself alongside MySQL.

December 19 2010

Strata gems: What your inbox knows

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: A sense of self.

Strata 2011 One of our themes at Strata is data in the dirt: mining the data exhaust from our lives to find meaning and value. In every organization, the trails left by email offer one of those repositories of hidden meaning.

Trampoline Systems's SONAR CRM takes an innovative approach to customer relationship management by mining the social networks created with and between companies. Through its integration with email logs, existing CRM systems and social networks, SONAR expands the scope of traditional CRM to give a fuller view of an company's relationships.

There is often more truth to be found in mining implicit data trails than by relying on explicitly logged information. Trampoline estimate that only 25% of actual contacts are recorded in CRM systems. By analyzing email flows, their system lets organizations understand who is talking to whom.

At O'Reilly, we specialize in connecting people across and within technical community "tribes". We've been experimenting with SONAR for some months. In my experience, it certainly contains the same knowledge about our contacts that I would otherwise have to obtain by asking around.

Email contact visualization
A SONAR visualization of some of O'Reilly's internal relationships

The more information you feed a system such as SONAR, the better results you can get. For instance, not all prodigious communicators are at the same level of influence: customer service personnel talk to as many people as business development, for instance, but the relationships they develop are of a more fleeting nature.

  • For a personal view on email analytics, Xobni offer an Outlook plugin that augments your email with information from social networks and analytical capabilities.

December 18 2010

Strata Gems: A sense of self

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Clojure is a language for data.

Strata 2011 The data revolution isn't just about big data. The smallest data can be the most important to us, and nothing more so than tracking our own activities and fitness. While standalone pedometers and other tracking devices are nothing new, today's devices are network-connected and social.

Android phones and iPhones are fitted with accelerometers, and work well as pedometers and activity monitors. RunKeeper is one application that takes advantage of this, along with GPS tracking, to log runs and rides. FitFu takes things a step further, mixing monitoring with fitness instruction and social interaction.

Phones, however ubiquitous, are still awkward to use for full-time fitness tracking. With a much smaller form factor, the Fitbit is a sensor you clip to your clothes. Throughout the day it records your movement, and at night it can sense whether you wake. With a long battery life and wireless syncing, it's the least intrusive device currently available for measuring your activity.

Fitbit data
An extract from the author's Fitbit data

Fitbit are working on delivering an API for access to your own data, but in the meantime there's an unofficial API available.

Withings produce a wi-fi enabled scale, that records weight and body mass index, uploading the data to their web site and making it available for tracking on the web or a smartphone.

The next step for these services is to move towards an API and interoperation. Right now, Fitbit requires you manually enter your own weight, and diet plans such as WeightWatchers aren't able to import your weight or activity from the other services.

For much more on recording and analyzing your own data, check out Quantified Self.

December 17 2010

Strata Gems: Clojure is a language for data

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Who needs disks anyway?.

Strata 2011

The Clojure programming language has been rising in popularity in
recent months. A Lisp-like language, it brings functional programming to the Java virtual machine (JVM)
platform. One of the distinctives of Clojure is that data is expressed in the same way as code, making
it ideal for writing powerful and concise domain-specific languages.

Clojure's inventor, Rich Hickey, has ensured that its integration with the world of Java is as painless as possible. And for those who fear Lisp-like languages, Clojure also bends a little to be friendlier. The result is that Clojure joins two worlds previously estranged: powerful functional programming with widespread and mature APIs.

In the world of big data, this means that Clojure can be used with Cascading, an API for programmatically creating Hadoop processing pipelines. Nathan Marz of Backtype used Clojure's power to create an entire query language for Hadoop, Cascalog.

Particular features of Clojure make it suitable for parallel data processing: immutable data types and built-in constructs for concurrency.

Cloud and big data go hand-in-hand. For working with the cloud, the jclouds project provides Java with a unified API to multiple cloud vendors, include Azure, Amazon and Rackspace. The jclouds API is often used with Clojure, exemplified by the Pallet project. Implemented in Clojure, Pallet automates the provisioning and control of cloud machine instances.

If you're looking to learn a new programming language and expand the way you think about coding, give Clojure a whirl. The excellent Java support means you won't be left isolated.

December 16 2010

Strata Gems: Who needs disks anyway?

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Kinect democratizes augmented reality.

Strata 2011 Today's databases are designed for the spinning platter of the hard disk. They take into account that the slowest part of reading data is seeking: physically getting the read head to the part of the disk it needs to be in. But the emergence of cost effective solid state drives (SSD) is changing all those assumptions.

Over the course of 2010, systems designers have been realizing the benefits of using SSDs in data centers, with major IT vendors and companies adopting them. Drivers for SSD adoption include lower power consumption and greater physical robustness. The robustness is a key factor when creating container-based modular data centers.

That still leaves the problem of software optimized for spinning disks. Enter RethinkDB, a project to create a storage engine for the MySQL database that is optimized for SSDs.

As well as taking advantage of the extra speed SSDs can offer, RethinkDB also majors on consistency, achieved by using append-only writes. Additionally, they are writing their storage engine with modern web application access patterns in mind: many concurrent reads and few writes.

The smartest aspect of what RethinkDB are doing, however, is creating their product as a MySQL storage engine, minimizing barriers to adoption. Currently in rapid development, you can obtain binary only downloads of RethinkDB from their web site. Definitely a project to watch as it matures over the course of the next year.

December 15 2010

Strata Gems: Kinect democratizes augmented reality

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Manage clusters with Mesos.

Strata 2011 The combination of augmented reality with data and analytics will bring radical change to our lives over the next few years. You're probably carrying a location-sensitive personal AR device right now, kitted out with motion sensors, audio-visual capabilities and network connectivity.

Microsoft's Kinect technology has now made the Xbox 360 gaming console a perfect experimentation platform for augmented reality. Marketed right now as a camera-based controller for video games, the Kinect's longer term impact might well be in the creation of augmented reality experiences. And of course, the world of advertising will be first in line to exploit this.

Dustin O'Connor has been busy hacking with the Kinect and producing a href="">number of very cool demos. In the video below, he
demonstrates the ability to manipulate a 3D object using multiple touches.

frameborder="0"><p><a href="">kinect augmented reality multi<br /> touching</a> from <a href="">dustin o&#039;connor</a> on <a<br /> href="">Vimeo.</p></p> <p>Between technologies such as Kinect and the widespread availability of smartphones, the means to create augmented reality experiences is now highly democratized, awaiting exploitation and experimentation from hackers and innovators.</p> <div class="feedflare"> <a href=""><img src="" border="0" /></a> <a href=""><img src="" border="0" /></a> <a href=""><img src="" border="0" /></a> <a href=""><img src="" border="0" /></a> </div><img src="" height="1" width="1" />

December 14 2010

Strata Gems: Manage clusters with Mesos

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Use GPUs to speed up calculation. Early-bird pricing on Strata closes today, December 14: don't forget to register!

Strata 2011 Innovation in big data processing architectures is far from over. While Hadoop was there first, a second generation of systems is emerging, typified by Google's Caffeine. The drive to real-time analysis in 2011 will only accelerate change.

Change isn't easy for everyone. For administrators running processing clusters, a key problem is scheduling and managing the workload of big data systems. Current solutions for this aren't really optimized for the scenario of evolving and heterogenous frameworks.

Enter Mesos, a key piece of cloud infrastructure to watch in 2011. Put simply, Mesos allows a collection of distributed applications to share a compute cluster, in the same way Linux allows multiple applications to share a single computer.

Mesos architecture
Mesos architectural overview, from Mesos presentation given to Bay Area Hadoop Users Group

Deploying processing frameworks on top of Mesos gives you exceptional flexibility: if a new version of Hadoop comes out, you no longer have the expense and worry of running a parallel cluster and switching. You simply deploy the new version inside the cluster and can phase out the old when you want - or easily roll back. If a whole new architecture comes out, you don't have to invest in a separate cluster to run it, you can use the same cluster.

Mesos offers other benefits, including the ability to isolate frameworks using Linux containers, and data locality for frameworks that require it, such as Hadoop.

Currently at an alpha stage of maturity, Mesos has recently been proposed to the Apache Incubator. It has been developing rapidly for nearly two years now, and is set to become a major part of big data infrastructure in the coming 24 months. In 2011, Mesos may well be the new 'M' in the SMAQ stack.

December 13 2010

Strata Gems: Use GPUs to speed up calculation

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: The emerging marketplace for social data. Early-bird pricing on Strata closes December 14: don't forget to register!

Strata 2011 The release in November of Amazon Web Services' Cluster GPU instances highlights the move to the mainstream of Graphics Processing Units (GPUs) for general purpose calculation. Graphical applications require very fast matrix transformations, for which GPUs are optimized. Boards such as the NVIDIA Tesla offer hundreds of processor cores all able to work in parallel.

While debate is ongoing about the exact range of performance boost available by using GPUs, reports indicate that speedups over CPUs from 2.5 to 15x can be obtained for calculation-heavy applications.

NVidia has led the trend for general purpose computing on GPUs with the Compute Unified Device Architecture (CUDA). By using extensions to the C programming language, developers can write code that executes on the GPU, mixed in with code running on the CPU.

NVIDIA's Tesla M2050 GPU Computing Module

While CUDA is NVIDIA-only, OpenCL (Open Computing Language) is a standard for cross-platform general parallel programming. Originated by Apple and AMD, it is now developed with cross industry participation. ATI and NVIDIA are among those offer OpenCL support for their products.

Now with Amazon's support for GPU clusters, it's easier than ever to start accessing the power of GPUs for data analysis.
OpenCL and CUDA bindings exist for many popular programming languages, including Java, Python and C++, and
the R+GPU project gives GPU access for the R statistical package.

To get a quick impression of what GPU code looks like, check out this example from the Python OpenCL bindings. The code to execute on the GPU is called out in bold text.

import pyopencl as cl
import numpy
import numpy.linalg as la

a = numpy.random.rand(50000).astype(numpy.float32)
b = numpy.random.rand(50000).astype(numpy.float32)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes)

prg = cl.Program(ctx, """
__kernel void sum(__global const float *a,
__global const float *b, __global float *c)
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];


prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf)

a_plus_b = numpy.empty_like(a)
cl.enqueue_read_buffer(queue, dest_buf, a_plus_b).wait()

print la.norm(a_plus_b - (a+b))

Amazon's Werner Vogels will be among the keynote speakers at Strata.

December 12 2010

Strata Gems: The emerging marketplace for social data

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Let it snow.

Strata 2011If you want to analyze social media data in significant volumes, it can be inconvenient and costly to aggregate it yourself. Aggregator and reseller Gnip is at the forefront of the growing marketplace for social data. As well as providing a unified API to aggregated public data from social web sites, Gnip are the first authorized reseller for the entirety of Twitter's public output.

Who uses this Twitter data, and for what? The ultimate end users of most aggregated social data are corporations. The data is used either for brand management and monitoring purposes, or as part of workflow systems that help them address customer issues online. However, the raw feeds themselves are most often provided to suppliers of social monitoring systems, not the end user.

Right now, you can't just install a "Twitter sink" and point the Twitter firehose into it. Gnip's licensees are mostly vendors of social CRM systems. According to Gnip's estimate there are over 500 social monitoring products available in the US alone, but the bigger market is in integration with existing CRMs. Rather than using separate systems, it makes more sense to introduce social media data into the existing CRM investments of large enterprises.

By reselling their public data, Twitter is the first social service to monetize its raw data in this way. Over the course of the next twelve months this trend is likely to accelerate. With that acceleration, we'll also see attendant issues over public awareness of the ultimate destination of their social media scribblings. Writing something in public for your friends is one thing, but people may be surprised to know their every utterance is being carefully watched by their favorite brands.

Gnip's CEO Jud Valeski will be participating in the Strata panel What's Mine is Yours: the Ethics of Big Data Ownership.

December 11 2010

Strata Gems: Let it snow

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Usahidi enables crowdsourced journalism and intelligence.

Strata 2011Weather in the UK is so unpredictable and varied that it's the primary topic of national conversation. Each year brings just enough snow to make it a remarkable surprise every time the white stuff comes down.

Snow in the UK is generally a source of delight, but inevitably brings school closures and travel disruption. So, a simple idea to use Twitter as a way of gathering snow reports from around the country has proved remarkably popular. Built by developer Ben Marsh (@benmarsh), the #uksnow Map aggregates the reports from Twitter users on a map. With the recent UK snowfalls, the project has had as much traffic as it did in the entirety of the 2009 winter.

UK snow screenshot
The #uksnow Map

The idea is to tweet using the hashtag #uksnow with a snow rating on a scale of 1-10, along with the first half of your postal code. If you attach a picture to the tweet, then that can be placed on the map too.

Getting people to carefully craft tweets can be troublesome, but if it's fun enough and the payoff useful, it's a great lightweight way of aggregating citizen reports. The Usahidi project is a generalization of this concept into a platform.

December 10 2010

Strata Gems: Usahidi enables crowdsourced journalism and intelligence

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Make beautiful graphs of your Twitter network.

Strata 2011 Usahidi is a software platform built to crowdsource information over multiple channels such as text messages, email and Twitter. Originally built in 2008 to map reports of post-election violence in Kenya, Usahidi has evolved into a non-profit company with a suite of tools that enables crowdsourced information aggregation, with applications ranging from citizen journalism and crisis management to the more commercial side of brand monitoring.

You can use the Usahidi tools in two ways: by downloading the source code and running it yourself, or by taking advantage of their hosted platforms SwiftRiver and Crowdmap.

CrowdMap is a minimum-fuss way to use the Usahidi tools to collect and visualize geographical data. Though built for emergency use, it has many applications for representation of local knowledge.

A free-to-use web hosted service, CrowdMap has been used to help with reporting floods in Pakistan, and to aggregate reports from citizen journalists about the Pope's visit to the UK.

UK Tube Strike Map
Crowdmap of the UK Tube strikes, created by the BBC

SwiftRiver is a media aggregation and filtering tool. It aggregates sources such as Twitter, blog, email and SMS, and provides features that help identify relationships and trends in the incoming data sets. Through semantic analysis, incoming content can be automatically categorized for review.

As well as data stream management and curation, SwiftRiver places an emphasis on adding
context and history to online research, enabling the location and reputation of data sources to
be taken into account.

You can use SwiftRiver as a hosted service, including a free individual plan, or download and run it yourself.

December 09 2010

Strata Gems: Make beautiful graphs of your Twitter network

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Explore and visualize graphs with Gephi.

Strata 2011 Where better to start analyzing social networks than with your own? Using the graphing tool Gephi and a little bit of Python script, you can analyze your own Twitter network, revealing the inherent structure among those you follow. It's also a fun way to learn more about network analysis.

Inspired by the LinkedIn Gephi graphs, I analyzed my Twitter friend network. I took everybody that I followed on Twitter, and found out who among them followed each other. I've shared the Python code I used to do this on

To use the script, you need to create a Twitter application and use command-line OAuth authentication to get the tokens to plug into the script. Writing about that is a bit gnarly for this post, but the easiest way I've found to authenticate a script with OAuth is by using the oauth command-line tool that ships with the Ruby OAuth gem.

The output of my Twitter-reading tool is a graph, in GraphML, suitable for import into Gephi. The graph has a node for each person, and an edge for each "follows" relationship. On initial load into Gephi, the graph looks a bit like a pile of spider webs, not showing much information.

I wanted to show a couple of things in the graph: cluster closely related people, and highlight who are the well-connected people. To find related groups of people, you can use Gephi to analyze the modularity of the network, and then color nodes according to the discovered communities. To find the well-connected people, run the "Degree Power Law" statistic in Gephi, which will calculate the betweenness centrality for each person, which essentially computes how much of a hub they are.

These steps are neatly laid out in a great slide deck from Sociomantic Labs on analyzing Facebook social networks. Follow the tips there and you'll end up with a beautiful graph of your network that you can export to PDF from Gephi.

Social graph
Overview of my social graph: click to view the full PDF version

The final result for my network is shown above. If you download the full PDF, you'll notice there are several communities, which I'll explain for interest. The mass of pink is predominantly my O'Reilly contacts, dark green shows the Strata and data community, the lime green the Mono and GNOME worlds, mustard shows the XML and open source communities. The balance of purple is assorted technologist friends.

Finally my sporting interests are revealed: the light blue are cricket fans and commentators, the red Formula 1 motor racing. Unsurprisingly, Tim O'Reilly, Stephen Fry and Miguel de Icaza are big hubs in my network. Your own graphs will reveal similar clusters of people and interests.

If this has whetted your appetite, you can discover more about mining social networks at Matthew Russell's Strata session, Unleashing Twitter Data For Fun And Insight.

December 08 2010

Strata Gems: Explore and visualize graphs with Gephi

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Five data blogs you should read.

Strata 2011If you need to explore your data as a graph, then Gephi is a great place to start. An open source project, Gephi is the ideal tool for exploring data and analyzing networks.

Gephi is available for Windows, Linux and OS X. You can get started by downloading and installing Gephi, and playing with one of the example data sets.

Gephi is a sophisticated tool. A "Photoshop for data", it offers a rich palette of features, including those specialized for social network analysis.

Gephi screenshot

Graphs can be loaded and created using many common graph file formats, and explored interactively. Hierarchical graphs such as social networks can be clustered in order to extract meaning. Gephi's layout algorithms automatically give shape to a graph to help exploration, and you can tinker with the colors and layout parameters to improve communication and appearance.

Following the Photoshop metaphor, one of the most powerful aspects of Gephi is that it is extensible through plugins. Though the plugin ecosystem is just getting started, existing plugins let you export a graph for publication on the web and experiment with additional layouts. The AlchemyAPI plugin uses natural language processing to identify real world entities from graph data, and shows the promise of connecting Gephi to web services.

Earlier this year, DJ Patil from LinkedIn brought Gephi-generated graphs of LinkedIn social networks to O'Reilly's Foo Camp. Aside from importing the data, very little manipulation was needed inside Gephi. In this video he explains the social networks of several participants.

December 07 2010

Strata Gems: Five data blogs you should read

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: The timeless utility of sed and awk.

Strata 2011Whether your interest in data is professional or casual, commercial or political, there's a blog out there for you. Feel free to add your own suggestions to the comments at the bottom.

Measuring Measures @bradfordcross

Eclectic, thoughtful and forthright, Bradford Cross spends his time making research work in practice: from hedge funds to data-driven startup FlightCaster. His blog covers topics ranging from venture capital and startups to coding in Clojure.

Dataists @vsbuffalo @hmason

Unapologetically geeky, and subtitled Fresher than seeing your model doesn't have heteroscedastic errors, Dataists is a group blog featuring contributions from writers in the New York data scene, such as Hilary Mason and Drew Conway. Dataists includes an insightful mix of instruction and opinion.

Flowing Data @flowingdata

Consistently excellent, Nathan Yau's Flowing Data blog is a frequently updated stream of articles on visualization, statistics and data. Always pretty to look at, the blog often includes commentary and coverage of topical data stories.

Flowing Data Blog

Guardian Data Blog @datastore

Subtitled Facts are sacred, and part of the UK Guardian's pioneering approach to online content, this blog uncovers the stories behind public data. Edited by Strata keynoter Simon Rogers.

Pete Warden @petewarden

Founder of OpenHeatMap, Pete Warden's keeps a personal blog with a strong component of data and visualization topics, as well as commentary on the emerging data industry: most recently, Data is snake oil.

And plenty more...

While these are some of my favorites, the question and answer web site Quora has a more exhaustive list of data blogs.

December 06 2010

Strata Gems: The timeless utility of sed and awk

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Where to find data.

Strata 2011Edison famously said that genius is 1% inspiration and 99% perspiration. Much the same can be said for data analysis. The business of obtaining, cleaning and loading the data often takes the lion's share of the effort.

Now over 30 years old, the UNIX command line utilities sed and awk are useful tools for cleaning up and manipulating data. In their Taxonomy of Data Science, Hilary Mason and Chris Wiggins note that when cleaning data, "Sed, awk, grep are enough for most small tasks, and using either Perl or Python should be good enough for the rest." A little aptitude with command line tools can go a long way.

sed is a stream editor: it operates on data in a serial fashion as it reads it. You can think of sed as a way to batch up a bunch of search and replace operations that you might perform in a text editor. For instance, this command will replace all instances of "foo" with "bar" within a file:

sed -e 's/foo/bar/g' myfile.txt

Anybody who has used regular expressions within a text editor or programming language will find sed easy to grasp. Awk takes a little more getting used to. A record-oriented tool, awk is the right tool to use when your data contains delimited fields that you want to manipulate.

Consider this list of names, which we'll imagine lives in the file presidents.txt.

George Washington
John Adams
Thomas Jefferson
James Madison
James Monroe

To extract just the first names, we can use the following command:

$ awk '{ print $1 }' presidents.txt

Or, to just find those records with "James" as the first name:

$ awk '$5 ~ /James/ { print }' presidents.txt
James Madison
James Monroe

Awk can do a lot more, and features programming concepts such as variables, conditionals and loops. But just a basic grasp of how to match and extract fields will get you far.

For more information, attend the Strata Data Bootcamp, where Hilary Mason is an instructor, or read sed & awk.

December 05 2010

Strata Gems: Where to find data

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Quick starts for charts.

Strata 2011With the growth of both the open data movement and data marketplaces, there's now a wealth of public data - some free, some for sale - that you can use in your analyses and applications. It's not just about data dumps: increasingly you can get data through APIs, or even execute it on servers provided by the data host.


An icon of the open data movement, Freebase is a graph database full of "people, places and things". The data is contributed and edited by community volunteers. Freebase was recently acquired by Google.

Freebase both names real world entities, and stores structured data about the attributes of and relations between those things. For example, see the page for the movie Harry Potter and the Deathly Hallows: Part I. It looks a bit like a Wikipedia page, but you can edit and retrieve the structured data for every page.

Developers have access to a variety of Freebase services, including dumps from the entire database, and API access to the data. Of particular interest is "Acre", a hosted platform that lets you implemented an application on Freebase servers, close to the data you need.

Freebase screenshot
Screenshot from Freebase, showing activity in the most popular data sets

Amazon Public Data Sets

As a public service, Amazon Web Services host a variety of Public Data Sets available to users writing applications on their cloud services. By putting the data on servers next to their cloud computing platform EC2, Amazon helps avoid the difficulty of locating, cleaning and downloading data. The data never needs to travel: only your code. This is obviously valuable when data sets get particuarly large, or are updated frequently.

Amazon's public data sets include annotated human genome data, a variety of data sets from the US Census Bureau, and dumps from services such as Freebase and Wikipedia.

Windows Azure Data Market

Launched publicly by Microsoft this year, the Azure Data Market offers a variety of data sets and sources, accessible by the OData protocol. OData offers uniform access to data, along with a standardized query interface. By using data from the market, a user can reduce the friction of parsing and importing data. Unsurprisingly, Microsoft's own tools such as Excel allow importing of data directly from the marketplace's OData endpoints.

Azure Data Market contains both free and for-pay data sets, offering a route to monetization for data publishers. Free data sets include government and international agency data. An example of for-pay data, Sports data provider MLB game by game statistics through the marketplace.

The emergence of data marketplaces offers developers a legitimate route to data previously only obtainable at high cost, or through illicit web scraping.

Yahoo! Query Language (YQL)

YQL is a technology that presents web services in way in which familiar SQL-like queries can be executed against them. SELECT, INSERT and DELETE operations can be performed against services such as Flickr.

In essence, YQL offers a technology similar to OData, providing an adapter layer that gives data consumers a uniform interface to data. Data providers must provide their data as an Open Data Table: or third parties can contribute adapter definitions, such as those for Foursquare, Github, and Google. The most limiting aspect of YQL is that queries must run through Yahoo's own servers.


Infochimps is another data market place and commons, founded by Strata speaker Flip Kromer.

Infochimps makes its data available either as downloadable data sets, or accessible via an API. For an example of commercial data available on Infochimps, check out the Twitter Census Conversation Metrics, which counts the occurrence of URLs, hashtags and Smileys used over a year in Twitter.


A previous Strata Gem covered the use of Wikipedia as training data, but there's more than just free text content inside Wikipedia: many articles contain structured information. DBPedia is a community led project to extract this structured information and make it available on the web.

DBpedia offers a variety of data sets, covering entities such as cities, countries, politicians, films and books. The data is available as dumps, queryable online or available as crawlable linked data in RDF format.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...