Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 08 2011

Strata Week: MapReduce gets its arms around a million songs

Here are some of the data stories that caught my attention this week.

A millions songs and MapReduce

Million Song DatasetEarlier this year, Echo Nest and LabROSA at Columbia University released the Million Song Dataset, a freely available collection of audio and metadata for a million contemporary popular music tracks. The purpose of the dataset, among other things, was to help encourage research on music algorithms. But as Paul Lamere, director of Echo Nest's Developer Platform, makes clear, getting started with the dataset can be daunting.

In a post on his Music Machinery blog, Lamere explains how to use Amazon's Elastic MapReduce to process the data. In fact, Echo Nest has loaded the entire Million Song Dataset onto a single S3 bucket, available at http://tbmmsd.s3.amazonaws.com/. The bucket contains approximately 300 files, each with data on about 3,000 tracks. Lamere also points to a small subset of the data — just 20 tracks — available in a file on GitHub, and he also created track.py to parse track data and return a dictionary containing all of it.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

GPS steps in where memory fails

Garmin 305 GPS deviceAfter decades of cycling without incident, The New York Times science writer John Markoff experienced what every cyclist dreads: a major crash, one that resulted in a broken nose, a deep gash on his knee, and road rash aplenty. He was knocked unconscious by the crash, unable to remember what had happened to cause it. In a recent piece in the NYT, he chronicled the steps he took to reconstruct the accident.

He did so by turning to the GPS data tracked by the Garmin 305 on his bicycle. Typically, devices like this are utilized to track the distance and location of rides as well as a cyclist's pedaling and heart rates. But as Markoff investigated his own crash, he found that the data stored in these types of devices can be use to ascertain what happens in cycling accidents.

In investigating his own memory-less crash, Markoff was able to piece together data about his trip:

My Garmin was unharmed, and when I uploaded the data I could see that in the roughly eight seconds before I crashed, my speed went from 30 to 10 miles per hour — and then 0 — while my heart rate stayed a constant 126. By entering the GPS data into Google Maps, I could see just where I crashed. I realized I did have several disconnected memories. One was of my hands being thrown off the handlebars violently, but I had no sense of where I was when it happened. With a friend, Bill Duvall, who many years ago also raced for the local bike club Pedali Alpini, I went back to the spot. La Honda Road cuts a steep and curving path through the redwoods. Just above where the GPS data said I crashed, we could see a long, thin, deep pothole. (It was even visible in Google's street view.) If my tire hit that, it could easily have taken me down. I also had a fleeting recollection of my mangled dark glasses, and on the side of the road, I stooped and picked up one of the lenses, which was deeply scratched. From the swift deceleration, I deduced that when my hands were thrown from the handlebars, I must have managed to reach my brakes again in time to slow down before I fell. My right hand was pinned under the brake lever when I hit the ground, causing the nasty road rash.

It's one thing for a rider to reconstruct his own accident, but Markoff says insurance companies are also starting to pay attention to this sort of data. As one lawyer notes in the Times article, "Frankly, it's probably going to be a booming new industry for experts."

Crowdsourcing and crisis mapping from WWI

The explosion of mobile, mapping, and web technologies has facilitated the rise of crowdsourcing during crisis situations, giving citizens and NGOs — among others — the ability to contribute to and coordinate emergency responses. But as Patrick Meier, director of crisis mapping and partnerships at Ushahidi has found, there are examples of crisis mapping that pre-date our Internet age.

Meier highlights maps he discovered from World War I at the National Air and Space Museum, pointing to the government's request for citizens to help with the mapping process:

In the event of a hostile aircraft being seen in country districts, the nearest Naval, Military or Police Authorities should, if possible, be advised immediately by Telephone of the time of appearance, the direction of flight, and whether the aircraft is an Airship or an Aeroplane.

And he asks a number of very interesting questions: How often were these maps updated? What sources were used? And "would public opinion at the time have differed had live crowdsourced crisis maps existed?"

Got data news?

Feel free to email me.

Related:

September 01 2011

Strata Week: What happens when 200,000 hard drives work together?

Here are a few of the data stories that caught my attention this week.

IBM's record-breaking data storage array

Hard Drive by walknboston, on FlickrIBM Research is building a new data storage array that's almost 10 times larger than anything that's been built before. The data array is comprised of 200,000 hard drives working together, with a storage capacity of 120 petabytes — that's 120 million gigabytes. To give you some idea of the capacity of the new "drive," writes MIT Technology Review, "a 120-petabyte drive could hold 24 billion typical five-megabyte MP3 files or comfortably swallow 60 copies of the biggest backup of the Web, the 150 billion pages that make up the Internet Archive's WayBack Machine."

Data storage at that scale creates a number of challenges, including — no surprise — cooling such a massive system. But other problems include handling failure, backups and indexing. The new storage array will benefit from other research that IBM has been doing to help boost supercomputers' data access. Its General Parallel File System was designed with this massive volume in mind. The GPFS spreads files across multiple disks so that many parts of a file can be read or written at once. This system already demonstrated that it can perform when it set a new scanning speed record last month by indexing 10 billion files in just 43 minutes.

IBM's new 120-petabyte drive was built at the request of an unnamed client that needed a new supercomputer for "detailed simulations of real-world phenomena."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Infochimps' new Geo API

InfoChimpsThe data marketplace Infochimps released a new Geo API this week, giving developers access to a number of disparate location-related datasets via one API with a unified schema.

According to Infochimps, the API addresses several pain points that those working with geodata face:

  1. Difficulty in integrating several different APIs into one unified app
  2. Lack of ability to display all results when zoomed out to a large radius
  3. Limitation of only being able to use lat/long

To address these issues, Infochimps has created a new simple schema to help make data consistent and unified when drawn from multiple sources. The company has also created a "summarizer" to intelligently cluster and better display data. And finally, it has also enabled the API to handle queries other than just those traditionally associated with geodata, namely latitude and longitude.

As we seek to pull together and analyze all types of data from multiple sources, this move toward a unified schema will become increasingly important.

Hurricane Irene and weather data

The arrival of Hurricane Irene last week reiterated the importance not only of emergency preparedness but of access to real-time data — weather data, transportation data, government data, mobile data, and so on.

New York Times Hurricane Irene tracker
Screenshot from the New York Times' interactive Hurricane Irene tracking map. See the full version.

As Alex Howard noted here on Radar, crisis data is becoming increasingly social:

We've been through hurricanes before. What's different about this one is the unprecedented levels of connectivity that now exist up and down the East Coast. According to the most recent numbers from the Pew Internet and Life Project, for the first time, more than 50% of American adults use social networks. 35% of American adults have smartphones. 78% of American adults are connected to the Internet. When combined, those factors mean that we now see earthquake tweets spread faster than the seismic waves themselves. The growth of an Internet of things is an important evolution. What we're seeing this weekend is the importance of an Internet of people."

Got data news?

Feel free to email me.

Hard drive photo: Hard Drive by walknboston, on Flickr

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl