Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 09 2013

How to analyze 100 million images for $624

Jetpac is building a modern version of Yelp, using big data rather than user reviews. People are taking more than a billion photos every single day, and many of these are shared publicly on social networks. We analyze these pictures to discover what they can tell us about bars, restaurants, hotels, and other venues around the world — spotting hipster favorites by the number of mustaches, for example.

Treating large numbers of photos as data, rather than just content to display to the user, is a pretty new idea. Traditionally it’s been prohibitively expensive to store and process image data, and not many developers are familiar with both modern big data techniques and computer vision. That meant we had to cut a path through some thick underbrush to get a system working, but the good news is that the free-falling price of commodity servers makes running it incredibly cheap.

I use m1.xlarge servers on Amazon EC2, which are beefy enough to process two million Instagram-sized photos a day, and only cost $12.48! I’ve used some open source frameworks to distribute the work in a completely scalable way, so this works out to $624 for a 50-machine cluster that can process 100 million pictures in 24 hours. That’s just 0.000624 cents per photo! (I seriously do not have enough exclamation points for how mind-blowingly exciting this is.)

The Foundations

So, how do I do all that work so quickly and cheaply? The building blocks are the open source Hadoop, HIPI, and OpenCV projects. Unless you were frozen in an Arctic ice-cave for the last decade, you’ll have heard of Hadoop, but the other two are a bit less famous.

HIPI is a Java framework that lets you efficiently process images on a Hadoop cluster. It’s needed because HDFS can’t handle large numbers of files, so it provides a way of bundling images together into much bigger files, and unbundling them on the fly as you process them. It’s been growing in popularity in research and academia, but hasn’t had widespread commercial use yet. I had to fork it and add meta-data support so I could keep track of the files as they went through the system, along with some other fixes. It’s running like a champion now, though, and has enabled everything else we’re doing.

OpenCV is written in C++, but has a recently added Java wrapper. It supports a lot of the fundamental operations you need to implement image-processing algorithms, so I was able to write my advanced (and fairly cunning, if I do say so myself, especially for mustaches!) image analysis and object detection routines using OpenCV for the basic operations.


The first and most time-consuming step is getting your images downloaded. HIPI has a basic distributed image downloader as an example, but you’ll want to make sure the site you’re accessing won’t be overwhelmed. I was focused on large social networks with decent CDNs, so I felt comfortable running several servers in parallel. I did have to alter the HIPI downloader code to add a user agent string so the admins of the sites could contact me if the downloading was causing any problems.

If you have three different external services you’re pulling from, with four servers assigned to each service, you end up taking about 30 days to download 100 million photos. That’s $4,492.80 for the initial pull, which is not chump change, but not wildly outside a startup budget, especially if you plan ahead and use reserved instances to reduce the cost.

Object Recognition

Now you have 100 million images, you need to do something useful with them. Photos contain a massive amount of information, and OpenCV has a lot of the building blocks you’ll need to extract some of the parts you care about. Think about the properties you’re interested in — maybe you want to exclude pictures containing people if you’re trying to create slideshows about places — and then look around at the tools that the library offers; for this example you could search for faces, and identify photos that are faceless as less likely to contain people. Anything you could do on a single image, you can now do in parallel on millions of them.

Before you go ahead and do a distributed run, though, make sure it works. I like to write a standalone command-line version of my image processing step, so I can debug it easily and iterate fast on the algorithm. OpenCV has a lot of convenience functions that make it easy to load images from disk, or even capture them from your webcam, display them, and save them out at the end. Their Java support is quite new, and I hit a few hiccups, but overall, it works very well. This made it possible to write a wrapper that is handed the images from HIPI’s decoding engine, does some processing, and then writes the results out in a text format, one line per image, all within a natively-Java Hadoop job.

Once you’re happy with the way the algorithm is working on your test data, and you’ve wrapped it in a HIPI wrapper, you’re ready to apply it to all the images you’ve downloaded. Just spin up a Hadoop job with your jar file, pointing at the HDFS folders containing the output of the downloader. In our experience, we were able to process around two million 700×700-sized JPEG images on each server per day, so you can use that as a rule of thumb for the size/speed tradeoffs you want to make when you choose how many machines to put in your cluster. Surprisingly, the actual processing we did within our algorithm didn’t affect the speed very much, apparently the object recognition and image-processing work ran fast enough that it was swamped by the time spent on IO.

I hope I’ve left you excited and ready to tackle your own large-scale image processing challenges. Whether you’re a data person who’s interested in image processing or a computer-vision geek who wants to widen your horizons, there’s a lot of new ground to be explored, and it’s surprisingly easy once you put together the pieces.

June 17 2013

Four short links: 18 June 2013

  1. Our Backbone Stack (Pamela Fox) — fascinating glimpse into the tech used and why.
  2. Automating Card Games Using OpenCV and PythonMy vision for an automated version of the game was simple. Players sit across a table on which the cards are laid out. My program would take a picture of the cards and recognize them. It would then generate valid expression that yielded 24, and then project the answer on to the table.
  3. Ray Ozzie on PRISM — posted on Hacker News (!). In particular, in this world where “SaaS” and “software eats everything” and “cloud computing” and “big data” are inevitable and already pervasive, it pains me to see how 3rd Party Doctrine may now already be being leveraged to effectively gut the intent of U.S. citizens’ Fourth Amendment rights. Don’t we need a common-sense refresh to the wording of our laws and potentially our constitution as it pertains to how we now rely upon 3rd parties? It makes zero sense in a “services age” where granting third parties limited rights to our private information is so basic and fundamental to how we think, work, conduct and enjoy life. (via Alex Dong)
  4. Larry Brilliant’s Commencement Speech (HufPo) — speaking to med grads, he’s full of purpose and vision and meaning for their lives. His story is amazing. I wish more CS grads were inspired to work on stuff that matters, and cautioned about adding their great minds to the legion trying to solve the problem of connecting you with brands you love.

May 02 2011

Four short links: 2 May 2011

  1. Chinese Internet Cafes (Bryce Roberts) -- a good quick read. My note: people valued the same things in Internet cafes that they value in public libraries, and the uses are very similar. They pose a similar threat to the already-successful, which is why public libraries are threatened in many Western countries.
  2. SIFT -- the Scale Invariant Feature Transform library, built on OpenCV, is a method to detect distinctive, invariant image feature points, which easily can be matched between images to perform tasks such as object detection and recognition, or to compute geometrical transformations between images. The licensing seems dodgy--MIT code but lots of "this isn't a license to use the patent!" warnings in the LICENSE file. (via Joshua Schachter)
  3. The Secret Life of Libraries (Guardian) -- I like the idea of the most-stolen-books revealing something about a region; it's an aspect of data revealing truth. For a while, Terry Pratchett was the most-shoplifted author in England but newspapers rarely carried articles about him or mentioned his books (because they were genre fiction not "real" literature). (via Brian Flaherty)
  4. Sweble -- MediaWiki parser library. Until today, Wikitext had been poorly defined. There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers. (via Dirk Riehle)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...