Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 11 2013

Four short links: 11 December 2013

  1. Meet Jack, or What The Government Could Do With All That Location Data (ACLU) — sham slidedeck which helps laypeople see how our data exhaust can be used against us to keep us safe.
  2. PirateBay Moves Domains — different ccTLDs have different policies and operate in different jurisdictions, because ICANN gives them broad discretion to operate the country code domains. However, post-Snowden, governments are turning on the US’s stewardship of critical Internet bodies, so look for governments (i.e., law enforcement) to be meddling a lot more in DNS, IP addresses, routing, and other things which thus far have been (to good effect) fairly neutrally managed.
  3. 3D Printed Room (PopSci) — printed from sand, 11 tons, fully structural, full of the boggle. (via John Hagel)
  4. Things Real People Don’t Say About Advertising — awesome tumblr, great post. (via Keith Bolland)

June 12 2013

Four short links: 12 June 2013

  1. geogit — opengeo project exploring the use of distributed management of spatial data. [...] adapts [git's] core concepts to handle versioning of geospatial data. Shapefiles, PostGIS or SpatiaLite data stored in a change-tracking repository, with all the fun gut features for branching history, merging, remote/local repos, etc. BSD-licensed. First sound attempt at open source data management.
  2. Introducing Loupe — Etsy’s monitoring stack. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.
  3. Bluetooth-Controlled Robotic Cockroach (Kickstarter) — ’nuff said. (via BoingBoing)
  4. Nature Sounds of New Zealand — if all the surveillance roboroach anomaly detection drone printing stories get to you, put this on headphones and recharge. (caution: contains nature)

December 05 2012

Four short links: 5 December 2012

  1. The Benefits of Poetry for Professionals (HBR) — Harman Industries founder Sidney Harman once told The New York Times, “I used to tell my senior staff to get me poets as managers. Poets are our original systems thinkers. They look at our most complex environments and they reduce the complexity to something they begin to understand.”
  2. First Few Milliseconds of an HTTPS Connection — far more than you ever wanted to know about how HTTPS connections are initiated.
  3. Google Earth EngineDevelop, access and run algorithms on the full Earth Engine data archive, all using Google’s parallel processing platform. (via Nelson Minar)
  4. 3D Printing Popup Store Opens in NYC (Makezine Blog) — MAKE has partnered with 3DEA, a pop up 3D printing emporium in New York City’s fashion district. The store will sell printers and 3D printed objects as well as offer a lineup of classes, workshops, and presentations from the likes of jewelry maker Kevin Wei, 3D printing artist Josh Harker, and Shapeways’ Duann Scott. This. is. awesome!

January 03 2012

The Transportation Security Administration's QR code flub

I recently read about a cyberpunk author focusing on fictional graffiti artists who use code stencils to overwrite existing QR codes. The author, Tim Maughan, didn't know about my hack showing that there's actually a generalizable method for making QR code stencils work. In Maughan's book, street artists do things like replace a Coca-Cola QR code advertisement with subversive virtual art. It's a cool concept, and the author deserves props for nailing the edge of current and future cyber-reality so well. But "replacing" QR codes in public places is a notion that myself and others have been toying with in the non-fiction world.

"Toying" and "doing" are different things, of course. For example, I've toyed with the idea of covering some of the Transportation Security Administration's (TSA) QR codes with my own because it wouldn't be hard to do. You could create stickers for your TSA QR Code prank, and while waiting in line at the airport, you'd — theoretically — put your stickers over the QR codes on the TSA's posters. The TSA QR codes link to boring and bland websites about how much safer we all are because we have to buy $5 bottled water on the other side of the X-ray scanner. These aren't the most popular links, so it's unlikely anyone at the TSA would quickly notice that the QR codes have been replaced. This is a prank that could hang around for a very long time.

So, why haven't I started doing this? I have a strong aversion to jail time. I have seriously considered using Post-It notes or something that would clearly not count as defacement. Permanent stickers might technically be defacing federal property, and they could easily figure out who added the stickers through video recordings. So, while it might be hilarious and completely awesome, I am not going to try it. For the record, neither should you for all of the same reasons.

In any case, now you can understand why I scan the QR codes at the TSA lines. There's always the chance someone with more courage/foolishness than me had the same idea.

And then one day while traveling in Orlando, I scanned the following sign:

TSA poster with QR code
TSA poster with QR code. Click to enlarge.

I'm surprised that what happened next didn't result in a full pat-down for me. The QR code I scanned didn't go to a site, so I started flipping out. I told my traveling companion that I would meet them on the other side of the scanners, and I just stood there in front of this sign trying to figure out if someone else had beat me to my own "hack."

The QR code linked directly to the site I rubbed the poster to see if I could detect a sticker. No sticker. The QR code was in the poster. Had someone replicated the whole poster and just changed the QR code? What a far more elaborate hack! How had they replaced the whole poster without anyone noticing? I took several minutes trying to get a decent photo, and the picture you see above is the best I got. You can still scan the QR code from the photo if you're patient, but trust me, it goes to

It took me a while to figure out what happened. Justin Watt, the owner of, had discovered QR codes relatively early, in 2007. He wrote about how his QR code blog post eventually earned the No. 2 spot in the Google image search for "QR code." The first spot belonged to the BBC, but they had put "BBC" in the center of the code, making his image the first "normal" one. You can see his code here.

Justin's QR code is identical to the code in the TSA poster. So, this wasn't a hack. What happened is that the designer of this poster put a "stock" QR code photo, pulled from Google's image search, into the poster as a placeholder. All of the placeholders in all of the posters were later replaced with Google short links to web pages. Except for this one. Apparently, no one bothered to check that the QR code links work. As far as I know, this poster is still sitting in the Orlando airport and pointing to the wrong website. (Note: I'm assuming that an image swap is what happened. It's really the only assumption that makes any sense. Plus, it's happened before.)

Could this flub get any better? Turns out, it can.

Like many people, Justin thinks the TSA is pretty silly. A quick site-search from Google reveals that Justin has very little patience for all of the mind-numbing things that the TSA regularly does. He even links to this article about Bruce Schneier that is every bit as juicy as the one that I was fantasizing about "hacking" into the TSA's posters.

So, the TSA accidentally linked its poster to a TSA critic. Awesome.

Why would anyone like me take the risk of making the TSA look ridiculous when they've done such a careful job themselves? They could not have done a better job here if they linked to the best way to support the Electronic Frontier Foundation. In fact, because he completely controls the domain, Justin can re-route the QR code to whatever he likes. I wonder what he'll do with his super power.

I will leave it to the readers to discuss the social implications of all of the English language QR code content working, while the Spanish language QR code poster was not checked before it went out. Suffice to say, I think there are some implications there.

I also wonder how long it will take for this poster to be pulled from the TSA screening lines. So, let's do this: Post your sightings of the flubbed QR code poster on Twitter using the hashtag #tsaflub. I will try to create a collection of the "sightings" so we can see how quickly the TSA takes these down.


October 27 2011

Strata Week: IBM puts Hadoop in the cloud

Here are a few of the data stories that caught my attention this week.

IBM's cloud-based Hadoop offering looks to make data analytics easier

IBM HadoopAt its conference in Las Vegas this week, IBM made a number of major big-data announcements, including making its Hadoop-based product InfoSphere BigInsights available immediately via the company's SmartCloud platform. InfoSphere BigInsights was unveiled earlier this year, and it is hardly the first offering that Big Blue is making to help its customers handle big data. The last few weeks have seen other major players also move toward Hadoop offerings — namely Oracle and Microsoft — but IBM is offering its service in the cloud, something that those other companies aren't yet doing. (For its part, Microsoft does say that a Hadoop service will come to Azure by the end of the year.)

IBM joins Amazon Web Services as the only other company currently offering Hadoop in the cloud, notes GigaOm's Derrick Harris. "Big data — and Hadoop, in particular — has largely been relegated to on-premise deployments because of the sheer amount of data involved," he writes, "but the cloud will be a more natural home for those workloads as companies begin analyzing more data that originates on the web."

Harris also points out that IBM's Hadoop offering is "fairly unique" insofar as it targets businesses rather than programmers. IBM itself contends that "bringing big data analytics to the cloud means clients can capture and analyze any data without the need for Hadoop skills, or having to install, run, or maintain hardware and software."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Cleaning up location data with Factual Resolve

The data platform Factual launched a new API for developers this week that tackles one of the more frustrating problems with location data: incomplete records. Called Factual Resolve, the new offering is, according to a company blog post, an "entity resolution API that can complete partial records, match one entity against another, and aid in de-duping and normalizing datasets."

Developers using Resolve tell it what they know about an entity (say, a venue name) and the API can return the rest of the information that Factual knows based on its database of U.S. places — address, category, latitude and longitude, and so on.

Tyler Bell, Factual's director of product, discussed the intersection of location and big data at this year's Where 2.0 conference. The full interview is contained in the following video:

Google and governments' data requests

As part of its efforts toward better transparency, Google has updated its Government Requests tool this week with information about the number of requests the company has received for user data since the beginning of 2011.

This is the first time that Google is disclosing not just the number of requests, but the number of user accounts specified as well. It's also made the raw data available so that interested developers and researchers can study and visualize the information.

According to Google, requests from U.S. government officials for content removal were up 70% in this reporting period (January-June 2011) versus the previous six months. And the number of user data requests was up by 29% compared to the previous reporting period. Google also says it received requests from local law enforcement agencies to take down various YouTube videos — one on police brutality, one that was allegedly defamatory — but Google says that it did not comply. But of the 5,950 user data requests (impacting some 11,000 user accounts) submitted between January and June 2011, Google says that it has complied with 93%, either fully or partially.

The U.S. was hardly the only government making an increased number of requests to Google. Spain, South Korea, and the U.K., for example, also made more requests. Several countries, including Sri Lanka and the Cook Islands, made their first requests.

Got data news?

Feel free to email me.


October 21 2011

Top Stories: October 17-21, 2011

Here's a look at the top stories published across O'Reilly sites this week.

Visualization deconstructed: Why animated geospatial data works
When you plot geographic data onto the scenery of a map and then create a shifting window into that scene through the sequence of time, you create a deep, data-driven story.

Jason Huggins' Angry Birds-playing Selenium robot
Jason Huggins explains how his Angry Birds-playing robot relates to the larger problems of mobile application testing and cloud-based infrastructure.

Data journalism and "Don Draper moments"
The Guardian's Alastair Dant discusses the organization's interactive stories, including its World Cup Twitter replay, along with the steps his team takes when starting a new data project.

Building books for platforms, from the ground up
Open Air Publishing's Jon Feldman says publishers aren't truly embracing digital. They're simply pushing out flat electronic versions of print books.

Open Question: What needs to happen for tablets to replace laptops?
What will it take for tablets to equal — or surpass — their laptop cousins? See specific wish lists and weigh in with your own thoughts.

Velocity Europe, being held November 8-9 in Berlin, brings together performance and site reliability experts who share the unique experiences that can only be gained by operating at scale. Save 20% on registration with the code RADAR20.

October 19 2011

Visualization deconstructed: Why animated geospatial data works

In this, my first Visualization Deconstructed post, I'm expanding the scope to examine one of the most popular contemporary visualization techniques: animation of geospatial data over time.

The beauty of photo versus the wonder of film

Paul Butler's visualizing frienshipsIn a previous post, Sebastien Pierre provided some excellent analysis about the illuminating visualization produced by Paul Butler, which examined the relationships between Facebook users around the world.

Here, we saw the intricate beauty that comes from a designer who finds the sweet spot of insightful effectiveness and aesthetic elegance. This accomplishment is all the more impressive when demonstrated through a static visualization.

Sebastien shared a great quote, attributed to Paul Butler, which read: "Visualizing data is like photography. Instead of starting with a blank canvas, you manipulate the lens used to present the data from a certain angle."

A static visualization is a single shot from this metaphorical camera: a carefully conceived, arranged and executed vision which, at its best, manages to portray the motion of a story without the deployment of movement.

If the static visualization is a photograph, an interactive visualization, by contrast, can be considered a movie. In today's technological environment, interactives expand the creative opportunities, enabling multi-talented designers to fully unleash the dynamism of their data.

One of the most powerful examples of interactive visualization is the animation of geospatial data. In its simplest state, this is geographical data with a timestamp, but when you plot this data onto the scenery of a map and then create a shifting window into the scene through the sequence of time, you create a data-driven story.

Examining the power of animated geospatial data

As the popularity and spread of data visualization practice expands, so too does the gallery of fantastic examples of animated geospatial data. We've been fortunate to see some great developments in recent times:

Visualizing US expansion through post offices, by Derek Watkins

Journalism's Voyage West — The growth of Newspapers Across the US: 1690-2011, by Stanford University

Journalism's Voyage West - The growth of Newspapers Across the US: 1690-2011
(Click to see the full interactive visualization.)

Personal Messages from Twitter, created by Twitter

So what are the design elements and characteristics that make these visualizations so powerful?

Data layer

In each of the examples above, we witness the compelling effect of data transformed from an abstract state to a physical representation on a map, instantly bringing it to life. This is the primary layer of the visualization. The challenge for the designer is to choose the right marker with which to represent the data point, with size and color being the most prominent considerations.

In some cases, we see the design of data markers being used to combine the representation of additional data variables beyond the geographical positioning. These might include a data category encoded through color (as seen in the presentation of depth in the "The Japanese Quake Map") or a quantitative measure revealed through size (as demonstrated in "The Geography of Job Losses").

Data points clearly need to be sized so they are visible without expanding beyond their specific locations or positions and cluttering the display. This is typically a problem associated with markers that double up in duty to represent quantitative values, as seen in "A Day in the Life of" Here, we see the radius of the circular shapes expanding far and wide across large areas, which can result in the hiding of or bleeding of markers into other data points.

Screenshot from A Day in the Life of the
Expanding circles, such as those in "A Day in the Life of," can hide other data points.

The challenge of displaying multiple data items around or on a similar geographical location is also critical, especially when working within the confines of such a small mapping design space. One of the most elegant solutions is seen in the "US expansion through post offices" visualization, which shows darker clusters of data points where there are clearly dense volumes.

Color and Background

The effectiveness of a visualization will be strongly influenced by how well the designer synthesizes the data layer with the background layer. The key influencing properties here are the color scheme and map choice.

Color choices should be influenced by the need to amplify the recognition and visibility of the data points. In the examples displayed, we see contrasting approaches to the deployment of color. In "Journalism's Voyage West," we see dark mapping shades working well as the backdrop to highlight the bright white data points as they emerge. The visualization of "US expansion through post offices," on the other hand, switches this approach with a very light mapping image and an unsaturated — and probably semitransparent — brown hue to represent data. These color properties help resolve the overlay of multiple data points, as mentioned above.

For the mapping imagery, the issue is whether to present a detailed terrain like "The Japanese Quake Map" or just to use the shape of the geographical regions, like in the "Personal Messages from Twitter" project. In the quake map, the data points are generally plotted out at sea, otherwise there would be quite a visual clash, making it difficult for the data points to stand out. Unless you really need the geographical details, shapes alone — perhaps with limited labeling — work very well and help keep the data center stage.

There is the option, of course, to use no mapping layer at all, as exhibited in the "Visualizing Facebook" projects. In this case, patterns formed by the locations of and relationships between Facebook users actually illustrates much of the world map. We then learn as much from the darkness of expected and absent regions as we do from the areas that have data points. However, this is only really appropriate if you have vast amounts of data to plot.

Animation and interaction

Central to the impact and effectiveness of these designs is the simple animation of the data over time. Some exist with just a play/pause button; others have more interactive options to control the speed, flow and progress of the timeline.

For the viewer, there is palpable excitement when anticipating how the patterns will evolve; when the data spread will increase or decline; when the data activity will speed up or slow down; and when it will pop up in new, previously uncharted territories.

One of the most interesting design features is how designers choose to manage the presentation of new data points as time progresses. The pulsing rain-drop effect used in Nathan Yau's "The Growth of Walmart and Sam's Club (1962-2010)" visualization is a wonderfully conceived approach. It really helps to briefly draw the viewer's attention to new and emerging data locations. Similarly, on the "US expansion through post offices" project, the designer employs a clever blurring effect, subtly relegating the prominence of each data point soon after it has appeared.

The Growth of Walmart and Sam's Club (1962-2010)
The rain-drop effect used in "The Growth of Walmart and Sam's Club (1962-2010)" momentarily draws the viewer's attention to emerging data locations. (Click to see the full version.)

Other creative features can add extra usability to the interactive experience, such as panning across and zooming within the map to see more localized details of interest.


The sometimes neglected inclusion of information to help explain and facilitate the viewing experience is such a key visualization layer. It can elevate a visualization from a nice animation to something truly revealing, such as the use of milestone points along a horizontal timeline to provide contextual understanding about key points in time.

While not included in the visualization itself, Derek Hawkins' associated blog post does offer some fascinating narrative to go with his "US expansion through post offices" work. However, the best example of thorough annotation has to be the full site version of the "Journalism's Voyage West" project. Aside from detailed, explanatory milestones, this version also includes a thorough introduction, clear data legends and the opportunity to explore the dataset underlying each data point.


Finally, we look at the value these interactive designs offer. The best designs, including those presented above, provide a dual utility: on one hand, revealing whole stories over time and location, and on the other, allowing us to unearth our own narratives through exploration.

Whether it is observing the journey of the United States' population growth and expansion from east to west or the global relationships of those touched by the Japanese earthquakes, these animations reveal patterns and relationships we would have otherwise not seen when viewing the data alone. This encapsulates the very purpose of data visualization.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


October 11 2011

Why indoor navigation is so hard

Map of the Air and Space museum in Washington, D.C. Remember the days before you could pull your smartphone out of your pocket and get instant directions from your current location to anywhere in the world? It's kind of foggy for me, too.

In fact, I'm so used to relying on my smartphone that I feel increasingly flustered when wandering the aisles of Costco, locating the elephant house at the zoo, or searching for decent food at the airport. Shouldn't my magical pocket computer help me with this, too?

The answer is "yes," of course. But there are challenges to implementing indoor navigation today.

User interface

The maps app on your smartphone has one primary concern: getting you from 106 Main Street to 301 Sunny Lane, or from work to home, or from home to Taco Bell. Why are you going to Taco Bell and what percentage of your taco beef will be meat filler? The app doesn't need to know. Thus, the typical interface for a smartphone maps app is a big map and a search box.

You might assume that an indoor navigation app for, say, the American Museum of Natural History has the same primary concern: getting you from the main entrance to the T-Rex. But why go to the T-Rex? How do I know there's a T-Rex here anyway? And what if my kids have 20 things they want to see and we only have two hours to see everything? And what's going on this week — are there special exhibits?

It turns out that creating a useful indoor navigation app requires more than navigation. So, an effective mobile UI should be more "smart guide" and less "paper maps" on your smartphone.

It's a design challenge, like any other mobile app. Help visitors decide where they need to go first, then direct them there.


Getting directions to the plumbing section of a store is certainly useful. But let's say you're looking for a particular Delta kitchen faucet. Wouldn't it be more useful to search in a retail app for "Delta faucet," check that it's in stock, then get directions right to that product? Who cares if it's in the plumbing section or the kitchen section?

To be truly useful, an app needs to integrate with dynamic data.

Similarly, a university campus app could offer to guide a student to "Kennedy Hall Room 203," but wouldn't it be better to search for "Econ 101" instead? Who cares where Econ 101 takes place today? Even better, just have students enter their name once, fetch their schedule, and automatically take them to whatever their next class is. Why make users do more work than they have to?

Current location

OK, so you decide you want directions to that Delta faucet I mentioned earlier. Ideally, the app will automatically start from your current location.

Now comes the great sadness: GPS, as you may know, does not work indoors. The satellite signals are just too weak to penetrate anything much thicker than the metal roof of your car.

However, all modern smartphones have Wi-Fi built in, and wireless networks are common enough in indoor spaces that an app could easily scan for known access points and calculate your position using trilateration.

Here's the catch, however: Unlike the wide open world of Android, developers on the iPhone side aren't allowed to perform these Wi-Fi "signal scans."

Fortunately, there are alternatives. One approach is to make the building do the work instead of the device. Some Wi-Fi installations, such as the Cisco MSE, can determine the location of any wireless device in the building. The access points themselves listen for the Wi-Fi signals created by your phone, then estimate its position via trilateration. This solution has been deployed successfully at a few locations, including at the American Museum of Natural History.

Designing for inaccuracy

One consequence of most indoor positioning systems is a lower degree of accuracy compared to GPS. For instance, indoor systems can usually guess which room you're in, and that's about it. Precision depends on signal fluctuations, which depend on factors like how many people are in the room, how you're holding your phone, and other vagaries.

An effective mobile app must design for this reality from the very beginning. One technique that will help users greatly is to point out quickly recognizable features of the environment.

The Meridian app, for example, uses a short text label to describe each direction step. (Disclosure: I'm the CTO and co-founder of Meridian.) Below, "Rose Room" is clearly marked in the "real world" space and easy to spot, as are the stairs headed down.

Meridian app
The Meridian app uses step-by-step text labels.

The best way to combat inaccuracy, however, is by making it as easy as possible for users to self-correct. In the Meridian app, the map can easily be dragged, rotated, zoomed in and out, and the turn-by-turn steps can be flipped through with ease. If the starting location isn't perfect, the user will instinctively drag around and figure it out.

Putting it all together

Building amazing indoor app experiences is not only possible, it's already happening. This year alone, many places — from stadiums and retailers to museums and corporate campuses — have launched apps that are used by hundreds of people every day for navigation and to access location-based content.

Indoor Wi-Fi positioning technology isn't a research project anymore; it's out there and works with the devices we all now carry. With the right user interfaces, it can be just as effective as GPS is outdoors.

It's time to spread the incredible experience of wandering around a place as enormously complex as the History Museum without ever feeling lost.


September 01 2011

Strata Week: What happens when 200,000 hard drives work together?

Here are a few of the data stories that caught my attention this week.

IBM's record-breaking data storage array

Hard Drive by walknboston, on FlickrIBM Research is building a new data storage array that's almost 10 times larger than anything that's been built before. The data array is comprised of 200,000 hard drives working together, with a storage capacity of 120 petabytes — that's 120 million gigabytes. To give you some idea of the capacity of the new "drive," writes MIT Technology Review, "a 120-petabyte drive could hold 24 billion typical five-megabyte MP3 files or comfortably swallow 60 copies of the biggest backup of the Web, the 150 billion pages that make up the Internet Archive's WayBack Machine."

Data storage at that scale creates a number of challenges, including — no surprise — cooling such a massive system. But other problems include handling failure, backups and indexing. The new storage array will benefit from other research that IBM has been doing to help boost supercomputers' data access. Its General Parallel File System was designed with this massive volume in mind. The GPFS spreads files across multiple disks so that many parts of a file can be read or written at once. This system already demonstrated that it can perform when it set a new scanning speed record last month by indexing 10 billion files in just 43 minutes.

IBM's new 120-petabyte drive was built at the request of an unnamed client that needed a new supercomputer for "detailed simulations of real-world phenomena."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Infochimps' new Geo API

InfoChimpsThe data marketplace Infochimps released a new Geo API this week, giving developers access to a number of disparate location-related datasets via one API with a unified schema.

According to Infochimps, the API addresses several pain points that those working with geodata face:

  1. Difficulty in integrating several different APIs into one unified app
  2. Lack of ability to display all results when zoomed out to a large radius
  3. Limitation of only being able to use lat/long

To address these issues, Infochimps has created a new simple schema to help make data consistent and unified when drawn from multiple sources. The company has also created a "summarizer" to intelligently cluster and better display data. And finally, it has also enabled the API to handle queries other than just those traditionally associated with geodata, namely latitude and longitude.

As we seek to pull together and analyze all types of data from multiple sources, this move toward a unified schema will become increasingly important.

Hurricane Irene and weather data

The arrival of Hurricane Irene last week reiterated the importance not only of emergency preparedness but of access to real-time data — weather data, transportation data, government data, mobile data, and so on.

New York Times Hurricane Irene tracker
Screenshot from the New York Times' interactive Hurricane Irene tracking map. See the full version.

As Alex Howard noted here on Radar, crisis data is becoming increasingly social:

We've been through hurricanes before. What's different about this one is the unprecedented levels of connectivity that now exist up and down the East Coast. According to the most recent numbers from the Pew Internet and Life Project, for the first time, more than 50% of American adults use social networks. 35% of American adults have smartphones. 78% of American adults are connected to the Internet. When combined, those factors mean that we now see earthquake tweets spread faster than the seismic waves themselves. The growth of an Internet of things is an important evolution. What we're seeing this weekend is the importance of an Internet of people."

Got data news?

Feel free to email me.

Hard drive photo: Hard Drive by walknboston, on Flickr


ePayments Week: Financial Times bets on its web app

Here's what caught my attention in the payment space this week.

Financial Times drops iOS app

Financial Times web appThere are at least two big issues involved in The Financial Times' decision to pull its iPad and iPhone apps from the iOS App Store this week: one is about data; the other is about money. The FT, along with other publishers, has complained that the key sticking point in Apple's new requirement that all purchases, including subscriptions, go through the App store, has been the question of who controls the relationship with the subscriber. The publishers see these as their readers, and they want to know everything about them. And when readers upgrade or renew their subscriptions, the publishers want to deal directly with them. The view from Cupertino is different: these readers appear to be iTunes subscribers making an in-app purchase. For delivering this consumer to the app maker (the FT in this case) Apple would like its 30% cut of revenue. That may have been a factor in the FT's decision, though it seems the amount of money it would have had to give up — Robert Andrews at figured it at $1.63 million at the high end — would have been fairly insignificant to FT's parent company Pearson (and even more so to Apple with its billions in cash).

The FT's withdrawal comes as no surprise. Its online and print versions have been encouraging readers all summer to dump their iOS apps and switch to's "web app" — its HTML5 site that displays nicely on the iPhone and iPad. The Wall Street Journal reported that more than 550,000 users have the web app. PaidContent's Andrews speculated that the web app's adoption may have been spurred by a promotional offer earlier this summer granting full access to the site. ( is primarily a paid-subscription site, allowing only 10 free articles to registered users every 30 days.)

Lest we wonder if "the pink 'un" knows what it's doing in walking out on Apple and its 200 million store members, we should note that has run successfully on its paid subscription model for more than 10 years, even during the days when most mainstream news publications believed they could never charge for online content. Some publishers have come around to the FT's model, most notably The New York Times, which resumed charging for full access to online content earlier this year.

What's more, the FT says it hasn't completely abandoned Apple and, according to a Reuters report, still plans to distribute future apps in its store, including one for its luxury weekend magazine, "How to Spend It." Apparently, those are subscribers that the FT doesn't mind sharing with Apple.

Android Open, being held October 9-11 in San Francisco, is a big-tent meeting ground for app and game developers, carriers, chip manufacturers, content creators, OEMs, researchers, entrepreneurs, VCs, and business leaders.

Save 20% on registration with the code AN11RAD

Flickr's geofencing: setting access based on location

Last week, I wrote about geofencing in the context of Placecast's service to announce deals and other offers when subscribers enter a virtually delineated space. This week, Flickr rolled out another interesting use of geofencing: automatically setting privacy restrictions on photos based on where they were taken. Flickr's blog explains the new feature, and creating a geofence and linking it to access preferences is a quick and easy process.

Flickr geofence example

Flickr's geofencing is a mashup of two services that its members are already familiar with: geotagging photos and setting limits on who can see them. But in combining these two simple features, Flickr (and parent Yahoo) will offer many consumers the first glimpse of a new degree of control they will gain over the intersection of their digital and physical worlds: setting controls over what happens when they move from one location to another.

As a bonus, there's a nice post on describing the details of the feature and the fun process the coders went through to pull it together: "We met at Nolan's house, ate a farmer's breakfast, and brainstormed."

Is daily deal fatigue getting you down?

Robert Hof has a compelling column on Forbes: "5 Reasons Daily Deals are Tanking — and 3 Reasons They're Not Dead Yet." Movements this week among the category's top players would seem to confirm the ambiguity of that headline. Facebook has said it will stop its four-month old Deals program and Yelp said it would scale its program back (CEO Jeremy Stoppelman said "it hasn't been all rainbows and unicorns"). Meanwhile, Google appeared to be ratcheting up its Offers program, even promoting an offer on its legendarily sparse home page ($5 tickets to New York's American Museum of Natural History). And Groupon continued to storm toward its anticipated IPO.

I tend to agree with one of Hof's main points: too many offers are for expensive, bucket-list or birthday-party events, like flying in a hot air balloon or learning to scuba dive. Google Offers appears to take a more budget-friendly approach, offering things that people really buy every day. Google launched its Offers in Portland in June with a $3 deal at Floyd's coffee, and it continues to promote cheap recession-friendly luxuries, like $7 worth of frozen yogurt. But even Google Offers suffers from an excess of kayak rental offers.

I have to wonder if all the wine-tasting and helicopter ride offers are part of the reason why Groupon has seen its web-based traffic drop by half since June, as reported by Experian Hitwise. It may be that while there is a continuous appetite for bargains on things we consume every day (like coffee and bread), it's more difficult to sustain interest in endless offers for boot camps and laser-based body slimming.

Got news?

News tips and suggestions are always welcome, so please send them along.

If you're interested in learning more about the payment development space, check out PayPal X DevZone, a collaboration between O'Reilly and PayPal.

Fence photo: Fence Friday by DayTripper (Tom), on Flickr


Reposted byurfin urfin

August 25 2011

ePayments Week: The rise of location-triggered offers

Here's what caught my attention in the payment space this week.

Geofencing: As long as you're here ...

Fence Friday by DayTripper Tom, on FlickrOne of the promises of mobile advertising — at least from the merchant's perspective — has been the potential to advertise to customers when they're near your store and can act immediately (and impulsively) on your offer. To make these location-triggered offers, merchants need to delineate a "geofence" around their retail outlets — a radius or polygonal area in which customers who have opted into a deal program can be notified on their mobiles that an offer is available nearby. Indeed, Groupon is working on adding such location-based deals to its daily offers, according to a letter sent from its general counsel David Schellhase to two U.S. Representatives who were asking about Groupon's privacy policies.

Placecast is one company that has been working on this issue. Its service allows merchants or event planners to delineate a virtual perimeter around their locations that marks their space. When customers who have opted in to receive alerts about their retail brand or event enter one of these locations, they get a text message (a "ShopAlert"), describing the offer or event. In an interview, Placecast CEO Alistair Goodman said the company has focused on text messages thus far because they're very effective. By some measures, 90% of all texts are opened within three minutes of receiving them.

This week the company expanded its service so that ShopAlerts can also work as notifications that are linked to apps. Just as with other notifications on iOS and Android, the relevant app doesn't need to be open to receive the notification, but clicking on the notification can trigger the app to open. The new notification capabilities would seem to go well with expected improvements in the way that iOS handles notifications.

Goodman said the key to success in mobile coupons is making the message relevant. "We're only sending a text if it's the right place and time." That's key since most of us are not very good coupon clippers; we're unlikely to retain, remember, and use an offer if we don't do so almost immediately. Goodman said that location-triggered delivery is highly effective with "exceedingly high" response rates: between 11% and 60% of users are likely to visit a store when pinged with an offer if they're nearby, and up to 46% are likely to make a purchase.

More than three million subscribers, mostly in the US and UK, are currently receiving offers from Placecast — though they don't see them as coming from Placecast, which operates as a "white brand" service to other businesses. Goodman emphasizes that subscribers have all opted in via their telecom carriers or a retail brand like North Face. With that much data, there's a back-end business for the company in aggregating and anonymizing the information so it can analyze it and feed data back to merchants on which offers are most effective and when. Indeed, the company's self-service tool with which clients can manage their offers online also includes some data tools for this type of analysis.

It remains to be seen how many customers will be comfortable with this level of interaction with stores — even if they are their favorite brands. On the up side, services like Placecast are merely sending out information based on location awareness; consumers aren't being asked to divulge any financial information. On the down side, some percentage of customers are always going to remain fairly uncomfortable broadcasting their locations in this way to businesses, even if doing so offers tangible rewards. The key to success will depend on how large that percentage is.

Android Open, being held October 9-11 in San Francisco, is a big-tent meeting ground for app and game developers, carriers, chip manufacturers, content creators, OEMs, researchers, entrepreneurs, VCs, and business leaders.

Save 20% on registration with the code AN11RAD

Survey: iPhone users keen on mobile payments

Results of a UK survey about customers' willingness to use mobile payments and banking apps suggest that iPhone users are somewhat more likely to embrace mobile payments than their Android- and Blackberry-carrying peers. According to a summary report on GigaOm, the survey found 46% of iPhone users said they would pay bills with their mobiles compared to only 21% of the total group surveyed. Not surprisingly, younger folks (18-24 years old) were also more comfortable with the idea than their older brothers and sisters.

YouGov's ongoing research has provided some other insights on platform differences among UK users. Loosely generalized, they paint a picture of Blackberry owners as more driven and responsible compared to iPhone users who are more likely to overdraw their bank accounts and spend the day on social networks. According to YouGov:

  • BlackBerry users are likely to earn more, with 10% earning over £50,000 a year compared to 7% of iPhone users and 5% of Android users.
  • iPhone users spend more time on their phones than users of any of the other top models, with 18% spending more than four hours a day on it compared to 4% apiece of Android and BlackBerry users.
  • 63% of iPhone users say social networking apps are among the three they spend the most time on compared to other types.

These results, combined with other research that has found iPhone users may be a more lucrative market for developers than Android users, suggests iPhone users are quicker to spend money on their phones. We can speculate on the reasons. Certainly the higher price point (in many cases) of an iPhone attracts a user who is willing to spend more on technology and its accoutrements. Another possible factor could be their familiarity with the Apple retail model: iPhone users are accustomed to a tightly controlled shop where they deal with a single company that they trust — the same company that made their phone and its software. The Android platform, by comparison, may require users to navigate a telecom interface, Android's operating system, a hardware maker's device, and perhaps a fourth-party app store. That could create a less-structured environment where users may be less comfortable spending money. McAfee's recent report on Android's greater susceptibility to malware may only compound this feeling.

Android phones are the new destination for crapware

And speaking of trust, are telecoms burning up the goodwill of their customers who choose Android handsets by loading them with crapware? Mike Jennings on compares this trend to the same syndrome experienced on Windows-based desktops and laptops in recent years, where the excitement of discovering your new gadget is often dampened by splash screens with offers to sign up for security or media services.

Jennings notes that it's worse this time around since the mobile software, which can degrade performance, is more difficult if not impossible for average users to uninstall. He blames the network carriers, who load up the handsets to fulfill lucrative deals they've signed with software vendors. But there may be a limit to what customers will accept. Earlier this month, the same publication reported that Vodafone was backpedalling on an over-the-air upgrade that loaded up HTC Desire handsets because customers had complained of being tricked into installing the software. "We've listened to feedback from customers on a number of points around the recent 360 Android 2.1 update and made some changes to the rollout plan," Vodafone posted sheepishly on its own forums.

Got news?

News tips and suggestions are always welcome, so please send them along.

If you're interested in learning more about the payment development space, check out PayPal X DevZone, a collaboration between O'Reilly and PayPal.

Fence photo: Fence Friday by DayTripper (Tom), on Flickr


July 18 2011

Four short links: 18 July 2011

  1. Organisational Warfare (Simon Wardley) -- notes on the commoditisation of software, with interesting analyses of the positions of some large players. On closer inspection, Salesforce seems to be doing more than just commoditisation with an ILC pattern, as can be clearly seen from Radian's 6 acquisition. They also seem to be operating a tower and moat strategy, i.e. creating a tower of revenue (the service) around which is built a moat devoid of differential value with high barriers to entry. When their competitors finally wake up and realise that the future world of CRM is in this service space, they'll discover a new player dominating this space who has not only removed many of the opportunities to differentiate (e.g. social CRM, mobile CRM) but built a large ecosystem that creates high rates of new innovation. This should be a fairly fatal combination.
  2. Learning to Win by Reading Manuals in a Monte-Carlo Framework (MIT) -- starting with no prior knowledge of the game or its UI, the system learns how to play and to win by experimenting, and from parsed manual text. They used FreeCiv, and assessed the influence of parsing the manual shallowly and deeply. Trust MIT to turn RTFM into a paper. For human-readable explanation, see the press release.
  3. A Shapefile of the TZ Timezones of the World -- I have nothing but sympathy for the poor gentleman who compiled this. Political boundaries are notoriously arbitrary, and timezones are even worse because they don't need a war to change. (via Matt Biddulph)
  4. Microsoft Adventure -- 1979 Microsoft game for the TRS-80 has fascinating threads into the past and into what would become Microsoft's future.

April 14 2011

3 big challenges in location development

With the goal of indexing the entire web by location, Fwix founder and Where 2.0 speaker Darian Shirazi (@darian314) has had to dig in to a host of location-based development issues. In the following interview, he discusses the biggest challenges in location and how Fwix is addressing them.

What are the most challenging aspects of location development?

Darian ShiraziDarian Shirazi: There are three really big challenges. The first is probably the least difficult of the three, and that's getting accurate real-time information around locations. Crawling the web in real-time is difficult, especially if you're analyzing millions of pieces of data. It's even difficult for a company like Google just because there's so much data out there. And in local, it's even more important for crawling to be done in real-time, because you need to know what's happening near you right now. That requires a distributed crawling system, and that's tough to build.

The second problem, the most difficult we've had to solve, is entity extraction. That's the process of taking a news article and figuring out what locations it mentions or what locations it's about. If you see an article that mentions five of the best restaurants in the Mission District, being able to analyze that content and note, for example, that "Hog 'N Rocks" is a restaurant on 19th and Mission, is really tough. That requires us to linguistically understand an entity and what is a pronoun and what isn't a pronoun. Then you get into all of these edge conditions where a restaurant like Hog 'N Rocks might be called "Hogs & Rocks" or "Hogs and Rocks" or "H 'N Rocks." You want to catch those entities to be able to say, "This article is about these restaurants and these lat/longs."

The third problem we've had to tackle is building a places taxonomy that you can match against. If you use SimpleGeo's API or Google Places' API, you're not going to get the detailed taxonomy required to match the identified entities. You won't be able to get the different spellings of certain restaurants, and you won't necessarily know that, colloquially, "Dom and Vinnie's Pizza Shop" is just called "Dom's." Being able to identify those against the taxonomy is quite difficult and requires structuring the data in a certain way, so that matching against aliases is done quickly.

Identifying and extracting entities, like restaurants, is a challenge for location developers.

How are you dealing with those challenges?

Darian Shirazi: We have a bunch of different taggers that we put into the system that we've worked through over time to determine which are good at identifying certain entities. Some taggers are very good at tagging cities, some are better at tagging businesses, and some are really good at identifying the difference between a person, place, or thing. So we have these quorum taggers that are being applied to the data to determine the validity of the tag or whether a tag gets detected.

The way that you test it is that you have a system that allows you to input hints, and you test the hint. The hints get put into a queue of other hints that we're testing. We run a regression test and then we see if that hint improved the tagging ability or made it worse. At this point, the process is really about moving the accuracy needle a quarter of a percent per week. That's just how this game goes. If you talk to the people at Google or Bing, they'll all say the same thing.

At Where 2.0 you'll be talking about an "open places database." What is that?

Darian Shirazi: A truly open database is a huge initiative, and something that we're working toward. I can't really give details as to exactly what it's going to be, but we're working with a few partners to come up with an open places database that is actually complete.

We think that an open places database is a lot more than just a list of places — it's a list of places and content, it's a list of places and the reviews associated with those businesses, it's the list of parks and the people that have checked in at those parks, etc. Additionally, an open places database, in our minds, is something that you can contribute to. We want users and developers and everyone to come back to us and say, "Dom and Vinnie's is really just called Dom's." We also want to be able to give information to people in any format. One of the things that we'll be allowing is if you contact us, we'll give you a data dump of our places database. We'll give you a full licensed copy of it, if you want it.

Where 2.0: 2011, being held April 19-21 in Santa Clara, Calif., will explore the intersection of location technologies and trends in software development, business strategies, and marketing.

Save 25% on registration with the code WHR11RAD

How do you see location technology evolving?

Darian Shirazi: Looking toward the future, I think augmented reality is going to be a big deal. I don't mean augmented reality in the sense of a game or in the sense of Yelp's Monocle, which is a small additive to their app to show reviews in the camera view. I think of augmented reality as you are at a location and you want to see the metadata about that location. I believe that when you have a phone or a location-enabled device, you should be able to get a sense for what's going on right there and a sense for the context. That's the key.

This interview was edited and condensed.


February 08 2011

Four short links: 8 February 2011

  1. Erase and Rewind -- the BBC are planning to close (delete) 172 websites on some kind of cost-cutting measure. i’m very saddened to see the BBC join the ranks of online services that don’t give a damn for posterity. As Simon Willison points out, the British Library will have archived some of the sites (and Internet Archive others, possibly).
  2. Announcing Farebot for Android -- dumps the information stored on transit cards using Android's NFC (near field communication, aka RFID) support. When demonstrating FareBot, many people are surprised to learn that much of the data on their ORCA card is not encrypted or protected. This fact is published by ORCA, but is not commonly known and may be of concern to some people who would rather not broadcast where they’ve been to anyone who can brush against the outside of their wallet. Transit agencies across the board should do a better job explaining to riders how the cards work and what the privacy implications are.
  3. Using Public Data to Fight a War (ReadWriteWeb) -- uncomfortable use of the data you put in public?
  4. CouchOne and Membase Merge -- consolidation in the commercial NoSQL arena. the merger not only results in the joining of two companies, but also combines CouchDB, memcached and Membase technologies. Together, the new company, Couchbase, will offer an end-to-end database solution that can be stored on a single server or spread across hundreds of servers.

January 27 2011

The "dying craft" of data on discs

To prepare for next week's Strata Conference, we're continuing our series of conversations with innovators working with big data and analytics. Today, we hear from Ian White, the CEO of Urban Mapping.

Mapfluence, one of Urban Mapping's products, is a spacial database platform that aggregates data from multiple sources to deliver geographic insights to clients. GIS services online are not a completely new idea, but White said the leading players haven't "risen to the occasion." That's left open some new opportunities, particularly at the lower end of the market. Whereas traditional GIS services still often deliver data by mailing out a CD-ROM or through proprietary client-server systems, Urban Mapping is one of several companies that have updated the model to work through the browser. Their key selling point, White said, is a wider range of licensing levels that allow it to support smaller clients as well as the larger ones.

Geographic data is increasingly free, but the value proposition for companies like Urban Mapping lies in the intelligence behind the data, and the organization that makes it accessible. "We're in a phase now where we're aggregating a lot of high-value data," White said. "The next phase is to offer tools to editorially say what you want."

Urban Mapping aims to provide the domain expertise on the demographic datasets it works with, freeing clients up to focus on the intelligence revealed by the data. "A developer might spend a lot of time looking through a data catalog to find a column name. If, for example, the developer is making an application for commercial real estate and they want demographic information, they might wonder which one of 1,500 different indicators they want." Delivering the right one is obviously of a higher value than delivering a list of all 1,500. "That saves an enormous amount of time."

To achieve those time savings, Urban Mapping considers the end users and their needs when they source data. As they design the architecture around it, they think about three layers: the design layer, the application layer, and the user interface layer atop that. "We look to understand the user's ultimate purpose and then work back from there," White said, as they organize tables, add metadata, and make sure data is accessible to technical and non-technical users efficiently.

"The notion of receiving a CD in the mail, opening it, reading the manual, it's kind of a dying craft," White said. "It's unfortunate that a lot of companies have built processes around having people on staff to do this kind of work. We can effectively allow those people to work in a higher-value area of the business."

You'll find the full interview in the following video:

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD


January 06 2011

Big data faster: A conversation with Bradford Stephens

Strata Conference 2011 To prepare for O'Reilly's upcoming Strata Conference, we're continuing our series of conversations with some of the leading innovators working with big data and analytics. Today, we hear from Bradford Stephens, founder of Drawn to Scale.

Drawn to Scale is a database platform that works with large data sets. Stephens describes its focus as slightly different from that of other big data tools: "Other tools out there concentrate on doing complex things with your data in seconds to minutes. We really concentrate on doing simple things with your data in milliseconds."

Stephens calls such speed "user time" and he credits Drawn to Scale's performance to its indexing system working in parallel with backend batch tools. Like other big data tools, Drawn to Scale uses MapReduce and Hadoop for batch processing on the back end. But on the front end, a series of secondary indices on top of the storage layer speed up retrieval. "We find that when you index data in the manner in which you wish to use it, it's basically one single call to the disk to access it," Stephens says. "So it can be extremely fast."

Big data tools and applications will be examined at the Strata Conference (Feb. 1-3, 2011). Save 30% on registration with the code STR11RAD.

Drawn to Scale's customers include organizations working with analytics, in social media, in mobile ad targeting and delivery, and also organizations with large arrays of sensor networks. While he expects to see some consolidation on the commercial side ("I see a lot of vendors out there doing similar things"), on the open source side he expects to see a proliferation of tools available in areas such as geo data and managing time series. "People have some very specific requirements that they're going to cook up in open source."

You'll find the full interview in the following video:

June 17 2010

Four short links: 17 June 2010

  1. What is IBM's Watson? (NY Times) -- IBM joining the big data machine learning race, and hatching a Blue Gene system that can answer Jeopardy questions. Does good, not great, and is getting better.
  2. Google Lays Out its Mobile Strategy (InformationWeek) -- notable to me for Rechis said that Google breaks down mobile users into three behavior groups: A. "Repetitive now" B. "Bored now" C. "Urgent now", a useful way to look at it. (via Tim)
  3. BP GIS and the Mysteriously Vanishing Letter -- intrigue in the geodata world. This post makes it sound as though cleanup data is going into a box behind BP's firewall, and the folks who said "um, the government should be the depot, because it needs to know it has a guaranteed-untampered and guaranteed-able-to-access copy of this data" were fired. For more info, including on the data that is available, see the geowanking thread.
  4. Streamhacker -- a blog talking about text mining and other good things, with nltk code you can run. (via heraldxchaos on Delicious)

April 21 2010

Four short links: 21 April 2010

  1. Akihabara -- toolkit for writing 8-bit style games in Javascript using HTML5. (via waxy)
  2. Google Government Requests Tool --moving services into the cloud loses you control and privacy (see my presentation on the subject), and one way is by making your mail/browser history/etc. easier for law enforcement to get their hands on. There's new moral ground here for service providers in what services they build, how they design their systems, and how they let people make informed choices. Google is one of the few companies around that are taking actions based on an analysis of what's right, and whether or not they fall short of your moral conclusions on the subject, you have to give them credit for responding to the moral challenge. Compare to Facebook whose moral response has been to reduce user control over the use of their data.
  3. World Bank Data -- the World Bank has released a huge amount of data about countries and economies, under an Open Knowledge Definition-compliant license. (via Open Knowledge Foundation)
  4. Moes Notes -- note-taking iPhone app that includes GPS reference, so you can associate a text/audio/photo/video note with time, date, or place. (via Rich Gibson)

April 08 2010

Brian Aker on post-Oracle MySQL

Brian Aker parted ways with the mainstream MySQL release, and with Sun Microsystems, when Sun was acquired by Oracle. These days, Aker is working on Drizzle, one of several MySQL offshoot projects. In time for next week's MySQL Conference & Expo, Aker discussed a number of topics with us, including Oracle's motivations for buying Sun and the rise of NoSQL.

The key to the Sun acquisition? Hardware:

MySQL Conference and ExpoBrian Aker: I have my opinions, and they're based on what I see happening in the market. IBM has been moving their P Series systems into datacenter after datacenter, replacing Sun-based hardware. I believe that Oracle saw this and asked themselves "What is the next thing that IBM is going to do?" That's easy. IBM is going to start pushing DB2 and the rest of their software stack into those environments. Now whether or not they'll be successful, I don't know. I suspect once Oracle reflected on their own need for hardware to scale up on, they saw a need to dive into the hardware business. I'm betting that they looked at Apple's margins on hardware, and saw potential in doing the same with Sun's hardware business. I'm sure everything else Sun owned looked nice and scrumptious, but Oracle bought Sun for the hardware.

The relationship between Oracle and the MySQL Community:

BA: I think Oracle is still figuring things out as far as what they've acquired and who they've got. All of the interfacing I've done with them so far has been pretty friendly. In the world of Drizzle, we still make use of the Innodb plugin, though we are transitioning to the embedded version. Everything there has gone just along swimmingly well. In the MySQL ecosystem you have MariaDB and the other distributions. They're doing the same things that Ubuntu did for Debian, which is that they're taking something that's there and creating a different sort of product around it. Essentially though, it's still exactly the same product. I think some patches are flowing from MariaDB back into MySQL, or at least I've seen some notice of that. So for the moment it looks like everything's as friendly as it is going to be.

Is NoSQL a fad or the next big thing?

BA: There are the folks who say "just go use gdbm or Berkeley DB." What they don't fundamentally understand is that when you get into a certain data size, you're just going to be dealing with multiple computers. You can't scale up infinitely. Those answers come from an immaturity of understanding that when you get to a certain data size, everything's not going to fit on a single computer. When everything doesn't fit onto a computer, you have to be able to migrate data to multiple nodes. You need some sort of scaling solution there.

With Cassandra, and similar solutions, the only issues that come up is when they don't fit the data's usage pattern. Like for instance with data analytics. There is also still the "I need these predicates across a relational entity." That's the part where the value key systems obviously fail. They have no knowledge of a relationship between two given items. So what happens then? Well, you can end up doing MapReduce. That's great if you've got an awful lot of computers and you don't really care about when the answer is going to be found. MapReduce works as a solution when your queries are operating over a lot of data; Google sizes of data. Few companies have Google-sized datasets though. The average sites you see, they're 10-20 gigs of data. Moving to a MapReduce solution for 20 gigs of data, or even for a terabyte or two of data, makes no sense. Using MapReduce with NoSQL solutions for small sites? This happens because people don't understand how to pick the right tools.

MySQL and location data:

BA: SQL goes very well with temporal data. SQL does very well with range data. I would say that SQL works very poorly today with location-based data. Is it the best thing out there, though? Probably. I'm still waiting for someone to really spend some time thinking about the location data problem, and come up with a true location store. I don't believe that SQL databases are the solution for tomorrow's location-based data. Location services are going to require something a lot better then what we have today. Because all we have today is a set of cobbled together hacks.

MySQL's future:

BA: There hasn't been a roadmap for MySQL for some time. Even before Sun acquired MySQL, it was languishing, and Sun's handling of MySQL just further eroded the canonical MySQL tree. I'm waiting to see what Oracle announces at the MySQL Conference. I expect Oracle to scrap the current 5.5 plan and come up with a viable roadmap. It won't be very innovative, but I am betting it will be a stable plan that users can look at.

I see a lot of excitement about multiple versions of MySQL. I'm hoping to see this push innovation as the different distributions differentiate themselves. I believe that the different MySQL distributions will all become forks eventually.

In the Drizzle world, the excitement is in the different sorts of plugins that have been written, and the opportunity for more. There has been a bunch of work around the replication system, and how it integrates with other systems. We have plugins now that allow Drizzle to replicate into things like RapidMQ, Cassandra, Gearman, Voldemort, Memcached and other database systems. Having a replication system that was designed from day one to be pluggable is a game changer for some enterprises. Drizzle's future? Everything is open source, and we will see where the community wants to take it.

I would like to see more focus on data bus architectures, i.e. geographical replication. In the past, replication was a lot about how to scale out. That's dead and gone. Anybody who's doing scale-out with replication is creating a future headache for themselves. What I'd like to actually see is more attention to how we pass data between datacenters. I would also like to see more work done on shared-nothing storage systems. There's been a few attempts at that with MySQL, but thus far, the attempts have been failures. The reasons for this? Poor code quality and difficulty of use. I believe we'll see new shared-nothing solutions coming out that will work better then anything that's been written so far.

March 25 2010

Four short links: 25 March 2010

  1. Aren't You Being a Little Hasty in Making This Data Free? -- very nice deconstruction of a letter sent by ESRI and competitors to the British Government, alarmed at the announcement that various small- and mid-sized datasets would no longer be charged for. In short, companies that make money reselling datasets hate the idea of free datasets. The arguments against charging are that the cost of gating access exceeds revenue and that open access maximises economic gain. (via glynmoody on Twitter)
  2. User Assisted Audio Selection -- amazing movie that lets you sing or hum along with a piece of music to pull them out of the background music. The researcher, Paris Smaragdis has a done lot of other nifty audio work. (via waxpancake on Twitter)
  3. Cologne-based Libraries Release 5.4M Bibliographic Records to CC0 -- I see resonance here with the Cologne Archives disaster last year, where the building collapsed and 18km of shelves covering over 2000 years of municipal history were lost. When you have digital heritage, embrace the ease of copying and spread those bits as far and wide as you can. Hoarding bits comes with a risk of a digital Cologne disaster, where one calamity deletes your collection. (via glynmoody on Twitter)
  4. ThinkTank -- web app that lets you analyse your tweets, break down responses to queries, and archive your Twitter experience. Built by Expert Labs.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
Get rid of the ads (sfw)

Don't be the product, buy the product!