Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 09 2012

Strata Week: Your personal automated data scientist

Here are a few of the data stories that caught my attention this week:

Wolfram|Alpha Pro: An on-call data scientist

The computational knowledge engine Wolfram|Alpha unveiled a pro version this week. For $4.99 per month ($2.99 for students), Wolfram|Alpha Pro offers access to more of the computational power "under the hood" of the site, in part by allowing users to upload their own datasets, which Wolfram|Alpha will in turn analyze.

This includes:

  • Text files — Wolfram|Alpha will respond with the character and word count, provide an estimate on how long it would take to read aloud, and reveal the most common word, average sentence length and more.
  • Spreadsheets — It will crunch the numbers and return a variety of statistics and graphs.
  • Image files — It will analyze the image's dimensions, size, and colors, and let you apply several different filters.

Wolfram Alpha Pro example
Wolfram|Alpha Pro subscribers can upload and analyze their own datasets.

There's also a new extended keyboard that contains the Greek alphabet and other special characters for manually entering data. Data and analysis from these entries and any queries can also be downloaded.

"In a sense," writes Wolfram's founder Stephen Wolfram, "the concept is to imagine what a good data scientist would do if confronted with your data, then just immediately and automatically do that — and show you the results."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Crisis-mapping and data protection standards

Ushahidi's Patrick Meier takes a look at the recently released Data Protection Manual issued by the International Organization for Migration (IOM). According to the IOM, the manual is meant to serve as a guide to help:

" ... protect the personal data of the migrants in its care. It follows concerns about the general increase in data theft and loss and the recognition that hackers are finding ever more sophisticated ways of breaking into personal files. The IOM Data Protection Manual aims to protect the integrity and confidentiality of personal data and to prevent inappropriate disclosure."

Meier describes the manual as "required reading" but notes that there is no mention of social media in the 150-page document. "This is perfectly understandable given IOM's work," he writes, "but there is no denying that disaster-affected communities are becoming more digitally-enabled — and thus, increasingly the source of important, user-generated information."

Meier moves through the Data Protection Manual's principles, highlighting the ones that may be challenged when it comes to user-generated, crowdsourced data and raising important questions about consent, privacy, and security.

Doubting the dating industry's algorithms

Many online dating websites claim that their algorithms are able to help match singles with their perfect mate. But a forthcoming article in "Psychological Science in the Public Interest," a journal of the Association for Psychological Science, casts some doubt on the data science of dating.

According to the article's lead author Eli Finkel, associate professor of social psychology at Northwestern University, "there is no compelling evidence that any online dating matching algorithm actually works." Finkel argues that dating sites' algorithms do not "adhere to the standards of science," and adds that "it is unlikely that their algorithms can work, even in principle, given the limitations of the sorts of matching procedures that these sites use."

It's "relationship science" versus the in-take questions that most dating sites ask in order to help users create their profiles and suggest matches. Finkel and his coauthors note that some of the strongest predictors for good relationships — such as how couples interact under pressure — aren't assessed by dating sites.

The paper calls for the creation of a panel to grade the scientific credibility of each online dating site.

Got data news?

Feel free to email me.


September 15 2011

Strata Week: Investors circle big data

This was a busy week for data stories. Here are a few that caught my attention:

Big money for big data

Opera SolutonsThere's recently been a steady stream of funding news for big data, database, and data mining companies. Last Thursday, Hadoop-based data analytics startup Platfora raised $5.7 million from Andreessen Horowitz. On Monday, 10gen announced it had raised $20 million for MongoDB, its open-source, NoSQL database. On Tuesday, Xignite said it had raised $10 million to build big data repositories for financial organizations; data storage provider Zetta announced a $9 million round; and Walmart announced it had acquired the ad targeting and data mining startup OneRiot (the terms of the deal were not disclosed). Finally, yesterday, big data analytics company Opera Solutions announced that it had raised a whopping $84 million in its first round of funding.

GigaOm's Derrick Harris offers the story behind Opera Solution's massive round of funding, noting that the company was already growing fast and doing more than $100 million per year in revenue. He also points to the company's penchant for hiring PhDs (90 so far), "something that makes it more akin to blue-chipper IBM than to many of today's big data startups pushing Hadoop or NoSQL technologies." Harris also notes that at a half-billion-dollar valuation and with 600-plus employees, Opera Solutions isn't a great acquisitions target for other big companies, even those wanting to beef up their analytics offerings. He contends this could allow Opera Solutions to remain independent and perhaps make some acquisitions of its own.

Ushahidi and Wikipedia team up for WikiSweeper

Wikipedia and UshahidiThe crisis-mapping platform Ushahidi unveiled a new tool this week to help Wikipedia editors track changes and verify sources on articles. The project, called WikiSweeper, is aimed at those highly- and rapidly-edited articles that are associated with major events.

As Ushahidi writes on its blog:

When a globally-relevant news story breaks, relevant Wikipedia pages are the subject of hundreds of edits as events unfold. As each editor looks to editing and maintaining the quality and credibility of the page, they need to manually track the news cycle, each using their own spheres of reference. The decisions that are made to accept one source while rejecting others remains opaque, as are the strategies that editors develop to alert and keep track of the latest information coming in from a variety of different sources.

WikiSweeper is based on Ushahidi's own open-source Sweeper tool, and its application to Wikipedia will help Ushahidi in turn build out its own project. After all, during major events, information comes in from multiple sources at a breakneck pace, and in crisis response, the accuracy and trustworthiness of the sources need to be quickly and transparently identified. As Ushahidi points out, this makes it a "win-win" for both organizations as they gain better tools for dealing with real-time news and social data.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Angry Birds take down pigs and the economy

Invoking the seasonal declarations come March about the amount of time Americans waste at work watching the NCAA college basketball tournament, The Atlantic's Alexis Madrigal has pointed to a far more insidious and year-round problem: the amount of hours American workers lose by playing Angry Birds.

Drawing on data about the number of minutes people spend playing Angry Birds per day — 200 million — Madrigal has calculated the resulting lost hours and lost wages. He estimates about 43,333,333 on-the-clock hours are spent playing Angry Birds each year, accounting for $1.5 billion in lost wages per year.

Obviously there are some really big assumptions in this calculation. The first is that five percent of the total Angry Bird hours are played by Americans at work ... we don't know the international breakdown, nor do we know how often people play at work. But, five percent seemed like a reasonable assumption. Second, the Pew income data for smartphone ownership is not that precise, particularly on the upper ($75,000+) and lower (less than $30,000) ends. I had to pick numbers, so I basically split Americans up into four categories: people earning $30,000, $50,000, $75,000, and $100,000, then I calculated simple hourly wages for those groups (income/52/40) and did a weighted average based on smartphone adoption in those categories. The $35 per hour number I used is comparable with the $38 that Challenger, Gray, and Christmas used for fantasy sports players. But this is certainly a rough approximation. Put it this way: I bet this estimate is right to the order of magnitude, if not in the details.

Take that, Gladwell

Malcolm Gladwell raised the ire of many social-media-savvy activists last year by claiming that "the revolution will not be tweeted." Writing in The New Yorker, Gladwell dismissed social media as a tool for change. He argued that bonds formed online are "weak" and unable to withstand the sorts of demands necessary for social change.

Gladwell's assertions have been countered in many places, and a new article analyzing social media's role in the Arab Spring takes the rebuttals to a new level.

"After analyzing over 3 million tweets, gigabytes of YouTube content and thousands of blog posts, a new study finds that social media played a central role in shaping political debates in the Arab Spring. Conversations about revolution often preceded major events on the ground, and social media carried inspiring stories of protest across international borders," the authors write.

The authors describe their research methodology for extracting and analyzing the texts from blogs and tweets, but also lamented some of the problems they faced, particularly with access to the Twitter archive.

Got data news?

Feel free to email me.


Sponsored post

August 26 2011

Social, mapping and mobile data tell the story of Hurricane Irene

As Hurricane Irene bears down the East Coast, millions of people are bracing for the impact of what could be a multi-billion dollar disaster.

We've been through hurricanes before. What's different about this one is the unprecedented levels of connectivity that now exist up and down the East Coast. According to the most recent numbers from the Pew Internet and Life Project, for the first time, more than 50% of American adults use social networks. 35% of American adults have smartphones. 78% of American adults are connected to the Internet. When combined, those factors mean that we now see earthquake tweets spread faster than the seismic waves themselves. The growth of an Internet of things is an important evolution. What we're seeing this weekend is the importance of an Internet of people.

As citizens look for hurricane information online, government websites are under high demand. In this information ecosystem, media, government and citizens alike will play a critical role in sharing information about what's happening and providing help to one another. The federal government is providing information on Hurricane Irene at and sharing news and advisories in real-time on the radio, television, mobile devices and online using social media channels like @fema. As the storm comes in, FEMA recommends for desktops.

Over the next 72 hours a networked public can share its effects in real-time, providing city, state and federal officials unprecedented insight into what's happening. Citizens will be acting as sensors in the midst of the storm, creating an ad hoc system of networked accountability through data. There are already efforts underway to organize and collect the crisis data that citizens are generating, along with putting the open data that city and state government have released.

Following are just a few examples of how data is playing a role in hurricane response and reporting.

Open data in the Big Apple

The city of New York is squarely in the path of Hurricane Irene and has initiated mandatory evacuations from low-lying areas. The NYC Mayor's Office has been providing frequent updates to New Yorkers as the hurricane approaches, including links to an evacuation map, embedded below:

NYC Hurricane Evacuation Map

The city provides public hurricane evacuation data on the NYC DataMine. Geographic data regarding NYC Hurricane Evacuation Zones and Hurricane Evacuation Centers is publicly available on the NYC DataMine. To find and use this open data, search for “Data by Agency” and select “Office of Emergency Management (OEM). Developers can also download Google Earth KMZ files for the Hurricane Evacuation Zones. If you have any trouble accessing these files, civic technologist Philip Ashlock is mirroring NYC Irene data and links on Amazon Web Services (AWS).

"This data is already being used to power a range of hurricane evacuation zone maps completely independent of the City of New York, including at and the New York Times," said Rachel Sterne, chief digital officer of New York City. "As always, we support and encourage developers to develop civic applications using public data."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

30% on registration with the code STN11RAD

Partnering with citizens in Maryland

"We're partnering with SeeClickFix to collect reports from citizens about the effects from Irene to help first responders," said Bryan Sivak, Maryland's chief innovation officer, in a phone interview. The state has invited its citizens to share and view hurricane data throughout the state.

"This is interesting from a state perspective because there are very few things that we are responsible for or have the ability to fix. Any tree branches or wires that go down will be fixed by a local town or a utility. The whole purpose is to give our first responders another channel. We're operating under the perspective that more information is better information. By having more eyes and ears out there reporting data, we can make better informed decisions from an emergency management perspective. We just want to stress that this is a channel for communication, as opposed to a way to get something fixed. If this channel is useful in terms of managing the situation, we'll work with local governments in the future to see if it can help them. "

Your browser does not support iframes. Try it from

SeeClickFix has been working on enabling government to use citizens as public sensors since its founding. We'll see if they can help Maryland with Hurricane Irene this weekend.

[Disclosure: O'Reilly AlphaTech Ventures is an investor in SeeClickFix.]

The best hurricane tracker ever?

In the face of the storm, the New York Times has given New Yorkers one of the best examples of data journalism I've seen to date, a hurricane tracker that puts open data from the National Weather Service to beautiful use.

If you want a virtuoso human curation of the storm, New York Times reporter Brian Stelter is down in the Carolinas and reporting live via Twitter.

Crisismapping the hurricane

A Hurricane Irene crisis map is already online, where volunteers have stood up an instance of Ushahidi:

Mashing up social and geospatial data

ESRI has also posted a mashup that combines video and tweets onto an interactive map, embedded below:

The Florida Division of Emergency Management is maintaining, with support from DHS Science and Technology, mashing up curated Twitter accounts. You can download live shape files of tweeters and KML files to use if you wish.

Google adds data layers

There are also a wealth of GIS and weather data feeds powering's Hurricane Season mashup:


If you have more data stories or sources from Hurricane Irene, please let me know at or on Twitter at @digiphile. If you're safe, dry and connected, you can also help Crisis Commons by contributing to the Hurricane Irene wiki.

April 21 2011

Data News: Week in Review

The Where 2.0 Conference was held April 19 - 21 in Santa Clara, Calif., so it's no surprise there were plenty of location-based developments to talk about this week in the data space. Here are a few of the data stories — place-based and otherwise — that caught my eye.

Your iPhone tracks your location

iPhone trackOn Wednesday, Pete Warden and Alasdair Allan made headlines with the story of their discovery of an iPhone file that tracks its owner's location. The iPhone appears to use cell-tower triangulation to periodically record user's latitude and longitude, storing the data in a file that lives on the iPhone and is transferred to a user's computer when the device is synced.

According to their research, the file appears to be part of iOS 4 update, as that's the point from which the recordings start. While the existence of the file raises some questions — what are Apple's plans for this data — more disconcerting may be that the file is unencrypted, leaving this trove of location data stored locally but unprotected. Apple doesn't transmit the data, it appears, but no other device seems to have a comparable file, according to Warden and Allan.

While there are questions about privacy and security here, the data is quite compelling, thanks in no small part to the iPhone Tracker tool Warden and Allan have built that will read this file on a user's computer and visualize their movements. Your phone has surreptitiously been tracking you, but the maps replay a fascinating and fairly accurate record of where you've travelled since June 2010.

Crowdsourced data versus "real statistics"

Ushahidi co-founder Eric Hersman wrote a strong defense of crowdsourced data this week in his post, "The Immediacy of the Crowd." His blog post served as a response to one that appeared last month on the social enterprise organization Benetech's blog. The title of the latter post -- "Crowdsourced data is not a substitute for real statistics" — probably demonstrates immediately why Ushahidi would object.

The Benetech post (along with a subsequent Fast Company article) suggests that crowdsourced data from mobile phones and SMS can "lead rescue teams in the wrong direction" and that that data might not be good for statistical analysis or modeling.

On one hand, this is an interesting and important academic debate here. Which is better, crowdsourced data or statistical patterns? Are there patterns in crowdsourced data that we can use, in aggregate or as predictions in real time?

But the back and forth between the blogs, as Hersman observes in his post, overlooks an important element: Crisis response is messy and hardly a "clinical environment where we all get to sit back, sift data and take our time to make a decision."

U.S. Senate finally releases its financial data ... in PDF

It's been almost two years since the U.S. Senate agreed to make the official record of its expenditures publicly available online. This week the Senate finally revealed its plan to release the information. According to the Sunlight Foundation, the Senate will begin to release records in November. This will cover the period from April to September.

But the data will be in PDF format. As the Sunlight Foundation notes with dismay:,

The legislation was rather clearly intended to create the release of actual data, not data in the difficult-to-reuse form of a paper document. Unfortunately, PDF documents can meet the standard of searchable (as long as the text is exposed), and itemized (if the items are listed), so the Senate is getting by on a technicality, and reaching for the lowest common denominator.

How do we demand more accessible, structured datasets? Or, how do we challenge the PDF?

Got data news?

Feel free to email me.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...