Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 16 2011

Strata Week: The effort to digitize Palin's email archive

Here are a few of the data stories that caught my attention this week:

Sarah Palin's Inbox

Last Friday, in response to a years-old public records request, the state of Alaska finally released some 24,000 pages of emails sent by former governor Sarah Palin. And "pages" really is the operative word here. Palin's emails were all printed out — about 250 pounds of paper all told — at a printing cost of $725 per set. At least initially, the documents were only available to those who picked them up in Juneau — or to those willing to pay the high cost of having the six boxes mailed elsewhere.

Various organizations worked quickly to digitize the documents, but the task was so daunting that there were calls from many news agencies, including The New York Times to crowdsource the review of the emails.

The Sunlight Foundation, an open government advocacy group, unveiled Sarah's Inbox this week, a site that makes it easier for people to search and examine Palin's emails.

The project echoes a similar one undertaken by the Sunlight Foundation last year when the group made a searchable interface for then Supreme Court nominee Elena Kagan's emails.

Sample email from Sarah's Inbox project
One of Sarah Palin's many email messages archived at Sarah's Inbox.

As the Sunlight Foundation notes:

Like Elena's Inbox, Sarah's Inbox faced staggering issues of data quality because government officials continue to release digital files as hideous printouts requiring a laborious and error-ridden optical character recognition (OCR) pass over. You will notice that many of the emails are garbled, incomplete or contain odd characters — please keep in mind that we did the best with what we had and are not responsible for the content. Due to the programmatic nature of the tools used to build this site, we recommend checking any research effort against the source files.

Legal limits on location data

Roughly two months after the iOS location story broke here on Radar, the U.S. legislature has taken steps to limit how both the government and private companies can use location data.

Two bills were introduced this week — one in the House and one in the Senate. The latter was proposed by Senators Al Franken and Richard Blumenthal and would require companies to obtain users' consent before sharing information about the location of a mobile device. The other bill, proposed by Representative Jason Chaffetz and Senator Ron Wyden, would require law enforcement agencies to obtain a warrant in order to track someone's location via their mobile phone.

The proposals are part of a larger effort to update digital privacy laws, as legislators seem to grow increasingly concerned about consumer protections and data security.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

LexisNexis open sources its Hadoop alternative

LexisNexisResearch company LexisNexis announced this week that it will open source its big data processing tools. LexisNexis is positioning its High Performance Computing Cluster (HPCC) Systems as an alternative to Hadoop, boasting that it can "process, analyze, and find links and associations in high volumes of complex data significantly faster and more accurately than current technology systems."

LexisNexis has a long history of working with big datasets and it began developing HPCC Systems internally in its Risk Solutions unit a decade ago. Risk Solutions CEO James Peck says the company has opted to open source HPCC in order to leverage the "innovation of the open source community to further the development of the platform for the benefit of our customers and the community."

HPCC Systems is comprised of a data-centric programming language and two processing platforms: the Thor Data Refinery Cluster and the Roxie Rapid Data Delivery Cluster.

We've been watching the Hadoop competition heat up over the last few months, and the entry by LexisNexis makes the development of big data technologies and the big data market even more interesting.

Got data news?

Feel free to email me.


March 09 2011

Why location data is a mess, and what can be done about it

Between identifying relevant and accurate data sources, harmonizing data from multiple sources, and finding new ways to store and manipulate that data, location technology can be messy, says SimpleGeo's Chris Hutchins (@hutchins). But there are ways to clean it up. Hutchins explains how in the following interview.

What makes location data messy?

Chris HutchinsChris Hutchins: The primary reasons are:

  • The ever-complicated restrictions, licenses, and use rights that come with different datasets — this can include requirements to use a company's map tiles, to share back all derivative works, and sponsored listings or advertisements alongside the data.
  • Conflating records that represent the same location/business/place between multiple datasets is an incredibly arduous process.
  • With small datasets, spatial queries are quite simple. However, as datasets grow exponentially in size, indexing that data to enable fast queries becomes difficult.
  • Location is usually an opinion, not a fact. For example, there are very strong views about where neighborhoods start and end.
  • The nature of location-based information requires all technology to handle real-time requests against datasets that are always changing.

What can be done to clean up location data?

Chris Hutchins: Part of cleaning up is understanding the situation. By being aware of the limitations of certain databases or of the restrictions that some datasets require, you can better understand your capabilities.

Specifically related to data, ensuring that your data source is providing clean and up-to-date data means you won't be sending end users to the wrong location or giving them false information. Also, as more companies understand what their core competency is — and what it isn't — they learn to trust other companies to handle the things that require a more niche expertise. Understanding that this technology is new and learning to embrace tools and services in their infancy will certainly give you an edge with location data.

Where 2.0: 2011, being held April 19-21 in Santa Clara, Calif., will explore the intersection of location technologies and trends in software development, business strategies, and marketing.

Save 25% on registration with the code WHR11RAD

What are the most challenging aspects of location-aware development?

Chris Hutchins: The primary challenges we hear about are a lack of fast and accurate tools for storing, manipulating and querying spatial data, and the fact that most data is expensive and comes with restrictive terms of use. Today's geospatial infrastructure platforms are antiquated, so building the back-end infrastructure for applications takes a long time and requires some very niche skills.

How is SimpleGeo Places being used?

Chris Hutchins: SimpleGeo Places is a free database of business listings and points of interest (POI), which is being used by applications to get an up-to-date view of local businesses without having to manage a large and changing spatial database in-house. Most current POI databases have restrictive terms of use and are expensive. We believe that this has impeded innovation in the development of location-aware services and applications, so SimpleGeo provides an amount of usage of our Places data at no cost to developers and it will always be free of restrictive licensing.

What future developments do you see for location technology?

Chris Hutchins: The future of location is context, where apps will be better at giving you relevant information based on real-time information about where you are and what's around you. I'm really looking forward to a world where by knowing where I've been in the past, the things my friends like, the weather, and more, applications will be able to pinpoint where I might be interested in going and what I might be interested in doing, as well as getting me there.

This interview was edited and condensed.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
Get rid of the ads (sfw)

Don't be the product, buy the product!