Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 05 2012

Strata Week: New life for an old census

Here are a few of the data stories that caught my attention this week

Now available in digital form: The 1940 census

The National Archives released the 1940 U.S. Census records on Monday, after a mandatory 72-year waiting period. The release marks the single largest collection of digital information ever made available online by the agency.

Screenshot from the 1940 Census available through Archives.org
Screenshot from the digital version of the 1940 Census.

The 1940 Census, conducted as a door-to-door survey, included questions about age, race, occupation, employment status, income, and participation in New Deal programs — all important (and intriguing) following the previous decade's Great Depression. One data point: in 1940, there were 5.1 million farmers. According to the 2010 American Community Survey (not the census, mind you), there were just 613,000.

The ability to glean these sorts of insights proved to be far more compelling than the National Archives anticipated, and the website hosting the data, Archives.com, was temporarily brought down by the traffic load. The site is now up, so anyone can investigate the records of approximately 132 million Americans. The records are searchable by map — or rather, "the appropriate enumeration district" — but not by name.

A federal plan for big data

The Obama administration unveiled its "Big Data Research and Development Initiative" late last week, with more than $200 million in financial commitments. Among the White House's goals: to "advance state-of-the-art core technologies needed to to collect, store, preserve, manage, analyze, and share huge quantities of data."

The new big data initiative was announced with a number of departments and agencies already on board with specific plans, including grant opportunities from the Department of Defense and National Science Foundation, new spending on an XDATA program by DARPA to build new computational tools as well as open data initiatives, such as the the 1000 Genomes Project.

"In the same way that past Federal investments in information-technology R&D led to dramatic advances in supercomputing and the creation of the Internet, the initiative we are launching today promises to transform our ability to use big data for scientific discovery, environmental and biomedical research, education, and national security," said Dr. John P. Holdren, assistant to the President and director of the White House Office of Science and Technology Policy in the official press release (PDF).

Personal data and social context

When the Girls Around Me app was released, using data from Foursquare and Facebook to notify users when there were females nearby, many commentators called it creepy. "Girls Around Me is the perfect complement to any pick-up strategy," the app's website once touted. "And with millions of chicks checking in daily, there's never been a better time to be on the hunt."

"Hunt" is an interesting choice of words here, and the Cult of Mac, among other blogs, asked if the app was encouraging stalking. Outcry about the app prompted Foursquare to yank the app's API access, and the app's developers later pulled the app voluntarily from the App Store.

Many of the responses to the app raised issues about privacy and user data, and questioned whether women in particular should be extra cautious about sharing their information with social networks. But as Amit Runchal writes in TechCrunch, this response blames the victims:

"You may argue, the women signed up to be a part of this when they signed up to be on Facebook. No. What they signed up for was to be on Facebook. Our identities change depending on our context, no matter what permissions we have given to the Big Blue Eye. Denying us the right to this creates victims who then get blamed for it. 'Well,' they say, 'you shouldn't have been on Facebook if you didn't want to ...' No. Please recognize them as a person. Please recognize what that means."

Writing here at Radar, Mike Loukides expands on some of these issues, noting that the questions are always about data and social context:

"It's useful to imagine the same software with a slightly different configuration. Girls Around Me has undeniably crossed a line. But what if, instead of finding women, the app was Hackers Around Me? That might be borderline creepy, but most people could live with it, and it might even lead to some wonderful impromptu hackathons. EMTs Around Me could save lives. I doubt that you'd need to change a single line of code to implement either of these apps, just some search strings. The problem isn't the software itself, nor is it the victims, but what happens when you move data from one context into another. Moving data about EMTs into context where EMTs are needed is socially acceptable; moving data into a context that facilitates stalking isn't acceptable, and shouldn't be."

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

March 17 2012

Profile of the Data Journalist: The Homicide Watch

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Chris Amico (@eyeseast) is a journalist and web developer based in Washington, DC, where he works on NPR's State Impact project, building a platform for local reporters covering issues in their states. Laura Norton Amico (@LauraNorton) is the editor of Homicide Watch (@HomicideWatch), an online community news platform in Washington, D.C. that aspires to cover every homicide in the District of Columbia. And yes, the similar names aren't a coincidence: the Amicos were married in 2010.

Since Homicide Watch launched in 2009, it's been earning praise and interest from around the digital world, including a profile by the Nieman Lab at Harvard University that asked whether a local blog "could fill the gaps of DC's homicide coverage. Notably, Homicide Watch has turned up a number of unreported murders.

In the process, the site has also highlighted an important emerging set of data that other digital editors should consider: using inbound search engine analytics for reporting. As Steve Myers reported for the Poynter Institute, Homicide Watch used clues in site search queries to ID a homicide victim. We'll see if the Knight Foundation think this idea has legs: the husband and wife team have applied for a Knight News Challenge grant to build a tooklit for real-time investigative reporting from site analytics.

The Amico's success with the site - which saw big growth in 2011 -- offers an important case study into why organizing beats may well hold similar importance as investigative projects. It also will be a case study with respect to sustainability and business models for the "new news,"as Homicide Watch looks to license its platform to news outlets across the country.

Below, I've embedded a presentation on Homicide Watch from the January 2012 meeting of the Online News Association. Our interview follows.

Watch live streaming video from onlinenewsassociation at livestream.com

Where do you work now? What is a day in your life like?

Laura: I work full time right now for Homicide Watch, a database driven beat publishing platform for covering homicides. Our flagship site is in DC, and I’m the editor and primary reporter on that site as well as running business operations for the brand.

My typical days start with reporting. First, news checks, and maybe posting some quick posts on anything that’s happened overnight. After that, it’s usually off to court to attend hearings and trials, get documents, reporting stuff. I usually have to to-do list for the day that includes business meetings, scheduling freelancers, mapping out long-term projects, doing interviews about the site, managing our accounting, dealing with awards applications, blogging about the start-up data journalism life on my personal blog and for ONA at journalists.org, guest teaching the occasional journalism class, and meeting deadlines for freelance stories. The work day never really ends; I’m online keeping an eye on things until I go to bed.

Chris: I work for NPR, on the State Impact project, where I build news apps and tools for journalists. With Homicide Watch, I work in short bursts, usually an hour before dinner and a few hours after. I’m a night owl, so if I let myself, I’ll work until 1 or 2 a.m., just hacking at small bugs on the site. I keep a long list of little things I can fix, so I can dip into the codebase, fix something and deploy it, then do something else. Big features, like tracking case outcomes, tend to come from weekend code sprints.

How did you get started in data journalism? Did you get any special degrees or certificates?

Laura: Homicide Watch DC was my first data project. I’ve learned everything I know now from conceiving of the site, managing it as Chris built it, and from working on it. Homicide Watch DC started as a spreadsheet. Our start-up kit for newsrooms starting Homicide Watch sites still includes filling out a spreadsheet. The best lesson I learned when I was starting out was to find out what all the pieces are and learn how to manage them in the simplest way possible.

Chris: My first job was covering local schools in southern California, and data kept creeping into my beat. I liked having firm answers to tough questions, so I made sure I knew, for example, how many graduates at a given high school met the minimum requirements for college. California just has this wealth of education data available, and when I started asking questions of the data, I got stories that were way more interesting.

I lived in Dalian, China for a while. I helped start a local news site with two other expats (Alex Bowman and Rick Martin). We put everything we knew about the city -- restaurant reviews, blog posts, photos from Flickr -- into one big database and mapped it all. It was this awakening moment when suddenly we had this resource where all the information we had was interlinked. When I came back to California, I sat down with a book on Python and Django and started teaching myself to code. I spent a year freelancing in the Bay Area, writing for newspapers by day, learning Python by night. Then the NewsHour hired me.

Did you have any mentors? Who? What were the most important resources they shared with you?

Laura: Chris really coached me through the complexities of data journalism when we were creating the site. He taught me that data questions are editorial questions. When I realized that data could be discussed as an editorial approach, it opened the crime beat up. I learned to ask questions of the information I was gathering in a new way.

Chris: My education has been really informal. I worked with a great reporter at my first job, Bob Wilson, who is a great interviewer of both people and spreadsheets. At NewsHour, I worked with Dante Chinni on Patchwork Nation, who taught me about reporting around a central organizing principle. Since I’ve started coding, I’ve ended up in this great little community of programmer-journalists where people bound ideas around and help each other out.

What does your personal data journalism "stack" look like? What tools could you not live without?

Laura: The site itself and its database which I report to and from, WordPress, Wordpress analytics, Google Analytics, Google Calendar, Twitter, Facebook, Storify, Document Cloud, VINElink, and DC Superior Court’s online case lookup.

Chris: Since I write more Python than prose these days, I spend most of my time in a text editor (usually TextMate) on a MacBook Pro. I try not to do anything with git.

What data journalism project are you the most proud of working on or creating?

Laura: Homicide Watch is the best thing I’ve ever done. It’s not just about the data, and it’s not just about the journalism, but it’s about meeting a community need in an innovative way. I stared thinking about a Homicide Watchtype site when I was trying to follow a few local cases shortly after moving to DC. It was nearly impossible to find news sources for the information. I did find that family and friends of victims and suspects were posting newsy updates in unusual places -- online obituaries and Facebook memorial pages, for example. I thought a lot about how a news product could fit the expressed need for news, information, and a way for the community to stay in touch about cases.

The data part developed very naturally out of that. The earliest description of the site was “everything a reporter would have in their notebook or on their desk while covering a murder case from start to finish.” That’s still one of the guiding principals of the site, but it’s also meant that organizing that information is super important. What good is making court dates public if you’re not doing it on a calendar, for example.

We started, like I said, with a spreadsheet that listed everything we knew: victim name, age, race, gender, method of death, place of death, link to obituary, photo, suspect name, age, race, gender, case status, incarceration status, detective name, age, race, gender, phone number, judge assigned to case, attorneys connected to the case, co-defendants, connections to other murder cases.

And those are just the basics. Any reporter covering a murder case, crime to conviction, should have that information. What Homicide Watch does is organize it, make as much of it public as we can, and then report from it. It’s led to some pretty cool work, from developing a method to discover news tips in analytics, to simply building news packages that accomplish more than anyone else can.

Chris: Homicide Watch is really the project I wanted to build for years. It’s data-driven beat reporting, where the platform and the editorial direction are tightly coupled. In a lot of ways, it’s what I had in mind when I was writing about frameworks for reporting.

The site is built to be a crime reporter’s toolkit. It’s built around the way Laura works, based on our conversations over the dinner table for the first six months of the site’s existence. Building it meant understanding the legal system, doing reporting and modeling reality in ways I hadn’t done before, and that was a challenge on both the technical and editorial side.

Where do you turn to keep your skills updated or learn new things?

Laura: Assigning myself new projects and tasks is the best way for me to learn; it forces me to find solutions for what I want to do. I’m not great at seeking out resources on my own, but I keep a close eye on Twitter for what others are doing, saying about it, and reading.

Chris: Part of my usual morning news reading is a run through a bunch of programming blogs. I try to get exposed to technologies that have no immediate use to me, just so it keeps me thinking about other ways to approach a problem and to see what other problems people are trying to solve.

I spend a lot of time trying to reverse-engineer other people’s projects, too. Whenever someone launches a new news app, I’ll try to find the data behind it, take a dive through the source code if it’s available and generally see if I can reconstruct how it came together.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Laura: Working on Homicide Watch has taught me that news is about so much more than “stories.” If you think about a typical crime brief, for example, there’s a lot of information in there, starting with the "who-what-where-when." Once that brief is filed and published, though, all of that information disappears.

Working with news apps gives us the ability to harness that information and reuse/repackage it. It’s about slicing our reporting in as many ways as possible in order to make the most of it. On Homicide Watch, that means maintaining a database and creating features like victims’ and suspects’ pages. Those features help regroup, refocus, and curate the reporting into evergreen resources that benefit both reporters and the community.

Chris: Spend some time with your site analytics. You’ll find that there’s no one thing your audience wants. There isn’t even really one audience. Lots of people want lots of different things at different times, or at least different views of the information you have.

One of our design goals with Homicide Watch is “never hit a dead end.” A user may come in looking for information about a certain case, then decide she’s curious about a related issue, then wonder which cases are closed. We want users to be able to explore what we’ve gathered and to be able to answer their own questions. Stories are part of that, but stories are data, too.

March 07 2012

Data markets survey


The sale of data is a venerable business, and has existed since the
middle of the 19th century, when Paul Reuter began providing
telegraphed stock exchange prices between Paris and London, and New
York newspapers founded the Associated Press.

The web has facilitated a blossoming of information providers. As the ability to discover and exchange data improves, the need to rely on aggregators such as Bloomberg or Thomson Reuters is declining. This is a good thing: the business models of large aggregators do not readily scale to web startups, or casual use of data in analytics.

Instead, data is increasingly offered through online marketplaces: platforms that host data from publishers and offer it to consumers. This article provides an overview of the most mature data markets, and contrasts their different approaches and facilities.

What do marketplaces do?

Most of the consumers of data from today's marketplaces are developers. By adding another dataset to your own business data, you can create insight. To take an example from web analytics: by mixing an IP address database with the logs from your website, you can understand where your customers are coming from, then if you add demographic data to the mix, you have some idea of their socio-economic bracket and spending ability.

Such insight isn't limited to analytic use only, you can use it to provide value back to a customer. For instance, by recommending restaurants local to the vicinity of a lunchtime appointment in their calendar. While many datasets are useful, few are as potent as that of location in the way they provide context to activity.

Marketplaces are useful in three major ways. First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume.

In general, one of the important barriers to the development of the data marketplace economy is the ability of enterprises to store and make use of the data. A principle of big data is that it's often easier to move your computation to the data, rather than the reverse. Because of this, we're seeing the increasing integration between cloud computing facilities and data markets: Microsoft's data market is tied to its Azure cloud, and Infochimps offers hosted compute facilities. In short-term cases, it's probably easier to export data from your business systems to a cloud platform than to try and expand internal systems to integrate external sources.

While cloud solutions offer a route forward, some marketplaces also make the effort to target end-users. Microsoft's data marketplace can be accessed directly through Excel, and DataMarket provides online visualization and exploration tools.

The four most established data marketplaces are Infochimps, Factual, Microsoft Windows Azure Data Marketplace, and DataMarket. A table comparing these providers is presented at the end of this article, and a brief discussion of each marketplace follows.

Infochimps

According to founder Flip Kromer, Infochimps was created to give data life in the same way that code hosting projects such as SourceForge or GitHub give life to code. You can improve code and share it: Kromer wanted the same for data. The driving goal behind Infochimps is to connect every public and commercially available database in the world to a common platform.

Infochimps realized that there's an important network effect of "data with the data," that the best way to build a data commons and a data marketplace is to put them together in the same place. The proximity of other data makes all the data more valuable, because of the ease with which it can be found and combined.

The biggest challenge in the two years Infochimps has been operating is that of bootstrapping: a data market needs both supply and demand. Infochimps' approach is to go for a broad horizontal range of data, rather than specialize. According to Kromer, this is because they view data's value as being in the context it provides: in giving users more insight about their own data. To join up data points into a context, common identities are required (for example, a web page view can be given a geographical location by joining up the IP address of the page request with that from the IP address in an IP intelligence database). The benefit of common identities and data integration is where hosting data together really shines, as Infochimps only needs to integrate the data once for customers to reap continued benefit: Infochimps sells datasets which are pre-cleaned and integrated mash-ups of those from their providers.

By launching a big data cloud hosting platform alongside its marketplace, Infochimps is seeking to build on the importance of data locality.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Factual

Factual was envisioned by founder and CEO Gil Elbaz as an open data platform, with tools that could be leveraged by community contributors to improve data quality. The vision is very similar to that of Infochimps, but in late 2010 Factual elected to concentrate on one area of the market: geographical and place data. Rather than pursue a broad strategy, the idea is to become a proven and trusted supplier in one vertical, then expand. With customers such as Facebook, Factual's strategy is paying off.

According to Elbaz, Factual will look to expand into verticals other than local information in 2012. It is moving one vertical at a time due to the marketing effort required in building quality community and relationships around the data.

Unlike the other main data markets, Factual does not offer reselling facilities for data publishers. Elbaz hasn't found that the cash on offer is attractive enough for many organizations to want to share their data. Instead, he believes that the best way to get data you want is to trade other data, which could provide business value far beyond the returns of publishing data in exchange for cash. Factual offer incentives to their customers to share data back, improving the quality of the data for everybody.

Windows Azure Data Marketplace

Launched in 2010, Microsoft's Windows Azure Data Marketplace sits alongside the company's Applications marketplace as part of the Azure cloud platform. Microsoft's data market is positioned with a very strong integration story, both at the cloud level and with end-user tooling.

Through use of a standard data protocol, OData, Microsoft offers a well-defined web interface for data access, including queries. As a result, programs such as Excel and PowerPivot can directly access marketplace data: giving Microsoft a strong capability to integrate external data into the existing tooling of the enterprise. In addition, OData support is available for a broad array of programming languages.

Azure Data Marketplace has a strong emphasis on connecting data consumers to publishers, and most closely approximates the popular concept of an "iTunes for Data." Big name data suppliers such as Dun & Bradstreet and ESRI can be found among the publishers. The marketplace contains a good range of data across many commercial use cases, and tends to be limited to one provider per dataset — Microsoft has maintained a strong filter on the reliability and reputation of its suppliers.

DataMarket

Where the other three main data marketplaces put a strong focus on the developer and IT customers, DataMarket caters to the end-user as well. Realizing that interacting with bland tables wasn't engaging users, founder Hjalmar Gislason worked to add interactive visualization to his platform.

The result is a data marketplace that is immediately useful for researchers and analysts. The range of DataMarket's data follows this audience too, with a strong emphasis on country data and economic indicators. Much of the data is available for free, with premium data paid at the point of use.

DataMarket has recently made a significant play for data publishers, with the emphasis on publishing, not just selling data. Through a variety of plans, customers can use DataMarket's platform to publish and sell their data, and embed charts in their own pages. At the enterprise end of their packages, DataMarket offers an interactive branded data portal integrated with the publisher's own web site and user authentication system. Initial customers of this plan include Yankee Group and Lux Research.


Data markets compared





 
Azure
Datamarket
Factual
Infochimps



Data sources
Broad range
Range, with a focus on country and industry stats
Geo-specialized, some other datasets
Range, with a focus on geo, social and web sources


Free data
Yes
Yes
-
Yes


Free trials of paid data
Yes
-
Yes, limited free use of APIs
-


Delivery
OData API
API, downloads
API, downloads for heavy users
API, downloads


Application hosting
Windows Azure
-
-
Infochimps Platform


Previewing
Service Explorer
Interactive visualization
Interactive search
-


Tool integration
Excel, PowerPivot, Tableau and other OData consumers-
Developer tool integrations
-


Data publishing
Via database connection or web service
Upload or web/database connection.
Via upload or web service.
Upload


Data reselling
Yes, 20% commission on non-free datasets
Yes. Fees and commissions vary. Ability to create branded data market
-
Yes. 30% commission on non-free datasets.


Launched
2010
2010
2007
2009


Other data suppliers

While this article has focused on the more general purpose marketplaces, several other data suppliers are worthy of note.

Social dataGnip and Datasift specialize in offering social media data streams, in particular Twitter.

Linked dataKasabi, currently in beta, is a marketplace that is distinctive for hosting all its data as Linked Data, accessible via web standards such as SPARQL and RDF.

Wolfram Alpha — Perhaps the most prolific integrator of diverse databases, Wolfram Alpha recently added a Pro subscription level that permits the end user to download the data resulting from a computation.

Related:

February 24 2012

Top stories: February 20-24, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Data for the public good
The explosion of big data, open data and social data offers new opportunities to address humanity's biggest challenges. The open question is no longer if data can be used for the public good, but how.

Building the health information infrastructure for the modern epatient
The National Coordinator for Health IT, Dr. Farzad Mostashari, discusses patient empowerment, data access and ownership, and other important trends in healthcare.

Big data in the cloud
Big data and cloud technology go hand-in-hand, but it's comparatively early days. Strata conference chair Edd Dumbill explains the cloud landscape and compares the offerings of Amazon, Google and Microsoft.

Everyone has a big data problem
MetaLayer's Jonathan Gosier talks about the need to democratize data tools because everyone has a big data problem.

Three reasons why direct billing is ready for its close-up
David Sims looks at the state of direct billing and explains why it's poised to catch on beyond online games and media.


Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

Cloud photo: Big cloud by jojo nicdao, on Flickr

December 26 2011

The year in big data and data science

Big data and data science have both been with us for a while. According to McKinsey & Company's May 2011 report on big data, back in 2009 "nearly all sectors in the U.S. economy had at least an average of 200 terabytes of stored data ... per company with more than 1,000 employees." And on the data-science front, Amazon's John Rauser used his presentation at Strata New York (below) to trace the profession of data scientist all the way back to 18th-century German astronomer Tobias Mayer.

Of course, novelty and growth are separate things, and in 2011, there were a number of new technologies and companies developed to address big data's issues of storage, transfer, and analysis. Important questions were also raised about how the growing ranks of data scientists should be trained and how data science teams should be constructed.

With that as a backdrop, below I take a look at three evolving data trends that played an important role over the last year.

The ubiquity of Hadoop

HadoopIt was a big year for investment for Apache Hadoop-based companies. Hortonworks, which was spun out of Yahoo this summer, raised $20 million upon its launch. And when Cloudera announced it had raised $40 million this fall, GigaOm's Derrick Harris calculated that, all told, Hadoop-based startups had raised $104.5 million between May and November of 2011. (Other startups raising investment for their Hadoop software included PlatforaHadapt and MapR.)

But it wasn't just startups that got in on the Hadoop action this year: IBM announced this fall that it would offer Hadoop in the cloud; Oracle unveiled its own Hadoop distribution running on its new Big Data appliance; EMC signed a licensing agreement with MapR; and Microsoft opted to put its own big data processing system, Dryad, on hold, signing a deal instead with Hortonworks to handle Hadoop on Azure.

The growing number of Hadoop providers and adopters has spurred more solutions for managing and supporting Hadoop. This will become increasingly important in 2012 as Hadoop moves beyond the purview of data scientists to become a tool more businesses and analysts utilize.

More data, more privacy and security concerns

Despite all the promise that better tools for handing and analyzing data holds, there were numerous concerns this year about the privacy and security implications of big data, stemming in part from a series of high-profile data thefts and scandals.

In April, a security breach at Sony led to the theft of the personal data of 77 million users. The intrusion into the Playstation Network prompted Sony to pull it offline, but Sony failed to notify its users about the issue for a full week (later admitting that it stored usernames and passwords unencrypted). Estimates of the cost of the security breach to Sony: between $170 million and $24 billion.

That's a wide range of estimates for the damage done to the company, but the point is clear nonetheless: not only do these sorts of data breaches cost companies millions, but the value of consumers' personal data is also increasing — for both legitimate and illegitimate purposes.

iOS mapSony was hardly the only company with security and privacy concerns on its hands. In April, Alasdair Allan and Pete Warden uncovered a file in Apple iOS software that noted users' latitude-longitude coordinates along with a timestamp. Apple responded, insisting that the company "is not tracking the location of your iPhone. Apple has never done so and has no plans to ever do so." Apple fixed what it said was a "bug."

Late this year, almost all handset makers and carriers were implicated by another mobile concern when Android developer Trevor Eckhart reported that the mobile intelligence company Carrier IQ's rootkit software could record all sorts of user data — texts, web browsing, keystrokes, and even phone calls.

That the data from mobile technology was at the heart of these two controversies reflects in some ways our changing data usage patterns. But whether it's mobile or not, as we do more online — shop, browse, chat, check in, "like" — it's clear that we're leaving behind an immense trail of data about ourselves. This year saw the arrival of several open-source efforts, such as the Locker Project and ThinkUp, that strive to give users better control over their personal social data.

And while better control and safeguards can offer some level of protection, it's clear that technology can always be cracked and the goals of data aggregators can shift. So, if digital data is and always will be a moving target, how does that shape our expectations for privacy? In Privacy and Big Data, published this year, co-authors Terence Craig and Mary Ludloff argued that we might be paying too much attention to concerns about "intrusions of privacy" and that instead we need to be thinking about better transparency with how governments and companies are using our data.

Open data's inflection point

Screenshot from the Open Knowledge Foundation's Open Government Data Map
Screenshot from the Open Knowledge Foundation's Open Government Data Map.

When it comes to better transparency, 2011 has been a good year for open data, with strong growth in the number of open data efforts. Canada, the U.K., France, the U.S., and Kenya were a few of the countries unveiling open data initiatives.

There were still plenty of open data challenges: budgets cuts, for example, threatened the U.S. Data.gov initiative. And in his "state of open data 2011" talk, open data activist David Eaves pointed to the challenges of having different schemas and few standards, making it difficult for some datasets to be used across systems and jurisdictions.

Even with a number of open data "wins" at the government level, a recent survey of the data science community by EMC named the lack of open data as one of the obstacles that data scientists and business intelligence analysts said they faced. Just 22% of the former and 12% of the latter said that they "strongly believed" that the employees at their companies have the access they need to run experiments on data. Arguably, more open data efforts have spawned more interest and better understanding of what this can mean.

The demands for more open data has also spawned a demand for more tools. Importantly, these tools are beginning to be open to more than just data scientists or programmers. They include things like visualization-creator Visual.ly, the scraping tool ScraperWiki, and data-sharing site BuzzData.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl