Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 01 2013

The World Top Incomes Database

The World Top Incomes Database :

C’est Jean Abbiateci qui signale sur facebook l’énorme base de données de Piketty sur les inégalités et les revenus par tranche depuis plus d’un siècle

There has been a marked revival of interest in the study of the distribution of top incomes using tax data. Beginning with the research by Thomas Piketty (2001, 2003) of the long-run distribution of top incomes in France, a succession of studies has constructed top income share time series over the long-run for more than twenty countries to date. These projects have generated a large volume of data, which are intended as a research resource for further analysis.

#statistiques #bases_de_donnée #database #inégalités #revenus #piketty

July 12 2013

July 05 2013

Four short links: 5 July 2013

  1. Quantitative Analysis of the Full Bitcoin Transaction Graph (PDF) — We analyzed all these large transactions by following in detail the way these sums were accumulated and the way they were dispersed, and realized that almost all these large transactions were descendants of a single transaction which was carried out in November 2010. Finally, we noted that the subgraph which contains these large transactions along with their neighborhood has many strange looking structures which could be an attempt to conceal the existence and relationship between these transactions, but such an attempt can be foiled by following the money trail in a su*ciently persistent way. (via Alex Dong)
  2. Majority of Gamers Today Can’t Finish Level 1 of Super Mario Bros — Nintendo test, and the President of Nintendo said in a talk, We watched the replay videos of how the gamers performed and saw that many did not understand simple concepts like bottomless pits. Around 70 percent died to the first Goomba. Another 50 percent died twice. Many thought the coins were enemies and tried to avoid them. Also, most of them did not use the run button. There were many other depressing things we noted but I can not remember them at the moment. (via Beta Knowledge)
  3. Bloat-Aware Design for Big Data Applications (PDF) — (1) merging and organizing related small data record objects into few large objects (e.g., byte buffers) instead of representing them explicitly as one-object-per-record, and (2) manipulating data by directly accessing buffers (e.g., at the byte chunk level as opposed to the object level). The central goal of this design paradigm is to bound the number of objects in the application, instead of making it grow proportionally with the cardinality of the input data. (via Ben Lorica)
  4. Poderopedia (Github) — originally designed for investigative journalists, the open src software allows you to create and manage entity profile pages that include: short bio or summary, sheet of connections, long newsworthy profiles, maps of connections of an entity, documents related to the entity, sources of all the information and news river with external news about the entity. See the announcement and website.

April 19 2013

Four short links: 19 April 2013

  1. Bruce Sterling on DisruptionIf more computation, and more networking, was going to make the world prosperous, we’d be living in a prosperous world. And we’re not. Obviously we’re living in a Depression. Slow first 25% but then it takes fire and burns with the heat of a thousand Sun Microsystems flaming out. You must read this now.
  2. The Matasano Crypto Challenges (Maciej Ceglowski) — To my delight, though, I was able to get through the entire sequence. It took diligence, coffee, and a lot of graph paper, but the problems were tractable. And having completed them, I’ve become convinced that anyone whose job it is to run a production website should try them, particularly if you have no experience with application security. Since the challenges aren’t really documented anywhere, I wanted to describe what they’re like in the hopes of persuading busy people to take the plunge.
  3. Tachyona fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. Berkeley-licensed open source.
  4. Jammit (GitHub) — an industrial strength asset packaging library for Rails, providing both the CSS and JavaScript concatenation and compression that you’d expect, as well as YUI Compressor, Closure Compiler, and UglifyJS compatibility, ahead-of-time gzipping, built-in JavaScript template support, and optional Data-URI / MHTML image and font embedding. (via Joseph Misiti)

November 23 2012

Four short links: 23 November 2012

  1. Trap Island — island on most maps doesn’t exist.
  2. Why I Work on Non-Partisan Tech (MySociety) — excellent essay. Obama won using big technology, but imagine if that effort, money, and technique were used to make things that were useful to the country. Political technology is not gov2.0.
  3. 3D Printing Patent Suits (MSNBC) — notable not just for incumbents keeping out low-cost competitors with patents, but also (as BoingBoing observed) Many of the key patents in 3D printing start expiring in 2013, and will continue to lapse through ’14 and ’15. Expect a big bang of 3D printer innovation, and massive price-drops, in the years to come. (via BoingBoing)
  4. GraphChican run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google’s Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph.

February 10 2012

Top stories: February 6-10, 2012

Here's a look at the top stories published across O'Reilly sites this week.

The NoSQL movement
A relational database is no longer the default choice. Mike Loukides charts the rise of the NoSQL movement and explains how to choose the right database for your application.

Jury to Eolas: Nobody owns the interactive web
A Texas jury has struck down a company's claim to ownership of the interactive web. Eolas, which has been suing technology companies for more than a decade, now faces the prospect of losing the patents.

It's time for a unified ebook format and the end of DRM
The music industry has shown that you need to offer consumers a universal format and content without rights restrictions. So when will publishers pay attention?

Business-government ties complicate cyber security
Is an attack on a U.S. business' network an attack on the U.S. itself? "Inside Cyber Warfare" author Jeffrey Carr discusses the intermingling of corporate and government interests in this interview.

Unstructured data is worth the effort when you've got the right tools
Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.

Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

Photo used with "Unstructured data" story: mess with graphviz.

November 15 2011

Helping educators find the right stuff

Learning RegistryEducation innovation will require scalable, national, open, interoperable systems that support data feedback loops. At the recent State Education Technology Director's Association's (SETDA) Leadership Summit, the United States Department of Education launched the Learning Registry, a powerful step toward creating the ecosystem infrastructure that will enable such systems.

The Learning Registry addresses the problem of discoverability of education resources. There are countless repositories of fantastic educational content, from user-generated and curated sites to Open Education Resources to private sector publisher sites. Yet, with all this high-quality content available to teachers, it is still nearly impossible to find content to use with a particular lesson plan for a particular grade aligned to particular standards. Regrettably, it is often easier for a teacher to develop his own content than to find just the right thing on the Internet.

Schools, states, individuals, and professional communities have historically addressed this challenge by curating lists of content; rating and reviewing sites; and sharing their finds via websites, Twitter and other social media platforms. With aggregated sites to peruse, a teacher might increase his odds of finding that "just right" content, but it is still often a losing proposition. As an alternative, most educators will resort to Google, but as Secretary of Education Arne Duncan told the SETDA members, "Today's search engines do many things well, but they aren't designed to directly support teaching and learning. The Learning Registry aims to fix this problem." Aneesh Chopra, United States CTO, called the project the flagship open-government initiative for the Department of Education.

The Department of Education and the Department of Defense set out to solve the problem of discoverability, each contributing $1.3 million to the registry project. Steve Midgley, Deputy Director for the Office of Educational Technology pointed out, "We didn't build another portal — that would not be the proper role of the federal government." Instead, the proper role as Midgley envisioned it was to create infrastructure that would enable all stakeholders to share valuable information and resources in a non-centralized, open way.

In short, the Learning Registry has created open application programming interfaces (APIs) that allow publishers and others to quickly publish metadata and paradata about their content. For instance, the Smithsonian could assert digitally that a certain piece of video is intended for ages 5-7 in natural science, aligned with specific state standards. Software developers could include algorithms in lesson-planning software systems that extract, sign, and send information, such as: "A third grade teacher used this video in a lesson plan on the bridges of Portland." Browser developers could write code to include this data in search results and to increase result relevance based on ratings and reputations from trusted sources. In fact, Midgley showed the SETDA audience a prototype browser plug-in that did just that.

The virtue of this system comes from the platform thinking behind its design — an open communication system versus a portal — and from the value it provides to users from the very beginning. In the early days, improved discoverability of relevant content is a boon to both the teacher who discovers it and the content owner who publishes it. The APIs are structured in such a way that well-implemented code will collect valuable information about how the content is used as a side effect of educators, parents, and others simply doing their daily work. Over time, a body of metadata and paradata will emerge that identifies educational content; detailed data about how it has been used and interacted with; as well as rating, reputation and other information that can feed interesting new analytics, visualizations, and meaningful presentation of information to teachers, parents, researchers, administrators and developers.

Midgley called for innovative developers and entrepreneurs to take advantage of this enabling system for data collection in the education market. As the simple uses begin to drive use cases that shed increasingly rich data, there will be new opportunities to build businesses based on analytics and the meaningful presentation of rich new data to teachers, parents, students, and others who have an interest in teaching and learning.

I am delighted and intrigued to see the Department of Education leading with infrastructure over point solutions. As Richard Culatta, Education Fellow in Senator Patty Murray's office, said to the audience, "When common frameworks are put in place, it allows smart people to do really creative things."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


October 06 2011

Oracle's NoSQL

Oracle's turn-about announcement of a NoSQL product wasn't really surprising. When Oracle spends time and effort putting down a technology, you can bet that its secretly impressed, and trying to re-implement it in its back room. So Oracle's paper "Debunking the NoSQL Hype" should really have been read as a backhanded product announcement. (By the way, don't click that link; the paper appears to have been taken down. Surprise.)

I have to agree with DataStax and other developers in the NoSQL movement: Oracle's announcement is a validation, more than anything else. It's certainly a validation of NoSQL, and it's worth thinking about exactly what that means. It's long been clear that NoSQL isn't about any particular architecture. When databases as fundamentally different as MongoDB, Cassandra, and Neo4J can all be legitimately characterized as "NoSQL," it's clear that NoSQL isn't a "thing." We've become accustomed to talking about the NoSQL "movement," but what does that mean?

As Justin Sheehy, CTO of Basho Technologies, said, the NoSQL movement isn't about any particular architecture, but about architectural choice. For as long as I can remember, application developers have debated software architecture choices with gusto. There were many choices for the front end; many choices for middleware; and careers rose and fell based on those choices. Somewhere along the way, "Software Architect" even became a job title. But for the backend, for the past 20 years there has really been only one choice: a relational database that looks a lot like Oracle (or MySQL, if you'd prefer). And choosing between Oracle, MySQL, PostgreSQL, or some other relational database just isn't that big a choice.

Did we really believe that one size fits all for database problems? If we ever did, the last three years have made it clear that the model was broken. I've got nothing against SQL (well, actually, I do, but that's purely personal), and I'm willing to admit that relational databases solve many, maybe even most, of the database problems out there. But just as it's clear that the universe is a more complicated place than physicists thought it was in 1990, it's also clear that there are data problems that don't fit 20-year-old models. NoSQL doesn't use any particular model for storing data; it represents the ability to think about and choose your data architecture. It's important to see Oracle recognize this. The company's announcement isn't just a validation of key-value stores, but of the entire discussion of database architecture.

Of course, there's more to the announcement than NoSQL. Oracle is selling a big data appliance: an integrated package including Hadoop and R. The software is available standalone, though Oracle clearly hopes that the package will be running on its Exadata Database hardware (or equivalent), which is an impressive monster of a database machine (though I agree with Mike Driscoll, that machines like these are on the wrong side of history). There are other bits and pieces to solve ETL and other integration problems. And it's fair to say that Oracle's announcement validates more than just NoSQL; it validates the "startup stack" or "data stack" that we've seen in many of most exciting new businesses that we watch. Hadoop plus a non-relational database (often MongoDB, HBase, or Cassandra), with R as an analytics platform, is a powerful combination. If nothing else, Oracle has given more conservative (and well-funded) enterprises permission to make the architectural decisions that the startups have been making all along, and to work with data that goes beyond what traditional data warehouses and BI technologies allow. That's a good move, and it grows the pie for everyone.

I don't think many young companies will be tempted to invest millions in Oracle products. Some larger enterprises should, and will, question whether investing in Oracle products is wise when there are much less expensive solutions. And I am sure that Oracle will take its share of the well-funded enterprise business. It's a win all around.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR


June 29 2011

What CouchDB can do for HTML5, web apps and mobile

CouchDBCouchApps are JavaScript and HTML5 applications served directly from the document-oriented database CouchDB. In the following interview, Found Line co-founder and OSCON speaker Bradley Holt (@BradleyHolt) talks about the utility of CouchApps, what CouchDB offers web developers, and how the database works with HTML5.

How do CouchApps work?

Bradley Holt: CouchApps are web applications built using CouchDB, JavaScript, and HTML5. They skip the middle tier and allow a web application to talk directly to the database — the CouchDB database could even be running on the end-user's machine or Android / iOS device.

What are the benefits of building CouchApps?

Bradley Holt: Streamlining of your codebase (no middle tier), replication, the ability to deploy/replicate an application along with its data, and the side benefits that come with going "with the grain" of how the web works are some of the benefits of building CouchApps.

To be perfectly honest though, I don't think CouchApps are quite ready for widespread developer adoption yet. The biggest impediment is tooling. The current set of development tools need refinement, and the process of building a CouchApp can be a bit difficult at times. The term "CouchApp" can also have many different meanings. That said, the benefits of CouchApps are compelling and the tools will catch up soon.

OSCON JavaScript and HTML5 Track — Discover the new power offered by HTML5, and understand JavaScript's imminent colonization of server-side technology.

Save 20% on registration with the code OS11RAD

HTML5 addresses a lot of storage issues. Where does CouchDB fit in?

Bradley HoltBradley Holt: The HTML5 Web Storage specification describes an API for persistent storage of key/value pairs locally within a user's web browser. Unlike previous attempts at browser local storage specifications, the HTML5 storage specification has achieved significant cross-browser support.

One thing that the HTML5 Web Storage API lacks, however, is a means of querying for values by anything other than a specific key. You can't query across a set of keys or values. IndexedDB addresses this and allows for indexed database queries, but IndexedDB is not currently part of the HTML5 specification and is only implemented in a limited number of browsers.

If you need more than just key/value storage, then you have to look outside of the HTML5 specification. Like HTML5 Web Storage, CouchDB stores key/value pairs. In CouchDB, the key part of the key/value pair is a document ID and the value is a JSON object representing a single document. Unlike HTML5 Web Storage, CouchDB provides a means of indexing and querying data using MapReduce "views." Since CouchDB is accessed using a RESTful HTTP API and stores documents as JSON objects, it is easy to work with CouchDB directly from an HTML5/JavaScript web application.

How does CouchDB's replication feature work with HTML5?

Bradley Holt: Again, CouchDB is not directly related to the HTML5 specification, but CouchDB's replication feature creates unique opportunities for CouchApps built using JavaScript and HTML5 (or any application built using CouchDB, for that matter).

I've heard J. Chris Anderson use the term "ground computing" as a counterpoint to "cloud computing." The idea is to store a user's data as close to that user as possible — and you can't get any closer than a user's own computer or mobile device! CouchDB's replication feature makes this possible. Data that is relevant to a particular user can be copied to and from that user's own computer or mobile device using CouchDB's incremental replication. This allows for faster access for the user (since his or her application is hitting a local database), offline access, data portability, and potentially more control over his or her own data.

Now that CouchDB runs on mobile devices, how do you see it shaping mobile app development?

Bradley Holt: While Android is a great platform, the biggest channel for mobile applications is Apple's iOS. CouchDB has been available on the Android for a while now, but it is relatively new to iOS. Now that CouchDB can be used to build iPhone/iPad applications, we will most certainly see many more mobile applications built using CouchDB in order to take advantage of CouchDB's unique features — especially replication.

The big question is, will these applications be built as native applications or will they be built as CouchApps? I don't know the answer, but I'd like to see more of these applications built on the CouchApps side. With CouchApps, developers can more easily port their applications across platforms, and they can use existing HTML5, JavaScript, and CSS skill sets.

This interview was edited and condensed.


March 14 2011

Four short links: 14 March 2011

  1. A History of the Future in 100 Objects (Kickstarter) -- blog+podcast+video+book project, to have future historians tell the story of our century in 100 objects. The BBC show that inspired it was brilliant, and I rather suspect this will be too. It's a clever way to tell a story of the future (his hardest problem will be creating a single coherent narrative for the 21st century). What are the 100 objects that future historians will use to sum up our century? 'Smart drugs' that change the way we think? A fragment from suitcase nuke detonated in Shanghai? A wedding ring between a human and an AI? The world's most expensive glass of water, returned from a private mission to an asteroid? (via RIG London weekly notes)
  2. Entrepreneurs Who Create Value vs Entrepreneurs Who Lock Up Value (Andy Kessler) -- distinguishes between "political entrepreneurs" who leverage their political power to own something and then overcharge or tax the crap out of the rest of us to use it vs "market entrepreneurs" who recognize the price-to-value gap and jump in. Ignoring legislation, they innovate, disintermediate, compete, stay up all night coding, and offer something better and cheaper until the market starts to shift. My attention was particularly caught by for every stroke of the pen, for every piece of legislation, for every paid-off congressman, there now exists a price umbrella that overvalues what he or any political entrepreneur is doing. (via Bryce Roberts)
  3. Harper-Collins Caps eBook Loans -- The publisher wants to sell libraries DRMed ebooks that will self-destruct after 26 loans. Public libraries have always served and continue to serve those people who can't access information on the purchase market. Jackass moves like these prevent libraries from serving those people in the future that we hope will come soon: the future where digital is default and print is premium. That premium may well be "the tentacles of soulless bottom-dwelling coprocephalic publishers can't digitally destroy your purchase". It's worth noting that O'Reilly offers DRM-free PDFs of the books they publish, including mine. Own what you buy lest it own you. (via BoingBoing and many astonished library sources)
  4. MAD Lib -- BSD-licensed open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. (via Ted Leung)

January 12 2011

Hadoop: What it is, how it works, and what it can do

HadoopHadoop gets a lot of buzz these days in database and content management circles, but many people in the industry still don't really know what it is and or how it can be best applied.

Cloudera CEO and Strata speaker Mike Olson, whose company offers an enterprise distribution of Hadoop and contributes to the project, discusses Hadoop's background and its applications in the following interview.

Where did Hadoop come from?

Mike OlsonMike Olson: The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google's innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

What problems can Hadoop solve?

Mike Olson: The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.

Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD

How is Hadoop architected?

Mike Olson: Hadoop is designed to run on a large number of machines that don't share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization's data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There's no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.

In a centralized database system, you've got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That's MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.

Architecturally, the reason you're able to deal with lots of data is because Hadoop spreads it out. And the reason you're able to ask complicated computational questions is because you've got all of these processors, working in parallel, harnessed together.

At this point, do companies need to develop their own Hadoop applications?

Mike Olson: It's fair to say that a current Hadoop adopter must be more sophisticated than a relational database adopter. There are not that many "shrink wrapped" applications today that you can get right out of the box and run on your Hadoop processor. It's similar to the early '80s when Ingres and IBM were selling their database engines and people often had to write applications locally to operate on the data.

That said, you can develop applications in a lot of different languages that run on the Hadoop framework. The developer tools and interfaces are pretty simple. Some of our partners — Informatica is a good example — have ported their tools so that they're able to talk to data stored in a Hadoop cluster using Hadoop APIs. There are specialist vendors that are up and coming, and there are also a couple of general process query tools: a version of SQL that lets you interact with data stored on a Hadoop cluster, and Pig, a language developed by Yahoo that allows for data flow and data transformation operations on a Hadoop cluster.

Hadoop's deployment is a bit tricky at this stage, but the vendors are moving quickly to create applications that solve these problems. I expect to see more of the shrink-wrapped apps appearing over the next couple of years.

Where do you stand in the SQL vs NoSQL debate?

Mike Olson: I'm a deep believer in relational databases and in SQL. I think the language is awesome and the products are incredible.

I hate the term "NoSQL." It was invented to create cachet around a bunch of different projects, each of which has different properties and behaves in different ways. The real question is, what problems are you solving? That's what matters to users.


January 06 2011

Big data faster: A conversation with Bradford Stephens

Strata Conference 2011 To prepare for O'Reilly's upcoming Strata Conference, we're continuing our series of conversations with some of the leading innovators working with big data and analytics. Today, we hear from Bradford Stephens, founder of Drawn to Scale.

Drawn to Scale is a database platform that works with large data sets. Stephens describes its focus as slightly different from that of other big data tools: "Other tools out there concentrate on doing complex things with your data in seconds to minutes. We really concentrate on doing simple things with your data in milliseconds."

Stephens calls such speed "user time" and he credits Drawn to Scale's performance to its indexing system working in parallel with backend batch tools. Like other big data tools, Drawn to Scale uses MapReduce and Hadoop for batch processing on the back end. But on the front end, a series of secondary indices on top of the storage layer speed up retrieval. "We find that when you index data in the manner in which you wish to use it, it's basically one single call to the disk to access it," Stephens says. "So it can be extremely fast."

Big data tools and applications will be examined at the Strata Conference (Feb. 1-3, 2011). Save 30% on registration with the code STR11RAD.

Drawn to Scale's customers include organizations working with analytics, in social media, in mobile ad targeting and delivery, and also organizations with large arrays of sensor networks. While he expects to see some consolidation on the commercial side ("I see a lot of vendors out there doing similar things"), on the open source side he expects to see a proliferation of tools available in areas such as geo data and managing time series. "People have some very specific requirements that they're going to cook up in open source."

You'll find the full interview in the following video:

December 21 2010

The growing importance of data journalism

One of the themes from News Foo that continues to resonate with me is the importance of data journalism. That skillset has received renewed attention this winter after Tim Berners-Lee called analyzing data the future of journalism.

When you look at data journalism and the big picture, as USA Today's Anthony DeBarros did at his blog in November, it's clear the recent suite of technologies is part of a continuum of technologically enhanced storytelling that traces back to computer-assisted reporting (CAR).

As Barros pointed out, the message of CAR "was about finding stories and using simple tools to do it: spreadsheets, databases, maps, stats," like Microsoft Access, Excel, SPSS, and SQL Server. That's just as true today, even if data journalists now have powerful new tools for scraping data from the web with tools like ScraperWiki and Needlebase, scripting with Perl, or Ruby, Python, MySQL and Django.

Understanding the history of computer-assisted reporting is key to putting new tools in the proper context. "We use these tools to find and tell stories," Barros wrote. "We use them like we use a telephone. The story is still the thing."

The data journalism session at News Foo took place on the same day civic developers were participating in a global open data hackathon and the New York Times hosted its Times Open Hack Day. Many developers at contests like these are interested in working with open data, but the conversation at News Foo showed how much further government entities need to go to deliver on the promise open data holds for the future of journalism.

The issues that came up are significant. Government data is often "dirty," with missing metadata or incorrect fields. Journalists have to validate and clean up datasets with tools like Google Refine. ProPublica's Recovery Tracker for stimulus data and projects is one of the best examples of the practice in action.

A recent gold standard for data journalism is the Pulitzer-Prize winning Toxic Waters project from the New York Times. The scale of that project makes it a difficult act to follow, though Times developers are working hard with nifty projects like Inside Congress.

You can see a visualization of the Toxic Waters project and other examples of data journalism in this Ignite presentation from News Foo.

At ProPublica, the data journalism team is conscious of deep linking into news applications, with the perspective that the visualizations produced from such apps are themselves a form of narrative journalism. With great data visualizations, readers can find their own way and interrogate the data themselves. Moreover, distinctions between a news "story" and a news "app" are dissolving as readers increasingly consume media on mobile devices and tablets.

One approach to providing useful context is the "Ion" format at, where a project like "Eye on the Stimulus" is a hybrid between a blog and an application. On one side of the web page, there's a news river. On the other, there's entry points into the data itself. The challenge to this approach is that a media outlet needs alignment between staff and story. A reporter has to be filing every day on a running story that's data sensitive.


The data journalism News Foo session featured a virtual component, bringing City Camp founder Kevin Curry, evangelist Jeanne Holm, and Reynolds fellow David Herzog together with News Foo participants to talk about the value propositions for open government data and data journalism.

As the recent open data report showed, developers are not finding the government data they need or want. If other entrepreneurs are to follow the lead of BrightScope, open government datasets will need to be more relevant to business. The feedback for and other government data repositories was clear: more data, better data, and cleaner data, please.

Improving media access to data at the county- or state-level of government has structural barriers because of growing budget crises in statehouses around the United States. As Jeanne Holm observed during the News Foo session, open government initiatives will likely be done in a zero-sum budget environment in 2011. Officials have to make them sustainable and affordable.

There are some areas where the federal government can help. Holm said has created cloud hosting that can be shared with state, local or tribal is also rolling out a set of tools that will help with data conversion, optical character recognition, and, down the road, better tools for structured data.

Those resources could make government data more readily available and accessible to the media. Kevin Curry said that data catalogs are popping up everywhere. He pointed to CivicApps in Portland, Ore., where Max Ogden's work on coding the middleware for open government led to translating government data into more useful forms for developers.

Data journalists also run into government's cultural challenges. It can be hard to find public information officers willing or able to address substantive questions about data. Holm said may post more contact information online and create discussions around each dataset. That kind of information is a good start for addressing data concerns at the federal level, but fostering useful connections between journalists and data will still require improvement and effort.


December 20 2010

Strata Gems: Turn MySQL into blazing fast NoSQL

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: What your inbox knows.

Strata 2011 The trend for NoSQL stores such as memcache for fast key-value storage should give us pause for thought: what have regular database vendors been doing all this time? An important new project, HandlerSocket, seeks to leverage MySQL's raw speed for key-value storage.

NoSQL databases offer fast key-value storage for use in backing web applications, but years of work on regular relational databases has hardly ignored performance. The main performance hit with regular databases is in interpreting queries.

HandlerSocket is a MySQL server plugin that interfaces directly with the InnoDB storage engine. Yoshinori Matsunobu, one of HandlerSocket's creators at Japanese internet and gaming company Dena, reports over 750,000 queries per second performance on commodity server hardware: compared with 420,000 using memcache, and 105,000 using regular SQL access to MySQL. Furthermore, since the underlying InnoDB storage is used, HandlerSocket offers a NoSQL-type interface that doesn't have to trade away ACID compliance.

With the additional benefits of being able to use the mature MySQL tool ecosystem for monitoring, replication and administration, HandlerSocket presents a compelling case for using a single database system. As the HandlerSocket protocol can be used on the same database and tables used for regular SQL access, the problems of inconsistency and replication created by multiple tiers of databases can be mitigated.

HandlerSocket has now been integrated into Percona's XtraDB, an enhanced version of the InnoDB storage engine for MySQL. You can also compile and install HandlerSocket yourself alongside MySQL.

December 16 2010

Strata Gems: Who needs disks anyway?

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Kinect democratizes augmented reality.

Strata 2011 Today's databases are designed for the spinning platter of the hard disk. They take into account that the slowest part of reading data is seeking: physically getting the read head to the part of the disk it needs to be in. But the emergence of cost effective solid state drives (SSD) is changing all those assumptions.

Over the course of 2010, systems designers have been realizing the benefits of using SSDs in data centers, with major IT vendors and companies adopting them. Drivers for SSD adoption include lower power consumption and greater physical robustness. The robustness is a key factor when creating container-based modular data centers.

That still leaves the problem of software optimized for spinning disks. Enter RethinkDB, a project to create a storage engine for the MySQL database that is optimized for SSDs.

As well as taking advantage of the extra speed SSDs can offer, RethinkDB also majors on consistency, achieved by using append-only writes. Additionally, they are writing their storage engine with modern web application access patterns in mind: many concurrent reads and few writes.

The smartest aspect of what RethinkDB are doing, however, is creating their product as a MySQL storage engine, minimizing barriers to adoption. Currently in rapid development, you can obtain binary only downloads of RethinkDB from their web site. Definitely a project to watch as it matures over the course of the next year.

December 15 2010

Four short links: 15 December 2010

  1. Dremel (PDF) -- paper on the Dremel distributed nested column-store database developed at Google. Interesting beyond the technology is the list of uses, which includes tracking install data for applications on Android Market; crash reporting from Google products; OCR results from Google Books; spam analysis; debugging map tiles. (via Greg Linden)
  2. Conversational UI: A Short Reading List -- it can be difficult to build a text user interface to a bot because there's not a great body of useful literature around textual UIs the way there is around GUIs. This great list of pointers goes a long way to solving that problem.
  3. Sustainable Education (YouTube) -- Watch this clip from the New Zealand Open Source Awards. Mark Osborne, Deputy Principal from Albany Senior High School, talks about the software choices at their school not because it's right for technology but because it's right for the students. Very powerful.
  4. What Font Should I Use? -- design life support for the terminally tasteless like myself. (via Hacker News)

December 14 2010

Big data, but with a familiar face

Strata Conference 2011To prepare for O'Reilly's upcoming Strata Conference, we're continuing our series of conversations with some of the leading innovators working with Big Data and analytics. Today, we have a brief chat with Martin Hall, co-founder, president, and CEO of Karmasphere.

Karmasphere is one of several companies shipping commercial tools that make Big Data more accessible to developers and analysts. Hall says the company's products focus on making the data accessible by integrating with tools and languages familiar to developers — like SQL.

"We're focused on providing a new kind of software for working with Big Data stored in Hadoop clusters. In particular, tools for developers and analysts, and doing it in such a way that they get familiar tools and familiar environments and can quickly be very productive analyzing and transforming data stored in Hadoop clusters."

Integrating big data into business will be discussed at the Executive Summit at the upcoming Strata Conference (Feb. 1-3, 2011). Save 30% on registration with the code STR11RAD.

Karmasphere Studio is the company's main product for developers. It's a graphical interface for programming and debugging MapReduce jobs, and it integrates within IDEs like Eclipse and NetBeans. The company recently announced Karmasphere Analyst, which offers a familiar SQL interface for querying Hadoop clusters.

Hall says businesses typically dip their toe into Big Data with a small R&D cluster. "Once they see success with that, they deploy it into production. Once they have it in production, they're looking to connect it with other data sources."

Over the past 18 months, customers who've reached that threshold have been asking Karmasphere for more and better visualization tools, not only at the front end where decision-makers need insights, but for developers who need "more ability to see what's going on in the cluster, to see the progress of their jobs, to analyze and debug what' going on." Hall says they're working on more hook-ins with existing visualization packages.

"We don't expect people who are embracing Hadoop to have to sweep away everything they've invested in, in terms of skill sets, hardware, or software," Hall says. "It's an integration story."

You'll find the full interview in the following video:

October 21 2010

Four short links: 21 October 2010

  1. Using MysQL as NoSQL -- 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients.
  2. Making an SLR Camera from Scratch -- amazing piece of hardware devotion. (via
  3. Mac App Store Guidelines -- Apple announce an app store for the Macintosh, similar to its app store for iPhones and iPads. "Mac App" no longer means generic "program", it has a new and specific meaning, a program that must be installed through the App store and which has limited functionality (only one can run at a time, it's full-screen, etc.). The list of guidelines for what kinds of programs you can't sell through the App Store is interesting. Many have good reasons to be, but It creates a store inside itself for selling or distributing other software (i.e., an audio plug-in store in an audio app) is pure greed. Some are afeared that the next step is to make the App store the only way to install apps on a Mac, a move that would drive me away. It would be a sad day for Mac-lovers if Microsoft were to be the more open solution than Apple. cf the Owner's Manifesto.
  4. Privacy Aspects of Data Mining -- CFP for an IEEE workshop in December. (via jschneider on Twitter)

September 24 2010

July 20 2010

Four short links: 20 July 2010

  1. Dangerous Prototypes -- "a new open source hardware project every month". Sample project: Flash Destroyer, which writes and verifies EEPROM chips until they blow out.
  2. Wabit -- GPLv3 reporting tool.
  3. Because No Respectable MBA Programme Would Admit Me (Mike Shaver) -- excellent book recommendations.
  4. The Most Prescient Footnote Ever (David Pennock) -- In footnote 14 of Chapter 5 (p. 228) of Graham’s classic Hackers and Painters, published in 2004, Graham asks “If the the Mac was so great why did it lose?”. His explanation ends with this caveat, in parentheses: "And it hasn’t lost yet. If Apple were to grow the iPod into a cell phone with a web browser, Microsoft would be in big trouble."

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!