Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 08 2012

Strata Week: Profiling data journalists

Here are a few of the data stories that caught my attention this week.

Profiling data journalists

Over the past week, O'Reilly's Alex Howard has profiled a number of practicing data journalists, following up on the National Institute for Computer-Assisted Reporting's (NICAR) 2012 conference. Howard argues that data journalism has enormous importance, but "given the reality that those practicing data journalism remain a tiny percentage of the world's media, there's clearly still a need for its foremost practitioners to show why it matters, in terms of impact."

Howard's profiles include:

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Surveying data marketplaces

Edd Dumbill takes a look at data marketplaces, the online platforms that host data from various publishers and offer it for sale to consumers. Dumbill compares four of the most mature data marketplaces — Infochimps, Factual, Windows Azure Data Marketplace, and DataMarket — and examines their different approaches and offerings.

Dumbill says marketplaces like these are useful in three ways:

"First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume."

Analyzing sports stats

The Atlantic's Dashiell Bennett examines the MIT Sloan Sports Analytics Conference, a "festival of sports statistics" that has grown over the past six years from 175 attendees to more than 2,200.

Bennett writes:

"For a sports conference, the event is noticeably athlete-free. While a couple of token pros do occasionally appear as panel guests, this is about the people behind the scenes — those who are trying to figure out how to pick those athletes for their team, how to use them on the field, and how much to pay them without looking like a fool. General managers and team owners are the stars of this show ... The difference between them and the CEOs of most companies is that the sports guys have better data about their employees ... and a lot of their customers have it memorized."

Got data news?

Feel free to email me.


November 30 2011

November 28 2011

Four short links: 28 November 2011

  1. Twine (Kickstarter) -- modular sensors with connectivity, programmable in If This Then That style. (via TechCrunch)
  2. Small Sample Sizes Lead to High Margins of Error -- a reminder that all the stats in the world won't help you when you don't have enough data to meaningfully analyse.
  3. Yahoo! Cocktails -- somehow I missed this announcement of a Javascript front-and-back-end dev environment from Yahoo!, which they say will be open sourced 1Q2012. Until then it's PRware, but I like that people are continuing to find new ways to improve the experience of building web applications. A Jobsian sense of elegance, ease, and perfection does not underly the current web development experience.
  4. UK Govt To Help Businesses Fight Cybercrime (Guardian) -- I view this as a good thing, even though the conspiracy nut in me says that it's a step along the path that ends with the spy agency committing cybercrime to assist businesses.

October 26 2011

Four short links: 26 October 2011

  1. CPAN Turns 0x10 -- sixteenth anniversary of the creation of the Comprehensive Perl Archive Network. Now holds 480k objects.
  2. Subtext -- social bookreading by adding chat, links, etc. to a book. I haven't tried the implementation yet but I've wanted this for years. (Just haven't wanted to jump into the cesspool of rights negotiations enough to actually build it :-) (via David Eagleman)
  3. Questions to Ask about Election Polls -- information to help you critically consume data analysis. (via Rachel Cunliffe)
  4. Technologies, Potential, and Implications of Additive Manufacturing (PDF) -- AM is a group of emerging technologies that create objects from the bottom-up by adding material one cross-sectional layer at a time. [...] Ultimately, AM has the potential to be as disruptive as the personal computer and the internet. The digitization of physical artifacts allows for global sharing and distribution of designed solutions. It enables crowd-sourced design (and individual fabrication) of physical hardware. It lowers the barriers to manufacturing, and allows everyone to become an entrepreneur. (via Bruce Sterling)

September 07 2011

Four short links: 7 September 2011

  1. Comparing Link Attention (Bitly) -- Twitter, Facebook, and direct (email/IM/etc) have remarkably similar patterns of decay of interest. (via Hilary Mason)
  2. Three Ages of Google -- from batch, to scaling through datacenters, and finally now to techniques for real-time scaling. Of interest to everyone interested in low-latency high-throughput transactions. Datacenters have the diameter of a microsecond, yet we are still using entire stacks designed for WANs. Real-time requires low and bounded latencies and our stacks can't provide low latency at scale. We need to fix this problem and towards this end Luiz sets out a research agenda, targeting problems that need to be solved. (via Tim O'Reilly)
  3. eReaders and eBooks (Luke Wroblewski) -- many eye-opening facts. In 2010 Amazon sold 115 Kindle books for every 100 paperback books. 65% of eReader owners use them in bed, in fact 37% of device usage is in bed.
  4. VT220 on a Mac -- dead sexy look. Impressive how many adapters you need to be able to hook a dingy old serial cable up to your shiny new computer.

August 11 2011

Four short links: 11 August 2011

  1. Why Restaurant Web Sites Are So Bad -- The rest of the Web long ago did away with auto-playing music, Flash buttons and menus, and elaborate intro pages, but restaurant sites seem stuck in 1999.
  2. North Korean Government Partly Funded by Gold Farming (Gamasutra) -- alleges a special group of hackers built automation software for MMOs and sent part of their profits back home.
  3. Pleasanton Protects Bicyclists with Microwave (Mercury News) -- no, not by pre-emptive cooking. The device monitors the intersection and can differentiate between vehicles and bicyclists crossing the road and either extends or triggers the light if a cyclist is detected.
  4. jStat -- a Javascript statistical library.

June 02 2011

Strata Week: Hadoop competition heats up

Here are a few of the data stories that caught my eye this week.

Hadoop competition heats up

HadoopAs the number of Hadoop vendors increases, companies are looking for ways to differentiate themselves. A couple of announcements this past week point to the angles vendors are taking.

Infrastructure company Rainstor announced that its latest data retention technology can be deployed using Cloudera's Hadoop distribution. Rainstor says it will improve the Hadoop Distributed File System with better compression and de-duplication, and it promises a physical footprint that is at least 97% smaller.

In other Hadoop news, MapR revealed that it will serve as the storage component for EMC's recently announced Greenplum HD Enterprise Edition Hadoop distribution. EMC's Hadoop distribution is not based on the official Apache Software Foundation version of the code, but is instead based on Facebook's optimized version.

In an interesting twist, MapR also became an official contributor to the Apache Hadoop project this week. As GigaOm's Derrick Harris observes:

More contributors [to Hadoop] means more (presumably) great ideas to choose from and, ideally, more voices deciding what changes to adopt and which ones to leave alone. For individual companies, getting officially involved with Apache means that perhaps Hadoop will evolve in ways that actually benefit their products that are based upon or seeking to improve Hadoop.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Visualizing Facebook's PHP codebase

Facebook's Greg Schechter offered an explanation this week of how and why Facebook built a visualization project in order to better grasp some of the interdependencies among the more than 10,000 modules that comprise Facebook's front-end code.

Facebook has been normalizing its PHP usage, particularly as it relates to managing modules' dependencies. With its new system, when a module is written or modified, other modules that are directly dependent are fully determinable. This makes sure that circular dependencies are avoided.

But graphing this with a classic "arc-and-node" graph visualization won't work at Facebook's scale, so at a recent hackathon, the company came up with a better visualization method.

Screen from Facebook PHP codebase visualization
Screen from Facebook PHP codebase visualization. See more here.

This method divides the information into layers, where each row represents a layer and a layer's modules are dependent only on modules in the rows below it, and are depended upon only by modules in the rows above it. The visualization also colors modules more darkly if they have more dependencies.

A few screens showing the visualization are available here. Unfortunately, the full tool is only available internally for the Facebook engineering team.

Visualizing Shaquille O'Neal's data

In honor of the end of Shaquille O'Neal's 19-year NBA career (an announcement he tweeted yesterday), data journalist Matt Stiles has created an interactive visualization of the star's stats.

The visualization was created using data from and the Many Eyes data visualization tool. The Atlantic's Alexis Madrigal used the tool to take a look at Shaq's shoddy free-throw record.

While Shaq's career — and now his retirement — provide ample data for off-hand curiosity, the merging of sports stats and visualizations also opens the door to broader opportunities and new kinds of data products.

Shaq three pointer graph
Because few things are funnier than a center lofting three pointers, this graph matches Shaquille O'Neal's age against his three-point attempts. He hit a high-water mark (the big dot) at 23 when he attempted two three pointers and hit the only three of his career.

Got data news?

Feel free to email me.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!