Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 05 2012

OpenCorporates opens up new database of corporate directors and officers

In an age of technology-fueled transparency, corporations are subject to the same powerful disruption as governments. In that context, data journalism has profound importance for society. If a researcher needs data for business journalism, OpenCorporates is a bonafide resource.

Today, OpenCorporates is making a new open database of corporate officers and directors available to the world.

"It's pretty cool, and useful for journalists, to be able to search not just all the companies with directors for a given name in a given state, but across multiple states," said Chris Taggart, founder of Open Corporates, in an email interview. "Not surprisingly, loads of people, from journalists to corruption investigators, are very interested in this."

OpenCorporates is the largest open database of companies and corporate data in the world. The service now contains public data from around the world, from health and safety violations in the United Kingdom to official public notices in Spain to a register of federal contractors. The database has been built by the open data community, under a bounty scheme in conjunction with ScraperWiki. The site also has a useful Google Refine reconciliation function that matches legal entities to company names. Taggart's presentation on OpenCorporates from the 2012 NICAR conference, which provides an overview, is embedded below:

The OpenCorporates open application programming interface can be used with or without a key, although an API key does increase usage limits. The open data site's business model comes with an interesting hook: while OpenCorporates makes its data both free and open under a Share-Alike Attribution Open Database License, users who wish import the data into a proprietary database or use it without attribution must pay to do so.

"The critical thing about our Directors import, and *all* the other data in OpenCorporates, is that we give the provenance, both where and when we got the information," said Taggart. "This is in contrast to the proprietary databases who never give this, because they don't want you to go straight to the source, which also means it's problematic in tracing the source of errors. We've had several instances of the data being wrong at the source, like U.K. health and safety violations."

Taggart offered more perspective on the source of OpenCorporates director data, corporate data availability and the landscape around a universal business ID in the rest of our interview:

Where does the officer and director data come from? How is it validated and cleaned?

It's all from the official company registers. Most are scraped (we've scraped millions of pages), a couple (e.g. Vermont) are from downloads that the registries provide. We just need to make sure we're scraping and importing properly. We do some cleaning up (e.g. removing some of the '**NO DIRECTOR**' entries, but to a degree this has to be done post import, as you often don't know these till they're imported (which is why there are still a few in there).

By the way, in case you were wondering, the reason there are so many more directors than in the filters to the right is that there are about 3 million and counting Florida directors.

Was this data available anywhere before? If no, why not?

As far as I'm aware, only in proprietary databases. Proprietary databases have dominated company data. The result is massive duplication of effort, databases that have opaque errors in them, because they don't have many eyes on them, and lack of access to the public, small businesses, and as you will have heard from NICAR, journalists. I'm tempted to offer a bottle of champagne to the first journalist who finds a story in the directors data.

Who else is working on the universal business ID issue? I heard Beth Noveck propose something along these lines, for instance.

Several organizations have been working on this, mostly from a semi-proprietary point of view, or at least trying to generate a monopoly ID. In other words, it might be open, but in order to get anything on the company, you have to use their site as a lookup table.

OpenCorporates is different in that if you know the URI you know the jurisdiction and identity issued by the company register and vice versa. This means you don't need to ask OpenCorporates what the company ID is, as it's there in the ID. It also works with the EU/W3C's Business Vocabulary, which has just been published.

ISO has been working on one, but it's got exactly this problem. Also, their database won't contain the company number, meaning it doesn't link to the legal entity. Bloomberg have been working on one, as have Thomson Reuters, as they need an alternative to the DUNS number, but from the conversations I had in D.C., nobody's terribly interested in this.

I don't really know the status of Beth's project. They were intending to create a new ID too. From speaking to Jim Hendler, it didn't seem to be connected to the legal entity but instead to represent a search of the name (actually a hash of a SPARQL query). You can see a demo site at http://tw.rpi.edu/orgpedia/companies. I have severe doubts regarding this.

Finally, there's the Financial Stability Board's (part of the G20) work on a global legal entity identifier -- we're on the advisory board for this. This also would be a new number, and be voluntary, but on the other hand will be openly licensed.

I don't think it's a solution to the problem, as it won't be complete and for other reasons, but it may surface more information. We'd definitely provide an entity resolution service to it.

September 15 2011

Strata Week: Investors circle big data

This was a busy week for data stories. Here are a few that caught my attention:

Big money for big data

Opera SolutonsThere's recently been a steady stream of funding news for big data, database, and data mining companies. Last Thursday, Hadoop-based data analytics startup Platfora raised $5.7 million from Andreessen Horowitz. On Monday, 10gen announced it had raised $20 million for MongoDB, its open-source, NoSQL database. On Tuesday, Xignite said it had raised $10 million to build big data repositories for financial organizations; data storage provider Zetta announced a $9 million round; and Walmart announced it had acquired the ad targeting and data mining startup OneRiot (the terms of the deal were not disclosed). Finally, yesterday, big data analytics company Opera Solutions announced that it had raised a whopping $84 million in its first round of funding.

GigaOm's Derrick Harris offers the story behind Opera Solution's massive round of funding, noting that the company was already growing fast and doing more than $100 million per year in revenue. He also points to the company's penchant for hiring PhDs (90 so far), "something that makes it more akin to blue-chipper IBM than to many of today's big data startups pushing Hadoop or NoSQL technologies." Harris also notes that at a half-billion-dollar valuation and with 600-plus employees, Opera Solutions isn't a great acquisitions target for other big companies, even those wanting to beef up their analytics offerings. He contends this could allow Opera Solutions to remain independent and perhaps make some acquisitions of its own.

Ushahidi and Wikipedia team up for WikiSweeper

Wikipedia and UshahidiThe crisis-mapping platform Ushahidi unveiled a new tool this week to help Wikipedia editors track changes and verify sources on articles. The project, called WikiSweeper, is aimed at those highly- and rapidly-edited articles that are associated with major events.

As Ushahidi writes on its blog:

When a globally-relevant news story breaks, relevant Wikipedia pages are the subject of hundreds of edits as events unfold. As each editor looks to editing and maintaining the quality and credibility of the page, they need to manually track the news cycle, each using their own spheres of reference. The decisions that are made to accept one source while rejecting others remains opaque, as are the strategies that editors develop to alert and keep track of the latest information coming in from a variety of different sources.

WikiSweeper is based on Ushahidi's own open-source Sweeper tool, and its application to Wikipedia will help Ushahidi in turn build out its own project. After all, during major events, information comes in from multiple sources at a breakneck pace, and in crisis response, the accuracy and trustworthiness of the sources need to be quickly and transparently identified. As Ushahidi points out, this makes it a "win-win" for both organizations as they gain better tools for dealing with real-time news and social data.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Angry Birds take down pigs and the economy

Invoking the seasonal declarations come March about the amount of time Americans waste at work watching the NCAA college basketball tournament, The Atlantic's Alexis Madrigal has pointed to a far more insidious and year-round problem: the amount of hours American workers lose by playing Angry Birds.

Drawing on data about the number of minutes people spend playing Angry Birds per day — 200 million — Madrigal has calculated the resulting lost hours and lost wages. He estimates about 43,333,333 on-the-clock hours are spent playing Angry Birds each year, accounting for $1.5 billion in lost wages per year.

Obviously there are some really big assumptions in this calculation. The first is that five percent of the total Angry Bird hours are played by Americans at work ... we don't know the international breakdown, nor do we know how often people play at work. But, five percent seemed like a reasonable assumption. Second, the Pew income data for smartphone ownership is not that precise, particularly on the upper ($75,000+) and lower (less than $30,000) ends. I had to pick numbers, so I basically split Americans up into four categories: people earning $30,000, $50,000, $75,000, and $100,000, then I calculated simple hourly wages for those groups (income/52/40) and did a weighted average based on smartphone adoption in those categories. The $35 per hour number I used is comparable with the $38 that Challenger, Gray, and Christmas used for fantasy sports players. But this is certainly a rough approximation. Put it this way: I bet this estimate is right to the order of magnitude, if not in the details.

Take that, Gladwell

Malcolm Gladwell raised the ire of many social-media-savvy activists last year by claiming that "the revolution will not be tweeted." Writing in The New Yorker, Gladwell dismissed social media as a tool for change. He argued that bonds formed online are "weak" and unable to withstand the sorts of demands necessary for social change.

Gladwell's assertions have been countered in many places, and a new article analyzing social media's role in the Arab Spring takes the rebuttals to a new level.

"After analyzing over 3 million tweets, gigabytes of YouTube content and thousands of blog posts, a new study finds that social media played a central role in shaping political debates in the Arab Spring. Conversations about revolution often preceded major events on the ground, and social media carried inspiring stories of protest across international borders," the authors write.

The authors describe their research methodology for extracting and analyzing the texts from blogs and tweets, but also lamented some of the problems they faced, particularly with access to the Twitter archive.

Got data news?

Feel free to email me.

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl