Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 05 2012

OpenCorporates opens up new database of corporate directors and officers

In an age of technology-fueled transparency, corporations are subject to the same powerful disruption as governments. In that context, data journalism has profound importance for society. If a researcher needs data for business journalism, OpenCorporates is a bonafide resource.

Today, OpenCorporates is making a new open database of corporate officers and directors available to the world.

"It's pretty cool, and useful for journalists, to be able to search not just all the companies with directors for a given name in a given state, but across multiple states," said Chris Taggart, founder of Open Corporates, in an email interview. "Not surprisingly, loads of people, from journalists to corruption investigators, are very interested in this."

OpenCorporates is the largest open database of companies and corporate data in the world. The service now contains public data from around the world, from health and safety violations in the United Kingdom to official public notices in Spain to a register of federal contractors. The database has been built by the open data community, under a bounty scheme in conjunction with ScraperWiki. The site also has a useful Google Refine reconciliation function that matches legal entities to company names. Taggart's presentation on OpenCorporates from the 2012 NICAR conference, which provides an overview, is embedded below:

The OpenCorporates open application programming interface can be used with or without a key, although an API key does increase usage limits. The open data site's business model comes with an interesting hook: while OpenCorporates makes its data both free and open under a Share-Alike Attribution Open Database License, users who wish import the data into a proprietary database or use it without attribution must pay to do so.

"The critical thing about our Directors import, and *all* the other data in OpenCorporates, is that we give the provenance, both where and when we got the information," said Taggart. "This is in contrast to the proprietary databases who never give this, because they don't want you to go straight to the source, which also means it's problematic in tracing the source of errors. We've had several instances of the data being wrong at the source, like U.K. health and safety violations."

Taggart offered more perspective on the source of OpenCorporates director data, corporate data availability and the landscape around a universal business ID in the rest of our interview:

Where does the officer and director data come from? How is it validated and cleaned?

It's all from the official company registers. Most are scraped (we've scraped millions of pages), a couple (e.g. Vermont) are from downloads that the registries provide. We just need to make sure we're scraping and importing properly. We do some cleaning up (e.g. removing some of the '**NO DIRECTOR**' entries, but to a degree this has to be done post import, as you often don't know these till they're imported (which is why there are still a few in there).

By the way, in case you were wondering, the reason there are so many more directors than in the filters to the right is that there are about 3 million and counting Florida directors.

Was this data available anywhere before? If no, why not?

As far as I'm aware, only in proprietary databases. Proprietary databases have dominated company data. The result is massive duplication of effort, databases that have opaque errors in them, because they don't have many eyes on them, and lack of access to the public, small businesses, and as you will have heard from NICAR, journalists. I'm tempted to offer a bottle of champagne to the first journalist who finds a story in the directors data.

Who else is working on the universal business ID issue? I heard Beth Noveck propose something along these lines, for instance.

Several organizations have been working on this, mostly from a semi-proprietary point of view, or at least trying to generate a monopoly ID. In other words, it might be open, but in order to get anything on the company, you have to use their site as a lookup table.

OpenCorporates is different in that if you know the URI you know the jurisdiction and identity issued by the company register and vice versa. This means you don't need to ask OpenCorporates what the company ID is, as it's there in the ID. It also works with the EU/W3C's Business Vocabulary, which has just been published.

ISO has been working on one, but it's got exactly this problem. Also, their database won't contain the company number, meaning it doesn't link to the legal entity. Bloomberg have been working on one, as have Thomson Reuters, as they need an alternative to the DUNS number, but from the conversations I had in D.C., nobody's terribly interested in this.

I don't really know the status of Beth's project. They were intending to create a new ID too. From speaking to Jim Hendler, it didn't seem to be connected to the legal entity but instead to represent a search of the name (actually a hash of a SPARQL query). You can see a demo site at I have severe doubts regarding this.

Finally, there's the Financial Stability Board's (part of the G20) work on a global legal entity identifier -- we're on the advisory board for this. This also would be a new number, and be voluntary, but on the other hand will be openly licensed.

I don't think it's a solution to the problem, as it won't be complete and for other reasons, but it may surface more information. We'd definitely provide an entity resolution service to it.

December 01 2011

Strata Week: New open-data initiatives in Canada and the UK

Here are a few of the data stories that caught my attention this week.

Open data from StatsCan

Statistics CanadaEmbassy Magazine broke the news this week that all of Statistics Canada's online data will be made available to the public for free, released under the Government of Canada's Open Data License Agreement beginning in February 2012. Statistics Canada is the federal agency commissioned with producing statistics to help understand the Canadian economy, culture, resources, and population. (It runs the Canadian census every five years.)

The decision to make the data freely and openly available "has been in the works for years," according to Statistics Canada spokesperson Peter Frayne. The Canadian government did launch an open-data initiative earlier this year, and the move on the part of StatsCan dovetails philosophically with that. Frayne said that the decision to make the data free was not a response to the controversial decision last summer when the agency dropped its mandatory long-form census.

Open government activist David Eaves responds with a long list of "winners" from the decision, including all of the consumers of StatsCan's data:

Indirectly, this includes all of us, since provincial and local governments are big consumers of StatsCan data and so now — assuming it is structured in such a manner — they will have easier (and cheaper) access to it. This is also true of large companies and non-profits which have used StatsCan data to locate stores, target services and generally allocate resources more efficiently. The opportunity now opens for smaller players to also benefit.

Eaves continues, stressing the importance of these smaller players:

Indeed, this is the real hope. That a whole new category of winners emerges. That the barrier to use for software developers, entrepreneurs, students, academics, smaller companies and non-profits will be lowered in a manner that will enable a larger community to make use of the data and therefore create economic or social goods.

Moving to Big Data: Free Strata Online Conference — In this free online event, being held Dec. 7, 2011, at 9AM Pacific, we'll look at how big data stacks and analytical approaches are gradually finding their way into organizations as well as the roadblocks that can thwart efforts to become more data driven. (This Strata Online Conference is sponsored by Microsoft.)

Register to attend this free Strata Online Conference

Open data from Whitehall

The British government also announced the availability of new open datasets this week. The Guardian reports that personal health records, transportation data, housing prices, and weather data will be included "in what promises to be the most dramatic release of public data since the 2010 election."

The government will also form an Open Data Institute (ODI), led by Sir Tim Berners-Lee. The ODI will involve both businesses and academic institutions, and will focus on helping transform the data for commercial benefit for U.K. companies as well as for the government. The ODI will also work on the development of web standards to support the government's open-data agenda.

The Guardian notes that the health data that's to be released will be the largest of its kind outside of U.S. veterans' medical records. The paper cites the move as something recommended by the Wellcome Trust earlier this year: "Integrated databases ... would make England unique, globally, for such research." Both medical researchers and pharmaceutical companies will be able to access the data for free.

Dell open sources its Hadoop deployment tool

HadoopHadoop adoption and investment has been one of the big data trends of 2011, with stories about Hadoop appearing in almost every edition of Strata Week. GigaOm's Derrick Harris contends that Hadoop's good fortunes will only continue in 2012, listing six reasons why next year may actually go down as "The Year of Hadoop."

This week's Hadoop-related news involves the release of the source code to Crowbar, Dell's Hadoop deployment tool. Silicon Angle's Klint Finley writes that:

Crowbar is an open-source deployment tool developed by Dell originally as part of its Dell OpenStack Cloud service. It started as a tool for installing Open Stack, but can deploy other software through the use of plug-in modules called 'barclamps' ... The goal of the Hadoop barclamp is to reduce Hadoop deployment time from weeks to a single day.

Finley notes that Crowbar isn't competition to Cloudera's line of Hadoop management tools.

What Muncie read

What Middletown Read"People don't read anymore," Steve Jobs once told The New York Times. It's a fairly common complaint, one that certainly predates the computer age — television was to blame, then video games. But our knowledge about reading habits of the past is actually quite slight. That's what makes the database based on ledgers from the Muncie, Ind., public library so marvelous.

The ledgers, which were discovered by historian Frank Felsenstein, chronicle every book checked out of the library, along with the name of the patron who checked it out, between November 1891 and December 1902. That information is now available in the What Middletown Read database.

In a New York Times story on the database, Anne Trubek notes that even at the turn of the 20th century, most library patrons were not reading "the classics":

What do these records tell us Americans were reading? Mostly fluff, it's true. Women read romances, kids read pulp and white-collar workers read mass-market titles. Horatio Alger was by far the most popular author: 5 percent of all books checked out were by him, despite librarians who frowned when boys and girls sought his rags-to-riches novels (some libraries refused to circulate Alger's distressingly individualist books). Louisa May Alcott is the only author who remains both popular and literary today (though her popularity is far less). "Little Women" was widely read, but its sequel "Little Men" even more so, perhaps because it was checked out by boys, too.

Got data news?

Feel free to email me.


Sponsored post

September 06 2011

The new guy wants to hack the city's data

Christopher Groskopf (@onyxfish) is a news app developer at The Chicago Tribune. It's a job he can do remotely, which is convenient since he announced last spring that he'd be leaving Chicago and relocating to Tyler, Texas.

The move is a long story and a very personal one. Groskopf and his wife are getting divorced, and she's moving to Tyler with their son. He's decided to follow them and, as he says on his blog, "I've opted to make this good." While there are things to like about Tyler — a cheaper cost of living than Chicago, for example, and no lengthy commute — Groskopf has decided to "improve the things I don't like, either through application of will or technology, or both."

And so the "Hack Tyler" project was born: a plan to open, share, and analyze city and county data. Groskopf has been updating his blog with his progress (and his entries, in turn, have been re-published in The Atlantic). In the interview below, Groskopf discusses the state of the project and what others can learn from his experiences.

Tyler Texas

What have you hacked in Tyler so far?

Christopher Groskopf: I've experimented with local geographic and census data, and built out the foundations of a local transit app. I've also written some prototype web scrapers for a variety of local datasets, which haven't yet been incorporated into anything visible. Mostly, I've spent a great deal of time aggregating data sources and plotting out what sorts of projects might be interesting and useful to the local community. I've got a list of a dozen relatively broad ideas.

What have been the biggest surprises?

Christopher Groskopf: My biggest surprise has been the amount of data that was available and who it was available from. I've learned since starting the project that Texas has a history of transparency projects that I was unaware of. Of particular note was a dramatic reversal in who provides data in raw formats. Chicago had accustomed me to thinking of transit as being the most forward-thinking in terms of providing data and law enforcement the least, but the situation in Tyler is inverted. The police department seems to be remarkably progressive in getting data out there — although the formats are terrible — and I've had great difficulty getting any useful information from transit, despite my highly visible efforts to craft something useful for the community.

I've also had local residents go head over heels for the idea behind the project, and I've had others tell me in no uncertain terms that my attitude was unwelcome. I've been heartened by the former and depressed by the latter, but surprised by neither.

In terms of local data, what's the biggest obstacle you're facing in Tyler? And is this obstacle unique to Tyler?

Christopher Groskopf: The biggest challenge is in convincing the local government it's worth their time to care. For small governments, everything must be evaluated in terms of the value it generates. It's hard for them to believe that this ethos I'm espousing can produce and, more importantly, sustain valuable projects for the community. This problem is exacerbated by my status as an "army of one" on this project. Tyler doesn't have a vast local community of developers to support the effort. Instead, I must appeal to broader examples and trends to underscore that every city can benefit from making data more open. I might be the first person to cause a fuss about data in Tyler, but I won't be the last — and software and data-oriented decision making will only continue to grow in importance.

I don't think this problem is peculiar to Tyler. I do believe its endemic to municipalities with populations under 500,000 people. Big cities are just starting to understand that by becoming transparent platforms for their citizenry, they can lower costs, generate new value, and engage with citizens in a new and fundamentally democratic way. Smaller cities and towns — with notable exceptions — haven't gotten the memo yet. Not being repositories of cutting-edge technology, smaller locales are frequently left to the caprices of unscrupulous software vendors and mediocre consultants.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

What can city governments and developers learn from your work in Tyler?

Christopher Groskopf: For want of a proper desk and computer (both of which are waiting for me in storage in Tyler), I've been confining my projects to the very small scale over the summer — I'm building maps and doing analysis, not to mention reading everything I can about the place. I bring this up to point out the nimbleness of the project. I have no money, so I host things on an EC2 micro instance and pay a paltry $15 a month. I have no laptop, so I hack on a netbook. I use open source. I examine the work of my peers. I copy good ideas.

Governments and developers should learn the same thing: agility. This work is not intrinsically expensive. It need not be exasperating or difficult. If it's costing you a million dollars, you're doing it wrong. If the project is scheduled to take a year, it's the wrong project. Start small with modest goals, then exceed them. And make no excuses. Anybody with the requisite programming skills can copy what I'm doing. And those requisite skills can be learned — for free! All Hack Tyler projects are open source and, more importantly, the projects that made them possible are open source. And of ultimate importance, the knowledge needed to create them is itself open.

This interview was edited and condensed.


August 30 2011

How to create sustainable open data projects with purpose

mySocietyThere has been much hand-wringing of late about whether the explosion of government-run app contests over the last couple of years has generated any real value for the public. With only one of the Apps for Democracy projects still running, it's easy to see the entire movement being written off as an overly optimistic fad.

The organisation that I'm lucky enough to lead — mySociety — didn't come from the world of app contests, but it does build the kind of open-source, open-data-grounded civic apps that such contests are suppose to produce. I believe that mySociety's story shows that it's possible to build meaningful, impactful civic and democratic web apps, to grow them to a scale where they're unambiguously a good use of time and money, then sustain them for years at a time. Right now we're launching a new site, FixMyTransport, that is trying to try to raise the bar for the ambition and scale of civic apps, so this seems a good moment to share some thoughts about what it takes to build good services and get them to last more than a few months.

You have to be just as focused on user needs as any company (and perhaps more so)

People have needs. Sometimes they need to eat, sometimes they need to sleep. And sometimes they need to send an urgent message to a local politician, or get a dangerous hanging branch cleared off of a road.

What people never, ever do is wake up thinking, "Today I need to do something civic," or, "Today I will explore some interesting data via an attractive visualisation." MySociety has always been unashamed about packaging civic services in a way that appeals directly to real people with real, everyday needs. I gleefully delete the two or three emails a year that land in our inbox suggesting that FixMyStreet should be renamed to FixOurStreet. No, dude, when I'm pissed it's definitely my street, which is why people have borrowed the name around the world.

We learned this lesson most vividly from Pledgebank, a sputtering site with occasional amazing successes and lots and lots of "meh." The reason it never took off was because, unlike the later (and brilliant) Kickstarter, we didn't make it specific enough. We didn't say "use this site to raise money for your first album," or "use this site to organise a march." We said it was a platform for "getting things done," and the users walked away in confusion. That's why our new site is called FixMyTransport, even though it's actually the first instance of a general civic-problem-fixing platform that could handle nearly any kind of local campaigning.

Being focused on user needs means not starting things you think you probably can't finish

In mySociety's history we have run four calls for proposals, asking the whole world what we should build next. Like most idea gathering processes, there's about 100 bad ideas for every good one, but the bad ideas have value in that they reveal a habitual digital era trait — being insanely optimistic about the effort required to build things to a high standard.

Now, clearly, I'm not saying it is impossible to hack brilliant things without piles of VC gold. But if you are going to hack something really, genuinely valuable in just a couple of weeks, and you want it to thrive and survive in the real Internet, you need to have an idea that is as simple as it is brilliant. Matthew Somerville's accessible Traintimes fits into this category, as does, and But ideas like this are super rare — they're so simple and powerful that really polished sites can be built and sustained on volunteer-level time contributions. I salute the geniuses who gave us the four sites I just mentioned. They make me feel small and stupid.

If your civic hack idea is more complicated than this, then you should really go hunting for funding before you set about coding. Because the Internet is a savagely competitive place, and if your site isn't pretty spanking, nobody is going to come except the robots and spammers.

To be clear — FixMyTransport is not an example of a super-simple genius idea. I wish it were. Rather it's our response to the questions "What's missing in the civic web?" and "What's still too hard to get done online?" But we didn't start building it until we knew we had the money, and we didn't try to fit it into evenings and weekends. It was painful to wait and not rush with it, but it was the right thing to do to build something up to the expectations of an Internet-using public habituated to websites with billion-dollar budgets. And we are emotionally and financially prepared for the six months of rapid iteration that will follow once the public arrives.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Data is your servant, not your master

I love open data. I love structured data. I love data, full stop. But my love of data is not the same as respecting our users' needs. There are more than 300,000 bus stops, train stations, ferry routes and so on in the FixMyTransport back end, munged together over months of hard work from dirty, dirty public data sources. Can you see any sign of this on the homepage? No sir, because users want to fix transport problems, not revel in our mastery of databases.

Demand fewer, larger grants from government and funders

MySociety got lucky. It was born into a period of high public spending, 2003/4, and its second ever grant was for 0.02% of a government funding pot worth more than a billion dollars — about a quarter of a million dollars. It was amazing luck for a small organisation with no track record, possible only because so much money was being thrown around. Those days are gone on both sides of the Pond, but governments everywhere should note that that funding of this scale got us right through our first couple of years, until sites like WriteToThem were mature and had proved their public value (and picked up an award or two).

In the subsequent few years, we saw the "thousand flowers bloom" mentality really take over the world of public-good digital funding, and we saw it go way beyond what was sensible. Time and again, we'd see two good ideas get funding and eight bad ones at the same time because of the sense that it was necessary to spread the money around. It would be great if someone could make the case to public grant funders that good tech ideas — and the teams that can implement them — are vanishingly rare. There is nothing to be ashamed about dividing the pot up two or three ways if there are only a few ideas or proposals or hacks that justify the money. The larger amounts this would produce wouldn't mean champagne parties for grantees, it would mean the best ideas surviving long enough to grow meaningful traffic and learn how to make money other ways.

After a long road supported by public grant funding, mySociety is now 50% commercially funded and 50% private-grant funded, but we'd never have arrived there without being 100% public-grant funded for the first couple of years. Now our key donors are philanthropic, with Indigo Trust in particular covering most of the core development cost for FixMyTransport.

Respect the geeks

All great technology projects have one or more über geeks at the heart of them. If you find the right über geeks, they'll understand politics, society and users just as much as they understand their code. If you find someone as ferociously multi-talented as, say, Louise Crow, who built FixMyTransport almost single-handedly, listen to them and change your plans when they say "no." Luckily, she said "yes" to building this project, and I hope those of you who care about civic tech give her the props appropriate to building something on this scale. Respect her, and respect the geeks like her, and you'll be one step closer to civic app success.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...