Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

October 15 2011

International Open Government Data Camp looks to build community

There's a growing international movement afoot worldwide to open up government data and make something useful with it. Civic apps based upon open data are emerging that genuinely serve citizens in a beneficial ways that officials may have not been able to deliver, particularly without significant time or increased expense.

For every civic app, however, there's a backstory that often involves a broad number of stakeholders. Governments have to commit to open up themselves but will in many cases need external expertise or even funding to do so. Citizens, industry and developers have to use the data, demonstrating that there's not only demand but skill outside of government to put open data to work in the service of accountability, citizen utility and economic opportunity. Galvanizing the co-creation of civic services, policies or apps isn't easy but the potential of the civic surplus attracted the attention of governments around the world.

The approach will not be a silver bullet to all of society's ills, given high unemployment, economic uncertainty or high healthcare or energy costs -- but an increasing number of states are standing up platforms and stimulating an app economy. Given the promise of leaner, smarter government that focuses upon providing open data to fuel economic activity, tough, results-oriented mayors like Rahm Emanuel and Mike Bloomberg are committing to opening Chicago and open government data in NYC.

A key ingredient in successful open government data initiatives is community. It's not enough to simply release data and hope that venture capitalists and developers magically become aware of the opportunity to put it to work. Marketing open government data is what has brought federal CTO Aneesh Chopra and HHS CTO Todd Park repeatedly out to Silicon Valley, New York City and other business and tech hubs. The civic developer and startup community is participating in creating a new distributed ecosystem, from BuzzData to Socrata to new efforts like Max Ogden's DataCouch.

As with other open source movements, people interested in open data are self-organizing and, in many cases, are using the unconference model to do so. Over the past decade, camps have sprung up all around the U.S. and, increasingly, internationally, from Asia to India to Europe Africa to South America. Whether they're called techcamps, barcamps, citycamps or govcamps, these forums are giving advocates, activists, civic media, citizens and public officials to meet, exchange ideas, code and expertise.

Next week, the second International Open Government Data Camp will pull together all of those constituencies in Warsaw, Poland to talk about the future of open data. Attendees will be able to learn from plenary keynotes from open data leaders and tracks full of sessions with advocates, activists and technologists. Satellite events around OGD Camp will also offer unstructured time for people to meet, mix, connect and create. You can watch a short film about open government data from the Open Knowledge Foundation below:

To learn more about what attendees should expect, I conducted an email interview with Jonathan Gray, the community coordinator for the Open Knowledge Foundation. For more on specific details about the camp, consult the FAQ at Gray offered more context on open government data at the Guardian this past week:

It's been over five years since the Guardian launched its influential Free Our Data campaign. Nearly four years ago Rufus Pollock coined the phrase "Raw Data Now" which web inventor Sir Tim Berners-Lee later transformed into the slogan for a global movement. And that same year a group of 30 open government advocates met in Sebastopol, California and drafted a succinct text on open government data which has subsequently been echoed and encoded in official policy and legislative documents around the world.

In under half a decade, open data has found its way into digital policy packages and transparency initiatives all over the place - from city administrations in Berlin, Paris and New York, to the corridors of supranational institutions like the European Commission or the World Bank. In the past few years we've seen a veritable abundance of portals and principles, handbooks and hackdays, promises and prizes.

But despite this enthusiastic and energetic reception, open data has not been without its setbacks and there are still huge challenges ahead. Earlier this year there were reports that will have its funding slashed. In the UK there are concerns that the ominously titled "Public Data Corporation" may mean that an increasing amount of data is locked down and sold to those who can afford to pay for it. And in most countries around the world most documents and datasets are still published with ambiguous or restrictive legal conditions, which inhibit reuse. Public sector spending cuts and austerity measures in many countries will make it harder for open data to rise up priority lists.

Participants at this year's camp will swap notes on how to overcome some of these obstacles, as well as learning about how to set up and run an open data initiative (from the people behind and other national catalogues), how to get the legal and technical details right, how to engage with data users, how to run events, hackdays, competitions, and lots more.

What will this camp change?

We want to build a stronger international community of people interested in open data - so people can swap expertise, anecdotes and bits of code. In particular we want to get public servants talking to each other about how to set up an open data initiative, and to make sure that developers, journalists NGOs and others are included in the process.

What did the last camp change?

Many of the participants from the 2010 camp came away enthused with ideas, contacts and energy that has catalysed and informed the development of open data around the world. For example, groups of citizens booted up grassroots open data meetups in several places, public servants set up official initiatives on the back of advice and discussions from the camp, developers started local versions of projects they liked, and so on.

Why does this matter to the tech community?

Public data is a fertile soil out of which the next generation of digital services and applications will grow. It may take a while for technologies and processes to get there, but eventually we hope open data will be ubiquitous and routine.

Why does it matter to the art, design, music, business or nonprofit community?

Journalists need to be able to navigate public information sources, from official documents and transcripts to information on the environment or the economy. Rather than relying on press releases and policy reports, they should be able to have some grasp of the raw information sources upon which these things depend - so they can make up their own mind, and do their own analysis and evaluation. There's a dedicated satellite event on data journalism at the camp, focusing on looking at where EU spending goes.

Similarly, NGOs, think tanks, and community groups should be able to utilise public data to improve their research, advocacy or outreach. Being more literate about data sources, and knowing how to use them in combination will existing free tools and services can be a very powerful way to put arguments into context, or to communicate issues they care about more effectively. This will be a big theme in this year's camp.

Why does it matter to people who have never heard of open data?

Our lives are increasingly governed by data. Having basic literacy about how to use the information around is is important for all sorts of things, from dealing with major global problems to making everyday decisions. In response to things like climate change, the financial crisis, or disease outbreaks, governments must share information with each other and with the public, to respond effectively and to keep citizens informed. We depend on having up-to-date information to plan our journeys, locate public facilities close to see how our taxes are spent.

What are the outcomes that matter from such an event?

We are hoping to build consensus around a set of legal principles for open data so key stakeholders around the world come to a more explicit and formal agreement about under what terms open data should be published (as liberal as possible!). And we'll be working on, which aims to be a comprehensive directory of open data catalogues from around the world curated for and by the open data community.

We also hope that some key open data projects will be ported and transplanted to different countries. Perhaps most importantly, we hope that (like last year) the discussions and workshops that take place will give a big boost to open data around the world, and people will continue to collaborate online after the camp.

How is OGD Camp going to be different from other events?

It looks like it will be the biggest open data event to date. We have representation from dozens and dozens of countries around the world. There will be a strong focus on getting things done. We're really excited!

March 15 2011

Data integration services combine storage and analysis tools

There has been a lot of movement on the data front in recent months, with a strong focus on integrations between data warehousing and analytics tools. In that spirit, yesterday IBM announced its Netezza data warehouse partnership with Revolution R Enterprise, bringing the R statistics language and predictive analytics to the big data warehouse table.

Microsoft and HP have jumped in as well. Microsoft launched the beta of Dryad, DSC, and DryadLINQ in December, and HP bought Vertica in February. HP plans to integrate Vertica into its new-and-improved overall business model, nicely outlined here by Klint Finley for ReadWrite Enterprise.

These sorts of data integration environments will likely become common as data storage and analysis emerge as mainstream business requirements. Derrick Harris touched on this in a post about the IBM-R partnership and the growing focus on integration:

It's not just about storing lots of data, but also about getting the best insights from it and doing so efficiently, and having large silos for each type of data and each group of business stakeholders doesn't really advance either goal.


January 27 2011

Will data warehousing survive the advent of big data?

For more than 25 years, data warehousing has been the accepted architecture for providing information to support decision makers. Despite numerous implementation approaches, it is founded on sound information management principles, most particularly that of integrating information according to a business-directed and predefined model before allowing use by decision makers. Big data, however one defines it, challenges some of the underlying principles behind data warehousing, causing some analysts to question if the data warehouse will survive.

In this article, I address this question directly and propose that data warehousing, and indeed information management as a whole, must evolve in a radically new direction if we are to manage big data properly and solve the key issue of finding implicit meaning in data.

Back in the 1980s I worked for IBM in Ireland, defining the first published data warehouse architecture (Devlin & Murphy, 1988). At that time, the primary driver for data warehousing was to reconcile data from multiple operational systems and to provide a single, easily-understood source of consistent information to decision makers. The architecture defined the "Business Data Warehouse (BDW) ... [as] the single logical storehouse of all the information used to report on the business ... In relational terms, the end user is presented with a view / number of views that contain the accessed data ..." Note the phrase "single logical storehouse" — I'll return to it later.

Big data (or what was big data then — a few hundred MB in many cases!) and the poor performance of early relational databases proved a challenge to the physical implementation of this model. Within a couple of years, the layered model emerged. Shown in Figure 1 (below), this has a central enterprise data warehouse (EDW) as a point of consolidation and reconciliation, and multiple user-access data marts fed from it. This implementation model has stood the test of time. But it does say that all data must (or should) flow through the EDW, the implications of which I'll discuss later.

Operational systems
Figure 1: The Traditional Data Warehouse Architecture.

The current hype around "big data" has caused some analysts and vendors to declare the death of data warehousing, and in some cases, the demise even of the relational database.

A prerequisite to discussing these claims is to understand and clearly define the term "big data." However, it's a fairly nebulous concept. Wikipedia's definition, as of December 2010, is vague and pliable:

Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.

So, it's as big as you want and getting ever larger.

A taxonomy for data — mind over matter

To get a better understanding, we need to look at the different types of data involved and, rather than focus on the actual data volumes, look to the scale and variety of processing required to extract implicit meaning from the raw data.

Figure 2 (below) introduces a novel and unique view of data, its categories and its relationship to meaning, which I call somewhat cheekily "Mind over Matter."

Operational systems
Figure 2: Mind over Matter and the Heart of Meaning.

Broadly speaking, the bottom pyramid represents data gleaned primarily from the physical world, the world of matter. At the lowest level, we have measurement data, sourced from a variety of sensors connected to computers and the Internet. Such physical event data includes location, velocity, flow rate, event count, G-force, chemical signal, and many more. Such measurements are widely used in science and engineering applications, and have grown to enormous volumes in areas such as particle physics, genomics and performance monitoring of complex equipment. This type of big data has been recognized by the scientific and engineering community for many years and is the basis for much modern research and development. When such basic data is combined in meaningful ways, it becomes interesting in the commercial world.

Atomic data is thus comprised of physical events, meaningfully combined in the context of some human interaction. For example, a combined set of location, velocity and G-force measurements in a specific pattern and time from an automobile monitoring box may indicate an accident. A magnetic card reading of account details, followed by a count of bills issued at an ATM, is clearly a cash withdrawal transaction. More sophisticated combinations include call detail records (CDRs) in telecom systems, web log records, e-commerce transactions and so on. There's nothing new in this type of big data. Telcos, financial institutions and web retailers have statistically analyzed it extensively since the early days of data warehousing for insight into customer behavior and as a basis for advertising campaigns or offers aimed at influencing it.

Derived data, created through mathematical manipulation of atomic data, is generally used to create a more meaningful view of business information to humans. For example, banking transactions can be accumulated and combined to create account status and balance information. Transaction data can be summarized into averages or sampled. Some of these processes result in a loss of detailed data. This data type and the two below it in the lower pyramid comprise hard information, that is largely numerical and keyword data, well structured for use by computers and amenable to standard statistical processing.

As we move to the top pyramid, we enter the realm of the mind — information originating from the way we as humans perceive the world and interact socially within it. We also call this soft information — less well structured and requiring more specialized statistical and analytical processing. The top layer is multiplex data, image, video and audio information, often in smaller numbers of very large files and very much part of the big data scene. Very specialized processing is required to extract context and meaning from such data and extensive research is ongoing to create the necessary tools. The layer below — textual data — is more suited to statistical analysis and text analytics tools are widely used against big data of this type.

The final layer in our double pyramid is compound data, a combination of hard and soft information, typically containing the structural, syntactic and model information that adds context and meaning to hard information and bridges the gap between the two categories. Metadata is a very significant subset of compound data. It is part of the data/information continuum; not something to push out to one side of the information architecture as a separate box — as often seen in data warehousing architectures.

Compound data is the final category of data, and probably the category of most current interest in big data. It contains much social media information — a combination of hard web log data and soft textual and multimedia data from sources such as Twitter, Facebook and so on.

The width of each layer in the pyramids corresponds loosely to data volumes and numbers of records in each category. The outer color bands in Figure 2 place data warehousing and big data in context. The two concepts overlap significantly in the world of matter. The major difference is that big data includes and even focuses on the world of mind at the detailed, high volume level.

More importantly, the underlying reason we do data warehousing (more correctly, business intelligence, for which data warehousing is the architectural foundation) and analyze big data is essentially the same: we are searching for meaning in the data universe. And meaning resides at the conjoined apexes of the two pyramids.

Both data warehousing and big data begin with highly detailed data, and approach its meaning by moving toward very specific insights that are represented by small data sets that the human mind can grasp. The old nugget, now demoted to urban legend, of "men who buy diapers on Friday evenings are also likely to buy beer" is a case in point. Business intelligence works more from prior hypotheses, whereas big data uses statistics to extract hypotheses.

Now that we understand the different types of data and how big data and data warehousing relate, we can address the key question: does big data spell the end of data warehousing?

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR111RAD

Reports of my death are greatly exaggerated

Data warehousing, as we currently do it — and that's a key phrase — is usually rather difficult to implement and maintain. The ultimate reason is that data warehousing seeks to ensure that enterprise-wide decision making is consistent and trusted. This was and is a valid and worthy objective, but it's also challenging. Furthermore, it has driven two architectural aims:

  1. To define, create and maintain a reconciled, integrated set of enterprise data for decision making.
  2. That this set should be the single source for all decision-making needs, be they immediate or long-term, one-off or ongoing, throw-away or permanent.

The first of these aims makes sense: there are many decisions which should be based on reconciled and integrated information for commercial, legal or regulatory reasons. The second aim was always questionable — as shown, for example, by the pervasive use of spreadsheets — and becomes much more so as data volumes and types grow. Big data offers new, easier and powerful ways to interactively explore even larger data sets, most of which have never seen the inside of a data warehouse and likely never will.

Current data warehousing practices also encourage and, in many ways, drive the creation of multiple copies of data. Data is duplicated across the three layers of the architecture in Figure 1, and further duplicated in the functional silos of the data marts. What is more, the practice of building independent data marts fed directly from the operational environment and bypassing the EDW entirely is lamentably far too common. The advent of big data, with its large and growing data volumes, argues strongly against duplication of data. I've explored these issues and more in a series of articles on B-eye-Network (Devlin, 2010), concluding that a new inclusive architecture — Business Integrated Insight (BI2) — is required to extend existing data warehousing approaches.

Big data will give (re)birth to the data warehouse

As promised, it is time to return to the "single logical storehouse" of information required by the business. Back in the 1980s, that information was very limited in comparison to what business needs today, and its uses were similarly circumscribed. Today's business needs both a far broader information environment and a much more integrated processing approach. A single logical storehouse is required with both a well-defined, consistent and integrated physical core, and a loose federation of data whose diversity, timeliness and even inconsistency is valued. In order to discuss this sensibly, we need some new terminology that minimizes confusion and contention between the advocates of the various different technologies and approaches.

The first term is "Business Information Resource" (BIR), introduced in a Teradata-sponsored white paper (Devlin, 2009), and defined as a single logical view of the entire information foundation of the business that aims to differentiate between different data uses and to reduce the tendency to duplicate data multiple times. Within a unified information space, the BIR has a conceptual structure allowing reasonable boundaries of business interest and implementation viability to be drawn (Devlin, 2010a). With such a broad scope, the BIR is clearly instantiated in a number of technologies, of which relational and XML databases, and distributed file and content stores such as Hadoop are key. Thus, the relational database technology of the data warehouse is focused on the creation and maintenance of a set of information that can support common and consistent decision making. Hadoop, MapReduce and similar technologies are directed to their areas of strength such as temporary, throw away data, fast turnaround reports where speed trumps accuracy, text analysis, graphs, large-scale quantitative analytical sand boxes, and web farm reporting. Furthermore, these stores are linked through virtual access technology that presents the separate physical stores to the business user as a single entity as and when required.

The second term, "Core Business Information" (CBI), from an Attivio-sponsored white paper (Devlin, 2010b), is the set of information that ensures the long-term quality and consistency of the BIR. This information needs to be modeled and defined at an early stage of the design and its content and structure subject to rigorous change management. While other information may undergo changes in definition or relationships over time, the CBI must remain very stable.

While space doesn't permit a more detailed description here of these two concepts, the above-mentioned papers make clear that the CBI contains the information at the heart of a traditional enterprise data warehouse (and, indeed, of modern Master Data Management). The Business Information Resource, on the other hand, is a return to the conceptual basis of the data warehouse — a logical single storehouse of all the information required by the business, which, by definition, encompasses big data in all its glory.


While announcing the death of data warehousing and relational databases makes for attention-grabbing headlines, reality is more complex. Big data is actually a superset of the information and processes that have characterized data warehousing since its inception, with big data focusing on large-scale and often short-term analysis. With the advent of big data, data warehousing itself can return to its roots — the creation of consistency and trust in enterprise information. In truth, there exists a substantial overlap between the two areas; the precepts and methods of both are highly complementary and the two will be mandatory for all forward-looking businesses.


Devlin, B. A. and Murphy, P. T., "An architecture for a business and information system," IBM Systems Journal, Volume 27, Number 1, Page 60 (1988)

Devlin, B., "Business Integrated Insight (BI2) — Reinventing enterprise information management," White Paper, (2009)

Devlin, B., "From Business Intelligence to Enterprise IT Architecture," B-eye-Network, (2010)

Devlin, B., "Beyond Business Intelligence," Business Intelligence Journal, Volume 15, Number 2, Page 7, (2010a)

Devlin, B., "Beyond the Data Warehouse: A Unified Information Store for Data and Content," White Paper, (2010b)


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...