Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 05 2012

OpenCorporates opens up new database of corporate directors and officers

In an age of technology-fueled transparency, corporations are subject to the same powerful disruption as governments. In that context, data journalism has profound importance for society. If a researcher needs data for business journalism, OpenCorporates is a bonafide resource.

Today, OpenCorporates is making a new open database of corporate officers and directors available to the world.

"It's pretty cool, and useful for journalists, to be able to search not just all the companies with directors for a given name in a given state, but across multiple states," said Chris Taggart, founder of Open Corporates, in an email interview. "Not surprisingly, loads of people, from journalists to corruption investigators, are very interested in this."

OpenCorporates is the largest open database of companies and corporate data in the world. The service now contains public data from around the world, from health and safety violations in the United Kingdom to official public notices in Spain to a register of federal contractors. The database has been built by the open data community, under a bounty scheme in conjunction with ScraperWiki. The site also has a useful Google Refine reconciliation function that matches legal entities to company names. Taggart's presentation on OpenCorporates from the 2012 NICAR conference, which provides an overview, is embedded below:

The OpenCorporates open application programming interface can be used with or without a key, although an API key does increase usage limits. The open data site's business model comes with an interesting hook: while OpenCorporates makes its data both free and open under a Share-Alike Attribution Open Database License, users who wish import the data into a proprietary database or use it without attribution must pay to do so.

"The critical thing about our Directors import, and *all* the other data in OpenCorporates, is that we give the provenance, both where and when we got the information," said Taggart. "This is in contrast to the proprietary databases who never give this, because they don't want you to go straight to the source, which also means it's problematic in tracing the source of errors. We've had several instances of the data being wrong at the source, like U.K. health and safety violations."

Taggart offered more perspective on the source of OpenCorporates director data, corporate data availability and the landscape around a universal business ID in the rest of our interview:

Where does the officer and director data come from? How is it validated and cleaned?

It's all from the official company registers. Most are scraped (we've scraped millions of pages), a couple (e.g. Vermont) are from downloads that the registries provide. We just need to make sure we're scraping and importing properly. We do some cleaning up (e.g. removing some of the '**NO DIRECTOR**' entries, but to a degree this has to be done post import, as you often don't know these till they're imported (which is why there are still a few in there).

By the way, in case you were wondering, the reason there are so many more directors than in the filters to the right is that there are about 3 million and counting Florida directors.

Was this data available anywhere before? If no, why not?

As far as I'm aware, only in proprietary databases. Proprietary databases have dominated company data. The result is massive duplication of effort, databases that have opaque errors in them, because they don't have many eyes on them, and lack of access to the public, small businesses, and as you will have heard from NICAR, journalists. I'm tempted to offer a bottle of champagne to the first journalist who finds a story in the directors data.

Who else is working on the universal business ID issue? I heard Beth Noveck propose something along these lines, for instance.

Several organizations have been working on this, mostly from a semi-proprietary point of view, or at least trying to generate a monopoly ID. In other words, it might be open, but in order to get anything on the company, you have to use their site as a lookup table.

OpenCorporates is different in that if you know the URI you know the jurisdiction and identity issued by the company register and vice versa. This means you don't need to ask OpenCorporates what the company ID is, as it's there in the ID. It also works with the EU/W3C's Business Vocabulary, which has just been published.

ISO has been working on one, but it's got exactly this problem. Also, their database won't contain the company number, meaning it doesn't link to the legal entity. Bloomberg have been working on one, as have Thomson Reuters, as they need an alternative to the DUNS number, but from the conversations I had in D.C., nobody's terribly interested in this.

I don't really know the status of Beth's project. They were intending to create a new ID too. From speaking to Jim Hendler, it didn't seem to be connected to the legal entity but instead to represent a search of the name (actually a hash of a SPARQL query). You can see a demo site at http://tw.rpi.edu/orgpedia/companies. I have severe doubts regarding this.

Finally, there's the Financial Stability Board's (part of the G20) work on a global legal entity identifier -- we're on the advisory board for this. This also would be a new number, and be voluntary, but on the other hand will be openly licensed.

I don't think it's a solution to the problem, as it won't be complete and for other reasons, but it may surface more information. We'd definitely provide an entity resolution service to it.

February 22 2012

Data for the public good

Can data save the world? Not on its own. As an age of technology-fueled transparency, open innovation and big data dawns around the world, the success of new policy won't depend on any single chief information officer, chief executive or brilliant developer. Data for the public good will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, in whatever form it is delivered.

Advocates, watchdogs and government officials now have new tools for data journalism and open government. Globally, there's a wave of transparency that will wash over every industry and government, from finance to healthcare to crime.

In that context, open government is about much more than open data — just look at the issues that flow around the #opengov hashtag on Twitter, including the nature identity, privacy, security, procurement, culture, cloud computing, civic engagement, participatory democracy, corruption, civic entrepreneurship or transparency.

If we accept the premise that Gov 2.0 is a potent combination of open government, mobile, open data, social media, collective intelligence and connectivity, the lessons of the past year suggest that a tidal wave of technology-fueled change is still building worldwide.

The Economist's support for open government data remains salient today:

"Public access to government figures is certain to release economic value and encourage entrepreneurship. That has already happened with weather data and with America's GPS satellite-navigation system that was opened for full commercial use a decade ago. And many firms make a good living out of searching for or repackaging patent filings."

As Clive Thompson reported at Wired last year, public sector data can help fuel jobs, and "shoving more public data into the commons could kick-start billions in economic activity." In the transportation sector, for instance, transit data is open government fuel for economic growth.

There is a tremendous amount of work ahead in building upon the foundations that civil society has constructed over decades. If you want a deep look at what the work of digitizing data really looks like, read Carl Malamud's interview with Slashdot on opening government data.

Data for the public good, however, goes far beyond government's own actions. In many cases, it will happen despite government action — or, often, inaction — as civic developers, data scientists and clinicians pioneer better analysis, visualization and feedback loops.

For every civic startup or regulation, there's a backstory that often involves a broad number of stakeholders. Governments have to commit to open up themselves but will, in many cases, need external expertise or even funding to do so. Citizens, industry and developers have to show up to use the data, demonstrating that there's not only demand, but also skill outside of government to put open data to work in service accountability, citizen utility and economic opportunity. Galvanizing the co-creation of civic services, policies or apps isn't easy, but tapping the potential of the civic surplus has attracted the attention of governments around the world.

There are many challenges for that vision to pass. For one, data quality and access remain poor. Socrata's open data study identified progress, but also pointed to a clear need for improvement: Only 30% of developers surveyed said that government data was available, and of that, 50% of the data was unusable.

Open data will not be a silver bullet to all of society's ills, but an increasing number of states are assembling platforms and stimulating an app economy.

Results-oriented mayors like Rahm Emanuel and Mike Bloomberg are committing to opening Chicago and opening government data in New York City, respectively.

Following are examples of where data for the public good is already having an impact upon the world we live in, along with some ideas about what lies ahead.

Financial good

Anyone looking for civic entrepreneurship will be hard pressed to find a better recent example than BrightScope. The efforts of Mike and Ryan Alfred are in line with traditional entrepreneurship: identifying an opportunity in a market that no one else has created value around, building a team to capitalize on it, and then investing years of hard work to execute on that vision. In the process, BrightScope has made government data about the financial industry more usable, searchable and open to the public.

Due to the efforts of these two entrepreneurs and their California-based startup, anyone who wants to learn more about financial advisers before tapping one to manage their assets can do so online.

Prior to BrightScope, the adviser data was locked up at the Securities and Exchange Commission (SEC) and the Financial Industry Regulatory Authority (FINRA).

"Ryan and I knew this data was there because we were advisers," said BrightScope co-founder Mike Alfred in a 2011 interview. "We knew data had been filed, but it wasn't clear what was being done with it. We'd never seen it liberated from the government databases."

While they knew the public data existed and had their idea years ago, Alfred said it didn't happen because they "weren't in the mindset of being data entrepreneurs" yet. "By going after 401(k) first, we could build the capacity to process large amounts of data," Alfred said. "We could take that data and present it on the web in a way that would be usable to the consumer."

Notably, the government data that BrightScope has gathered on financial advisers goes further than a given profile page. Over time, as search engines like Google and Bing index the information, the data has become searchable in places consumers are actually looking for it. That's aligned with one of the laws for open data that Tim O'Reilly has been sharing for years: Don't make people find data. Make data find the people.

As agencies adapt to new business relationships, consumers are starting to see increased access to government data. Now, more data that the nation's regulatory agencies collected on behalf of the public can be searched and understood by the public. Open data can improve lives, not least through adding more transparency into a financial sector that desperately needs more of it. This kind of data transparency will give the best financial advisers the advantage they deserve and make it much harder for your Aunt Betty to choose someone with a history of financial malpractice.

The next phase of financial data for good will use big data analysis and algorithmic consumer advice tools, or "choice engines," to make better decisions. The vast majority of consumers are unlikely to ever look directly at raw datasets themselves. Instead, they'll use mobile applications, search engines and social recommendations to make smarter choices.

There are already early examples of such services emerging. Billshrink, for example, lets consumers get personalized recommendations for a cheaper cell phone plan based on calling histories. Mint makes specific recommendations on how a citizen can save money based upon data analysis of the accounts added. Moreover, much of the innovation in this area is enabled by the ability of entrepreneurs and developers to go directly to data aggregation intermediaries like Yodlee or CashEdge to license the data.

EMC's Big Data solution accelerates business transformation. We offer a cost-efficient and scale-out IT infrastructure that allows organizations to access broad data sources, collaborate and execute real-time analysis and drive actionable insight.

Transit data as economic fuel

Transit data continues to be one of the richest and most dynamic areas for co-creation of services. Around the United States and beyond, there has been a blossoming of innovation in the city transit sector, driven by the passion of citizens and fueled by the release of real-time transit data by city governments.

Francisca Rojas, research director at the Harvard Kennedy School's Transparency Policy Project, has investigated the dynamics behind the disclosure of data by transit agencies in the United States, which she calls one of the most successful implementations of open government. "In just a few years, a rich community has developed around this data, with visionary champions for disclosure inside transit agencies collaborating with eager software developers to deliver multiple ways for riders to access real-time information about transit," wrote Rojas.

The Massachusetts Bay Transit Authority (MBTA) learned from Portland, Oregon's, TriMet that open data is better. "This was the best thing the MBTA had done in its history," said Laurel Ruma, O'Reilly's director of talent and a long-time resident in greater Boston, in her 2010 Ignite talk on real-time transit data. The MBTA's move to make real-time data available and support it has spawned a new ecosystem of mobile applications, many of which are featured at MBTA.com.

There are now 44 different consumer-facing applications for the TriMet system. Chicago, Washington and New York City also have a growing ecosystem of applications.

As more sensors go online in smarter cities, tracking the movements of traffic patterns will enable public administrators to optimize routes, schedules and capacity, driving efficiency and a better allocation of resources.

Transparency and civic goods

As John Wonderlich, policy director at the Sunlight Foundation, observed last year, access to legislative data brings citizens closer to their representatives. "When developers and programmers have better access to the data of Congress, they can better build the databases and tools that let the rest of us connect with the legislature."

That's the promise of the Sunlight Foundation's work, in general: Technology-fueled transparency will help fight corruption, fraud and reveal the influence behind policies. That work is guided by data, generated, scraped and aggregated from government and regulatory bodies. The Sunlight Foundation has been focused on opening up Congress through technology since the organization was founded. Some of its efforts culminated recently with the publication of a live XML feed for the House floor and a transparency portal for House legislative documents.

There are other horizons for transparency through open government data, which broadly refers to public sector records that have been made available to citizens. For a canonical resource on what makes such releases truly "open," consult the "8 Principles of Open Government Data."

For instance, while gerrymandering has been part of American civic life since the birth of the republic, one of the best policy innovations of 2011 may offer hope for improving the redistricting process. DistrictBuilder, an open-source tool created by the Public Mapping Project, allows anyone to easily create legal districts.

"During the last year, thousands of members of the public have participated in online redistricting and have created hundreds of valid public plans," said Micah Altman, senior research scientist at Harvard University Institute for Quantitative Social Science, via an email last year.

"In substantial part, this is due to the project's effort and software. This year represents a huge increase in participation compared to previous rounds of redistricting — for example, the number of plans produced and shared by members of the public this year is roughly 100 times the number of plans submitted by the public in the last round of redistricting 10 years ago," Altman said. "Furthermore, the extensive news coverage has helped make a whole new set of people aware of the issue and has re framed it as a problem that citizens can actively participate in to solve, rather than simply complain about."

Principles for data in the public good

As a result of digital technology, our collective public memory can now be shared and expanded upon daily. In a recent lecture on public data for public good at Code for America, Michal Migurski of Stamen Design made the point that part of the global financial crisis came through a crisis in public knowledge, citing "The Destruction of Economic Facts," by Hernando de Soto.

To arrive at virtuous feedback loops that amplify the signals that citizens, regulators, executives and elected leaders inundated with information need to make better decisions, data providers and infomediaries will need to embrace key principles, as Migurski's lecture outlined.

First, "data drives demand," wrote Tim O'Reilly, who attended the lecture and distilled Migurski's insights. "When Stamen launched crimespotting.org, it made people aware that the data existed. It was there, but until they put visualization front and center, it might as well not have been."

Second, "public demand drives better data," wrote O'Reilly. "Crimespotting led Oakland to improve their data publishing practices. The stability of the data and publishing on the web made it possible to have this data addressable with public links. There's an 'official version,' and that version is public, rather than hidden."

Third, "version control adds dimension to data," wrote O'Reilly. "Part of what matters so much when open source, the web, and open data meet government is that practices that developers take for granted become part of the way the public gets access to data. Rather than static snapshots, there's a sense that you can expect to move through time with the data."

The case for open data

Accountability and transparency are important civic goods, but adopting open data requires grounded arguments for a city chief financial officer to support these initiatives. When it comes to making a business case for open data, John Tolva, the chief technology officer for Chicago, identified four areas that support the investment in open government:

  1. Trust — "Open data can build or rebuild trust in the people we serve," Tolva said. "That pays dividends over time."
  2. Accountability of the work force — "We've built a performance dashboard with KPIs [key performance indicators] that track where the city directly touches a resident."
  3. Business building — "Weather apps, transit apps ... that's the easy stuff," he said. "Companies built on reading vital signs of the human body could be reading the vital signs of the city."
  4. Urban analytics — "Brett [Goldstein] established probability curves for violent crime. Now we're trying to do that elsewhere, uncovering cost savings, intervention points, and efficiencies."

New York City is also using data internally. The city is doing things like applying predictive analytics to building code violations and housing data to try to understand where potential fire risks might exist.

"The thing that's really exciting to me, better than internal data, of course, is open data," said New York City chief digital officer Rachel Sterne during her talk at Strata New York 2011. "This, I think, is where we really start to reach the potential of New York City becoming a platform like some of the bigger commercial platforms and open data platforms. How can New York City, with the enormous amount of data and resources we have, think of itself the same way Facebook has an API ecosystem or Twitter does? This can enable us to produce a more user-centric experience of government. It democratizes the exchange of information and services. If someone wants to do a better job than we are in communicating something, it's all out there. It empowers citizens to collaboratively create solutions. It's not just the consumption but the co-production of government services and democracy."

The promise of data journalism

NYTimes: 365/360 - 1984 (in color) by blprnt_van, on FlickrThe ascendance of data journalism in media and government will continue to gather force in the years ahead.

Journalists and citizens are confronted by unprecedented amounts of data and an expanded number of news sources, including a social web populated by our friends, family and colleagues. Newsrooms, the traditional hosts for information gathering and dissemination, are now part of a flattened environment for news. Developments often break first on social networks, and that information is then curated by a combination of professionals and amateurs. News is then analyzed and synthesized into contextualized journalism.

Data is being scraped by journalists, generated from citizen reporting, or gleaned from massive information dumps — such as with the Guardian's formidable data journalism, as detailed in a recent ebook. ScraperWiki, a favorite tool of civic coders at Code for America and elsewhere, enables anyone to collect, store and publish public data. As we grapple with the consumption challenges presented by this deluge of data, new publishing platforms are also empowering us to gather, refine, analyze and share data ourselves, turning it into information.

There are a growing number of data journalism efforts around the world, from New York Times interactive features to the award-winning investigative work of ProPublica. Here are just a few promising examples:

  • Spending Stories, from the Open Knowledge Foundation, is designed to add context to news stories based upon government data by connecting stories to the data used.
  • Poderopedia is trying to bring more transparency to Chile, using data visualizations that draw upon a database of editorial and crowdsourced data.
  • The State Decoded is working to make the law more user-friendly.
  • Public Laboratory is a tool kit and online community for grassroots data gathering and research that builds upon the success of Grassroots Mapping.
  • Internews and its local partner Nai Mediawatch launched a new website that shows incidents of violence against journalists in Afghanistan.

Open aid and development

The World Bank has been taking unprecedented steps to make its data more open and usable to everyone. The data.worldbank.org website that launched in September 2010 was designed to make the bank's open data easier to use. In the months since, more than 100 applications have been built using the data.

"Up until very recently, there was almost no way to figure out where a development project was," said Aleem Walji, practice manager for innovation and technology at the World Bank Institute, in an interview last year. "That was true for all donors, including us. You could go into a data bank, find a project ID, download a 100-page document, and somewhere it might mention it. To look at it all on a country level was impossible. That's exactly the kind of organization-centric search that's possible now with extracted information on a map, mashed up with indicators. All of sudden, donors and recipients can both look at relationships."

Open data efforts are not limited to development. More data-driven transparency in aid spending is also going online. Last year, the United States Agency for International Development (USAID) launched a public engagement effort to raise awareness about the devastating famine in the Horn of Africa. The FWD campaign includes a combination of open data, mapping and citizen engagement.

"Frankly, it's the first foray the agency is taking into open government, open data, and citizen engagement online," said Haley Van Dyck, director of digital strategy at USAID, in an interview last year.

"We recognize there is a lot more to do on this front, but are happy to start moving the ball forward. This campaign is different than anything USAID has done in the past. It is based on informing, engaging, and connecting with the American people to partner with us on these dire but solvable problems. We want to change not only the way USAID communicates with the American public, but also the way we share information."

USAID built and embedded interactive maps on the FWD site. The agency created the maps with open source mapping tools and published the datasets it used to make these maps on data.gov. All are available to the public and media to download and embed as well.

The combination of publishing maps and the open data that drives them simultaneously online is significantly evolved for any government agency, and it serves as a worthy bar for other efforts in the future to meet. USAID accomplished this by migrating its data to an open, machine-readable format.

"In the past, we released our data in inaccessible formats — mostly PDFs — that are often unable to be used effectively," said Van Dyck. "USAID is one of the premiere data collectors in the international development space. We want to start making that data open, making that data sharable, and using that data to tell stories about the crisis and the work we are doing on the ground in an interactive way."

Crisis data and emergency response

Unprecedented levels of connectivity now exist around the world. According to a 2011 survey from the Pew Internet and Life Project, more than 50% of American adults use social networks, 35% of American adults have smartphones, and 78% of American adults are connected to the Internet. When combined, those factors mean that we now see earthquake tweets spread faster than the seismic waves themselves. Networked publics can now share the effects of disasters in real time, providing officials with unprecedented insight into what's happening. Citizens act as sensors in the midst of the storm, creating an ad hoc system of networked accountability through data.

The growth of an Internet of Things is an important evolution. What we saw during Hurricane Irene in 2011 was the increasing importance of an Internet of people, where citizens act as sensors during an emergency. Emergency management practitioners and first responders have woken up to the potential of using social data for enhanced situational awareness and resource allocation.

An historic emergency social data summit in Washington in 2010 highlighted how relevant this area has become. And last year's hearing in the United States Senate on the role of social media in emergency management was "a turning point in Gov 2.0," said Brian Humphrey of the Los Angeles Fire Department.

The Red Cross has been at the forefront of using social data in a time of need. That's not entirely by choice, given that news of disasters has consistently broken first on Twitter. The challenge is for the men and women entrusted with coordinating response to identify signals in the noise.

First responders and crisis managers are using a growing suite of tools for gathering information and sharing crucial messages internally and with the public. Structured social data and geospatial mapping suggest one direction where these tools are evolving in the field.

A web application from ESRI deployed during historic floods in Australia demonstrated how crowdsourced social intelligence provided by Ushahidi can enable emergency social data to be integrated into crisis response in a meaningful way.

The Australian flooding web app includes the ability to toggle layers from OpenStreetMap, satellite imagery, and topography, and then filter by time or report type. By adding structured social data, the web app provides geospatial information system (GIS) operators with valuable situational awareness that goes beyond standard reporting, including the locations of property damage, roads affected, hazards, evacuations and power outages.

Long before the floods or the Red Cross joined Twitter, however, Brian Humphrey of the Los Angeles Fire Department (LAFD) was already online, listening. "The biggest gap directly involves response agencies and the Red Cross," said Humphrey, who currently serves as the LAFD's public affairs officer. "Through social media, we're trying to narrow that gap between response and recovery to offer real-time relief."

After the devastating 2010 earthquake in Haiti, the evolution of volunteers working collaboratively online also offered a glimpse into the potential of citizen-generated data. Crisis Commons has acted as a sort of "geeks without borders." Around the world, developers, GIS engineers, online media professionals and volunteers collaborated on information technology projects to support disaster relief for post-earthquake Haiti, mapping streets on OpenStreetMap and collecting crisis data on Ushahidi.

Healthcare

What happens when patients find out how good their doctors really are? That was the question that Harvard Medical School professor Dr. Atul Gawande asked in the New Yorker, nearly a decade ago.

The narrative he told in that essay makes the history of quality improvement in medicine compelling, connecting it to the creation of a data registry at the Cystic Fibrosis Foundation in the 1950s. As Gawande detailed, that data was privately held. After it became open, life expectancy for cystic fibrosis patients tripled.

In 2012, the new hope is in big data, where techniques for finding meaning in the huge amounts of unstructured data generated by healthcare diagnostics offer immense promise.

The trouble, say medical experts, is that data availability and quality remain significant pain points that are holding back existing programs.

There are, literally, bright spots that suggest what's possible. Dr. Gawande's 2011 essay, which considered whether "hotspotting" using health data could help lower medical costs by giving the neediest patients better care, offered another perspective on the issue. Early outcomes made the approach look compelling. As Dr. Gawande detailed, when a Medicare demonstration program offered medical institutions payments that financed the coordination of care for its most chronically expensive beneficiaries, hospital stays and trips to the emergency rooms dropped more than 15% over the course of three years. A test program adopting a similar approach in Atlantic City saw a 25% drop in costs.

Through sharing data and knowledge, and then creating a system to convert ideas into practice, clinicians in the ImproveCareNow network were able to improve the remission rate for Crohn's disease from 49% to 67% without the introduction of new drugs.

In Britain, researchers found that the outcomes for adult cardiac patients improved after the publication of information on death rates. With the release of meaningful new open government data about performance and outcomes from the British national healthcare system, similar improvements may be on the way.

"I do believe we are at the beginning of a revolutionary moment in health care, when patients and clinicians collect and share data, working together to create more effective health care systems," said Susannah Fox, associate director for digital strategy at the Pew Internet and Life Project, in an interview in January. Fox's research has documented the social life of health information, the concept of peer-to-peer healthcare, and the role of the Internet among people living with chronic disease.

In the past few years, entrepreneurs, developers and government agencies have been collaboratively exploring the power of open data to improve health. In the United States, the open data story in healthcare is evolving quickly, from new mobile apps that lead to better health decisions to data spurring changes in care at the U.S. Department of Veterans Affairs.

Since he entered public service, Todd Park, the first chief technology officer of the U.S. Department of Health and Human Services (HHS), has focused on unleashing the power of open data to improve health. If you aren't familiar with this story, read the Atlantic's feature article that explores Park's efforts to revolutionize the healthcare industry through better use of data.

Park has focused on releasing data at Health.Data.Gov. In a speech to a Hacks and Hackers meetup in New York City in 2011, Park emphasized that HHS wasn't just releasing new data: "[We're] also making existing data truly accessible or usable," he said, taking "stuff that's in a book or on a website and turning it into machine-readable data or an API."

Park said it's still quite early in the project and that the work isn't just about data — it's about how and where it's used. "Data by itself isn't useful. You don't go and download data and slather data on yourself and get healed," he said. "Data is useful when it's integrated with other stuff that does useful jobs for doctors, patients and consumers."

What lies ahead

There are four trends that warrant special attention as we look to the future of data for public good: civic network effects, hybridized data models, personal data ownership and smart disclosure.

Civic network effects

Community is a key ingredient in successful open government data initiatives. It's not enough to simply release data and hope that venture capitalists and developers magically become aware of the opportunity to put it to work. Marketing open government data is what repeatedly brought federal Chief Technology Officer Aneesh Chopra and Park out to Silicon Valley, New York City and other business and tech hubs.

Despite the addition of topical communities to Data.gov, conferences and new media efforts, government's attempts to act as an "impatient convener" can only go so far. Civic developer and startup communities are creating a new distributed ecosystem that will help create that community, from BuzzData to Socrata to new efforts like Max Ogden's DataCouch.

Smart disclosure

There are enormous economic and civic good opportunities in the "smart disclosure" of personal data, whereby a private company or government institution provides a person with access to his or her own data in open formats. Smart disclosure is defined by Cass Sunstein, Administrator of the White House Office for Information and Regulatory Affairs, as a process that "refers to the timely release of complex information and data in standardized, machine-readable formats in ways that enable consumers to make informed decisions."

For instance, the quarterly financial statements of the top public companies in the world are now available online through the Securities and Exchange Commission.

Why does it matter? The interactions of citizens with companies or government entities generate a huge amount of economically valuable data. If consumers and regulators had access to that data, they could tap it to make better choices about everything from finance to healthcare to real estate, much in the same way that web applications like Hipmunk and Zillow let consumers make more informed decisions.

Personal data assets

When a trend makes it to the World Economic Forum (WEF) in Davos, it's generally evidence that the trend is gathering steam. A report titled "Personal Data Ownership: The Emergence of a New Asset Class" suggests that 2012 will be the year when citizens start thinking more about data ownership, whether that data is generated by private companies or the public sector.

"Increasing the control that individuals have over the manner in which their personal data is collected, managed and shared will spur a host of new services and applications," wrote the paper's authors. "As some put it, personal data will be the new 'oil' — a valuable resource of the 21st century. It will emerge as a new asset class touching all aspects of society."

The idea of data as a currency is still in its infancy, as Strata Conference chair Edd Dumbill has emphasized. The Locker Project, which provides people with the ability to move their own data around, is one of many approaches.

The growth of the Quantified Self movement and online communities like PatientsLikeMe and 23andMe validates the strength of the movement. In the U.S. federal government, the Blue Button initiative, which enables veterans to download personal health data, has now spread to all federal employees and earned adoption at Aetna and Kaiser Permanente.

In early 2012, a Green Button was launched to unleash energy data in the same way. Venture capitalist Fred Wilson called the Green Button an "OAuth for energy data."

Wilson wrote:

"It is a simple standard that the utilities can implement on one side and web/mobile developers can implement on the other side. And the result is a ton of information sharing about energy consumption and, in all likelihood, energy savings that result from more informed consumers."

Hybridized public-private data

Free or low-cost online tools are empowering citizens to do more than donate money or blood: Now, they can donate, time, expertise or even act as sensors. In the United States, we saw a leading edge of this phenomenon in the Gulf of Mexico, where Oil Reporter, an open source oil spill reporting app, provided a prototype for data collection via smartphone. In Japan, an analogous effort called Safecast grew and matured in the wake of the nuclear disaster that resulted from a massive earthquake and subsequent tsunami in 2011.

Open source software and citizens acting as sensors have steadily been integrated into journalism over the past few years, most dramatically in the videos and pictures uploaded after the 2009 Iran election and during 2011's Arab Spring.

Citizen science looks like the next frontier. Safecast is combining open data collected by citizen science with academic, NGO and open government data (where available), and then making it widely available. It's similar to other projects, where public data and experimental data are percolating.

Public data is a public good

Despite the myriad challenges presented by legitimate concerns about privacy, security, intellectual property and liability, the promise of more informed citizens is significant. McKinsey's 2011 report dubbed big data as the next frontier for innovation, with billions of dollars of economic value yet to be created. When that innovation is applied on behalf of the public good, whether it's in city planning, transit, healthcare, government accountability or situational awareness, those effects will be extended.

We're entering the feedback economy, where dynamic feedback loops between customers and corporations, partners and providers, citizens and governments, or regulators and companies can both drive efficiencies and leaner, smarter governments.

The exabyte age will bring with it the twin challenges of information overload and overconsumption, both of which will require organizations of all sizes to use the emerging toolboxes for filtering, analysis and action. To create public good from public goods — the public sector data that governments collect, the private sector data that is being collected and the social data that we generate ourselves — we will need to collectively forge new compacts that honor existing laws and visionary agreements that enable the new data science to put the data to work.

Photo: NYTimes: 365/360 - 1984 (in color) by blprnt_van, on Flickr

Related:

January 27 2012

Top stories: January 23-27, 2012

Here's a look at the top stories published across O'Reilly sites this week.

On pirates and piracy
Mike Loukides: "I'm not willing to have the next Bach, Beethoven, or Shakespeare post their work online, only to have it taken down because they haven't paid off a bunch of executives who think they own creativity."

Microsoft's plan for Hadoop and big data
Strata conference chair Edd Dumbill takes a look at Microsoft's plans for big data. By embracing Hadoop, the company aims to keep Windows and Azure as a standards-friendly option for data developers.

Coming soon to a location near you: The Amazon Store?
Jason Calacanis says an Amazon retail presence isn't out of the question and that AmazonBasics is a preview of what's to come.

Survey results: How businesses are adopting and dealing with data
Feedback from a recent Strata Online Conference suggests there's a large demand for clear information on what big data is and how it will change business.

Why the fuss about iBooks Author?
Apple doesn't have an objective to move the publishing industry forward. With iBooks Author, the company sees an opportunity to reinvent this industry within its own closed ecosystem.


Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

November 03 2011

Strata Week: Cloudera founder has a new data product

Here are a few of the data stories that caught my attention this week:

Odiago: Cloudera founder Christophe Bisciglia's next big data project

Odiago and WibiDataCloudera founder Christophe Bisciglia unveiled his new data startup this week: Odiago. The company's product, WibiData (say it out loud), uses Apache Hadoop and Hbase to analyze consumer web data. Database industry analyst Curt Monash describes WibiData on his DBMS2 blog:

WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop, is a data management and analytic execution layer. That's where the secret sauce resides.

GigaOm's Derrick Harris posits that Odiago points to "the future of Hadoop-based products." Rather than having to "roll your own" Hadoop solutions, future Hadoop users will be able to build their apps to tap into other products that do the "heavy lifting."

Hortonworks launches its data platform

Hadoop company Hortonworks, which spun out of Yahoo earlier this year, officially announced its products and services this week. The Hortonworks Data Platform is an open source distribution powered by Apache Hadoop. It includes the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase and Zookeeper, as well as HCatalog and open APIs for integration. THe Hortonworks Data Platform also includes Ambari, another Apache project, that will serve as the Hadoop installation and management system.

It's possible Hortonworks' efforts will pick up the pace of the Hadoop release cycle and address what ReadWriteWeb's Scott Fulton sees as the "degree of fragmentation and confusion." But as GigaOm's Derrick Harris points out, there is still "so much Hadoop in so many places, with multiple companies offering their own Hadoop solutions.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.



Save 20% on registration with the code RADAR20

Big education content meets big education data

A couple of weeks ago, the adaptive learning startup Knewton announced that it had raised an additional $33 million. This latest round was led by Pearson, the largest education company in the world. As such, the announcement this week that Knewton and Pearson are partnering is hardly surprising.

But this partnership does mark an important development for big data, textbook publishing, and higher education.

Knewton's adaptive learning platform will be integrated with Pearson's digital courseware, giving students individualized content as they move through the materials. To begin with, Knewton will work with just a few of the subjects within Pearson's MyLab and Mastering catalog. There are more than 750 courses in that catalog, and the adaptive learning platform will be integrated with more of them soon. The companies also say they plan to "jointly develop a line of custom, next-generation digital course solutions, and will explore new products in the K12 and international markets."

The data from Pearson's vast student customer base — some 9 million higher ed students use Pearson materials — will certainly help Knewton refine its learning algorithms. In turn, the promise of adaptive learning systems means that students and teachers will be able to glean insights from the learning process — what students understand, what they don't — in real time. It also means that teachers can provide remediation aimed at students' unique strengths and weaknesses.

Got data news?

Feel free to email me.

Related:

September 20 2011

BuzzData: Come for the data, stay for the community

BuzzDataAs the data deluge created by the activities of global industries accelerates, the need for decision makers to find a signal in the noise will only grow more important. Therein lies the promise of data science, from data visualization to dashboard to predictive algorithms that filter the exaflood and produce meaning for those who need it most. Data consumers and data producers, however, are both challenged by "dirty data" and limited access to the expertise and insight they need. To put it another way, if you can't derive value, as Alistair Croll has observed here at Radar, there's no such thing as big data.

BuzzData, based in Toronto, Canada, is one of several startups looking to help bridge that gap. BuzzData launched this spring with a combination of online community and social networking that is reminiscent of what GitHub provides for code. The thinking here is that every dataset will have a community of interest around the topic it describes, no matter how niche it might be. Once uploaded, each dataset has tabs for tracking versions, visualizations, related articles, attachments and comments. BuzzData users can "follow" datasets, just as they would a user on Twitter or a page on Facebook.

"User experience is key to building a community around data, and that's what BuzzData seems to be set on doing," said Marshall Kirkpatrick, lead writer at ReadWriteWeb, in an interview. "Right now it's a little rough around the edges to use, but it's very pretty, and that's going to open a lot of doors. Hopefully a lot of creative minds will walk through those doors and do things with the data they find there that no single person would have thought of or been capable of doing on their own."

The value proposition that BuzzData offers will depend upon many more users showing up and engaging with one another and, most importantly, the data itself. For now, the site remains in limited beta with hundreds of users, including at least one government entity, the City of Vancouver.

"Right now, people email an Excel spreadsheet around or spend time clobbering a shared file on a network," said Mark Opauszky, the startup's CEO, in an interview late this summer. "Our behind-the-scenes energy is focused on interfaces so that you can talk through BuzzData instead. We're working to bring the same powerful tools that programmers have for source code into the world of data. Ultimately, you're not adding and removing lines of code — you're adding and removing columns of data."

Opauszky said that BuzzData is actively talking with data publishers about the potential of the platform: "What BuzzData will ultimately offer when we move beyond a minimum viable product is for organizations to have their own territory in that data. There is a 'brandability' to that option. We've found it very easy to make this case to corporations, as they're already spending dollars, usually on social networks, to try to understand this."

That corporate constituency may well be where BuzzData finds its business model, though the executive team was careful to caution that they're remaining flexible. It's "absolutely a freemium model," said Opauszky. "It's a fundamentally free system, but people can pay a nominal fee on an individual basis for some enhanced features — primarily the ability to privatize data projects, which by default are open. Once in a while, people will find that they're on to something and want a smaller context. They may want to share files, commercialize a data product, or want to designate where data is stored geographically."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30


Open data communities

"We're starting to see analysis happen, where people tell 'data stories' that are evolving in ways they didn't necessarily expect when they posted data on BuzzData," said Opauszky. "Once data is uploaded, we see people use it, fork it, and evolve data stories in all sorts of directions that the original data publishers didn't perceive."

For instance, a dataset of open data hubs worldwide has attracted a community that improved the original upload considerably. BuzzData featured the work of James McKinney, a civic hacker from Montreal, Canada, in making it so. A Google Map mashing up locations is embedded below:


The hope is that communities of developers, policy wonks, media, and designers will self-aggregate around datasets on the site and collectively improve them. Hints of that future are already present, as open government advocate David Eaves highlighted in his post on open source data journalism at BuzzData. As Eaves pointed out, it isn't just media companies that should be paying attention to the trends around open data journalism:

For years I argued that governments — and especially politicians — interested in open data have an unhealthy appetite for applications. They like the idea of sexy apps on smart phones enabling citizens to do cool things. To be clear, I think apps are cool, too. I hope in cities and jurisdictions with open data we see more of them. But open data isn't just about apps. It's about the analysis.

Imagine a city's budget up on BuzzData. Imagine the flow rates of the water or sewage system. Or the inventory of trees. Think of how a community of interested and engaged "followers" could supplement that data, analyze it, and visualize it. Maybe they would be able to explain it to others better, to find savings or potential problems, or develop new forms of risk assessment.

Open data journalism

"It's an interesting service that's cutting down barriers to open data crunching," said Craig Saila, director of digital products at the Globe and Mail, Canada's national newspaper, in an interview. He said that the Globe and Mail has started to open up the data that it's collecting, like forest fire data, at the Globe and Mail BuzzData account.

"We're a traditional paper with a strong digital component that will be a huge driver in the future," said Saila. "We're putting data out there and letting our audiences play with it. The licensing provides us with a neutral source that we can use to share data. We're working with data suppliers to release the data that we have or are collecting, exposing the Globe's journalism to more people. In a lot of ways, it's beneficial to the Globe to share census information, press releases and statistics."

The Globe and Mail is not, however, hosting any information there that's sensitive. "In terms of confidential information, I'm not sure if we're ready as a news organization to put that in the cloud," said Saila. "Were just starting to explore open data as a thing to share, following the Guardian model."

Saila said that he's found the private collaboration model useful. "We're working on a big data project where we need to combine all of the sources, and we're trying to munge them all together in a safe place," he said. "It's a great space for journalists to connect and normalize public data."

The BuzzData team emphasized that they're not trying to be another data marketplace, like Infochimps, or replace Excel. "We made an early decision not to reinvent the wheel," said Opauszky, "but instead to try to be a water cooler, in the same way that people go to Vimeo to share their work. People don't go to Flickr to edit photos or YouTube to edit videos. The value is to be the connective tissue of what's happening."

If that question about "what's happening?" sounds familiar to Twitter users, it's because that kind of stream is part of BuzzData's vision for the future of open data communities.

"One of the things that will become more apparent is that everything in the interface is real time," said Opauszky. "We think that topics will ultimately become one of the most popular features on the site. People will come from the Guardian or the Economist for the data and stay for the conversation. Those topics are hives for peers and collaborators. We think that BuzzData can provide an even 'closer to the feed' source of information for people's interests, similar to the way that journalists monitor feeds in Tweetdeck."

Related:

August 24 2011

Inside Google+: The virtuous circle of data and doing right by users

Inside Google+ free webcastTim O'Reilly and Google+ VP of Product Bradley Horowitz took a deep dive into Google+ and a host of adjacent topics during a webcast yesterday. A full recording of their conversation is embedded at the end of this post — and it's well worth watching — but I thought it would be useful to extract and amplify a couple of key points that were made. Horowitz will expand on some of these ideas during his session at next month's Strata Summit.

Data lock-in and the virtuous circle

Data has supplanted source code as the key to lock-in. This shift was the focus of an interesting exchange between O'Reilly and Horowitz (it begins at the 13:37 mark).

"Clearly, developers and users are betting on Google with the integrity of their data," Horowitz said. "We're trying to do right by that opportunity." Horowitz pointed to the Data Liberation Front as an important part of Google's approach to data. "We have to allow people with the pull of a handle to up and leave and take that data."

O'Reilly noted that a "virtuous circle" forms when data runs through certain stacks — Apple devices tend to work better with Apple services, Google services operate well with other Google services, and these positive experiences keep users contained within an ecosystem. "The question of consolidation is there," O'Reilly said. "In that world of consolidated stacks, user freedom may not be about having your own software source code. It's about being able to get your data somewhere else. That's the shift we're in the middle of, with people starting to understand that lock-in really comes from your data more than your source code."

"That's the huge leap of faith that we take with the Data Liberation effort," Horowitz responded. "This is exactly contrary to services that are trying to build roach motels and ant farms. We're trying to give users choice. They can leave and come back. We want people to use this because we're offering the best service in the market at any given instant, and not because they're trapped at Google."

Strata Summit New York 2011, being held Sept. 20-21, is for executives, entrepreneurs, and decision-makers looking to harness data. Hear from the pioneers who are succeeding with data-driven strategies, and discover the data opportunities that lie ahead.

Save 30% on registration with the code SS11RAD

Getting it right

The importance of building Google+ correctly popped up throughout the discussion, with Horowitz applying it to the controversy around Google+ pseudonyms (26:27 mark), the need to expand Google+ to enterprise users and other audiences (22:13 mark), and the eventual — but unannounced — release of Google+ APIs (18:34 mark). This same hyper-focus on creating thoughtful and well-constructed user experiences was also evident during Alex Howard's recent interview with Google+ team member Joseph Smarr.

Google+ and the competition

Facebook announced a number of changes to its sharing tools yesterday — some of which resemble functionality available on Google+ — so the topic naturally came up during the discussion (at the 39:55 mark).

"I think what they did was familiar and good for users," Horowitz said when asked about Facebook's changes. "That's another impact that Google+ can have on the world: raising the bar of what the expectations and standards around something like privacy should be."

Other subjects

A host of additional topics were addressed during the webcast, including:

  • The deep thinking behind the speed of Google+ (5:35)
  • The "noisy stream" problem (8:56)
  • Will aspects of Google+ be open sourced? (12:06)
  • Horowitz (aka "elatable") on his own experiences with pseudonyms (22:13)
  • "Listen to what people say and watch what they do." (29:14)
  • What Google hopes to get out of the Google+ "limited field trial" (33:35)
  • The possibility of auto-generated "implicit" Circles (51:00)

Check out the full conversation in the following video:



Related:


August 10 2011

T-Mobile challenges churn with data

For T-Mobile USA, Inc., big data is federated and multi-dimensional. The company has overcome challenges from a disparate IT infrastructure to enable regional marketing campaigns, more advanced churn management, and an integrated single-screen "Quick View" for customer care. Using its data integration architecture, T-Mobile USA can begin to manage "data zones" that are virtualized from the physical storage and network infrastructure.

With 33.63 million customers at the end of the first quarter of 2011 and US$4.63 billion in service revenues that quarter, T-Mobile USA manages a complex data architecture that has been cobbled together through the combination of VoiceStream Wireless (created in 1994), Omnipoint Communications (acquired in 2000) and Powertel (merged with VoiceStream Wireless in 2001 by new parent company Deutsche Telekom AG).

The recently announced AT&T agreement to acquire T-Mobile USA kicked off a regulatory review process that is expected to last approximately 12 months. If completed, the acquisition would create the largest wireless carrier in the United States, with nearly 130 million customers. Until then, AT&T and T-Mobile USA remain separate companies and continue to operate independently.

Information management architecture

As T-Mobile USA awaits the next stage of its corporate history, integration architecture manager Sean Hickey and his colleagues manage data flows across a federated, disparate infrastructure. To enable T-Mobile's more than 33 million U.S. customers to "stick together," as the company says in its marketing tagline, a lot of subscriber and network data has to come together among multiple databases and source systems.

T-Mobile Information Management Architecture and Source Systems
T-Mobile Information Management Architecture and Source Systems (click to enlarge).

Previously, many IT systems were very point specific, stove-piped and not scalable. Some systems began as start-up projects that are now still running seven or eight years later, long after they no longer meet a good return on investment (ROI) standard. Staff that knew the original data models and schema no longer work there.

To integrate data across its disparate federated architecture, T-Mobile USA uses Informatica PowerCenter. (Disclosure: Informatica is a client of my company, Zettaforce.). T-Mobile runs PowerCenter version 8.6.1, is a 9.1 beta customer, and plans to upgrade to version 9.1 in the fourth quarter of this year. Data modeling tools include CA ERwin and Embarcadero ER/Studio. To identify data relationships in its complex IT environment, T-Mobile USA uses Informatica PowerCenter Data Profiling and IBM Exeros Discovery
Data Architecture (now part of IBM InfoSphere Discovery).

This data integration layer powers multiple key business drivers, including regional marketing campaigns, churn management and customer care. Longer term projects — such as adoption of self-service BI and automatically provisioned virtual data marts for business analysts — are on hold pending the acquisition.

Virtual data zones

Backed by this data integration layer, the T-Mobile USA architecture team introduced the concept of virtual "data zones". Each data zone comprises data subjects, and is tied to one or more business objectives. These zones virtualize data applications from the physical data storage and network. From a data architecture perspective, the data zone approach helps pinpoint where there are complex systems to maintain, shadow IT, redundant feeds, differences in data definitions or incompatible data. This approach also helps highlight where business rules are embedded all over, leading to duplicate or inconsistent business rules, versus more centralized rule management.

T-Mobile Data Zones
T-Mobile Data Zones (click to enlarge).

T-Mobile USA adopted SAP BusinessObjects Strategic Workforce Planning, the first SAP application to use SAP HANA in-memory computing to provide real-time insights and simulation capabilities. According to Sean Hickey, T-Mobile USA has been very pleased so far with pilot tests of the HANA-enabled in-database analytics.

Legacy systems do present constraints with management of specific data subjects. For example, T-Mobile USA would like to archive off historical subscriber records that are more than seven years old, which is the cut-off date for regulatory-required retention. However, with the bottom-up growth of the the company's data architecture, it is difficult to carve out old data. The call date was not necessarily part of the partition key. Accordingly, with how data is segmented, T-Mobile USA continues to store subscriber records and other information dating back to 1999.

T-Mobile Data Zones
T-Mobile Data Subjects (click to enlarge).

Regional marketing campaigns

Each data zone is associated with one or more strategic business objectives. For marketing, a couple years ago T-Mobile did a fairly aggressive U.S. reorganization to become a more regional-oriented organization. T-Mobile used to do U.S. national marketing campaigns but has moved to a decentralized model that involves geography, demographics and call usage patterns to perform cross-sell and upsell campaigns by region, with assistance from third-party marketing partners for outsourced analytics. T-Mobile now has more than 20 regional districts across the United States, with a local head who is responsible for sales, marketing and operations in that district.

Northern California VP and GM Rich Garwood added about 30 staff in new regional jobs to take over functions previously handled by T-Mobile USA headquarters in Bellevue, Wash., and will for the first time make a concerted effort to market to small business owners in Northern California. "It's exciting for us as employees. We really have local ownership of what the results are", Garwood told the San Francisco Business Times.

T-Mobile Data Zones
T-Mobile Business Objectives Associated with Each Data Zone (click to enlarge).

SAS Marketing Automation gathers 300 attributes, including campaigns, take rates and dispositions. Before, T-Mobile did national campaigns, with a kind of "shoot and see what sticks" approach. Now, T-Mobile's regions can run targeted campaigns specific to customer demographics and customer segmentation. This requires pulling in more than 20 different sources of data. Deep data mining operations cover billions of rows a day.

For analytics reporting, T-Mobile USA uses SAP Business Objects including Crystal Reports. Finance and accounting department staff still tend to download data into Excel spreadsheets. As part of the company's data security enforcement, every employee and contractor is required to use a T-Mobile-supplied computer with hard drive encryption. Power users can access the Teradata system directly with Teradata SQL for data mining.

Churn management

T-Mobile USA has begun using a "tribe" calling circle model — with multi-graphs akin to social network analysis — to predict propensity of churns and mitigate the potential impact of "tribe leaders" who have high influence in large, well-connected groups of fellow subscribers. An influential tribe "leader" who switches to a competitor's service can kick off "contagious churn," where that leader's friends, family or co-workers also switch.

In the past, wireless service providers calculated net present value (NPV) by estimating a subscriber's lifetime spend on services and products. Now, part of the NPV calculation measures the level of influence and size of a subscriber's tribe.

As noted by Ken King, marketing manager for the communications industry at SAS: "In North America, we increasingly work with service providers that are keen to examine not only segmentation, churn and customer lifetime value but new things like social networking impact on their brands, or the relationships between customers so they can recognize group leaders and their influence on others in terms of buying products or switching to competitors."

Churn management at T-Mobile USA begins with an Amdocs subscriber billing system and financial data stored in a Teradata enterprise data warehouse (EDW). "The heart of the company is the billing system", said Sean Hickey.

However, some key data for churn management is not captured in the billing system. Non-billable events can be very important for marketing. Raw call data gathered from cell towers and switches, supplied by Ericsson and other system vendors, can show the number of dropped calls for each subscriber and the percent of a subscriber's total calls that drop. T-Mobile USA loads call data into IBM Netezza systems from a series of flat files.

T-Mobile engineering uses this data for drop call analysis. They can look at drop calls for specific phone numbers. For example, if a T-Mobile customer moves to a new home in a location where cell towers provide only limited coverage, T-Mobile marketing can proactively offer the subscriber a new cell phone that could improve reception, or a free femtocell that connects to the subscriber's home broadband network. Customer demographic data, however, is not stored in the Netezza systems — that's stored by T-Mobile IT in its Teradata enterprise data warehouse (TED).

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

The Teradata EDW sends out extracts to T-Mobile USA's SQL servers and Oracle servers. The Teradata, Oracle and Microsoft SQL Server databases are fed by dozens of source systems, including Seibel, the Billing Portal, Epiphany, Sales Operations and Cash Applications. "Shadow IT" data warehouses include revenue assurance, cost benefits analysis (CBA), business operations and sales credits.

The T-Mobile USA information management team has targeted multiple data marts and shadow IT warehouses to incorporate into the Teradata enterprise data warehouse, pending funded projects to add those to the EDW. In this respect, T-Mobile USA is similar to many other Fortune 500 organizations, which balance an EDW vision with the constraints of budgeting, legacy systems and acquisition integration, and therefore manage a hybrid information management architecture combining an EDW and data federation.

Data delivery is really across the board. It takes a day for information to be batch loaded from retail stores and web sales. It used to then take a second data for analysis. The combination of Informatica PowerCenter and SAP Business Objects Explorer enables the T-Mobile USA channel management team to run reports within seconds rather than an hour or a day. "It's a pretty cool platform," said Hickey. Future steps may target speeding up the data acquisition.

T-Mobile USA continues to innovate for churn management. To better identify the multi-faceted reasons behind customer turnover, T-Mobile USA ran a proof of concept (PoC) with EMC Greenplum, with a storage capacity of roughly 1 petabyte, including data from cell towers, call records, clickstreams and social networks. Following the PoC, T-Mobile USA decided to work with an outsourced service provider, which uses Apache Hadoop to store and process multi-dimensional data. Sentiment analysis predicts triggers and indicators of what customer actions are going to be, which helps T-Mobile proactively respond.

Informatica's newly announced PowerCenter version 9.1 includes connectivity for Hadoop Distributed File System (HDFS), to load or extract data, as explained by Informatica solution evangelist Julianna DeLua. Customers can use Informatica data quality and other transformation tools either pre- or post-writing the data into HDFS.

Single-screen Quick View for customer care

Backed by this data integration architecture, T-Mobile USA just rolled out Quick View as part of an upgrade of its customer care system. With Quick View, agents and retail store associates can view multiple key indicators including the customer segmentation value on one screen. Before, call center agents and retail store associates had to look at multiple screens, which is problematic while talking live with a customer.

Quick View pops up with offers specific to that customer, such as a new phone or a new service plan. Subscribers with a high value may be sent automatically to care agents specially trained on handling high-value customers. T-Mobile USA plans to extend Quick View to third-party retailer partners such as Best Buy that sell T-Mobile phones and services in their retail stores.

More integration

In addition to empowering innovations in regional marketing campaigns, churn management and customer care, data integration will take on even more significance if the AT&T acquisition of T-Mobile USA is approved next year. An approved acquisition would kick off a host of new integration initiatives between the two companies.



Related:


June 14 2011

3 ideas you should steal from HubSpot

HubSpotI've been following Dharmesh Shah's OnStartups blog for years and I remember when he announced HubSpot, the company he was starting. I've been fascinated to watch it grow and grow, so I was excited when I got to visit their offices a few months ago. Just after my visit they closed a Series D funding round for $32 million from Sequoia, Google Ventures and Salesforce.com, but despite its success almost nobody in the technology world has heard of HubSpot. I blame the combination of a location in Boston and a mainstream customer-base of small business owners for the lack of recognition. It's a shame because there's a lot to learn from their technology and process — they've solved some hard problems in thought-provoking ways.

People are fascinated by mirrors

There's a good chance you've used their Twitter Grader tool, and its popularity shows one of the secrets to HubSpot's success. The inspiration for the company came when Dharmesh realized that his own blog was driving a lot of traffic, and the startups he was helping out were all struggling to get anywhere near the same number of visitors. He built HubSpot by applying what he'd learned from blogging, and one of the key lessons was that people crave new information about their own lives and projects. If you can create a service that gives people interesting data about themselves and their organizations, they'll spend time exploring it and they'll share it with their friends.

With Twitter Grader, Dharmesh didn't just create a source of free advertising for his company, it's also implicitly targeted at people who want to improve their presence on the social network. Many of these people will be the small business owners that are in his target market. Even better, by offering the statistics as a gift to users, he created a small sense of reciprocal obligation that will make them more likely to purchase his services. The approach started with their original Website Grader service, but they found it so powerful, HubSpot now has a whole range of similar free tools for analyzing everything from your Facebook page to your blog.

The lesson for me is that giving people data and visualizations about things they truly care about can be a powerful tool for drawing them in to your service. Do some creative thinking about your customer's problems, and see if there's something you can offer them as a reward for their attention.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

You should kill unicorns and rainbows with science

One of the most enjoyable conversations I had at HubSpot was with Dan Zarrella, who describes himself as a social media scientist. I can already hear some physics PhDs grinding their teeth, but Dan has earned that title by applying a lot of much-needed rigor to the fluffy world of social media measurements. He's crusading against "unicorns and rainbows" metrics that have no connection to the goals you want to achieve. Many businesses have focused on building up easy-to-measure numbers like fan or follower counts, but to use Eric Ries' term, those are just vanity metrics. You can gain a million friends without it leading to a penny in revenue.

Dan's antidote is the relentless application of logic and analysis, working backwards from the business goals to evaluate everything you're doing as objectively as possible. A fantastic example of this is his study looking at how minor content details, like punctuation, make a retweet more or less likely. It's possible to argue with particular conclusions he draws, but he's transparently laid out the methods by which he arrived at them. Anybody with some technical knowledge and access to a decent chunk of Twitter data can try to reproduce and refine his results. This makes the report so much more useful than the opinions or impressions that dominate most discussions of social media, since we can actually have an evidence-based argument about it.

I came away from talking with Dan with a new appreciation of how powerful the scientific method can be in even the most unlikely situations. I'll be taking a fresh look at some of the painful problems my projects are hitting, and seeing if there's some way I can gather the right data to gain insights, even if they seem hopelessly qualitative at first glance.

User education is painful but powerful

HubSpot focuses on the sort of people who used to buy ads in the Yellow Pages to promote their businesses. These people know they now need to use the Internet to reach customers, but they aren't sure how. To succeed, HubSpot has to help those people build useful websites and channels. Templates and other automated tools help, but a lot still comes down to people creating the right content for their own businesses and responding appropriately when customers get in touch through Twitter, Facebook or email. The only way to achieve that is to teach people how to do it, and so a lot of the company's resources are put into education.

On a simple level, tools like HubSpot's graders offer simple suggestions for improving websites and other content. Users of the service are sent regular emails that remind them of steps and actions they need to take, such as updating their blogs. HubSpot hosts a popular video cast that covers all sorts of tips and horror stories from the last week in social media. All of these efforts really seem to help the company, judging from how enthusiastically users respond to all the material. On a deeper level, it also seems to help build a long-term relationship between the company and its customers, driving real loyalty.

One of the unwritten rules of the consumer technology world is that anything that requires educating users is a losing proposition. Anybody who has looked at their customer acquisition funnel knows how even minor usability problems can drive away vast swaths of people. What's different about HubSpot is that their customers are a lot more motivated than your average consumer on the web. They're using the service in the hope of actually making more money, so they're willing to invest some time. It left me wondering if I should spend more time creating training material for my own projects, rather than always prioritizing interface work to make them easier to use. The people who use them to create content are already investing their own time, so is that perhaps another situation where education would pay off?


Hubspot is a smart, practical company that's very focused on using the data they're gathering to understand what their customers really need. Maybe that's precisely because the team isn't in the Valley to be distracted by every shiny new idea? No matter what the cause, I'm grateful that they spent the time to show me what they'd learned, and I'm looking forward to applying these ideas to my own work.



Related:


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl