Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 17 2012

Strata Week: Google unveils its Knowledge Graph

Here's what caught my attention in the data space this week.

Google's Knowledge Graph

Google Knowledge Graph"Google does the semantic Web," says O'Reilly's Edd Dumbill, "except they call it the Knowledge Graph." That Knowledge Graph is part of an update to search that Google unveiled this week.

"We've always believed that the perfect search engine should understand exactly what you mean and give you back exactly what you want," writes Amit Singhal, Senior VP of Engineering, in the company's official blog post.

That post makes no mention of the semantic web, although as ReadWriteWeb's Jon Mitchell notes, the Knowledge Graph certainly relies on it, following on and developing from Google's acquisition of the semantic database Freebase in 2010.

Mitchell describes the enhanced search features:

"Most of Google users' queries are ambiguous. In the old Google, when you searched for "kings," Google didn't know whether you meant actual monarchs, the hockey team, the basketball team or the TV series, so it did its best to show you web results for all of them.

"In the new Google, with the Knowledge Graph online, a new box will come up. You'll still get the Google results you're used to, including the box scores for the team Google thinks you're looking for, but on the right side, a box called "See results about" will show brief descriptions for the Los Angeles Kings, the Sacramento Kings, and the TV series, Kings. If you need to clarify, click the one you're looking for, and Google will refine your search query for you."

Yahoo's fumbles

The news from Yahoo hasn't been good for a long time now, with the most recent troubles involving the departure of newly appointed CEO Scott Thompson over the weekend and a scathing blog post this week by Gizmodo's Mathew Honan titled "How Yahoo Killed Flickr and Lost the Internet." Ouch.

Over on GigaOm, Derrick Harris wonders if Yahoo "sowed the seeds of its own demise with Hadoop." While Hadoop has long been pointed to as a shining innovation from Yahoo, Harris argues that:

"The big problem for Yahoo is that, increasingly, users and advertisers want to be everywhere on the web but at Yahoo. Maybe that's because everyone else that's benefiting from Hadoop, either directly or indirectly, is able to provide a better experience for consumers and advertisers alike."

De-funding data gathering

The appropriations bill that recently passed the U.S. House of Representatives axes funding for the Economic Census and the American Community Survey. The former gathers data about 25 million businesses and 1,100 industries in the U.S., while the latter collects data from three million American households every year.

Census Bureau director Robert Groves writes that the bill "devastates the nation's statistical information about the status of the economy and the larger society." BusinessWeek chimes in that the end to these surveys "blinds business," noting that businesses rely "heavily on it to do such things as decide where to build new stores, hire new employees, and get valuable insights on consumer spending habits."

Got data news to share?

Feel free to email me.

OSCON 2012 — Join the world's open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

Save 20% on registration with the code RADAR


June 17 2011

Radar's top stories: June 13-17, 2011

Here's a look at the top stories published on Radar this week.

Big data and the semantic web

Big data is poised to light the fire beneath the long-held dreams of the semantic web, and the semantic web will enable data scientists to describe, organize and reason about their results.

Choosing the right license for open data
OpenStreetMap founder Steve Coast explains the long and tricky shift from a Creative Commons license to the more data-friendly Open Database License.
3 ideas you should steal from HubSpot
HubSpot's location (near Boston) and its target market (small businesses) may keep it under the radar of Silicon Valley, but the company's approach to data products and customer empowerment are worthy of attention.
The blurring line between speech and text
A third category of speech has emerged: Internet-based updates that marry the ephemeral nature of the spoken word and the archival permanance of text.
A fresh look at your business connections
With its InMaps tool, LinkedIn is using social data and visualizations to reveal connections, clusters and outliers. LinkedIn senior data scientist Ali Imam discusses the nuts and bolts of InMaps in this interview.

OSCON Java 2011, being held July 25-27 in Portland, Ore., is focused on open source technologies that make up the Java ecosystem. Save 20% on registration with the code OS11RAD

Sponsored post

June 14 2011

Big data and the semantic web

On Quora, Gerald McCollum asked if big data and the semantic web were indifferent to each other, as there was little discussion of the semantic web topic at Strata this February.

My answer in brief is: big data's going to give the semantic web the massive amounts of metadata it needs to really get traction.

As the chair of the Strata conference, I see a vital link between big data and semantic web, and have my own roots in the semantic web world. Earlier this year however, the interaction was not yet of sufficient utility to make a strong connection in the conference agenda.

Google and the semantic web

A good example of the development of the relationship between big data and the semantic web is Google. Early on, Google search eschewed explicit use of semantics, preferring to infer a variety of signals in order to generate results. They used big data to create signals such as PageRank.

Now, as the search algorithms mature, Google's mission is to make their results ever more useful to users. To achieve this, their software must start to understand more about the actual world. Who's an author? What's a recipe? What do my friends find useful? So the connections between entities become more important. To achieve this Google are using data from initiatives such as, RDFa and microformats.

Google do not use these semantic web techniques to replace their search, but rather to augment it and make it more useful. To get all fancypants about it: Google are starting to promote the information they gather towards being knowledge. They even renamed their search group as "Knowledge".

Metadata is hard: big data can help

Conventionally, semantic web systems generate metadata and identified entities explicitly, ie. by hand or as the output of database values. But as anybody who's tried to get users to do it will tell you, generating metadata is hard. This is part of why the full semantic web dream isn't yet realized. Analytical approaches take a different approach: surfacing and classifying the metadata from analysis of the actual content and data itself. (Freely exposing metadata is also controversial and risky, as open data advocates will attest.)

Once big data techniques have been successfully applied, you have identified entities and the connections between them. If you want to join that information up to the rest of the web, or to concepts outside of your system, you need a language in which to do that. You need to organize, exchange and reason about those entities. It's this framework that has been steadily built up over the last 15 years with the semantic web project.

To give an already widespread example: many data scientists use Wikipedia to help with entity resolution and disambiguation, using Wikipedia URLs to identify entities. This is a classic use of the most fundamental of semantic web technologies: the URI.

For Strata, as our New York series of conferences approaches, we will be starting to include a little more semantic web, but with a strict emphasis on utility.

Strata itself is not as much beholden to big data, as about being data-driven, and the ongoing consequences that has for technology, business and society.

March 01 2011

November 15 2010

Where the semantic web stumbled, linked data will succeed

In the same way that the Holy Roman Empire was neither holy nor Roman, Facebook's OpenGraph Protocol is neither open nor a protocol. It is, however, an extremely straightforward and applicable standard for document metadata. From a strictly semantic viewpoint, OpenGraph is considered hardly worthy of comment: it is a frankenstandard, a mishmash of microformats and loosely-typed entities, lobbed casually into the semantic web world with hardly a backward glance.

But this is not important. While OpenGraph avoids, or outright ignores, many of the problematic issues surrounding semantic annotation (see Alex Iskold's excellent commentary on OpenGraph here on Radar), criticism focusing only on its technical purity is missing half of the equation. Facebook gets it right where other initiatives have failed. While OpenGraph is incomplete and imperfect, it is immediately usable and sympathetic with extant approaches. Most importantly, OpenGraph is one component in a wider ecosystem. Its deployment benefits are apparent to the consumer and the developer: add the metatags, get the "likes," know your customers.

Such consumer causality is critical to the adoption of any semantic mark-up. We've seen it before with microformats, whose eventual popularity was driven by their ability to improve how a page is represented in search engine listings, and not by an abstract desire to structure the unstructured. Successful adoption will often entail sacrificing standardization and semantic purity for pragmatic ease-of-use; this is where the semantic web appears to have stumbled, and where linked data will most likely succeed.

Linked data intends to make the Web more interconnected and data-oriented. Beyond this outcome, the term is less rigidly defined. I would argue that linked data is more of an ethos than a standard, focused on providing context, assisting in disambiguation, and increasing serendipity within the user experience. This idea of linked data can be delivered by a number of existing components that work together on the data, platform, and application levels:

  • Entity provision: Defining the who, what, where and when of the Internet, entities encapsulate meaning and provide context by type. In its most basic sense, an entity is one row in a list of things organized by type -- such as people, places, or products -- each with a unique identifier. Organizations that realize the benefits of linked data are releasing entities like never before, including the publication of 10,000 subject headings by the New York Times, admin regions and postcodes from the UK's Ordnance Survey, placenames from Yahoo GeoPlanet, and the data infrastructures being created by Factual [disclosure: I've just signed on with Factual].
  • Entity annotation: There are numerous formats for annotating entities when they exist in unstructured content, such as a web page or blog post. Facebook's OpenGraph is a form of entity annotation, as are HTML5 microdata, RDFa, and microformats such as hcard. Microdata is the shiny, new player in the game, but see Evan Prodromou's great post on RDFa v. microformats for a breakdown of these two more established approaches.
  • Endpoints and Introspection: Entities contribute best to a linked data ecosystem when each is associated with a Uniform Resource Identifier (URI), an Internet-accessible, machine readable endpoint. These endpoints should provide introspection, the means to obtain the properties of that entity, including its relationship to others. For example, the Ordnance Survey URI for the "City of Southampton" is Its properties can be retrieved in machine-readable format (RDF/XML,Turtle and JSON) by appending an "rdf," "ttl," or "json" extension to the above. To be properly open, URIs must be accessible outside a formal API and authentication mechanism, exposed to semantically-aware web crawlers and search tools such as Yahoo BOSS. Under this definition, local business URLs, for example, can serve in-part as URIs -- 'view source' to see the semi-structured data in these listings from Yelp (using hcard and OpenGraph), and Foursquare (using microdata and OpenGraph).
  • Entity extraction: Some linked data enthusiasts long for the day when all content is annotated so that it can be understood equally well by machines and humans. Until we get to that happy place, we will continue to rely on entity extraction technologies that parse unstructured content for recognizable entities, and make contextually intelligent identifications of their type and identifier. Named entity recognition (NER) is one approach that employs the above entity lists, which may also be combined with heuristic approaches designed to recognize entities that lie outside of a known entity list. Yahoo, Google and Microsoft are all hugely interested in this area, and we'll see an increasing number of startups like Semantinet emerge with ever-improving precision and recall. If you want to see how entity extraction works first-hand, check out Reuters-owned Open Calais and experiment with their form-based tool.
  • Entity concordance and crosswalking: The multitude of place namespaces illustrates how a single entity, such as a local business, will reside in multiple lists. Because the "unique" (U) in a URI is unique only to a given namespace, a world driven by linked data requires systems that explicitly match a single entity across namespaces. Examples of crosswalking services include: Placecast's Match API, which returns the Placecast IDs of any place when supplied with an hcard equivalent; Yahoo's Concordance, which returns the Where on Earth Identifier (WOEID) of a place using as input the place ID of one of fourteen external resources, including OpenStreetMap and Geonames; and the Guardian Content API, which allows users to search Guardian content using non-Guardian identifiers. These systems are the unsung heroes of the linked data world, facilitating interoperability by establishing links between identical entities across namespaces. Huge, unrealized value exists within these applications, and we need more of them.
  • Relationships: Entities are only part of the story. The real power of the semantic web is realized in knowing how entities of different types relate to each other: actors to movies, employees to companies, politicians to donors, restaurants to neighborhoods, or brands to stores. The power of all graphs -- these networks of entities -- is not in the entities themselves (the nodes), but how they relate together (the edges). However, I may be alone in believing that we need to nail the problem of multiple instances of the same entity, via concordance and crosswalking, before we can tap properly into the rich vein that entity relationships offer.

The approaches outlined above combine to help publishers and application developers provide intelligent, deep and serendipitous consumer experiences. Examples include the semantic handset from Aro Mobile, the BBC's World Cup experience, and aggregating references on your Facebook news feed.

Linked data will triumph in this space because efforts to date focus less on the how and more on the why. RDF, SPARQL, OWL, and triple stores are onerous. URIs, micro-formats, RDFa, and JSON, less so. Why invest in difficult technologies if consumer outcomes can be realized with extant tools and knowledge? We have the means to realize linked data now -- the pieces of the puzzle are there and we (just) need to put them together.

Linked data is, at last, bringing the discussion around to the user. The consumer "end" trumps the semantic "means."


November 05 2010

Four short links: 5 November 2010

  1. S4 -- S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Open-sourced (Apache license) by Yahoo!.
  2. RDF and Semantic Web: Can We Reach Escape Velocity? (PDF) -- spot-on presentation from the linked data advisor. It nails, clearly and in only 12 slides, why there's still resistance to linked data uptake and what should happen to change this. Amen! (via Simon St Laurent)
  3. Pew Internet Report on Location-based Services -- 10% of online Hispanics use these services - significantly more than online whites (3%) or online blacks (5%).
  4. Slate -- Python library for extracting text from PDFs easily.

August 18 2010

Linked data is opening 800 years of UK legal info

Earlier this month, the National Archives of the United Kingdom launched to provide public access to a primary source of legal information for citizens. covers more than 800 years of legal history in England, Scotland, Wales and Northern Ireland.

When I heard about the new site, I dialed up John Sheridan (@johnlsheridan), head of e-services and strategy at the UK's National Archives, to talk about its implications for open government. Our conversation is embedded as a podcast in this post.

For more on the technologies, inspirations and aspirations for the project, read Sheridan's comprehensive post on at the Cornell Law School.

As Sheridan writes there:

We aimed to make a source of open data from the outset.  The importance of open legal data is made powerfully by people like Carl Malamud and the Law.Gov campaign. Our desire to make the statute book available as open data motivated a number of technology choices we made. For example, the website is built on top of an open Application Programming Interface (API). The same API is available for others to use to access the raw data.

For more on the growing semantic web of data, watch the video of Tim Berners-Lee embedded below:


Gov 2.0 SummitOpen data is on the agenda for the Gov 2.0 Summit, being held September 7-8 in Washington D.C. Request an invitation.

August 04 2010

Four short links: 4 August 2010

  1. FuXi -- Python-based, bi-directional logical reasoning system for the semantic web from the folks at the Open Knowledge Foundation. (via About Inferencing)
  2. Harness the Power of Being an Internet -- I learn by trying to build something, there's no other way I can discover the devils-in-the-details. Unfortunately that's an incredibly inefficient way to gain knowledge. I basically wander around stepping on every rake in the grass, while the A Students memorize someone else's route and carefully pick their way across the lawn without incident. My only saving graces are that every now and again I discover a better path, and faced with a completely new lawn I have an instinct for where the rakes are.
  3. Stack Overflow's Curated Folksonomy -- community-driven tag synonym system to reduce the chaos of different names for the same thing. (via Skud)
  4. Image Deblurring using Inertial Measurement Sensors (Microsoft Research) -- using Arduino to correct motion blur. (via Jon Oxer)

August 03 2010

Four short links: 3 August 2010

  1. OpenStructs -- an education and distribution site dedicated to open source software for converting, managing, viewing and manipulating structured data.
  2. TinkerPop -- many (often open source) tools for graph data.
  3. Polaroid a Day -- a moving human story told in photographs.
  4. Prizes (PDF) -- White House memorandum to government agencies explaining how prizes are to be used. The first part, the why and how of contests and prizes, is something to add to your "here, read this" arsenal.

July 21 2010

Four short links: 21 July 2010

  1. The Men Who Stare at Screens (NY Times) -- What was unexpected was that many of the men who sat long hours and developed heart problems also exercised. Quite a few of them said they did so regularly and led active lifestyles. The men worked out, then sat in cars and in front of televisions for hours, and their risk of heart disease soared, despite the exercise. Their workouts did not counteract the ill effects of sitting. (via Andy Baio)
  2. Caring with Cash -- describes a study where "pay however much you want" had high response rate but low average price, "half goes to charity" barely changed from the control (fixed price) response rate, but "half goes to charity and you can pay what you like" earned more money than either strategy.
  3. Behavioural Economics a Political Placebo? (NY Times) -- As policymakers use it to devise programs, it’s becoming clear that behavioral economics is being asked to solve problems it wasn’t meant to address. Indeed, it seems in some cases that behavioral economics is being used as a political expedient, allowing policymakers to avoid painful but more effective solutions rooted in traditional economics. (via Mind Hacks)
  4. Protege -- open source ontology editor and knowledge-base framework.

May 25 2010

Facebook Open Graph: A new take on semantic web

Facebook logoA few weeks ago, Facebook announced an Open Graph initiative -- a move considered to be a turning point not just for the social networking giant, but for the web at large. The company's new vision is no longer to just connect people. Facebook now wants to connect people around and across the web through concepts they are interested in.

This vision of the web isn't really new. Its origins go back the the person who invented the web, Sir Tim Berners-Lee. This vision has been passionately shared and debated by the tech community over the last decade. What Facebook has announced as Open Graph has been envisioned by many as semantic web.

The web of people and things

At the heart of this vision is the idea that different web pages contain the same objects. Whether someone is reading about a book on Barnes and Noble, on O'Reilly or on a book review blog doesn't matter. What matters is that the reader is interested in this particular book. And so it makes sense to connect her to friends and other readers who are interested in the same book -- regardless of when and where they encountered it.

The same is true about many everyday entities that we find on the web -- movies, albums, stars, restaurants, wine, musicians, events, articles, politicians, etc -- the same entity is referenced in many different pages. Our brains draw the connections instantly and effortlessly, but computers can't deduce that an "Avatar" review on is talking about the movie also described on a page on

The reason it is important for things to be linked is so that people can be connected around their interests and not around websites they visit. It does not matter to me where my friends are reading about "Avatar", what matters is which of my friends liked the movie and what they had to say. Without interlinking objects across different sites, the global taste graph is too sparse and uninteresting. By re-imagining the web as the graph of things we are interested in, a new dimension, a new set of connections gets unlocked -- everything and everyone connects in a whole new way.

A brief history of semantic markups

The problem of building the web of people and things boils down to describing what is on the page and linking it to other pages. In Tim Berners-Lee's original vision, the entities and relationships between them would be described using RDF. This mathematical language was designed to capture the essence of objects and relationships in a precise way. While it's true that RDF annotation would be the most complete, it also turns out to be quite complicated.

It is this complexity that the community has attempted to address over the years. A simpler approach called Microformats was developed by Tantek Celik, Chris Messina and others. Unlike RDF, Microformats rely on existing XHTML standards and leverage CSS classes to markup the content. Critically, Microformats don't add any additional information to the page, but just annotate the data that is already on the page.

Microformats enjoyed support and wider adoption because of their relative simplicity and focus on marking up the existing content. But there are still issues. First, the number of supported entities is limited, the focus has been on marking organizations, people and events, and then reviews, but there is no way to markup, for example, a movie or a book or a song. Second, Microformats are somewhat cryptic and hard to read. There is cleverness involved in figuring out how to do the markup, which isn't necessarily a good thing.

In 2005, inspired by Microformats, Ian Davis, now CTO of Talis, developed eRDF -- a syntax within HTML for expressing a simplified version of RDF. His approach married the canonical concepts of RDF and the idea from Microformats that the data is already on the page. An iteration of Ian's work, called RDFa, has been adopted as a W3C standard. All the signs point in the direction of RDFa being the solution of choice for describing entities inside HTML pages.

Until recently, despite the progress in the markups, adoption was hindered by the fact that publishers lacked the incentive to annotate the pages. What is the point if there are no applications that can take advantage of it? Luckily, in 2009 both Yahoo and Google put their muscle behind marking up pages.

First Yahoo developed an elegant search application called Search Monkey. This app encouraged and enabled sites to take control over how Yahoo's search engine presented the results. The solution was based on both markup on the page and a developer plugin, which gave the publishers control over presenting the results to the user. Later, Google announced rich snippets. This supported both Microformats and RDFa markup and enabled webmasters to control how their search results are presented.

Still missing from all this work was a simple common vocabulary for describing everyday things. In 2008-2009, with help from Peter Mika from Yahoo research, I developed a markup called abmeta. This extensible, RDFa-based markup provided a vocabulary for describing everyday entities like movies, albums, books, restaurants, wines, etc. Designed with simplicity in mind, abmeta supports declaring single and multiple entities on the page, using both meta headers and also using RDFa markup inside the page.

Facebook Open Graph protocol

The markup announced by Facebook can be thought of as a subset of abmeta because it supports the declaration
of entities using meta tags. The great thing about this format is simplicity. It is literally readable in English.

The markup defines several essential attributes -- type, title, URL, image and description. The protocol comes with a reasonably rich taxonomy of types, supporting entertainment, news, location, articles
and general web pages. Facebook hopes that publishers will use the protocol to describe the entities on pages.
When users press the LIKE button, Facebook will get not just a link, but a specific object of the specific type.

If all of this computes correctly, Facebook should be able to display a rich collection of entities on user profiles,
and, should be able to show you friends who liked the same thing around the web, regardless of the site. So by
publishing this protocol and asking websites to embrace it, Facebook clearly declares its foray
into the web of people and things -- aka, the semantic web.

Technical issues with Facebook's protocol

As I've previously pointed out on my post on ReadWriteWeb, there are several issues with the markup that Facebook proposed.

1. There is no way to disambiguate things. This is quite a miss on Facebook's part, which is already resulting in bogus data on user profiles. The ambiguity is because the protocol is lacking secondary attributes for some data types. For example, it is not possible to distinguish the movie from its remake. Typically, such disambiguation would be done by using either a director or a year property, but Facebook's protocol does not define these attributes. This leads to duplicates and dirty data.

2. There is no way to define multiple objects on the page. This is another rather surprising limitation, since previous markups, like Microformats and abmeta, support this use case. Of course if Facebook only cares about getting people to LIKE pages so that they can do better ad targeting, then having multiple objects inside the page is not necessary. But Facebook claimed and marketed this offering as semantic web, so it is surprising that there is no way to declare multiple entities on a single page. Surely a comprehensive solution ought to do that.

3. Open protocol can't be closed. Finally, Facebook has done this without collaborating with anyone. For something to be rightfully called an Open Graph Protocol, it should be developed in an open collaboration with the web. Surely, Google, Yahoo!, W3C and even small startups playing in the semantic web space would have good things to contribute here.

It sadly appears that getting the semantic web elements correct was not the highest priority for Facebook. Instead, the announcement seems to be a competitive move against Twitter, Google and others with the goal to lock-in publishers by giving them a simple way to recycle traffic.

Where to next?

Despite the drawbacks, there is no doubt that Facebook's announcement is a net positive for the web at large. When one of the top companies takes a 180-degree turn and embraces a vision that's been discussed for a decade, everyone stops and listens. The web of people and things is now both very important and a step closer. The questions are: What is the right way? And how do we get there?

For starters, it would be good to fill in some holes in Facebook Open Graph. Whether it is the right way overall or not, at least we need to make it complete. It is important to add support for secondary attributes necessary for disambiguation and also, important to add support for multiple entities inside the page (even if there is only one LIKE button on the whole page). Both of these are already addressed by Microformats and abmeta, so it should be easy to fix.

Beyond technical issues, Facebook should open up this protocol and make it owned by the community, instead of being driven by one company's business agenda. A true roundtable with major web companies, publishers, and small startups would result in a correct, comprehensive and open protocol. We want to believe that Facebook will do the right thing and will collaborate with the rest the web on what has been an important work spanning years for many of us. The prospects are exciting, because we just made a giant leap. We just need to make sure we land in the right place.

May 21 2010

Four short links: 21 May 2010

  1. Infrastructures (xkcd) -- absolutely spot-on.
  2. The Michel Thomas App: Behind the Scenes (BERG) -- not interesting to me because it's iPhone, but for the insight into the design process. The main goal here was for me to do just enough to describe the idea, so that Nick could take it and iterate it in code. He’d then show me what he’d built; I’d do drawings or further animations on top of it, and so on and so on. It’s a fantastic way of working. Before long, you start finishing each others’ sentences. Both of us were able to forget about distinguishing between design and code, and just get on with thinking through making together. It’s brilliant when that happens.
  3. Open Government and the World Wide Web -- Tim Berners-Lee offered his "Five-Star" plan for open data. He said public information should be awarded a star rating based on the following criteria: one star for making the information public; a second is awarded if the information is machine-readable; a third star if the data is offered in a non-proprietary format; a fourth is given if it is in Linked Data format; a fifth if it has actually been linked. Not only a good rating system, but a clear example of the significantly better communication by semantic web advocates. Three years ago we'd have had a wiki specifying a ratings ontology with a union of evaluation universes reconciled through distributed trust metrics and URI-linked identity delivered through a web-services accessible RDF store, a prototype of one component of which was running on a devotee's desktop machine at a university in Bristol, written in an old version of Python. (via scilib on Twitter)
  4. Data Access, Data Ownership, and Sharecropping -- With Flickr you can get out, via the API, every single piece of information you put into the system. Every photo, in every size, plus the completely untouched original. (which we store for you indefinitely, whether or not you pay us) Every tag, every comment, every note, every people tag, every fave. Also your stats, view counts, and referers. Not the most recent N, not a subset of the data. All of it. It’s your data, and you’ve granted us a limited license to use it. Additionally we provide a moderately competently built API that allows you to access your data at rates roughly 500x faster then the rate that will get you banned from Twitter. Asking people to accept anything else is sharecropping. It’s a bad deal. (via Marc Hedlund)

April 09 2010

April 01 2010

Imagine a world that has moved entirely to cloud computing

For April Fools Day I'm offering a short story about a future world
that has moved entirely to cloud computing: href="">Hardware

The cloud still scares as many IT managers as it attracts. But the
advantages of cloud computing for maintenance, power consumption, and
other things suggests it will dominate computing in a decade or so.

Meanwhile, other changes are affecting the way we use data
everyday. Movements such as NoSQL, big data, and the Semantic Web all
come at data from different angles, but indicate a shift from
retrieving individual facts we want to looking at relationships among
huge conglomerations of data. I've explored all these things in blogs
on this site, along with some other trends such as shrinking computer
devices, so now I decided to combine them in a bit of a whacky tale.

March 23 2010

Four short links: 23 March 2010

  1. British Prime Minister's Speech -- a huge amount of the speech is given to digital issues, including the funding and founding of an "Institute for Web Science" headed by Sir Tim Berners-Lee. (via Rchards on Twitter)
  2. Periodic Table of Science Bloggers -- a great way to explore the universe of science blogging. (via sciblogs)
  3. For All The Tea in China -- a tale of industrial espionage from the 1800s. The man behind the theft was Robert Fortune, a Scottish-born botanist who donned mandarin garb, shaved the top of his head and attached a long braid as part of a disguise that allowed him to pass as Chinese so he could go to areas of the country that were off-limits to foreigners. He forged a token and stole IP, in some ways it's like the reverse of the Google-China breakin. (via danjite on Twitter)
  4. Nature by Numbers -- relating numbers, geometry, and nature. Beautiful and educational. (via BoingBoing)

January 07 2010

Pew Research asks questions about the Internet in 2020

Pew Research, which seems to be interested in just about everything,
conducts a "future of the Internet" survey every few years in which
they throw outrageously open-ended and provocative questions at a
chosen collection of observers in the areas of technology and
society. Pew makes participation fun by finding questions so pointed
that they make you choke a bit. You start by wondering, "Could I
actually answer that?" and then think, "Hey, the whole concept is so
absurd that I could say anything without repercussions!" So I
participated in their href=""
2006 survey and did it again this week. The Pew report will
aggregate the yes/no responses from the people they asked to
participate, but I took the exercise as a chance to hammer home my own
choices of issues.

(If you'd like to take the survey, you can currently visit;

and enter PIN 2000.)

Will Google make us stupid?

This first question is not about a technical or policy issue on the
Internet or even how people use the Internet, but a purported risk to
human intelligence and methods of inquiry. Usually, questions about
how technology affect our learning or practice really concern our
values and how we choose technologies, not the technology itself. And
that's the basis on which I address such questions. I am not saying
technology is neutral, but that it is created, adopted, and developed
over time in a dialog with people's desires.

I respect the questions posed by Nicholas Carr in his Atlantic
article--although it's hard to take such worries seriously when he
suggests that even the typewriter could impoverish writing--and would
like to allay his concerns. The question is all about people's
choices. If we value introspection as a road to insight, if we
believe that long experience with issues contributes to good judgment
on those issues, if we (in short) want knowledge that search engines
don't give us, we'll maintain our depth of thinking and Google will
only enhance it.

There is a trend, of course, toward instant analysis and knee-jerk
responses to events that degrades a lot of writing and discussion. We
can't blame search engines for that. The urge to scoop our contacts
intersects with the starvation of funds for investigative journalism
to reduce the value of the reports we receive about things that are
important for us. Google is not responsible for that either (unless
you blame it for draining advertising revenue from newspapers and
magazines, which I don't). In any case, social and business trends
like these are the immediate influences on our ability to process
information, and searching has nothing to do with them.

What search engines do is provide more information, which we can use
either to become dilettantes (Carr's worry) or to bolster our
knowledge around the edges and do fact-checking while we rely mostly
on information we've gained in more robust ways for our core analyses.
Google frees the time we used to spend pulling together the last 10%
of facts we need to complete our research. I read Carr's article when
The Atlantic first published it, but I used a web search to pull it
back up and review it before writing this response. Google is my

Will we live in the cloud or the desktop?

Our computer usage will certainly move more and more to an environment
of small devices (probably in our hands rather than on our desks)
communicating with large data sets and applications in the cloud.
This dual trend, bifurcating our computer resources between the tiny
and the truly gargantuan, have many consequences that other people
have explored in depth: privacy concerns, the risk that application
providers will gather enough data to preclude competition, the
consequent slowdown in innovation that could result, questions about
data quality, worries about services becoming unavailable (like
Twitter's fail whale, which I saw as recently as this morning), and

One worry I have is that netbooks, tablets, and cell phones will
become so dominant that meaty desktop systems will rise in the cost
till they are within the reach only of institutions and professionals.
That will discourage innovation by the wider populace and reduce us to
software consumers. Innovation has benefited a great deal from the
ability of ordinary computer users to bulk up their computers with a
lot of software and interact with it at high speeds using high quality
keyboards and large monitors. That kind of grassroots innovation may
go away along with the systems that provide those generous resources.

So I suggest that cloud application providers recognize the value of
grassroots innovation--following Eric von Hippel's findings--and
solicit changes in their services from their visitors. Make their code
open source--but even more than that, set up test environments where
visitors can hack on the code without having to download much
software. Then anyone with a comfortable keyboard can become part of
the development team.

We'll know that software services are on a firm foundation for future
success when each one offers a "Develop and share your plugin here"

Will social relations get better?

Like the question about Google, this one is more about our choices
than our technology. I don't worry about people losing touch with
friends and family. I think we'll continue to honor the human needs
that have been hard-wired into us over the millions of years of
evolution. I do think technologies ranging from email to social
networks can help us make new friends and collaborate over long

I do worry, though, that social norms aren't keeping up with
technology. For instance, it's hard to turn down a "friend" request
on a social network, particularly from someone you know, and even
harder to "unfriend" someone. We've got to learn that these things are
OK to do. And we have to be able to partition our groups of contacts
as we do in real life (work, church, etc.). More sophisticated social
networks will probably evolve to reflect our real relationships more
closely, but people have to take the lead and refuse to let technical
options determine how they conduct their relationships.

Will the state of reading and writing be improved?

Our idea of writing changes over time. The Middle Ages left us lots of
horribly written documents. The few people who learned to read and
write often learned their Latin (or other language for writing) rather
minimally. It took a long time for academies to impose canonical
rules for rhetoric on the population. I doubt that a cover letter and
resume from Shakespeare would meet the writing standards of a human
resources department; he lived in an age before standardization and
followed his ear more than rules.

So I can't talk about "improving" reading and writing without
addressing the question of norms. I'll write a bit about formalities
and then about the more important question of whether we'll be able to
communicate with each other (and enjoy what we read).

In many cultures, writing and speech have diverged so greatly that
they're almost separate languages. And English in Jamaica is very
different from English in the US, although I imagine Jamaicans try
hard to speak and write in US style when they're communicating with
us. In other words, people do recognize norms, but usage depends on
the context.

Increasingly, nowadays, the context for writing is a very short form
utterance, with constant interaction. I worry that people will lose
the ability to state a thesis in unambiguous terms and a clear logical
progression. But because they'll be in instantaneous contact with
their audience, they can restate their ideas as needed until
ambiguities are cleared up and their reasoning is unveiled. And
they'll be learning from others along with way. Making an elegant and
persuasive initial statement won't be so important because that
statement will be only the first step of many.

Let's admit that dialog is emerging as our generation's way to develop
and share knowledge. The notion driving Ibsen's Hedda Gabler--that an
independent philosopher such as Ejlert Løvborg could write a
masterpiece that would in itself change the world--is passé. A
modern Løvborg would release his insights in a series of blogs
to which others would make thoughtful replies. If this eviscerated
Løvborg's originality and prevented him from reaching the
heights of inspiration--well, that would be Løvborg's fault for
giving in to pressure from more conventional thinkers.

If the Romantic ideal of the solitary genius is fading, what model for
information exchange do we have? Check Plato's Symposium. Thinkers
were expected to engage with each other (and to have fun while doing
so). Socrates denigrated reading, because one could not interrogate
the author. To him, dialog was more fertile and more conducive to

The ancient Jewish scholars also preferred debate to reading. They
certainly had some received texts, but the vast majority of their
teachings were generated through conversation and were not written
down at all until the scholars realized they had to in order to avoid
losing them.

So as far as formal writing goes, I do believe we'll lose the subtle
inflections and wordplay that come from a widespread knowledge of
formal rules. I don't know how many people nowadays can appreciate all
the ways Dickens sculpted language, for instance, but I think there
will be fewer in the future than there were when Dickens rolled out
his novels.

But let's not get stuck on the aesthetics of any one period. Dickens
drew on a writing style that was popular in his day. In the next
century, Toni Morrison, John Updike, and Vladimir Nabokov wrote in a
much less formal manner, but each is considered a beautiful stylist in
his or her own way. Human inventiveness is infinite and language is a
core skill in which we we all take pleasure, so we'll find new ways to
play with language that are appropriate to our age.

I believe there will always remain standards for grammar and
expression that will prove valuable in certain contexts, and people
who take the trouble to learn and practice those standards. As an
editor, I encounter lots of authors with wonderful insights and
delightful turns of phrase, but with deficits in vocabulary, grammar,
and other skills and resources that would enable them to write better.
I work with these authors to bring them up to industry-recognized

Will those in GenY share as much information about themselves as they age?

I really can't offer anything but baseless speculation in answer to
this question, but my guess is that people will continue to share as
much as they do now. After all, once they've put so much about
themselves up on their sites, what good would it do to stop? In for a
penny, in for a pound.

Social norms will evolve to accept more candor. After all, Ronald
Reagan got elected President despite having gone through a divorce,
and Bill Clinton got elected despite having smoked marijuana.
Society's expectations evolve.

Will our relationship to key institutions change?

I'm sure the survey designers picked this question knowing that its
breadth makes it hard to answer, but in consequence it's something of
a joy to explore.

The widespread sharing of information and ideas will definitely change
the relative power relationships of institutions and the masses, but
they could move in two very different directions.

In one scenario offered by many commentators, the ease of
whistleblowing and of promulgating news about institutions will
combine with the ability of individuals to associate over social
networking to create movements for change that hold institutions more
accountable and make them more responsive to the public.

In the other scenario, large institutions exploit high-speed
communications and large data stores to enforce even greater
centralized control, and use surveillance to crush opposition.

I don't know which way things will go. Experts continually urge
governments and businesses to open up and accept public input, and
those institutions resist doing so despite all the benefits. So I have
to admit that in this area I tend toward pessimism.

Will online anonymity still be prevalent?

Yes, I believe people have many reasons to participate in groups and
look for information without revealing who they are. Luckily, most new
systems (such as U.S. government forums) are evolving in ways that
build in privacy and anonymity. Businesses are more eager to attach
our online behavior to our identities for marketing purposes, but
perhaps we can find a compromise where someone can maintain a
pseudonym associated with marketing information but not have it
attached to his or her person.

Unfortunately, most people don't appreciate the dangers of being
identified. But those who do can take steps to be anonymous or
pseudonymous. As for state repression, there is something of an
escalating war between individuals doing illegal things and
institutions who want to uncover those individuals. So far, anonymity
seems to be holding on, thanks to a lot of effort by those who care.

Will the Semantic Web have an impact?

As organizations and news sites put more and more information online,
they're learning the value of organizing and cross-linking
information. I think the Semantic Web is taking off in a small way on
site after site: a better breakdown of terms on one medical site, a
taxonomy on a Drupal-powered blog, etc.

But Berners-Lee had a much grander vision of the Semantic Web than
better information retrieval on individual sites. He's gunning for
content providers and Web designers the world around to pull together
and provide easy navigation from one site to another, despite wide
differences in their contributors, topics, styles, and viewpoints.

This may happen someday, just as artificial intelligence is looking
more feasible than it was ten years ago, but the chasm between the
present and the future is enormous. To make the big vision work, we'll
all have to use the same (or overlapping) ontologies, with standards
for extending and varying the ontologies. We'll need to disambiguate
things like webbed feet from the World Wide Web. I'm sure tools to
help us do this will get smarter, but they need to get a whole lot

Even with tools and protocols in place, it will be hard to get
billions of web sites to join the project. Here the cloud may be of
help. If Google can perform the statistical analysis and create the
relevant links, I don't have to do it on my own site. But I bet
results would be much better if I had input.

Are the next takeoff technologies evident now?

Yes, I don't believe there's much doubt about the technologies that
companies will commercialize and make widespread over the next five
years. Many people have listed these technologies: more powerful
mobile devices, ever-cheaper netbooks, virtualization and cloud
computing, reputation systems for social networking and group
collaboration, sensors and other small systems reporting limited
amounts of information, do-it-yourself embedded systems, robots,
sophisticated algorithms for slurping up data and performing
statistical analysis, visualization tools to report the results of
that analysis, affective technologies, personalized and location-aware
services, excellent facial and voice recognition, electronic paper,
anomaly-based security monitoring, self-healing systems--that's a
reasonable list to get started with.

Beyond five years, everything is wide open. One thing I'd like to see
is a really good visual programming language, or something along those
lines that is more closely matched to human strengths than our current
languages. An easy high-level programming language would immensely
increase productivity, reduce errors (and security flaws), and bring
in more people to create a better Internet.

Will the internet still be dominated by the end-to-end principle?

I'll pick up here on the paragraph in my answer about takeoff
technologies. The end-to-end principle is central to the Internet I
think everybody would like to change some things about the current
essential Internet protocols, but they don't agree what those things
should be. So I have no expectation of a top-to-bottom redesign of the
Internet at any point in our viewfinder. Furthermore, the inertia
created by millions of systems running current protocols would be hard
to overcome. So the end-to-end principle is enshrined for the
foreseeable future.

Mobile firms and ISPs may put up barriers, but anyone in an area of
modern technology who tries to shut the spiget on outside
contributions eventually becomes last year's big splash. So unless
there's a coordinated assault by central institutions like
governments, the inertia of current systems will combine with the
momentum of innovation and public demand for new services to keep
chokepoints from being serious problems.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...