Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

July 14 2011

Why files need to die

Filing Cabinet by Robin Kearney, on FlickrFiles are an outdated concept. As we go about our daily lives, we don't open up a file for each of our friends or create folders full of detailed records about our shopping trips. Create, watch, socialize, share, and plan — these are the new verbs of the Internet age — not open, save, close and trash.

Clinging to outdated concepts stifles innovation. Consider the QWERTY keyboard. It was designed 133 years ago to slow down typists who were causing typewriter hammers to jam. The last typewriter factory in the world closed last month, and yet even the shiny new iPad 2 still uses the same layout. Creative alternatives like Dvorak and more recently Swype still struggle to compete with this deeply ingrained idea of how a keyboard should look.

Today we use computers for everything from booking travel to editing snapshots, and we accumulate many thousands of files. As a result, we've become digital librarians, devising naming schemes and folder systems just to cope with the mountains of digital "stuff" in our lives.

The file folder metaphor makes no sense in today's world. Gone are the smoky 1970s offices where secretaries bustled around fetching armfuls of paperwork for their bosses, archiving cardboard files in dusty cabinets. Our lives have gone digital and our data zips around the world in seconds as we buy goods online or chat with distant relatives.

A file is a snapshot of a moment in time. If I email you a document, I'm freezing it and making an identical copy. If either of us wants to change it, we have to keep our two separate versions in sync.

So it's no wonder that as we try and force this dated way of thinking onto today's digital landscape, we are virtually guaranteed the pains of lost data, version conflicts and failed uploads.

It's time for a new way to store data – a new mental model that reflects the way we use computers today.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Flogging a dead horse

Microsoft, Apple and Linux have all failed to provide ways to work with our data in an intuitive way. Many new products have emerged to try and ease our pain, such as Dropbox and Infovark, but they're limited by the tired model of files and folders.

The emergence of Web 2.0 offered new hope, with much brouhaha over folksonomies. The idea was to harness "people power" by getting us to tag pictures or websites with meaningful labels, removing the need for folders. But Flickr and Delicious, poster boys of the tagging revolution, have fallen from favor and as the tools have stagnated and enthusiasm for tagging has dwindled.

Clearly, human knowledge is needed for computers to make sense of our data – but relying on human effort to digitize that knowledge by labeling files or entering data can only take us so far. Even Wikipedia has vast gaps in its coverage.

Instead, we need computers to interpret and organize data for us automatically. This means they'll store not only our data, but also information about that data and what it means – metadata. We need them to really understand our digital information as something more than a set of text documents and binary streams. Only then will we be freed from our filing frustrations.

I am not a machine, don't make me think like one

In all our efforts to interact with computers, we're forced to think like a machine: What device should I access? What format is that file? What application should I launch to read it? But that's not how the brain works. We form associations between related things, and that's how we access our memories:

Associative recall in the brain

Wouldn't it be nice if we could navigate digital data in this way? Isn't it about time that computers learned to express the world in our terms, not theirs?

It might seem like a far-off dream, but it's achievable. To do this, computers will need to know what our data relates to. They can learn this by capturing information automatically and using it to annotate our data at the point it is first stored — saving us from tedious data entry and filing later.

For example, camera manufacturers have realized that adding GPS to cameras provides valuable metadata for each photograph. Back at your PC, your geo-tagged images will be automatically grouped by time and location with zero effort.

Our digital lives are full of signals and sensors that can be similarly harnessed:

  • ReQall uses your calendar and to-do list activity to help deliver information at the right time.
  • RescueTime tracks the websites and programs you use to understand your working habits.
  • Lifelogging projects like MyLifeBits go further still, recording audio and video of your life to provide a permanent record.
  • A research project at Ryerson University demonstrates the idea of context-aware computing — combining live, local data and user information to deliver highly relevant, customized content.

Semantics: Teaching computers to understand human language

Metadata annotation via sensors and semantic annotation

As this diagram shows, hardware and software sensors can only tell half the story. Where computers stand to learn the most is by analyzing the meanings behind the 1s and 0s. Once computers understand our language, our documents and correspondence are no longer just isolated files. They become source material, full of facts and ready to be harvested.

This is the science of semantics — programs that can extract meaning from the written word.

Here's some of what we can do with semantic technology today:

Today, most semantic research is done by enterprises that can afford to spend time and money on enterprise content management (ECM) and content analytics systems to make sense of their vast digital troves. But soon consumers will reap the benefits of semantic technology too, as these applications show:

  • While surfing the web, we can chat and interact around particular movies, books or activities using the browser plug-in GetGlue, which scans the text in the web pages you visit to identify recognized social objects.
  • We will soon have our own intelligent agents, the first of which is Siri, an iPhone app that can book movie tickets or make restaurant reservations without us having to fill in laborious online forms.

This ability for computers to understand our content is critical as we move toward file-less computing. A new era of information-based applications is beginning, but its success requires a world where information isn't fragmented across different files.

Time for a new view of data

Let's use your summer vacation as an example: All the digital information relating to your vacation is scattered across hundreds of files, emails and transactions, often locked into different applications, services and formats.

No matter how many fancy applications you have for "seamlessly syncing" of all these files, any talk of interoperability is meaningless until you have a basic fabric for viewing and interacting with your data at a higher level.

If not files, then what? The answer is surprisingly simple.

What is the one thing all your data has in common?

Time.

Almost all data can be thought of as a stream, changing over time:

The streams of my digital life

Already we generate vast streams of data as we go about our lives: credit card purchases, web history, photographs, file edits. We never get to see them on screen like that though. Combining these streams into a single timeline — a personal life stream — brings everything together in a way that makes sense:

A personal life stream


Asking the computer "Show me everything I was doing at 3 p.m. yesterday." or "Where
are Thursday's figures?" is something we can't easily do today. Products such as AllOfMe are beginning to experiment in this space.

We can go further — time itself can be used to help associate things. For example: Since I can only be in one place at one time, everything that happens there and then must be related:

All data at the same time is related

The computer can easily help me access the most relevant information — it just needs to track back along the streams to the last time I was at a certain place or with a specific person:

Related data can be found by finding previous occurrences on each stream

The world — our lives — is interconnected, and data needs to be the same.

This timeline-based view of data is useful, but it becomes even more powerful when combined with the annotations and semantic metadata gathered earlier. With this much cross-linking between data, our information can now be associated with everything it relates to, automatically.

Finally, we can do away with files because we have a system that works like the brain does – giving us another new power — to traverse effortlessly from one related concept or entity to another until we reach the desired information:

Associative data navigation

In a system like this we navigate based on what the data means to us – not which file it is located in.

There will be technical challenges in maintaining data that resides on different devices and is held by different service providers, but cloud computing industry giants like Amazon and Google have already solved much more difficult problems.

A world without files

In the world of linked data and semantically indexed information, saving or losing data is not something we'll have to worry about. The stream is saved. Think about it: You'd never have to organize your emails or project plans because everything would be there, as connected as the thoughts in your head. Collaborating and sharing would simply mean giving other people access to read from or contribute to part of your stream.

We already see a glimpse of this world when we look at Facebook. It's no wonder that it's so successful; it lets us deal with people, events, messages and photos — the real fabric of our everyday lives — not artificial constructs like files, folders and programs

Files are a relic of a bygone age. Often, we hang onto ideas long past their due date because it's what we've always done. But if we're willing to let go of the past, a fascinating world of true human-computer interaction and easy-to-find information awaits.

Moving beyond files to associative and stream-based models will have profound implications. Data will be traceable, creators will be able to retain control of their works, and copies will know they are copies. Piracy and copyright debates will be turned on their heads, as the focus shifts from copying to the real question of who can access what. Data traceability could also help counter the spread of viral rumors and inaccurate news reports.

Issues like anonymity, data security and personal privacy will require a radical rethink. But wouldn't it be empowering to control your own information and who can access it? There's no reason why big corporations should have control of our data. With the right general-purpose operating system that makes hosting a piece of data, recording its metadata and managing access to it as easy as sharing a photo on Facebook, we will all be empowered to embrace our digital futures like never before.

Photo: Filing Cabinet by Robin Kearney, on Flickr



Related:


July 13 2011

Four short links: 13 July 2011

  1. Freebase in Node.js (github) -- handy library for interacting with Freebase from node code. (via Rob McKinnon)
  2. Formalize -- CSS library to provide a standard style for form elements. (via Emma Jane Hogbin)
  3. Suggesting More Friends Using the Implicit Social Graph (PDF) -- Google paper on the algorithm behind Friend Suggest. Related: Katango. (via Big Data)
  4. Dyslexia -- a typeface for dyslexics. (via Richard Soderberg)

April 27 2011

Linked data creates a new lens for examining the U.S. Civil War

Screenshot from the Civl War Data 150 projectApril 2011 marks the 150th anniversary of the first hostilities of U.S. Civil War, and museums, municipalities, historic sites, and schools are making their preparations for the events and exhibits to commemorate it. While, no doubt, times are tough for funding cultural heritage projects, there's a lot of excitement around the sesquicentennial, making it a great opportunity for those exploring how technology can make history more interactive.

It's also a great opportunity to pursue linked data efforts across these museums and historic sites, in turn making this historical information more discoverable and interoperable. That's what the Civil War Data 150 project is undertaking, and I asked two of the project organizers — Scott Nesbitt, Civil War historian and associate director of the Digital Scholarship Lab at the University of Richmond, and Jon Voss, founder of LookBackMaps — about how the Civil War anniversary will help boost linked data and digital history efforts.

What opportunities does the sesquicentennial provide for museums, historical sites, data geeks, and developers?

Scott Nesbitt: We're in a time of remarkable collaboration across institutional barriers. Just as an example, a wide array of institutions ranging from the Slave Trail Commission to the Museum of the Confederacy came together recently in Richmond, Virginia to commemorate Civil War and Emancipation Day. More than 3,000 visitors braved the rain to tour these sites and hear presentations about new discoveries and data sources that are still emerging. In the same way, building technical links to data currently held by institutions within a single community and across the country is really quite an opportunity.

Jon Voss: The cultural heritage community has been preparing for the sesquicentennial for several years — it's a huge opportunity to engage and educate new audiences about the Civil War. Many archives and libraries have been bringing their Civil War collections to the web in new ways, and we've seen a host of new digitization efforts as well. And while there are incredible curated exhibits and events across the country, an increasing number of institutions are recognizing the power of direct discovery and are making raw data and metadata open and accessible to developers and the general public. Combined, there's a myriad of possibilities to discover and analyze the Civil War in new ways during the four-year commemoration.

How does linked data benefit the study of Civil War history?

Jon Voss: Perhaps the most exciting possibility of applying linked open data to Civil War history is to connect information and images across many standalone databases and view them together in any number of applications. One element of this is discovery — finding images associated with one regiment in multiple institutions, for instance. But more important is the ability to combine that information in an entirely new way. Just the ability to search across historical collections is a radical development, as search engines typically aren't able to crawl databases. Part of what linked data does is expose metadata that's been pretty much hidden up until now.

What are some of the new things we can learn thanks to this sort of approach?

Jon Voss: Already we're learning about the variety and sheer amount of data out there. More than 3 million Americans fought in the Civil War, and there is an enormous amount of paper left behind, including muster rolls, medical records, food shipments, pensions, photographs, correspondence, and first-hand accounts.

Scott Nesbitt: With new links between data sources, historians will be able to imagine new questions to ask that would have been discarded as nearly impossible to answer before. We don't know, for example, whether Union regiments made up of working-class men or farmers were more likely to go out of their way to set enslaved men and women free in the South, or whether units made up of primarily of Republican or Democrat men were more likely to confiscate food from devastated southern farms. So the possibilities for historians are exciting.

What are the most interesting data projects you see happening in conjunction with the anniversary?

Scott Nesbitt: Linked data presents an exciting challenge: How do we begin to make sense of the patterns within large datasets and between disparate kinds of data? At the University of Richmond, we have been building "Hidden Patterns of the Civil War," a suite of projects devoted to exploring these possibilities.

Hidden Patterns of the Civil War
"Hidden Patterns of the Civil War" collects a number of interrelated data projects.

Jon Voss: We've already seen some great data visualizations from media outlets like the History Channel, Washington Post, and The New York Times, and I expect to see a lot more of that as we go. But what's on the horizon are augmented reality and location-based apps that really bring the harsh reality of the Civil War to life — there are a few of these projects already in the works. There will also be lots of opportunities for people to transcribe and map documents and photos. That will help us make the links between disparate datasets.

What's really exciting is that all of the data we work with on this project will be permanently open and free to use.

Associated photo used on home and category pages: General John P. Hatch by The U.S. National Archives, on Flickr

This interview was edited and condensed.



Related:


November 15 2010

Where the semantic web stumbled, linked data will succeed

In the same way that the Holy Roman Empire was neither holy nor Roman, Facebook's OpenGraph Protocol is neither open nor a protocol. It is, however, an extremely straightforward and applicable standard for document metadata. From a strictly semantic viewpoint, OpenGraph is considered hardly worthy of comment: it is a frankenstandard, a mishmash of microformats and loosely-typed entities, lobbed casually into the semantic web world with hardly a backward glance.

But this is not important. While OpenGraph avoids, or outright ignores, many of the problematic issues surrounding semantic annotation (see Alex Iskold's excellent commentary on OpenGraph here on Radar), criticism focusing only on its technical purity is missing half of the equation. Facebook gets it right where other initiatives have failed. While OpenGraph is incomplete and imperfect, it is immediately usable and sympathetic with extant approaches. Most importantly, OpenGraph is one component in a wider ecosystem. Its deployment benefits are apparent to the consumer and the developer: add the metatags, get the "likes," know your customers.

Such consumer causality is critical to the adoption of any semantic mark-up. We've seen it before with microformats, whose eventual popularity was driven by their ability to improve how a page is represented in search engine listings, and not by an abstract desire to structure the unstructured. Successful adoption will often entail sacrificing standardization and semantic purity for pragmatic ease-of-use; this is where the semantic web appears to have stumbled, and where linked data will most likely succeed.

Linked data intends to make the Web more interconnected and data-oriented. Beyond this outcome, the term is less rigidly defined. I would argue that linked data is more of an ethos than a standard, focused on providing context, assisting in disambiguation, and increasing serendipity within the user experience. This idea of linked data can be delivered by a number of existing components that work together on the data, platform, and application levels:

  • Entity provision: Defining the who, what, where and when of the Internet, entities encapsulate meaning and provide context by type. In its most basic sense, an entity is one row in a list of things organized by type -- such as people, places, or products -- each with a unique identifier. Organizations that realize the benefits of linked data are releasing entities like never before, including the publication of 10,000 subject headings by the New York Times, admin regions and postcodes from the UK's Ordnance Survey, placenames from Yahoo GeoPlanet, and the data infrastructures being created by Factual [disclosure: I've just signed on with Factual].
  • Entity annotation: There are numerous formats for annotating entities when they exist in unstructured content, such as a web page or blog post. Facebook's OpenGraph is a form of entity annotation, as are HTML5 microdata, RDFa, and microformats such as hcard. Microdata is the shiny, new player in the game, but see Evan Prodromou's great post on RDFa v. microformats for a breakdown of these two more established approaches.
  • Endpoints and Introspection: Entities contribute best to a linked data ecosystem when each is associated with a Uniform Resource Identifier (URI), an Internet-accessible, machine readable endpoint. These endpoints should provide introspection, the means to obtain the properties of that entity, including its relationship to others. For example, the Ordnance Survey URI for the "City of Southampton" is http://data.ordnancesurvey.co.uk/id/7000000000037256. Its properties can be retrieved in machine-readable format (RDF/XML,Turtle and JSON) by appending an "rdf," "ttl," or "json" extension to the above. To be properly open, URIs must be accessible outside a formal API and authentication mechanism, exposed to semantically-aware web crawlers and search tools such as Yahoo BOSS. Under this definition, local business URLs, for example, can serve in-part as URIs -- 'view source' to see the semi-structured data in these listings from Yelp (using hcard and OpenGraph), and Foursquare (using microdata and OpenGraph).
  • Entity extraction: Some linked data enthusiasts long for the day when all content is annotated so that it can be understood equally well by machines and humans. Until we get to that happy place, we will continue to rely on entity extraction technologies that parse unstructured content for recognizable entities, and make contextually intelligent identifications of their type and identifier. Named entity recognition (NER) is one approach that employs the above entity lists, which may also be combined with heuristic approaches designed to recognize entities that lie outside of a known entity list. Yahoo, Google and Microsoft are all hugely interested in this area, and we'll see an increasing number of startups like Semantinet emerge with ever-improving precision and recall. If you want to see how entity extraction works first-hand, check out Reuters-owned Open Calais and experiment with their form-based tool.
  • Entity concordance and crosswalking: The multitude of place namespaces illustrates how a single entity, such as a local business, will reside in multiple lists. Because the "unique" (U) in a URI is unique only to a given namespace, a world driven by linked data requires systems that explicitly match a single entity across namespaces. Examples of crosswalking services include: Placecast's Match API, which returns the Placecast IDs of any place when supplied with an hcard equivalent; Yahoo's Concordance, which returns the Where on Earth Identifier (WOEID) of a place using as input the place ID of one of fourteen external resources, including OpenStreetMap and Geonames; and the Guardian Content API, which allows users to search Guardian content using non-Guardian identifiers. These systems are the unsung heroes of the linked data world, facilitating interoperability by establishing links between identical entities across namespaces. Huge, unrealized value exists within these applications, and we need more of them.
  • Relationships: Entities are only part of the story. The real power of the semantic web is realized in knowing how entities of different types relate to each other: actors to movies, employees to companies, politicians to donors, restaurants to neighborhoods, or brands to stores. The power of all graphs -- these networks of entities -- is not in the entities themselves (the nodes), but how they relate together (the edges). However, I may be alone in believing that we need to nail the problem of multiple instances of the same entity, via concordance and crosswalking, before we can tap properly into the rich vein that entity relationships offer.

The approaches outlined above combine to help publishers and application developers provide intelligent, deep and serendipitous consumer experiences. Examples include the semantic handset from Aro Mobile, the BBC's World Cup experience, and aggregating references on your Facebook news feed.

Linked data will triumph in this space because efforts to date focus less on the how and more on the why. RDF, SPARQL, OWL, and triple stores are onerous. URIs, micro-formats, RDFa, and JSON, less so. Why invest in difficult technologies if consumer outcomes can be realized with extant tools and knowledge? We have the means to realize linked data now -- the pieces of the puzzle are there and we (just) need to put them together.

Linked data is, at last, bringing the discussion around to the user. The consumer "end" trumps the semantic "means."



Related:



May 21 2010

Four short links: 21 May 2010

  1. Infrastructures (xkcd) -- absolutely spot-on.
  2. The Michel Thomas App: Behind the Scenes (BERG) -- not interesting to me because it's iPhone, but for the insight into the design process. The main goal here was for me to do just enough to describe the idea, so that Nick could take it and iterate it in code. He’d then show me what he’d built; I’d do drawings or further animations on top of it, and so on and so on. It’s a fantastic way of working. Before long, you start finishing each others’ sentences. Both of us were able to forget about distinguishing between design and code, and just get on with thinking through making together. It’s brilliant when that happens.
  3. Open Government and the World Wide Web -- Tim Berners-Lee offered his "Five-Star" plan for open data. He said public information should be awarded a star rating based on the following criteria: one star for making the information public; a second is awarded if the information is machine-readable; a third star if the data is offered in a non-proprietary format; a fourth is given if it is in Linked Data format; a fifth if it has actually been linked. Not only a good rating system, but a clear example of the significantly better communication by semantic web advocates. Three years ago we'd have had a wiki specifying a ratings ontology with a union of evaluation universes reconciled through distributed trust metrics and URI-linked identity delivered through a web-services accessible RDF store, a prototype of one component of which was running on a devotee's desktop machine at a university in Bristol, written in an old version of Python. (via scilib on Twitter)
  4. Data Access, Data Ownership, and Sharecropping -- With Flickr you can get out, via the API, every single piece of information you put into the system. Every photo, in every size, plus the completely untouched original. (which we store for you indefinitely, whether or not you pay us) Every tag, every comment, every note, every people tag, every fave. Also your stats, view counts, and referers. Not the most recent N, not a subset of the data. All of it. It’s your data, and you’ve granted us a limited license to use it. Additionally we provide a moderately competently built API that allows you to access your data at rates roughly 500x faster then the rate that will get you banned from Twitter. Asking people to accept anything else is sharecropping. It’s a bad deal. (via Marc Hedlund)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl