Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

July 28 2011

Visualizing structural change

My ongoing series about the elmcity project has shown a number of ways in which I invite participants to think like the web. One of the principles I try to illustrate by example is:

3. Know the difference between structured and unstructured data

Participants learn that calendars on web pages and in PDF files don't syndicate around the web, but calendars in the structured iCalendar format do. They also learn a subtler lesson about structured data. Curators of elmcity hubs manage the settings for their hubs, and for their feeds, by tagging Delicious bookmarks using a name=value syntax that enables those bookmarks to work as associative arrays (also known as dictionaries, hashtables, and mappings). For example, here's a picture of the bookmark that defines the settings for the Honolulu hub:

The bookmark's tags represent these attributes:

contact: elmcity@alohavibe.com
facebook: yes
header_image: no
radius: 300
title: Hawaii Events courtesy of AlohaVIBE.com
twitter: alohavibe
tz: hawaiian
where: honolulu,hi

Curators use Delicious to declare these sets of attributes, but the elmcity service doesn't always retrieve them from Delicious. Instead it syncs the data to Azure tables and blobs. When it needs to use one of these associative arrays it fetches an XML chunk from an Azure table or a JSON blob from the Azure blob store. Both arguably qualify as NoSQL mechanisms but I prefer to define things according to what they are instead of what they're not. To me these are just ways to store and retrieve associative arrays.

Visualizing change history for elmcity metadata

Recently I've added a feature that enables curators to review the changes they've made to the metadata for their hubs and feeds. The other day, for example, I made two changes to the Keene hub's registry. I added a new feed, and I added a tag to an existing feed. You can see both changes highlighted in green on this change history page. A few hours later I renamed the tag I'd added. That change shows up in yellow here. On the following day I deleted three obsolete feeds. That change shows up in yellow here and in red here.

These look a lot like Wikipedia change histories, or the "diffs" that programmers use to compare versions of source files. But Wikipedia histories and version control diffs compare unstructured texts. When you change structured data you can, at least in theory, visualize your changes in more focused ways.

One of the great ironies of software development is that although computer programs are highly structured texts, we treat them just like Wikipedia articles when we compare versions. I've had many discussions about this over the years with my friend Greg Wilson, proprietor of the Software Carpentry project. We've always hoped that mainstream version control systems would become aware of the structure of computer programs. So far we've been disappointed, and I guess I can understand why. Old habits run deep. I am, after all, writing these words in a text editor that emulates emacs.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Maybe, though, we can make a fresh start as the web of data emerges. The lingua franca of the data web was going to be XML, now the torch has passed to JSON (JavaScript Object Notation), both are used widely to represent all kinds of data structures whose changes we might want to visualize in structured ways.

The component at the heart of the elmcity's new change visualizer is a sweet little all-in-one-page web app by Tom Robinson. It's called JSON Diff. To try it out in a simple way, let's use this JSON construct:

{
  "Evernote" :
   {
   "InstalledOn":"3/19/2008",
   "Version": "3.0.0.539"
   },
  "Ghostery IE Plugin" :
   {
   "InstalledOn":"7/9/2011",
   "Version": "2.5.2.0"
   }
}

These are a couple of entries from the Programs and Features applet in my Windows Control Panel. If my system were taking JSON snapshots of those entries whenever they changed, and if I were later to upgrade the Ghostery plugin to (fictitious future version) 2.6.1.0, I could see this JSON Diff report:

You can try it yourself at this JSON Diff URL. Or if you're running Internet Explorer, which the original JSON Diff doesn't support, you can copy that JSON chunk and paste it into one of the earlier examples. The elmcity adaptation of JSON Diff, which uses JQuery to abstract browser differences, does work with IE.

It's worth noting that the original JSON Diff has the remarkable ability to remember any changes you make by dynamically tweaking its URL. It does so by tacking your JSON fragments onto the end of the URL after the hash symbol (#) as one long fragment identifier! The elmcity version sacrifices that feature in order to avoid running into browser-specific URL-length limits, and because it works with a server-side companion that can feed it data. But it's cool to see how a self-contained single-page-web app can deliver what is, in effect, a web service.

What changed, and when?

A key question, in many contexts, is: "What changed, and when?" In the Heavy Metal Umlaut screencast I animated the version history of a Wikipedia page. It was a fascinating exercise that sparked ideas about tools that would automate the process. Those tools haven't arrived yet. We could really use them, and not just for Wikipedia. In law and in journalism the version control discipline practiced by many (but not all!) programmers is tragically unknown. In these and in other fields we should expect at least what Wikipedia provides -- and ideally better ways to visualize textual change histories.

But we can also expect more. Think about the records that describe the status of your health, finances, insurance policies, vehicles, and computers. Or the products and personnel of companies you work for or transact with. Or the policies of governments you elect. All these records can be summarized by key status indicators that are conceptually just sets of name/value pairs. If the systems that manage these records could produce timestamped JSON snapshots when indicators change, it would be much easier to find out what changed, and when.



Related:



June 09 2011

Why Facebook isn't the best home for your public events

In an earlier episode of this series I discussed how Facebook events can flow through elmcity hubs by way of Facebook's search API. Last week I added another, and more direct, method. Now you can use a Facebook iCalendar URL (the export link at the bottom of Facebook's Events page) to route your public events through an elmcity hub.

The benefit, of course, is convenience. If you're promoting a public community event, Facebook is a great way to get the word out and keep track of who's coming. Ideally you should only have to write down the event data once. If you can enter the data in Facebook and then syndicate it elsewhere, that seems like a win.

In Syndicating Facebook events I explain how this can work. But I also suggest that your Facebook account might not be the best authoritative home for your public event data. Let's consider why not.

Here's a public event that I'm promoting:

Facebook public event

Here's how it looks in a rendering of the Keene elmcity hub:

Rendering of the Keene elmcity hub

And here's the link to the End of the world (again) event:

https://www.facebook.com/event.php?eid=207438602626457

Did you click it? If so, one of two things happened. If you were logged into Facebook you saw the event. If not you saw this:



Facebook login page




Is this a public event or not? It depends on what you mean by public.
In this case the event is public within Facebook but not available on the
open web. The restriction is problematic. Elmcity hubs are transparent
conduits, they reveal their sources, curators do their work out in the
open, and communities served by elmcity hubs can see how those hubs
are constituted. Quasi-public URLs like this one aren't in the spirit
of the project.

My end-of-the-world event is obviously an illustrative joke. But consider two other organizations whose events appear in that elmcity screenshot: the Gilsum Church and the City of Keene. These organizations are currently using Google Calendar to manage their public events. They use Google Calendar's widget to display events on their websites, and they route Google Calendar's iCalendar feeds through the elmcity hub.

Now that elmcity can receive iCalendar feeds from Facebook, the church and the city could use their Facebook accounts, instead of Google Calendar, to manage their public events. Should they? I think not. Public information should be really public, not just quasi-public.

What's more, organizations should strive to own and control their online identities (and associated data) to the extent they can. From that perspective, using services like Google Calendar or Hotmail Calendar are also problematic. But you have choices. While it's convenient to use the free services of Google Calendar or Hotmail Calendar, and I recommend both, I regard them as training wheels. An organization that cares about owning its identity and data, as all ultimately should, can use any standard calendar system to publish a feed to a URL served by a host that it pays and trusts, using an Internet domain name that it paid for and owns.

Either way, how could an organization manage its public event stream using standard calendar software while still tapping into Facebook's excellent social dynamics? Here's what I'd like to see:

Example Facebook login page

It's great that Facebook offers outbound iCalendar feeds. I'd also like to see it accept inbound feeds. And that should work everywhere, by the way, not just for Facebook and not just for calendar events. Consider photos. I should be able to pay a service to archive and manage my complete photo stream. If I choose to share some of those photos on Facebook and others on Flickr, both should syndicate the photos from my online archive using a standard feed protocol -- say Atom, or if richer type information is needed, OData.

The elmcity project is, above all, an invitation to explore what it means to be the authoritative source of your own data. Among other things, it means that we should expect services to be able to use our data without owning our data. And that services should be able to acquire our data not only by capturing our keystrokes, but also by syndicating from URLs that we claim as our authoritative sources.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD



Related:


April 20 2011

Uniform APIs for the data web

The elmcity service connects to a half-dozen other services, including Eventful, Upcoming, EventBrite, Facebook, Delicious, and Yahoo. It's nice that each of these services provides an API that enables elmcity to read their data. It would be even nicer, though, if elmcity didn't have to query, navigate, and interpret the results of each of these APIs in different ways.

For example, the elmcity service asks the same question of Eventful, Upcoming, and EventBrite: "What are the titles, dates, times, locations, and URLs of recent events within radius R of location L?" It has to ask that question three different ways, and then interpret the answers three different ways. Can we imagine a more frictionless approach?

I can. Here's how the question might be asked in a general way using the Open Data Protocol (OData):

http://odata.[eventful|upcoming|eventbrite].com/events?$filter=type eq 'recent' and radius lt 5

An OData reply is a feed of Atom entries, optionally annotated with types. Here's a sketch of how one of those entries might look as part of a general OData answer to the question:

<entry>
  <id>http://odata.hypothetical.com/events/ids('1428475')</id>
  <title type="text">Carola Zertuche presents traditional flamenco</title>
  <updated>2011-04-18T23:00:00Z</updated>
  <author><name /></author>
  <m:properties 
       xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" 
       xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">
    <d:Id>1428475</d:Id>
    <d:start m:type="Edm.DateTime">2011-05-12T12:00</d:start>
    <d:latitude m:type="Edm.Double">37.809600000000003</d:latitude>
    <d:longitude m:type="Edm.Double">-122.4106</d:longitude>
    <d:url>http://hypothetical.com/events/1428475</d:Url>
    <d:venue_id>83329</d:venue_id>
  </m:properties>
</entry>

(With the addition of $format=json to the query URL, the same information arrives as a JSON payload.)

Of course there would still be differences among these APIs. Each of the three services in this example has its own naming conventions and its own way of modeling events and venues. It would still take some work to abstract away those differences. But you'd be using a common query mechanism, a common set of data representations, a common way of linking them together, and a common set of helper libraries for many programming environments.

A WordPress thought experiment

Blog publishing systems have long implemented APIs that enable client applications to fetch and post blog entries. For historical reasons there are a variety of these APIs. Because they're widely adopted in the blog domain, it's pretty likely that an application that works with one blog system's implementation of one of the APIs will work with another blog system's implementation of the same API. But these APIs are specific to the blog domain.

What if blogs had come of age in an era when a uniform kind of API was expected? We could then ask questions of blogs in the same way we could ask questions of event services in the hypothetical example shown above, or of any other kind of service. And we could interpret the answers in the same way too.

Suppose we want to ask a blog service: "What are the published entries since April 10, 2011?" Here's an OData version of the question:

http://odata.hypothetical.com/wp_posts?$filter=post_date gt datetime'2011-04-10' and post_status eq 'publish'

And here's an answer, in JSON format, from a hypothetical WordPress OData service:

{"d" : { "results": [ 
  { "__metadata": {
    "uri": "https://odata.sqlazurelabs.com/OData.svc/v0.1/goyot8lwrc/wordpress/wp_posts(7L)", 
    "type": "wordpress.wp_posts" }, 
    "comment_count": "0", 
    "comment_status": "open", 
    "guid": "http://elmcity-companion.cloudapp.net/wordpress/?p=7", 
    "ID": "7", 
    "post_author": "1", 
    "post_content": "OData as universal API?", 
    "post_date": "\/Date(1303216978000)\/", 
    "post_name": "a-wordpress-thought-experiment", 
    "post_status": "publish", 
    "post_title": "A WordPress thought experiment"
    }
  ]
}}

Except it's not hypothetical! The guid shown in this example points to a real WordPress post. And the uri in the example points to a live OData service that emits the chunk of JSON we see here. If you're so inclined, you can start at the root of the service and explore all the tables used in that WordPress blog.

How is this possible? I'm running WordPress on Azure; this instance of WordPress uses the SQL Azure database; the database is OData-enabled. In this case I'm allowing only read access. But if the database were writable a blog client could add new entries by sending HTTP POST requests with Atom payloads.

OData for MySQL

Of course WordPress more typically runs on MySQL. Can we do the same kind of thing there? Sort of. Here's a query that fetches posts from a Linux/MySQL instance of WordPress and returns them as an Atom feed with OData annotations:

http://www.elmcity.info/mysqlodata/?wp_posts

In this case the OData view of the underlying MySQL database is provided by MySQLOData, a "PHP-based MySQL OData Server library which exposes all data within a MySQL database to the world in OData ATOM or JSON format."

There are two issues here. One is my fault. I'm not fluent in PHP and I haven't been able to get MySQLOData working to its full capability. Do you know of a live instance of MySQLOData that is properly installed and configured? If so please show me the URL, I'd like to try it out.

The second issue is more fundamental. Suppose MySQLOData becomes a full implementation of OData. In any environment where there is PHP and MySQL, any application built on MySQL could automatically expose an API based on a common query mechanism, a common set of data representations, a common way of linking them together, and a common set of helper libraries. Great! But what if there's no PHP in the environment? What if there's only Python? Or only Ruby? A Django- or Rails-based service shouldn't have to add PHP to the mix in order to provide a uniform API.

If MySQL itself could present an OData interface, then layered services written in any language could automatically provide APIs in a standard way. Here's a description of how that might work:

If we provide access to existing databases as though they were in hypertext form, the system will get off the ground quicker ... What is required is a gateway program which will map an existing structure onto the hypertext model, and allow limited (perhaps read-only) access to it.

If you know your web history that may sound familiar. It's from Tim Berners-Lee's 1989 proposal for the World Wide Web.

There's more than one way to do it

Of course OData isn't the only way services could automatically provide uniform APIs. Such things typically come in several flavors. In the blog domain there have always been a few of them: the Blogger API, the metaWeblog API, etc. I think it's unlikely that we'll end up with a single flavor of uniform API. But right now we don't have any uniform flavor! Every service that provides an API has to invent its own query mechanism, data representations, and helper libraries. If you want to mash up services — as we increasingly do — the differences among these APIs create a lot of friction.

OData looks to me like one good way to overcome that friction. I'd love to see OData gateways co-located with every popular database. With such gateways in place, the web of data we're collectively trying to build would get off the ground quicker.



Related:


December 22 2010

How will the elmcity service scale? Like the web!

During a recent talk at Harvard's Berkman Center, Scott MacLeod asked (via the IRC backchannel): "How does the elmcity service scale?" He wondered, in particular, whether the service could support an online university like the World University and School that might produce an unlimited number of class schedules.

My short answer was that the elmcity service scales like the web. But what does that really mean? I promised Scott that I'd spell it out here. We'll start with an analogy. As I mentioned in The power of informal contracts, the elmcity project envisions a web of calendar feeds that's analogous to the blogosphere's web of RSS and Atom feeds. We take for granted that the blogosphere scales like the web. A blog feed is just a special kind of web page. Anybody can create a blog and publish its feed at some URL. Why not calendars too? We haven't thought about them in the same way, but the ICS (iCalendar) files that our calendar programs export are the moral equivalents of the RSS and Atom feeds that our blog publishing tools export. Anybody can create a calendar and publish its feed at some URL.

These webs -- of HTML pages, of blog feeds, of calendar feeds -- are notionally webs of peers. We can all publish, and we can all read, without relying on a central authority or privileged hub. There are, to be sure, powerful centralized services. My blog, for example, is one of millions hosted at wordpress.com, aggregated by Bloglines and Google Reader, and indexed by Google and Bing. But these services, while convenient, are optional. So long as we can publish our blogs somewhere online, advertise their URLs, and get the DNS to resolve their domain names, we can have a working blogosphere. The necessary and sufficient condition is that we can all publish resources (e.g., pages and feeds), and that we can all access those resources.

For the calendarsphere that I envision, a service like elmcity is likewise optional. Let's suppose that the World University and School succeeds wildly. At any given moment there are tens of thousands of courses on offer, each with its own course page and also with its own calendar. Instructors publish course pages using any web publishing tool, and also publish calendars using any calendar publishing tool -- Google Calendar, or Outlook, or Apple iCal, or another calendar program. Students pick schedules of courses, bookmark the course pages, and load the course calendars into any of these same calendar programs. The calendar software merges the separate course calendars and combines them with the students' personal calendars. These calendar programs are thus aggregators of calendar feeds in the same way that feedreaders like NetNewsWire or Google Reader are aggregators of blog feeds.

Given a baseline web of peers, it's useful to be able to merge our individual views of them into pooled spaces. NetNewsWire is a personal feedreader, but Google Reader is social. In the pool created by Google Reader, data finds data and people find people. The elmcity service aims to create that same kind of effect in the realm of public calendar events. When we pool our separate calendars, we publicize the events that we are promoting, we discover events that others are promoting, and we see all our public events on common timelines.

What constrains our ability to scale out pools of calendars? Let's continue the analogy to the blogosphere. Google Reader constitutes one pooled space for blog feeds, Bloglines another. Because the data aggregated by these services conforms to open standards (i.e., RSS and Atom), other services can create blog pools too. Likewise in the calendarsphere, Google Calendar is one way to pool calendars, the elmcity service is another, Calagator is a third. Others can play too.

How can we scale these providers of calendar pools? Along one axis, each provider needs to be able to grow its computing power. Google Calendar scales on this axis by using Google's cloud platform. The elmcity service uses Azure, the Microsoft cloud platform. Note that elmcity, unlike Google Calendar, is an open source service. That means you could run your own instance of it, using your own Azure account, but you'd still be relying on the Azure compute fabric.

Calagator, based on Ruby on Rails, could be deployed either to a conventional hosting environment or to a cloud platform. It would thus scale, along the compute axis, as either environment allows. The elmcity service could be used in this way too. The service is written for Azure, but the core aggregation engine is independent of Azure and could be deployed to a conventional hosting environment.

For feed aggregators, another axis of scale is the number of feeds that can be processed. When that number grows, the time required to connect to many feeds and ingest their contents becomes a constraint. The elmcity service currently supports 50 calendar hubs. Thrice daily, each hub pulls data from Eventful, Upcoming, EventBrite, Facebook, and a list of iCalendar feeds. So far a single Azure worker role can easily do all this work. I'll dial up the number of workers if needed, but first I want to squeeze as much parallelism as I can out of each worker. To that end, I recently upgraded to the 4.0 version of the .NET Framework in order to exploit its dramatically simplified parallel processing. In this week's companion article I show how the elmcity service uses that new capability to optimize the time required to gather feeds from many sources.

Pub/sub networks can also scale by coalescing feeds. Consider a calendar hub operated, for some city, by the online arm of that city's newspaper. One model is flat. The newspaper runs a hub whose registry lists all the calendar feeds in town. But another model is hierarchical. In that model, there's a hub for arts and culture, a hub for sports and recreation, a hub for city government, and so on. Each hub gathers events from many feeds, and publishes the merged result on its own website for its own constituency. If the newspaper wants to include all those feeds, it can list them individually in its own registry. But why aggregate arts, sports, or recreation feeds more than once? The newspaper's uber-hub can, instead, reuse the arts, sports, and recreation feeds curated by those respective hubs, adding their merged outputs to its own set of curated feeds. Such reuse can cut down the computational time and effort required to propagate feeds throughout the network.

None of these mechanisms will matter, though, until a vibrant ecosystem of calendar feeds requires them. That's the ultimate constraint. Scaling the calendarsphere isn't a problem yet, but it would be a good problem to have. First, though, we've got to light up a whole bunch of feeds.



Related:




November 12 2010

The iCalendar chicken-and-egg conundrum

If you're running a website for a school, or a business, or a band, or a club, there's probably a tab on your site's home page labeled Events. The calendar that shows up on that page is most likely driven by some kind of content management system that collects your events using a form, stores them in a database, and renders them through an HTML template to produce your events page.

One of the premises of this series is that publishing calendars as HTML, for people to read on your events page, is necessary but not sufficient. You should also want to publish events as data for computers to process and for networks to syndicate. And you should expect your content management system, which already has the data in a machine-friendly format, to do that for you. But this almost never happens, and so we have a classic chicken-and-egg conundrum. Nobody expects that a website's events page should offer a parallel data feed, so nobody demands that CMS vendors provide one, and as a result hardly any do.

Hannah Grimes calendarConsider the case of the Hannah Grimes Center in Keene, NH. Its events page is provided by the Constant Contact service as part of a package of event marketing features that include registration and social media promotion. But there's only an HTML page; no companion iCalendar feed is available. That means the Hannah Grimes Center's events aren't available as a stream of data that can syndicate through networks and merge with other calendars.

The irony here is that a previous version of the Hannah Grimes site did provide an iCalendar feed. That version of the site was based on Drupal, which has a calendar module that publishes event data two ways: as HTML, and as iCalendar. Last week I happened to meet the site's coordinator; she wondered why their event stream had dropped out of syndication. I explained, and she wrote back the other day saying: "We're in the process of converting back to Drupal for events!"

Drupal is a great system, and that's fine, but it shouldn't be necessary. Every CMS that produces an events page should also produce an iCalendar feed. When I talk to CMS vendors they confirm what I've said here: Customers don't ask, so vendors don't provide. What I also hear from them, though, is that producing an iCalendar feed seems like a lot of work. It isn't. Libraries are available to simplify the translation from programming-language data structures to iCalendar feeds. And even without such libraries, it's not hard to whip up a simple iCalendar feed.

My vision for a network of iCalendar feeds is inspired by the early blogosphere's network of RSS feeds. Content management systems, like Radio UserLand, produced those feeds automatically. But many of us also produced our own feeds without the aid of kits. We did it "by hand" -- that is, by interpolating variables into XML templates. It was easy to do, and it worked well enough to enable our homegrown feeds to participate in the emerging RSS ecosystem. In this week's companion article I show how the elmcity service uses a library to produce iCalendar feeds. But if you look at the sample feed at the end of that article, you'll see that -- like the RSS 0.9 many of us created by hand -- it isn't really rocket science.

There was, of course, another reason why the homegrown approach worked as well as it did. It's true that we didn't have tools to help us write RSS feeds, but we did have a tool to validate them. The Feed Validator, originally created by Mark Pilgrim and Sam Ruby, was -- and remains -- a pillar supporting the RSS (and now, RSS/Atom) ecosystem.

When I set out to bootstrap an iCalendar network, I realized that there was no analogous service to validate iCalendar feeds. So I reached out to Doug Day. He's the author of DDay.iCal, which is the open source library that the elmcity service uses to parse the iCalendar feeds it gathers into each hub, and also write iCalendar feeds that merge all of the inputs to each hub. Doug thought there should be a validator for iCalendar feeds, so he stepped up to the plate and created one at icalvalid.cloudapp.net.

If you're a school or a business or a band or a club whose website sports an Events tab that doesn't offer a companion iCalendar feed, I hope you'll ask your CMS vendor why not. If you're one of the vast majority of CMS vendors whose systems create events pages but don't produce iCalendar feeds, I hope you won't wait to be asked. With or without tool support, it's not hard to make them. (See this week's companion article for a C# example based on Doug Day's DDay.iCal library.) And, thanks to Doug, there's now a validation service to help you make them right.



Related:




November 04 2010

Heds, deks, and ledes

When a copy editor applies a real or virtual red pencil to a piece of journalistic prose, he or she is likely to use weird spellings: hed for head (headline), dek for deck (subhead), lede for lead (first paragraph). The idea is that these intentional misspellings will help distinguish an editor's commentary from a writer's prose.

Whether this is a useful convention or just an antiquated habit I really can't say. But the principle of heads, decks, and leads matters more than ever, and not just in journalism. We're all publishers now in one way or another. None of us can predict the contexts in which what we publish will be found. But if we're careful about writing heads, decks, and leads, we'll improve the odds that it will be found.

You can apply this principle to any package of information: an email message, a blog post, an event listing. In an email message the head is clearly the Subject: line; in a blog posting it's the title. What about the deck? Here's where creative analogies come into play. In these cases, only the head will be seen when readers scan their inboxes or RSS feeds. So you might want to collapse the idea of head and deck into a single robust head. As for the lead, well, it's always best to begin an email message or a blog post with a compelling first paragraph.

Event listings are trickier. Here are some possible mappings for two examples from Facebook:

Example 1

Head: World History @ Starving Artist (Keene NH)

Deck: Sunday, November 14 · 8:00pm - 11:30pm, The Starving Artist, 10 West St, Keene, NH

Lead: Music by: WORLD HISTORY, www.worldhistoryband.com, YOUNG MOUNTAIN, www.myspace.com/youngmountainnh, ALL AGES are welcome, $5, http://www.thestarvingartistcollective.com/

Here the best analog for head (or title, or subject) is the What are you planning? field. A reasonable analog for deck combines the When? and Where? fields. And the More info? field maps neatly to the lead.

Example 2

Head: Cold River Ranters at Armadillo's Burritos

Deck: Friday, November 5 · 7:00pm - 9:00pm, 82 Main Street

Lead: None

Following the same pattern, we can map the head and the deck and omit the lead. Now, compare the two mappings. Example 2 lacks some critical information. And that missing information makes it less discoverable than Example 1. Can you spot what's missing? Here's a hint: It's not the absence of the lead. Another hint: It's not even the absence of a full street address including the city and state. The critical difference is that Example 1 includes the city and state in the head and Example 2 doesn't.

I only realized this when I added Facebook to the set of elmcity event sources. The elmcity service uses the Facebook graph API to search for events using search terms like "Keene, NH" or "Santa Rosa, CA." If you run that search for Keene you'll see that Example 1 shows up but Example 2 is missing in action.

Here's a version that would have worked for the Cold River Ranters:

Example 2a

Head: Cold River Ranters at Armadillo's Burritos, Keene, NH

Deck: Friday, November 5 · 7:00pm - 9:00pm, 82 Main Street

But here's a version that wouldn't have worked:

Example 2b

Head: Cold River Ranters at Armadillo's Burritos

Deck: Friday, November 5 · 7:00pm - 9:00pm, 82 Main Street, The Starving Artist, 10 West St, , Keene, NH



It seems that when you use Facebook's API to search for events, it only looks for your phrase in the heads. So, for example, if you searched Facebook events for Starving Artist the results would include:

      {
         "name": "Starving Artist Ent. Presents: This Is Hip Hop",
         "start_time": "2010-11-28T04:00:00+0000",
         "end_time": "2010-11-28T09:00:00+0000",
         "location": "My House Bar & Lounge",
         "id": "157463474273974"
      },
      {
         "name": "Starving Artist Project",
         "start_time": "2010-11-19T03:30:00+0000",
         "end_time": "2010-11-19T06:00:00+0000",
         "location": "Mary's Attic Chicago",
         "id": "137831546269077"
      },
      {
         "name": "World History @ Starving Artist (Keene NH)",
         "start_time": "2010-11-15T04:00:00+0000",
         "end_time": "2010-11-15T07:30:00+0000",
         "location": "The Starving Artist",
         "id": "100599506662874"
      }

The last of these is the same event that's included in a search for Keene, NH. By writing a head that includes the location, the Starving Artist Collective succeeded in making its event discoverable by Keene's event hub. By failing to include the location, the Cold River Ranters missed out on that opportunity. (So did Mary's Attic: A hub tuned to Facebook's virtual channel for Chicago events would have missed that one.)

I'm sure that neither the Starving Artist Collective nor the Cold River Ranters knows anything about the workings of the Facebook search API. But the Starving Artist Collective has intuited an crucial principle: Headlines matter. Always pack as much distinguishing data into them as available space allows. Heads will always be visible to a scan or a search; decks and leads are active in far fewer contexts. If your headline doesn't create access to those supporting contexts, it will be much harder for people to reach them serendipitously.

In this week's companion article I show how the elmcity service uses the Facebook API. If you're not a developer you won't care about that. But everybody should care about the principle of heads, decks, and leads. We're all publishers. We publish in order to be found, to be read, to connect, to have influence. When we're careful about how we package and layer our information, we become more effective publishers.



Related:




October 26 2010

A lesson in civics, public data, and computational principles

Among the elmcity hubs that started up last week were Santa Rosa, Calif. and Bellingham, Wash.. Both towns' local newspapers, the Press Democrat and the Bellingham Herald, use a service called Zvents to manage their online calendars. The curators who started the Santa Rosa and Bellingham hubs, Sean Boisen and Tim Sawtell, wondered if they could subscribe these hubs to iCalendar feeds from Zvents.

At first glance the answer was yes. On the Press Democrat's site, for example, if you view the Arts & Crafts category, you'll find this encouraging cluster of icons and links:

An iCalendar feed? Sweet! But alas, while that "Save as iCal" does yield an iCalendar response, it's an empty shell:

BEGIN:VCALENDAR
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:PUBLISH
PRODID:Zvents Ical
END:VCALENDAR

Why? Beats me, if someone knows I'd love to hear the explanation. Meanwhile, what about the corresponding RSS feed? I wasn't hopeful. In my work on the elmcity project I often see an error of the sort I discussed in "Developing intuitions about data." People tend to conflate the purposes of an RSS feed, which typically conveys headlines and links to people, and an iCalendar feed, which conveys dates and times to computers. This category error is so common that I've enshrined it in a slide I've used in several recent talks.

But I opened up the Press Democrat's RSS feed anyway, and here is what I found:

<item>
<title>Event: JLNS Holiday Home Tour & Winter Market at Friedman Event Center, Sat, Nov 20 10:00a</title>
<description>The Junior League of Napa-Sonoma presents a tour of prestigious homes in Bennett Valley, all festively decorated, with all proceeds to benefit local charities</description>
<link>http://events.pressdemocrat.com/santa-rosa-ca/events/show/139081965-jlns-holiday-home-tour-winter-market</link>
<xCal:dtstart>2010-11-20 10:00:00 +0000</xCal:dtstart>
<xCal:dtend>2010-11-20 16:00:00 +0000</xCal:dtend>
<xCal:location>http://events.pressdemocrat.com/santa-rosa-ca/venues/show/672937-friedman-event-center</xCal:location>
</item>

Fig. 1: An event in the Press Democrat's RSS events feed

Even if you know about such things as XML, RSS, and xCal, pretend for a moment that you don't. Anyone can see that there is structure here: <xCal:dtstart>2010-11-20 10:00:00 +0000</xCal:dtstart>. That makes this feed very different from most RSS feeds that purport to represent calendar events, which typically look like this:

<item>
<title>Event: JLNS Holiday Home Tour & Winter Market at Friedman Event Center, Sat, Nov 20 10:00a</title>
<description>The Junior League of Napa-Sonoma presents a tour of prestigious homes in Bennett Valley, all festively decorated, with all proceeds to benefit local charities</description>
<link>http://events.pressdemocrat.com/santa-rosa-ca/events/show/139081965-jlns-holiday-home-tour-winter-market</link>
</item>

Fig. 2: Same event in a typical RSS events feed



We humans have no trouble understanding Sat, Nov 20 10:00a. The year is omitted but we know what's meant. Likewise we can parse a wide range of alternatives, such as Saturday, November 20, at 10:00. Does that mean AM or PM? We just know that it's AM; a home tour wouldn't start on Saturday at 10PM. Conversely we just know that a blues band wouldn't start playing on Saturday at 10AM.

Since we aren't aware that we hold this tacit knowledge, it doesn't occur to us that computers lack it, or that as a result they require explicit rules and structure. But if you want your data to syndicate around the web, you've got to provide rule-based structure. Since iCalendar is the most ubiquitous format for event data, that's currently the best way to do it. Here's that same event in iCalendar:

BEGIN:VCALENDAR
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:PUBLISH
PRODID:Zvents Ical
BEGIN:VCALENDAR
BEGIN:VEVENT
DTSTART:20101120T100000
DTEND:20101120T16000000
SUMMARY:Event: JLNS Holiday Home Tour & Winter Market at Friedman 
  Event Center, Sat, Nov 20 10:00a
END:VEVENT
END:VCALENDAR
Fig. 3: Same event in an iCalendar feed

A point that technologists often miss, when we fight religious wars amongst ourselves about competing formats -- RSS versus Atom, iCalendar versus xCalendar, and so on -- is that the existence of structure matters far more than the kind of structure. Fig. 1 and Fig. 3 are two species within the same genus. Fig. 2, though, belongs to another phylum altogether. If you're using the method shown in Fig. 2 to syndicate your data on the web, you're doing it wrong. That RSS feed is no more useful for the purpose than a PDF file, or an HTML file.

When I realized that Zvents produces RSS+xCal feeds, and that multiple newspaper sites rely on Zvents, I added support for that format to the elmcity service. A translator reads RSS+xCal and writes iCalendar. Because the Zvents flavor of RSS+xCal is well structured, it was trivial to create that translator.

This new feature for elmcity hubs creates some interesting opportunities. For example, since each Zvents feed is the result of a query, the set of these RSS+xCal feeds is unbounded. Here's one kind of query used on the Press Democrat's events page; it lists events in the "Dance" category.

http://events.pressdemocrat.com/search?cat=4&st=event

We can easily transform that URL into one that yields the corresponding RSS feed:

http://events.pressdemocrat.com/search?cat=4&st=event&rss=1

Observing this, Tim Sawtell was able to merge a set of categorized feeds into the Santa Rosa hub. In doing so, he illustrated a number of key principles that computational thinkers know and apply:

  1. query -- the feed is the output of an open-ended search
  2. data structure -- a structured representation of the search is available as RSS+xCal
  3. transformation -- from RSS+xCal to iCalendar
  4. abstraction and generalization - what works for one category works for all

Even more is possible. Suppose you're a grief counselor in Santa Rosa, and you would like to provide your clientele with a comprehensive list of support resources. Here's a useful search:

http://events.pressdemocrat.com/search?swhat=bereavement

It yields two recurring events for two different support groups at Hospice By The Bay.

Free Hospice By The Bay Drop-in Group Supports Newly Bereaved
Join others who are beginning the journey through grief at a free, ongoing, drop-in ... 10/26/2010 Tuesday
12:00p to 1:00p
(repeats 9 times)
Hospice By The Bay,
Sonoma CA Hospice By The Bay Support Group for Spousal/Partner Loss
Hospice By The Bay offers an eight-week support group to help adults who have lost ... 10/26/2010 Tuesday
10:00a to 11:30a
Hospice By The Bay,
Sonoma CA
Fig. 4: Bereavement support group meetings in Santa Rosa, via the Press Democrat

Here's a transformation of that search URL that yields a RSS+xCal data feed:

http://events.pressdemocrat.com/search?swhat=bereavement&rss=1

That feed can now be further transformed into an iCalendar feed and included in an elmcity hub, or in any other cloud-based service or device-based app that reads iCalendar feeds. If you wanted to create a bereavement category in an elmcity hub you'd be off to a great start! But where else would you look? There's plenty of information about public events on the web today. But only a tiny fraction of it exists as structured data that can flow through syndicated networks. Most of it lives in PDF files, or HTML files, that are only valuable to people who find their way to the sites that serve up those files.

In an effort to visualize this iceberg of unstructured information below the waterline of the data web, I added a feature to the elmcity service that searches for recurring events. It works by looking for the kinds of phrases that we humans use in our discourse: first Monday of every month at 9PM or 2nd and 4th Tuesday, 6:15-7:45 pm. In this week's companion article I show how that search harvests pages containing these terms from Google and Bing. Here, let's consider a few of the 3,500 items found when running that kind of search for Santa Rosa:

google: 1253
bing: 2023
google_and_bing: 292

1. Hannah Caratti, Pre-Licensed Professional, Santa Rosa, CA 95404 ... (google)

Every Monday at 6pm - 7pm $20 - $30 per session. Meditation & Stress Reduction Group ... Chronic Pain or Illness Therapist in Santa Rosa, CA ...

2. Bob Greenberg, Marriage & Family Therapist, Santa Rosa, CA 95404 ... (google)

Every Monday at 12am - 12am $40+ per session. An in depth group for adult ...

3. Classes at the Women's Health and Birth Center (google)

Every Monday (except for holiday Mondays). Group/walk-in from 12 noon ... Women's Health and Birth Center since 1993, 583 Summerfield Road Santa Rosa, CA 95405.

4. North Bay Bereavement and Grief Support Programs (google)

Every Monday, Noon-1:30 p.m.. Back to top ... 547 Mendocino Avenue, Santa Rosa, CA 95401 (Parking garage 521 7th Street) ...

Fig. 5: Unstructured event data for Santa Rosa

Investigating the fourth item, North Bay Bereavement and Grief Support Programs, we find a bunch of events represented in an unstructured way:

Bereaved Parents: For parents whose young or adult child has died. 2nd and 4th Thursdays, 6:00 - 7:30 p.m.



Family and Caregiver Support Groups: For adults whose loved one has a life-threatening illness.
Every Tuesday, 4:00-5:30 p.m.



Survivors of Suicide: For those who have lost a loved one to suicide.
Every Monday, Noon-1:30 p.m.



People in Grief: For people whose loved one has died.
Every Wednesday, 6:00-7:30 p.m.



Partner Loss - Evening: For adults whose spouse or partner has died.
2nd and 4th Tuesday, 6:15-7:45 p.m.



Partner Loss - Daytime: For adults whose spouse or partner has died.
Every Wednesday, 11:00 a.m. - 12:30 p.m.



Fig. 6: Unstructured event data about bereavement support groups in Santa Rosa

I'm sure the Press Democrat would love to include these events on its calendar. It can't, though, because there's only one way for Sutter VNA and Hospice to get its support group meetings onto the Press Democrat's calendar. Somebody has to log into the site and input the data.

That model has never worked well, and it never will. The folks at Sutter VNA and Hospice only want to input that information once, on their own website. And that's all they should be expected to do! Their site ought to be the authoritative source for both human-readable information about events and machine-readable data that can syndicate to the Press Democrat or to any other site that needs it.

Unfortunately the Sutter VNA folks don't know about this dual possibility, and don't realize that they could achieve it using Google Calendar, or Hotmail Calendar, or any other single source of human-readable text and machine-readable data about public events.

Likewise, the Press Democrat does not realize that it could subscribe to a data feed from Sutter VNA, once, and thereafter automatically receive a stream of data as comprehensive and accurate as the authoritative source wishes to provide.

This model for collective information management relies on principles that computational thinkers know and apply, including:

  1. pub/sub -- the communication pattern is publish/subscribe
  2. indirection -- event data is passed by reference, not by value, from publisher to subscriber
  3. syndication -- a loosely-coupled network of publishers and subscribers

How might we teach these kinds of principles to the Sutter VNAs and Press Democrats of the world? Maybe we can start by teaching them to the kids we think are digital natives, but who don't actually learn these principles -- because we haven't formulated them and don't teach them.

If you teach in a middle school or a high school, here's an interesting civics lesson you could try. Spin up an elmcity hub for your town, point kids at the unstructured iceberg revealed by the search feature, and show them how to use a service like Google Calendar or Hotmail Calendar to convert unstructured event information into structured event data that can syndicate through the hub.

The task can easily be parallelized by carving the list of search results into chunks, and assigning chunks to individual students or teams of students. Working together they should soon be able to produce a substantial calendar of events that won't appear in any existing online directory. That calendar will be both a valuable civic contribution and a lesson in underlying principles.

For extra credit, have the students engage with the sources and explain the principles to them. The script might go like this:

Dear Mr. Jones,

We're students at the Jefferson Middle School, and we're working on a class project to improve the amount and quality of online event information for our community. We noticed that the following information is available on your website: [EXAMPLES].

However, these events aren't published in a form that enables them to show up automatically elsewhere -- for example, on the Herald's site, or the Chamber of Commerce site, or on people's personal calendars. To show how that can work, we have reformulated your information as a data feed. You can see it merged together with other data feeds here: [EXAMPLE].

This is just a demonstration. We're not the appropriate source for your data, you are. As part of our class project, we're reaching out to organizations like yours to show them how they can publish their own event information in two ways: as text for people to read, and also as data for computers to process and for networks to syndicate.

We know that sounds complicated, but it's really just a way of applying the ordinary calendar software that you probably already have and use. May we contact the person in your organization who's responsible for the events page on your website, and make a presentation about how you could be publishing event information in a more useful way?

Sincerely,

Kayla Smith, Tim Miller, Samantha Williams
Jefferson Middle School Civic Data Project

If you're not a teacher yourself, but you know teachers who might like to try this project-based exercise in civic data gathering and computational thinking, by all means invite them to contact me. I'll be happy to help set up the exercise, support it, and document the outcome.



Related:




October 07 2010

Developing intuitions about data

In The laws of information chemistry I mentioned that my local high school uses a PDF file to publish the school's calendar of events. Let's look at some different ways to represent the calendar entries for Oct 6, 2010. First I'll divide these representations into two major categories: "What People See," and "What Computers See." Then I'll discuss how the various formats serve various purposes.

Category 1: What People See

Here's a piece of the PDF file for the week of Oct 4, 2010.

How the PDF looks to a person

Fig. 1a: How the PDF looks to a person

And here's how the same entries might look in Google Calendar (or in any other calendar program).


Fig. 1b: How the calendar looks to a person

Category 2: What Computers See

The PDF file describes fonts and layout in a highly structured way. But the calendar's data -- dates, times, descriptions -- only lives in free-form text. Computers use it to enable people to read or print that text.



10/6



-Junior class NECAP testing info. Meeting block 4 (aud.)

-Rain date for AP Env. Sci. trip to Monadnock 7:30 am-3 pm (Davenson/Sintros)

-Field trip: Physics to Arnone’s 7:40-11 a.m. (Lybarger/Romano) List will be sent.

-Senior workshop: “Tips & tricks for writing your college essay 8:05-8:45 & 1:200-2:02 (GCR)

-New teacher workshop 2:15-3:00 p.m. (PCR) “Guidance & Special Ed. Responsibilities”


Fig. 2a: How the data in the PDF file looks to a computer


When your browser renders the calendar, it sees a mixture of HTML and JavaScript. Computers use that mixture to enable people to read, print, and also interact with the text.

<TR class="lv-row lv-newdate lv-firstevent lv-alt">

<TH class=lv-datecell rowSpan=5><A class=lv-datelink 
href="javascript:void(Vaa('20101006'))">Wed Oct 6</A></TH>

<TD class="lv-eventcell lv-status"> </TD>

<TD class="lv-eventcell lv-time"><SPAN class=lv-event-time
onmousedown="Waa(event,'listview','YzFmYT...b2tAZw','20101006');return false;">All
day</SPAN></TD>

<TD class="lv-eventcell lv-titlecell">
<DIV id=listviewzYzFmYT...b2tAZw20101006 class=lv-zippy
onmousedown="Waa(event,'listview','YzFmYT...b2tAZw','20101006');return false;"></DIV>

<DIV class=lv-event-title-line><A style="COLOR: #1f753c" class=lv-event-title
onmousedown="Waa(event,'listview','YzFmYT...b2tAZw','20101006');return false;"
href="javascript:void(0)">-Junior class NECAP testing info. Meeting block 4
<SPAN dir=ltr>(aud.)</SPAN></A> </DIV>

Fig. 2b: How the HTML looks to a computer



A calendar application or service that knows how use a standard format called iCalendar will receive a structured representation of the data. It relies on that structure to identify, recombine, and exchange the dates, times, and descriptions.

BEGIN:VCALENDAR

PRODID:-//Google Inc//Google Calendar 70.9054//EN

VERSION:2.0

BEGIN:VEVENT

DTSTART:20101006T113000Z

DTEND:20101006T190000Z

DTSTAMP:20101005T172506Z

UID:bccvmn5aooodokincjbgl8crc0@google.com

CREATED:20101005T161914Z

DESCRIPTION:

LOCATION:

SUMMARY:-Rain date for AP Env. Sci. trip to Monadnock 7:30 am-3 pm (Davenso

n/Sintros)

END:VEVENT



Fig. 2c: How the iCalendar feed looks


If a proposed format called xCalendar is approved as a standard, and is widely adopted by calendar applications and services, then calendar applications or services might also use that format to identify, recombine, and exchange dates, times, and descriptions.



<icalendar xmlns="urn:ietf:params:xml:ns:icalendar-2.0">
<vcalendar>
<properties>
<prodid>
<text>-//Google Inc//Google Calendar 70.9054//EN</text>
</prodid>
<version>
<text>2.0</text>
</version>
</properties>
<components>
<vevent>
<properties>
<dtstamp>20101005T172506Z</dtstamp>
<dtstart>20101006T113000Z</dtstart>
<dtend>20101006T190000Z</dtend>
<uid>
<text>bccvmn5aooodokincjbgl8crc0@google.com</text>
</uid>
<summary>
<text>-Rain date for AP Env. Sci. trip to Monadnock 7:30 am-3 pm
(Davenson/SintrosEvent #2</text>
</summary>
</properties>
</vevent>
</components>
</vcalendar>
</icalendar>

Fig. 2d: How an xCalendar feed might look



Note that Fig. 2c (iCalendar) and Fig 2d (xCalendar) look very different. The iCalendar format uses lines of plain text to represent name:value pairs. The xCalendar format use a package of nested XML entities to represent the same data. Technical experts can, and do, endlessly debate the pros and cons of these different approaches. But for our purposes here, the key observations are:

  • Fig. 2c and Fig. 2d contain the same data


  • Computers can reliably extract that data

  • Computers can transform either format into the other without loss of fidelity

  • Computers can also transform either format into one that's more directly useful to people -- e.g., HTML or PDF

It's also worth noting that this simple name:value technique, which has been the Internet calendar standard for over a decade, is broadly useful. Curators of elmcity calendar hubs, for example, follow a convention for representing name:value pairs as tags, attached to Delicious bookmarks, that have the form name=value. A similar convention enables any calendar event, made by any calendar program, to specify the URL for the event and the categories that it belongs to. In this week's companion article on answers.oreilly.com I show how to extract these name:value pairs from free text.

A taxonomy of representations and purposes

Let's chart these representations and arrange them according to purpose.




What people see
Why?
What computers see
Why?





Fig. 1a: pdf


To view and print



Fig 2a: pdf


To enable people to view and and print






Fig. 1b: html


To view, print, and interact



Fig 2b: html


To enable people to view, print, and interact




 

 



Fig 2c: iCalendar


To enable data to flow reliably and recombine easily




 


 



Fig 2d: xCalendar


To enable data to flow reliably and recombine easily


To most people, all four items in the What Computers See column are roughly equivalent. They're understood to be computer files of one sort or another. But when computers use these files on our behalf, they use them in very different ways. The first two uses enable people to read, print, and interact online. The latter two enable computers to exchange data without loss of fidelity, so that other people can read, print, and interact online.

The laws of information chemistry say that if we want to exchange data, we must provide it in a format that's useful for that purpose. In this example the PDF and HTML formats aren't; the iCalendar and xCalendar formats are. To most people it's not obvious why that's so. Our brains are such powerful pattern recognizers, and we know so much about the world in which the patterns occur, that we can look at Fig. 2a and see that the text clearly implies a structure involving dates, times, titles, and descriptions. Computers can't do that so easily or so well.

Computers are, of course, getting smarter all the time. Google Calendar's Quick Add feature is a perfect example. I used it to create the example shown in Fig. 1b, and it did a great job of parsing out the times and titles of the events. But that was only possible because I inserted the events, one at time, into a container that Google Calendar understood to represent Wed Oct 6. It wouldn't be able to import the original free-form text that was the original source for the PDF file. No other calendar program could either.

The surprising difficulty of structured information

It's counter-intuitive that computers don't recognize structure easily or reliably. But so are many other things. For example:

You have $100. It grows by 25%, then shrinks by 25%. Do you end up with more or less?

You can live a long time without ever developing an intuition that the final amount is less. And you may be profoundly harmed because you lack that intuition. If you have it, you most likely didn't acquire it all by yourself. Either somebody taught it to you, or nobody did.

Although our sample PDF file contains no structured representation of the events that it exists to convey, it does contain some other structured data:






Title Microsoft Word - weekly draft


Made_by
Word


Created_with
Mac OS X 10.4.11 Quartz PDFContext

From this we learn that that calendar originates in Microsoft Word. Why Word instead of a calendar program? Available cloud-based applications include Google Calendar and Hotmail Calendar. On the Mac desktop where the document originated, there's Apple iCal. If one of these alternatives were even considered, a number of valid concerns would arise:

  1. It's cumbersome to enter data into a calendar program's input fields; it's much easier and quicker to type into a Word table

  2. The document doesn't only contain structured data, it is also a textual narrative. Calendar programs don't flexibly accomodate narrative.

  3. The webmaster knows how to post a PDF, but wouldn't know what to do with dual outputs from a calendar program (one for humans to read, another for computers to process).

And if alternatives were considered, we could discuss those concerns:

  1. Yes, it is more cumbersome to enter data into a calendar program. But do we want students and teachers and parents to be able to pull these events into their own calendars? Do we want the events to also be able to flow automatically to community-wide calendars? If so, these are big payoffs for a fairly small investment of extra effort. And by doing things this way, we'll demonstrate the 21st-century skills that we say our students need to learn and apply.
  2. Yes, it's true that calendar programs don't accomodate narrative. But we're publishing to the web. We can use documents and links to build a context that includes: the calendar in an HTML format that people can read, print, and interact with; the calendar in another format that can syndicate to other calendars; narrative related to the calendar.

  3. Yes, but the webmaster needn't even be tasked with this chore. Various tools -- some that we already have and use, others that are freely available -- enable us to publish the desired formats ourselves.

Since alternatives are almost never considered, though, the ensuing discussion almost never happens. Why not? Key intuitions are missing. Some kinds of computer files have different properties than others, and thus serve different purposes. Structured representation of data is one such property. If we are trying to put data onto the web, and if we want others to have the use of that data, and if we hope it will flow reliably through networks to all the places where it's needed, then we ought to consider how the files we choose to publish do, or don't, respect that property.

Nobody is born knowing this stuff. We need to learn it. Schools aren't the only source of instruction. But they ought to teach core principles that govern the emerging web of people, data, and services. And they ought to cultivate intuitions about when, why, and how to apply those principles.



Related:


September 30 2010

The principle of indirection

Programmers learn, early on, that there's a difference between values stored in memory and pointers to (or references to, or addresses of) values stored in memory. The key distinction emerges when these things move around within programs, and it is captured by a pair of phrases: pass by value and pass by reference.

Suppose the value stored in some memory location represents the number 6. In a pass-by-value regime, 6 is copied from one part of the program to another part of the program. If the value stored in the original memory location then becomes 8, those parts of the program that got copies of 6 still represent 6. The 6 was "passed by value."

In a pass-by-reference regime, though, what's copied from one part of the program to another isn't the value, 6, but rather a reference, or pointer, to the memory location where 6 is stored. In this case, if the value stored in that memory location becomes 8, the parts of the program that received references to 6's memory location now represent 8 too. The 6 was "passed by reference" and, by way of that reference, has become 8.

It used to be that nobody except programmers had to appreciate this subtle distinction. But along came the web, and now everybody does. Why? Another name for a pointer, or a reference, or an address, is a hyperlink. We use hyperlinks every day. But most people don't use them as well as they could, because most people don't see "pass by value" and "pass by reference" at work in our everyday online discourse.

Here's a quiz that looks easy, and should be, but turns out to be quite hard for most people. Somebody asks you: "What information do you have about topic X?" It's a multiple-choice quiz. There are two ways to answer:

1. Make a list of things, and send a copy of the list.

2. Make a list of things, and send a reference to the list.

Most people choose 1 -- that is, pass by value. The value they send, in this case, is a list of things we can describe using words, phrases, sentences, paragraphs, URLs. The list can be printed on paper and delivered by hand. Or it can be typed and sent as an email or text message. Either way, what's sent is a copy of the list. The original list remains in situ. When it changes, over time, those changes don't propagate through the network of copies.

The minority who choose 2 -- that is, to pass by reference -- achieve the same two goals as do the pass-by-value majority. One goal is to send a social signal: "Here is information I want to give you." The other is to convey the actual information. You get these same two effects no matter whether you pass by value or pass by reference.

When I send you a link to the list, though, instead of a copy of the list, I connect you to a live list that provides four extra benefits:

1. I am the authoritative source for the list. It lives at a location in memory (that is, at a URL in the cloud) that's under my control, and is bound to my identity.

2. The list is always up-to-date. When I add items, you (and everyone else) will see a freshly-updated list when you follow the link I sent you.

3. The list is social. If other people cite my link, I can find their citations and connect with them.

4. The list is collaborative. Suppose you want to extend my list. In a pass-by-value world, the best you can do is add to the copy I sent you. I won't see what you've added, and neither will anybody else. In a pass-by-reference world, though, we can both keep our own lists, publish references to them, and then produce a merged list by combining the referents.

(Of course there's no free lunch. If you depend on the link and it fails, we're out of luck. This week's companion piece at answers.oreilly.com explores one way to handle transient failure.)

The fourth benefit, the collaborative one, is rather abstract. So let's nail it down to a common real-world scenario. Suppose you're running a newspaper, or a hyperlocal website, or some other nexus for community information. And suppose I am a source for that information. Almost always, as things stand today, you'll ask me to pass information to you by value. If I'm promoting a council meeting, or a church supper, or a riverside cleanup, or an open mic night, you'll expect me to inform you about my event's date, time, and description by sending you an email, or by visiting your website and typing the data into a form. Either way, it boils down to: "Give me a copy of your information."

Before 1994 there was no alternative. My original, whether it was a piece of paper in my drawer or a file on my computer's hard drive, wasn't immediately available to you. It could only be passed by value. Since 1994 we've had an exciting new option, albeit one the world mostly hasn't yet caught up to. Now the original can reside on the web, at a permanent and well-known address within its vast memory. And it can be passed by reference.

So providers of information about community events -- the city government, the church, the environmental group, the musicians -- can post references to information about their events. Those references can appear wherever the providers choose to establish their online identities: on conventional websites, on blogs, on Twitter, on Facebook. Purveyors of that information -- newspapers, hyperlocal websites, other nexuses -- can use those references to create views that join many sources, from many perspectives, for many purposes.

That's still a notch too abstract so let's make it even more concrete. City governments provide calendars of council and committee meetings. Local newspapers purvey those calendars. Citizens use them. In the prevailing pass-by-value model, the city gives copies of its event information to the newspaper, which in turn makes copies to give to citizens, who in turn may need to make more copies to pass around. Where's the original? In a document on a computer at city hall.

In a pass-by-reference world, the original resides in the cloud at a unique URL. That URL refers to a list of events. And each item on the list -- each event on the calendar -- has its own URL. The city publishes its calendar on its own website, in HTML, so citizens can read it there. But instead of giving the newspaper copies of event information, it gives the newspaper a link to the calendar's feed. The newspaper, by subscribing to the link, ensures that the information it receives from the city is as timely, accurate, and complete as the city cares to make it. Of course the newspaper still has to make copies for its print version. But online, along with the subset of facts about each event that it chooses to relay, it provides the event's URL. Citizens can click through the event URL to see the whole description, and to check for updates. Citizens can also subscribe directly to the city's calendar URL, and thus merge its stream of civic event data with their own streams of personal event data.

I've yet to convince a local newspaper to adopt this model. It could be that they fear disintermediation. After all, if citizens can subscribe directly to calendar feeds, why will they need the newspaper to tell them about what's going on? But I don't think that's the real problem. There will always be community attention hubs. Newspapers, or whatever they evolve into, will continue to occupy that niche. In their role as purveyors of community information, though, pass-by-value makes them less effective than pass-by-reference could.

The real problem, I think, is that if you're a newspaper editor, or a city official, or a citizen, pass-by-reference just isn't part of your mental toolkit. We teach the principle of indirection to programmers. But until recently there was no obvious need to teach it to everybody else, so we don't.

I've noticed that educators do, nowadays, talk a lot about about systems thinking and digital literacy and 21st-century skills. Good! Now let's codify what we mean. Networks of people and data are governed by principles as basic as the commutative law of addition and multiplication. Indirection is one of those principles. Others include pub/sub syndication, universal naming, and data structure. First we need to write them down. Then we need to figure out how to teach them.



Related:

September 22 2010

Personal data stores and pub/sub networks

The elmcity project joins five streams of data about public calendar events. Four of them are well-known services: Facebook, EventBrite, Upcoming, and Eventful. They all work the same way. You sign up for a service, you post your events there, other people can go there to find out about your events. What they find, when they go there, are copies of your event data. If you want to promote an event in more than one place, you have to push a copy to each place. If you change the time or day of an event, you have to revisit all those places and push new copies to each.

The fifth stream works differently. It's a loosely-coupled network of publishers and subscribers. To join it you post events once to your own website, blog, or online calendar, in a way that yields two complementary outputs. For people, you offer HTML files that can be read and printed. For mechanized web services like elmcity, you offer iCalendar feeds that can be aggregated and syndicated. If you want to promote an event in more than one place, you ask other services to subscribe to your feed. If you change the time or day of the event, every subscriber sees the change.

The first and best example of a decentralized pub/sub network is the blogosphere. My original blogging tool, Radio UserLand, embodied the pub/sub pattern. It made everything you wrote automatically available in two ways: as HTML for people to read, and as RSS for machines to process. What's more, Radio UserLand didn't just produce RSS feeds that other services could read and aggregate. It was itself an aggregator that pointed the way toward what became a vibrant ecosystem of applications -- and services -- that knew how to merge RSS streams. In that network the feeds we published flowed freely, and appeared in many contexts. But they always remained tethered to original sources that we stamped with our identities, hosted wherever we liked, and controlled ourselves. Every RSS feed that was published, no matter where it was published, contributed to a global pool of RSS feeds. Any aggregator could create a view of the blogosphere by merging a set of feeds, chosen from the global pool, based on subject, author, place, time, or combinations of these selectors.

Now social streams have largely eclipsed RSS readers, and the feed reading service I've used for years -- Bloglines -- will soon go dark. Dave Winer thinks the RSS ecosystem could be rebooted, and argues for centralized subscription handling on the next turn of the crank. Of course definitions tend to blur when we talk about centralized versus decentralized services. Consider FriendFeed. It's centralized in the sense that a single provider offers the service. But it can be used to create many RSS hubs that merge many streams for many purposes. In The power of informal contracts I showed how an instance of FriendFeed merges a particular set of RSS feeds to create a news service just for elmcity curators. The elmcity service itself has the same kind of dual nature. A single provider offers the service. But many curators can use it to spin up many event hubs, each tuned to a location or topic.

The early blogosphere proved that we could create and share many views drawn from the same pool of feeds. That's one of the bedrock principles that I hope we'll remember and carry forward to other pub/sub networks. Another principle is that we ought to control and syndicate our data. Radio UserLand, for example, was happy to host your blog, just as Twitter and Facebook are now happy to host your online social presence. But unlike Twitter and Facebook, Radio UserLand was just as happy to let you push your data to another host. To play in the syndication network your feed just had to exist -- it didn't matter where -- and be known to one or more hubs.

This notion of a cloud-based personal data store is only now starting to come into focus. When I was groping for a term to describe it back in 2007 I came up with hosted lifebits. More recently the Internet Identity Workshop gang have settled on personal data store, as recently described by Kaliya Hamlin and Phil Windley. The acronym is variously PDS or PDX, where X, as Kaliya says, stands for "store, service, locker, bank, broker, vault, etc." Phil elaborates:

The term itself is a problem. When you say "store" or "locker" people assume that this is a place to put things (not surprisingly). While there will certainly be data stored in the PDS, that really misses its primary purposes: acting as a broker for all the data you've got stored all over the place, and managing the metadata about that data. That is, it is a single place, but a place of indirection not storage. The PDS is the place where services that need access to your data will come for permission, metadata, and location.

The elmcity service aligns with that vision. If we require the calendar data for a city, town, or neighborhood to live in a single place of storage, we'll never agree to use the same place. Thus the elmcity service merges streams from Facebook, EventBrite, Upcoming, and Eventful. But those streams are fed by people who put copies of their events into them, one event at at time, once per stream. What if we managed our public calendar data canonically, in personal (or organizational) data stores fed from our own preferred calendar applications? These data stores would in turn feed downstream hubs like Facebook, EventBrite, Upcoming, and Eventful, all of which could -- although they currently don't -- receive and transmit such feeds. Other hubs, based on instances of the elmcity service or a similar system, would enable curators to create particular geographic or topical views.

I've identified a handful of common calendar applications that can publish calendar data at URLs accessible to any such hub, in a format (iCalendar) that enables automated processing. The short list includes Google Calendar, Outlook, Apple iCal, and Windows Live Calendar. But there are many others. Here's the full list of producers as captured so far by the elmcity service:

feed producer# of feeds -//Google Inc//Google Calendar 70.9054//EN151-//Meetup Inc//RemoteApi//EN14unknown14iCalendar-Ruby6e-vanced event management system6-//DDay.iCal//NONSGML ddaysoftware.com//EN5-//Last.fm Limited Event Feeds//NONSGML//EN4-//openmikes.org/NONSGML openmikes.org//EN3-//CollegeNET Inc//NONSGML R25//EN3-//Drupal iCal API//EN3-//Microsoft Corporation//Windows Live Calendar//EN3-//Trumba Corporation//Trumba Calendar Services 0.11.6830//EN2-//herald-dispatch/calendar//NONSGML v1.0//EN1-//WebCalendar-v1.1.21Zvents Ical1Coldfusion81-//Intand Corporation//Tandem for Schools//EN1-//strange bird labs//Drupal iCal API//EN1-//SchoolCenter/NONSGML Calendar v9.0//EN1-//blogTO//NONSGML Toronto Events V1.0//EN1-//Events at Stanford//iCal4j 1.0//EN1-//University of California\\, Berkeley//UCB Events Calendar//EN 1-//EVDB//www.eventful.com//EN1-//mySportSite Inc.//mySportSite//EN1Mobile Geographics Tides 3988 20101

Google Calendar dominates overwhelmingly, but the long tail hints at the variety of event sources that could feed into a calendar-oriented pub/sub network. How much of the total event flow comes by way of this assortment of iCalendar sources, as compared to centralized sources? Here's the breakdown:


(Click to enlarge)

It's roughly half Eventful, a third Upcoming, a fifth iCalendar. There's negligible flow from EventBrite, which focuses on big events. Likewise FaceBook where the focus, though it's evolving, remains on group versus world visibility.

In a companion piece at O'Reilly Answers I show how I made this visualization. It's a nice example of another kind of pub/sub network, in this case one that's enabled by the OData protocol. For our purposes here, I just want to draw attention to the varying contributions made by the five streams to each of the hubs. The Eventful stream is strong almost everywhere. The Upcoming and iCalendar tributaries are only strong in some places. But where the iCalendar stream does flow powerfully, there's a curator who has mined one or more rich veins of data from a school system, or a city government, or a newspaper. Today the vast majority of these organizations think of the calendar information they push as text for people to read. Few realize it is also data for networks to syndicate. When that mindset changes, a river of data will be unleashed.



Related:


September 10 2010

Twitter kills the password anti-pattern, but at what cost?

Last week Twitter ended basic name/password authentication to its API. This was announced well in advance. Still, like many procrastinators, I found myself scrambling to learn what is now the only available way to authenticate to the Twitter API: OAuth.

Some people can visualize how these kinds of protocols work by reading specs and flow diagrams, but I'm not one of them. For me it doesn't really sink in until I create a working implementation. So I embraced the opportunity to solidify what had been just a theoretical understanding of the OAuth protocol.

I've also long embraced the principle that motivates OAuth. You should never have to give your name/password credentials to a third-party application or service so that it can impersonate you. This so-called password anti-pattern is profoundly wrong. When legitimate applications and services ask for permission to impersonate us, we learn that it's OK to do things that way. It isn't. Malicious actors can and do exploit our willingness to give up our credentials.

How can I authorize a third party to access my data on a service like Twitter, and use its capabilities on my behalf, without giving it my credentials? OAuth is a clever solution to this hard problem. A third party that would otherwise need my credentials to do something can interrupt the transaction, send me to Twitter where I authorize it to act on my behalf, save a token that represents that authorization, and then resume the transaction.

In my case, though, that's a solution to a problem that I didn't have. OAuth is a three-party protocol. It typically involves a user, an application or service that the user authenticates to directly, and a third-party application or service to which the user delegates agency. The elmcity/Twitter scenario, though, is really just a two-party game. As I discussed in "The power of informal contracts," the elmcity service uses Twitter as a channel for authenticated messages sent from event curators to the service.

Here's how it works. If you're the curator of an elmcity hub, you can associate your hub with a Twitter account. The elmcity service, which exists in Twitterspace as @elmcity_azure, will follow your Twitter account. That enables you to send direct messages to @elmcity_azure. The service reads and reacts to messages sent through this trusted channel.

In this scenario there is no third party. As far as Twitter and the curators are concerned, the elmcity service is just another Twitter user. Of course it isn't a human sitting in front of a browser, but rather a service running in the cloud. So it has to use the Twitter API to follow curators, and to read the direct messages they send to it.

When I had the idea to do this, the implementation came quickly and easily. Why? Because the activation threshold was low. Until recently, it was dead simple to access the Twitter API, using HTTP basic authentication, in order to try out interesting ideas. Earlier this year, for example, I wondered what it would be like to visualize the names of your Twitter lists as a tag cloud. I sketched a solution using a simple web page that used JavaScript to call the Twitter API and render results.

The page asked for your Twitter name and password, but I explained that things weren't as they might have seemed:

The first time through, you'll be prompted to authenticate to api.twitter.com. This looks like the password anti-pattern, but really isn't. You're authenticating yourself to the Twitter API in the same way that you normally do to the Twitter website.

When you try that page now, the Twitter API responds:

{"errors":[{"code":53,"message":"Basic authentication is not supported" }] }

Are there JavaScript libraries that would enable me to recreate this app using OAuth? Yes. Will I bother to acquire, evaluate, learn, configure, and use them? No way, life's too short. And anyway there's no great loss here, it was just a sketch of an idea.

What is lost, though, is the possibility to sketch a lot of other ideas. The unexpected way in which the elmcity service uses Twitter is, I argue, an interesting and useful model. Having concocted it, I went to a fair bit of trouble last week to move it from basic authentication to OAuth. In a companion article at answers.oreilly.com I share what I learned about libraries, tools, and techniques that can help developers work with OAuth. It's a complicated dance of cryptography, redirection, and callbacks. So complicated, in fact, that if I'd had my idea today instead of six months ago I probably wouldn't have bothered to jump through all the hoops.

According to Eran Hammer-Lahav, who is the editor of OAuth 1.0 and also an editor of OAuth 2.0, the 2.0 protocol will make it possible again to sketch ideas using simple tools:

OAuth 2.0 provides a cryptography-free option for authentication which is based on existing cookie authentication architecture. Instead of sending signed requests using HMAC and token secrets, the token itself is used as a secret sent over HTTPS. This allows making API calls using cURL and other simple scripting tools without having to canonicalize the request and sign it.

Good! But OAuth 1.0 is only now gathering momentum. OAuth 2.0, which isn't backward-compatible, will require its own long runway. I'm glad to see Twitter driving a stake into the heart of the password anti-pattern. But the Twitter ecosystem wouldn't exist if it hadn't been possible to sketch ideas, and to explore the unanticipated uses that can emerge from the soup of active ingredients that the web has become.

We must not teach people to give up their credentials and expose themselves to attack. We must, however, teach them to savor that soup of ingredients, to expect novel concoctions to arise from it, and where possible to create their own. These innovations require us to assert and delegate our identities. Today that can be easy or it can be safe. Until it can be both easy and safe, we'll be forced to make painful trade-offs.


Related:

August 18 2010

The laws of information chemistry

In the course of my work on the elmcity project I've talked to a lot of people about forming networks of calendars. One of the major hurdles has been the very idea that we can form such networks, in an ad-hoc way, using informal contracts. Later in this series I'll explore why that's a tough concept, and mull over how we might soften it up. Here I'll focus on an even more basic conceptual stumbling block: information structure.

Everybody learns that things in the physical world are structured in ways that govern how they can or cannot interact. Whether it's proteins folded into biochemical locks and keys, or metallic parts formed into real locks and keys, we know the drill. The right shape will open the door, the wrong one won't. You can't get through grade school without being exposed to that idea.

Unless you're on an IT track, though, you'll likely graduate from college without ever learning this corollary: The right information structures open doors, the wrong ones won't.

My project has shown me that many otherwise well-educated professionals have no intuitive sense of the differences between these various representations of a calendar:

  1. As a PDF file

  2. As an HTML page

  3. As an RSS feed
  4. As an iCalendar feed

These are all just different flavors of computer files, most people think. Pick a format that can be read on a PC or a Mac and you're good to go. So my local high school, for example, uses PDF:

It irks me the school publishes this data without acknowledging that it is data, or providing it in a way that's appropriate for the kind of data it is. In 2010, one of the "tools to succeed in a diverse and interdependent world" has to be a basic working knowledge of information chemistry. The quotation about learning that appears at the top of that image speaks to the underlying principle:

"Treat it as an active process of constructing ideas, rather than a passive process of absorbing information." - Daniel J. Boorstin

I looked for that quotation's context, by the way, and didn't find it in any of Boorstin's works. Instead it shows up here:

Anyway, I agree with the authors of "From Risk To Renewal: Charting a Course for Reform." We don't just passively dwell in social information networks, we actively co-create them. To do that effectively we need to know what will or won't catalyze a chemical reaction in data space.

The reaction I hope the elmcity project will help catalyze is one that unlocks calendar data and enables it to flow freely through networks without loss of fidelity. In theory, any of the four document flavors listed above could work, if supported by tools that encode and exchange the core structure of an event: a title, a date and time, a link to the authoritative source. In fact there's only one common flavor that preserves that structure: iCalendar. And there's only one widely-deployed kind of application, examples of which include Google Calendar, Microsoft Outlook, Apple iCal, and Lotus Notes.

Among technical folk there's been an on-again, off-again effort to migrate the iCalendar standard from its existing plain text format to one based on XML, which didn't exist when iCalendar was born. For me it's a wash. Either flavor can encode the basic facts in a way that enables calendar networks to form. Will translation between the flavors be a problem? It shouldn't be, but if so I'd regard it as a good problem to have as compared to the one we've actually got, which is that nearly all the calendar information available online isn't in any calendar format. It's randomly dumped into PDF files, or into HTML pages that don't (as they might) encode event structure using the hCalendar microformat.

The calendar-like HTML page is so common that a service called FuseCal tried (with pretty good success) to scrape those web pages and turn them into standard iCalendar feeds. The service is gone now, and one piece of the elmcity project (which I describe in the companion how-to article "How to write an elmcity event parser plug-in") aims to recreate it in a modest way. I'm ambivalent about doing this, though, because web-page scraping sweeps the real problem under the rug. Of course we can't expect people to read and write raw data-exchange formats. But we can and should expect people to have a clue about what data-exchange formats are, and to know something about when and why to use them.

There has been progress. Starting with the early blogosophere, and continuing into the present era of Facebook and Twitter, the technocracy has introduced the masses to the concept of information feeds. Many people now know, in a general way, that some molecular strands of information combine more readily than others. But the concept isn't yet fully digested. So, for example, events pages on websites are far more likely to link to RSS or Atom feeds than to iCalendar feeds. With apologies to Guy Kawasaki and Terence Trent D'Arby, that's the Right Thing done the Wrong Way.

Here's why: Publishing a data feed is absolutely the right idea, but using RSS or Atom feeds to do it is a category error. Because these feeds don't encode dates, times, and locations in any standard way, they're part of the blogosophere but can't flow through calendar networks.

Calendars, of course, are just one of many types of data that can drive online chemical reactions. We're reaching a consensus that open publication of data is a necessary condition. But it's not sufficient. We've always expected educated citizens to know at least basic physics and chemistry. Now we need to discover, write down, and teach the analogous laws that govern social information networks.



Related:


August 11 2010

The power of informal contracts

Before releasing elmcity as
an open source project,
it had to be reviewed by a team of Microsoft lawyers. They found no
patentable invention in the code and gave me the green light. Which is
funny in a way, because I'm sure I couldn't ever create a patentable
software invention. I'm just not that talented as a programmer.
Writing code, for me, is mainly a way to explore ideas and illustrate
possibilities. In this case, it's the idea that we can manage data
in a social way by participating in networks of syndicated feeds. And
it's the possibility that a lone developer, modestly capable but
empowered by languages, libraries, tools, and cloud services, can
bring that idea to life.

What is innovative about my project, I claim, is the network-oriented
way of thinking and acting that it embodies and promotes. The
approach is based on a set of principles that we have yet to fully
articulate, never mind teach along with the proverbial three R's.
Almost everybody learns the rules of reading, writing, and
arithmetic. But almost nobody learns the laws that govern the
structure and flow of information in networks, and that enable us to
make effective use of those networks.

We don't need software innovation to solve the problem that the
elmcity project addresses. Our perpetual inability to merge personal
and public calendar data can't be blamed on a lack of standard,
broadly-available software and services. We have all the stuff we
need, and have had for quite some time, and it all interoperates
pretty well. But we haven't internalized why, and how, to
use it in a network-oriented way.

One of the crucial ingredients is something that I'll call the
Principle of Informal Contracts. Here's an example from the early
blogosphere: RSS auto-discovery. It's the mechanism that associates the
pages of a website with a corresponding feed. Back in
2002 a remarkable collaboration brought it
into existence, summarized href="http://diveintomark.org/archives/2002/06/02/importan
t_change_to_the_link_tag">here by Mark Pilgrim who concluded:

Thank you to everyone who has been working on making this come together in the past few days. It has been surprisingly painless and friction-free. Together, we have come up with a new standard that is useful, elegant, forward-thinking, and widely implemented. In 4 day.

Thanks to that consensus, it has ever since been easier to subscribe to
blogs. But what about before then? The blogosphere's feed ecosystem was already
bootstrapped and thriving. RSS, implemented by various feed
publishers and feed readers, was the obvious enabler. But there was also an
informal contract. It went something like this:

My blog publishes a feed at a URL that I will tell you about one way or another. Maybe I'll use an orange icon, maybe I'll use a subscribe link, maybe both, maybe something else. By whatever means, I promise to produce a flow of new items at that URL. The feed itself might be one or another flavor of RSS. Whichever it is, your feed reader can count on finding certain things in a predictable way: a title, a link, a description.

That was enough to get things started in a big way. Here's a more contemporary
example: the contract that's implied by choosing a tag at the beginning of a
conference. It says:

We, the organizers, promise to bind the online resources that we produce for this conference to this tag. And we invite you, the attendees, to do the same. We do not promise to collect all the resources discoverable by means of this tag, dump them into a bucket, and assert authority over (or assume liability for) the contents of the bucket. We just want to be able to find all the stuff, yours and ours, and use it as needed. We want you to be able to do the same. And we think the virtual network routed by that tag can be valuable to everybody.

Members of a certain tribe, which you most likely belong to if you're
reading this, take that contract for granted. For you, the word "camp"
does not connote outdoor recreation, but rather a new kind of
conference that's co-created by the organizers and by the attendees.
One of the organizers' roles is to declare a tag for the conference.
I've watched newcomers to the tribe encounter this practice for the
first time, and then immediately adopt it. So I know the idea is
catching on.

But we haven't yet spelled out the underlying principle of informal
contracts. And we've got to do that, because we're living in a world
where networks of people and data can fruitfully use such contracts. They're easy to
create if you know how and why. Here's the contract for the ecosystem
of calendar feeds that I'm trying to bootstrap:

Anybody who wants to promote a series of public events agrees to publish a feed that transmits certain facts about events in a predictable way: the title, the starting date and time, a link back to the authoritative source. Anybody who wants to curate a view of sets of feeds made available in this way can do so by making a list of them.

Within the project itself, there are some other contracts. Here's one:

To make the lists of feeds that define these curated views, we'll agree to use the delicious social bookmarking service in a particular way. The elmcity service will define the extensible set of delicious accounts used for this purpose. We'll agree that users of those accounts will follow certain practices that define the settings for their instances of the service, and the lists of feeds they trust to flow through their hubs.

That agreement in turn enables another. Because every action in
delicious is a database query that produces an RSS feed, the
acquisition of a new calendar feed by a curator sends a message on a
virtual channel that can be read and reacted to. The service that
reads and reacts, in this case, is FriendFeed. Here's the contract:

Curators can subscribe to a project feed that's aggregated by an instance of the FriendFeed service. Items flowing through its Atom feed are significant events in the life of the project. They include:


  • The posting of a forum message or reply.

  • The posting elsewhere of a tweet, blog item, or other resource bound to the
    project tag.

  • The discovery of a new (and trusted) iCalendar resource by any hub's curator.


  • Like delicious, FriendFeed can support this kind of contract by empowering
    users to define sets of online resources and share them as feeds.

    With these kinds of contracts in place, interesting possibilities
    arise. Here's one that tickles me: The elmcity service doesn't yet
    need to implement its own system of user registration. If my approach
    to decentralized event curation takes off, I might someday need to create a
    registration system and require curators to use it. But for now it's
    trivial for me to bookmark a delicious account and tag it as one that
    the service regards as authoritative for a hub.

    Likewise, the folks who provide the feeds trusted by curators and
    aggregated by hubs don't yet have to register with the service. For
    now, providers need only make feeds known to curators, one way or
    another. And curators who deem those feeds worthy of inclusion need
    only bookmark and tag them.

    These are the kinds of workflows that can arise when informal
    contracts are in place. They're cheap to create and easy to evolve.
    Here's a final example. If you're a curator who adds a new feed to a
    hub, you naturally want to see your new events merged into the hub as soon
    as possible. Normally you'd have to wait until the next aggregation
    cycle, which might be as long as eight hours. So I had to come up with a way for
    a curator to send an authenticated message to the service, telling it to
    short-circuit the wait and gather new events right away.

    My first thought was that here, finally, was the reason I'd need to build my
    own user registration system and require curators to use it. My second
    thought was,
    hang on, I already trust their delicious accounts, why not use those accounts as
    channels for authenticated messages? That's doable, but I didn't find
    a graceful way to do it. My third thought was: What about Twitter?
    Hence the Twitter contract:

    Among the settings that a curator conveys to the elmcity service, by way of a trusted delicious account, is the Twitter account to be used in connection with that curator's elmcity hub. The service will follow the curator at that Twitter account. By doing so, it enables the curator to send authentic messages to the service. The vocabulary used by those messages will initially be just a single verb: start. When the service receives the start message from a hub, it will reaggregate the hub's list of feeds.

    In a companion how-to article on O'Reilly Answers, I
    show how this piece of the service works. What's relevant here is that
    the code doesn't have to do very much. That's because the informal
    contract makes it possible to reuse Twitter in a novel way, as a
    channel for authenticated messages.

    In a world full of services like delicious, FriendFeed, and Twitter --
    services that can route feeds of data based on user-defined
    vocabularies -- you don't have to be a programmer to create useful
    mashups. You just have to understand, and find ways to apply, the
    Principle of Informal Contracts.


    Related:

    August 03 2010

    Lessons learned building the elmcity service

    In 1995 I started writing a column for BYTE about the development of the magazine's website, plus some early examples of what we now call web services and social media. When I started, I knew very little about Apache, Perl, and the Common Gateway Interface. But I was lucky to be able to learn by doing, by explaining what I learned to my readers, and by relaying what they were teaching me. Because I came to the project with a beginner's mind, the column became a launchpad for a lot of people who were just getting started on web development.

    Nowadays I'm working on another web project, the elmcity calendar aggregator. And I came to this project with a different kind of beginner's mind. I had built a first version of the service a few years back in Python, on Linux, using the Django framework. After I joined Microsoft I decided to recreate it on Azure. I started in Python -- specifically, IronPython. But Azure was brand new at the time, and not very friendly to IronPython. So I switched to C# and .NET. I knew more about that environment than I had once known about Perl and CPAN, but not a whole lot more. That inexperience qualifies me to write another series of learning-by-doing essays, and that's what this will be.

    The code, which is under an Apache 2.0 license, will live on github. I'll discuss it in detail over on O'Reilly Answers. In this space, I'll reflect on larger themes: building and operating a cloud service in 2010, in a way that cooperates with other services and straddles two different cultures.

    You know the cultural stereotypes. In the open source realm, services
    written in dynamically-typed languages like Python and Ruby wrangle
    streams of open data for the public good. In the enterprise zone,
    services written in statically-typed languages like C# and Java
    manage proprietary data for profit. What happens when you mix open
    source goals, styles, and attitudes with Microsoft tools, languages,
    and frameworks? You get a cultural mashup. That's what the elmcity
    project is, and what this series will explore.

    Recently I had dinner with Adrian Holovaty. He's the force behind Django, the popular Python-based web development framework, and EveryBlock, an engine for hyperlocal news and information. Adrian asked me what it's like to build software the way I've been doing it for the last year: in C# (and IronPython), on Azure, using Visual Studio Express. I picked the first example that came into my head: "When I rename a variable or method," I said, "it gets automatically renamed across the whole project." Adrian's response was: "I've never used a tool like that, so I don't know what I'm missing."

    Of course it goes both ways. A lot of developers on the Microsoft side of the fence have never used Django, or Rails, and they don't know what they're missing either.

    If you've followed my work over the years, you know I've always been a best-of-both-worlds pragmatist. So this will be an atypical narrative about C# and .NET development. I see through the lens of Perl, Python, HTTP, and REST, with a bias toward The Simplest Thing That Could Possibly Work.

    You shouldn't have to drink a gallon of Kool-Aid, and then have a brain transplant, in order to start producing useful results. Back in the BYTE era I was struck by how little I actually had to learn about Perl and CGI in order to accomplish my goals. Likewise, I've barely scratched the surface of C#, .NET, Visual Studio, and Azure as I've developed the elmcity service.

    I claim that's a good thing. There are many more services needing to be built than there are Adrian Holovatys available to build them. One of Microsoft's great strengths has always been the empowerment of the average developer. It should be possible for a useful service to be built, maintained, and evolved by somebody who isn't a great programmer. And trust me, I'm not. But the languages, tools, framework, and platform that I'm using for this project have enabled me to be better than I otherwise would be.

    Finally, this series is about the wider goals of the elmcity project. It was born of my frustration with the web's longstanding failure to outperform posters on shop windows and kiosks as a source of information about goings-on in our cities, towns, and neighborhoods. I'm trying to bootstrap an ecosystem of iCalendar feeds that's analogous to the existing network of RSS and Atom feeds. The elmcity service is an example of what Rohit Khare memorably called syndication-oriented architecture. It embraces that style by syndicating with other services such as delicious and FriendFeed. And it will ultimately succeed only when everyone involved in the events ecosystem -- event owners and promoters as well as print and online aggregators -- can plug into a network of syndicated data feeds. So I'll talk about lessons learned while building and running the service, but also about why we need to broadly enable -- and popularize! -- a decentralized style of social information management. Because it's not just about events and calendars. We're all becoming publishers and consumers of many different kinds of data. Centralized repositories won't work. We have to learn how to network our data.

    Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
    Could not load more posts
    Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
    Just a second, loading more posts...
    You've reached the end.

    Don't be the product, buy the product!

    Schweinderl