Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

July 28 2011

Visualizing structural change

My ongoing series about the elmcity project has shown a number of ways in which I invite participants to think like the web. One of the principles I try to illustrate by example is:

3. Know the difference between structured and unstructured data

Participants learn that calendars on web pages and in PDF files don't syndicate around the web, but calendars in the structured iCalendar format do. They also learn a subtler lesson about structured data. Curators of elmcity hubs manage the settings for their hubs, and for their feeds, by tagging Delicious bookmarks using a name=value syntax that enables those bookmarks to work as associative arrays (also known as dictionaries, hashtables, and mappings). For example, here's a picture of the bookmark that defines the settings for the Honolulu hub:

The bookmark's tags represent these attributes:

facebook: yes
header_image: no
radius: 300
title: Hawaii Events courtesy of
twitter: alohavibe
tz: hawaiian
where: honolulu,hi

Curators use Delicious to declare these sets of attributes, but the elmcity service doesn't always retrieve them from Delicious. Instead it syncs the data to Azure tables and blobs. When it needs to use one of these associative arrays it fetches an XML chunk from an Azure table or a JSON blob from the Azure blob store. Both arguably qualify as NoSQL mechanisms but I prefer to define things according to what they are instead of what they're not. To me these are just ways to store and retrieve associative arrays.

Visualizing change history for elmcity metadata

Recently I've added a feature that enables curators to review the changes they've made to the metadata for their hubs and feeds. The other day, for example, I made two changes to the Keene hub's registry. I added a new feed, and I added a tag to an existing feed. You can see both changes highlighted in green on this change history page. A few hours later I renamed the tag I'd added. That change shows up in yellow here. On the following day I deleted three obsolete feeds. That change shows up in yellow here and in red here.

These look a lot like Wikipedia change histories, or the "diffs" that programmers use to compare versions of source files. But Wikipedia histories and version control diffs compare unstructured texts. When you change structured data you can, at least in theory, visualize your changes in more focused ways.

One of the great ironies of software development is that although computer programs are highly structured texts, we treat them just like Wikipedia articles when we compare versions. I've had many discussions about this over the years with my friend Greg Wilson, proprietor of the Software Carpentry project. We've always hoped that mainstream version control systems would become aware of the structure of computer programs. So far we've been disappointed, and I guess I can understand why. Old habits run deep. I am, after all, writing these words in a text editor that emulates emacs.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Maybe, though, we can make a fresh start as the web of data emerges. The lingua franca of the data web was going to be XML, now the torch has passed to JSON (JavaScript Object Notation), both are used widely to represent all kinds of data structures whose changes we might want to visualize in structured ways.

The component at the heart of the elmcity's new change visualizer is a sweet little all-in-one-page web app by Tom Robinson. It's called JSON Diff. To try it out in a simple way, let's use this JSON construct:

  "Evernote" :
   "Version": ""
  "Ghostery IE Plugin" :
   "Version": ""

These are a couple of entries from the Programs and Features applet in my Windows Control Panel. If my system were taking JSON snapshots of those entries whenever they changed, and if I were later to upgrade the Ghostery plugin to (fictitious future version), I could see this JSON Diff report:

You can try it yourself at this JSON Diff URL. Or if you're running Internet Explorer, which the original JSON Diff doesn't support, you can copy that JSON chunk and paste it into one of the earlier examples. The elmcity adaptation of JSON Diff, which uses JQuery to abstract browser differences, does work with IE.

It's worth noting that the original JSON Diff has the remarkable ability to remember any changes you make by dynamically tweaking its URL. It does so by tacking your JSON fragments onto the end of the URL after the hash symbol (#) as one long fragment identifier! The elmcity version sacrifices that feature in order to avoid running into browser-specific URL-length limits, and because it works with a server-side companion that can feed it data. But it's cool to see how a self-contained single-page-web app can deliver what is, in effect, a web service.

What changed, and when?

A key question, in many contexts, is: "What changed, and when?" In the Heavy Metal Umlaut screencast I animated the version history of a Wikipedia page. It was a fascinating exercise that sparked ideas about tools that would automate the process. Those tools haven't arrived yet. We could really use them, and not just for Wikipedia. In law and in journalism the version control discipline practiced by many (but not all!) programmers is tragically unknown. In these and in other fields we should expect at least what Wikipedia provides -- and ideally better ways to visualize textual change histories.

But we can also expect more. Think about the records that describe the status of your health, finances, insurance policies, vehicles, and computers. Or the products and personnel of companies you work for or transact with. Or the policies of governments you elect. All these records can be summarized by key status indicators that are conceptually just sets of name/value pairs. If the systems that manage these records could produce timestamped JSON snapshots when indicators change, it would be much easier to find out what changed, and when.


June 21 2011

Taking it offline while staying online

SecretSocialThe move toward a social web doesn't necessarily mean that everything we talk about online should be public. That's the argument of a new startup called SecretSocial. The company wants to build a place online where people can feel free to express themselves without worrying that data about each interaction, each click, and each word is accumulated and tracked.

SecretSocial only retains data about conversations on the site for the length of the conversation itself. Users' information isn't tracked or sold to advertisers.

I recently spoke with SecretSocial co-founder Zubin Wadia (@secretsocial) about the need for such a service and the challenges he's encountered while building a site for confidential conversations.

What sorts of problems does SecretSocial solve?

Zubin Wadia: SecretSocial is a place for people to engage in private and authentic conversations on the social web. The social web today is primarily a public affair with a static social graph associated to it. If you "follow" or "friend" someone, they remain in your sphere until you decide to curate your social graph again. At best, many social services offer the ability to create groups to partition your social engagements. We think groups are unintuitive. Human relationships are far too sophisticated and transient to be expressed that way.

SecretSocial makes conversations natural by letting the topic drive the degree of privacy and participation needed. The added twist to this is plausible deniability. No record of the conversation is kept on our servers upon conclusion or termination of the conversation.

We also have a "sudden death" mode where any participant can preemptively terminate a conversation and destroy the data if they're uncomfortable with it.

What are some of the use cases? Could this be a WikiLeaks-type service?

Zubin Wadia: Initially, we expect this to be popular among users of Twitter who regularly strike up interesting conversations with people they want to know better. It should also be well received by users of Gmail who are initiating emails that may be sensitive in nature. SecretSocial makes transitioning from that initial Twitter skirmish to a relationship extremely simple. You can also invite users over SMS and email.

In the future, we can see SecretSocial being used for journalism, pro-democracy movements and a number of professional use-cases. Our intention is not to be a WikiLeaks-type service, as we are inherently identity oriented. We simply want people to have genuine conversations about topics pertinent to them.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

How do you make sure that users' data is really gone?

Zubin Wadia: For starters, we do not store conversations on disk. Everything happens in-memory within our system architecture. Conversation data in SecretSocial is by default, ephemeral. There is no disk persistence. We take a page out of records management and essentially put a self-expiration tag on every conversation object. Conversations have to last between 15 minutes to 1 week. When time elapses, the objects are destroyed.

Also, traffic is encrypted end-to-end with TLS 1.0 and we take active measures to ensure nothing remains in the browser cache once a conversation concludes.

While we agree with Pete Warden on his assertion that you can't really anonymize your data, we do not think it applies to us because we don't aggregate or retain any of the conversation data beyond one week of the initiation. So passive correlations with other datasets to reveal identity are just not possible within the confines of our service.

What are your policies when it comes to sharing information with the authorities?

Zubin Wadia: Our privacy policy makes it clear that we will cooperate with authorities when a particular account comes under investigation. That being said, SecretSocial gives the user community immense power over what data is retained versus what is destroyed — which is very different from Facebook or Twitter.

It should also be understood that we take active measures to monitor, detect and ban users who are engaging in suspicious activities. We also anticipate the community being a powerful force in keeping SecretSocial a safe place for having unvarnished conversations.

How do you monitor? What kinds of things are considered "suspicious activities"?

Zubin Wadia: Monitoring is done through keyword matching right now. We discard all terms once the conversation time elapses or terminates. In the future it is going to get significantly more intelligent, but at this early stage it was the bare minimum we could do to keep people safe and not record or snoop.

Suspicious activities include anything related to terrorism or sexual predators. We have a zero-tolerance policy when proof of this is reported by one of our users or we pick up a pattern in an active conversation.

This interview was edited and condensed.


May 24 2011

The search for a minimum viable record

Catalogue by Helga's Lobster Stew, on FlickrAt first blush, bibliographic data seems like it would be a fairly straightforward thing: author, title, publisher, publication date. But that's really just the beginning of the sorts of data tracked in library catalogs. There's also a variety of metadata standards and information classification systems that need to be addressed.

The Open Library has run into these complexities and challenges as it seeks to create "one web page for every book ever published."

George Oates, Open Library lead, recently gave a presentation in which she surveyed audience members, asking them to list the five fields they thought necessary to adequately describe a book. In other words, what constitutes a "minimum viable record"? Akin to the idea of the "minimum viable product" for getting a web project coded and deployed quickly, the minimum viable record (MVR) could be a way to facilitate an easier exchange of information between library catalogs and information systems.

In the interview below, Oates explains the issues and opportunities attached to categorization and MVRs.

What are some of the challenges that libraries and archives face when compiling and comparing records?

George OatesGeorge Oates: I think the challenges for compilation and comparison of records rest in different styles, and the innate human need to collect, organize, and describe the things around us. As Barbara Tillett noted in a 2004 paper: "Once you have a collection of over say 2,000 items, a human being can no longer remember every item and needs a system to help find things."

I was struck by an article I saw on a site called Apartment Therapy, about "10 Tiny Gardens," where the author surveyed extremely different decorations and outputs within remarkable constraints. That same concept can be dropped into cataloging, where even in the old days, when librarians described books within the boundaries of a physical index card, great variation still occurred. Trying to describe a book on a 3x5 card is oddly reductionist.

It's precisely this practice that's produced this "diabolical rationality" of library metadata that Karen Coyle describes [slide No. 38]. We're not designed to be rational like this, all marching to the same descriptive drum, even though these mythical levels of control and uniformity are still claimed. It seems to be a human imperative to stretch ontological boundaries and strive for greater levels of detail.

Some specific categorization challenges are found in the way people's names are cataloged. There's the very simple difference between "Lastname, Firstname" and "Firstname Lastname" or the myriad "disambiguators" that can help tell two authors with the same name apart — like a middle initial, a birthdate, title, common name, etc.

There are also challenges attached to the normal evolution of language, and a particular classification's ability to keep up. An example is the recent introduction of the word "cooking" as an official Library of Congress Subject Heading. "Cooking" supersedes "Cookery," so now you have to make sure all the records you have in your catalog that previously referred to "Cookery" now know about this newfangled "Cooking" word. This process is something of a ouroboros, although it's certainly made easier now that mass updates are possible with software.

A useful contrast to all this is the way tagging on Flickr was never controlled (even though several Flickr members crusaded for various patterns). Now, even from this chaos, order emerges. On Flickr it's now possible to find photos of red graffiti on walls in Brooklyn, all through tags. Using metadata "native" to a digital photograph, like the date it was taken, and various camera details, you can focus even deeper, to find photos taken with a Nikon in the winter of 2008. Even though that's awesome, I'm sure it rankles professionals since Flickr also has a bunch of photos that have no tags at all.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

In a blog post, you wrote about "a metastasized level of complexity." How does that connect to our need for minimum viable records?

George Oates: What I'm trying to get at is a sense that cataloging is a bit like case law: special cataloging rules apply in even the most specific of situations. Just take a quick glance at some of the documentation on cataloging rules for a sense of that. It's incredible. As a librarian friend of mine once said, "Some catalogers like it hard."

At Open Library, we're trying to ingest catalogs from all over the place, but we're constantly tripped up by fields we don't recognize, or things in fields that probably shouldn't be there. Trying to write an importing program that's filled with special treatments and exceptions doesn't seem practical since it would need constant tweaking to keep up with new styles or standards.

The desire to simplify this sort of thing isn't new. The Dublin Core (DC) initiative came out of a meeting hosted by OCLC in 1995. There are now 15 base DC fields that can describe pretty much anything, and DC is widely used as an approachable standard for all sorts of exchanges of data today. All in all, it's really successful.

Interestingly, after 16 years, DC now has an incorporated organization, loads of contributors, and documentation that seems much more complex than "just use these 15 fields for everything." As every good archivist would tell you, it's better to archive something than nothing, and to get as much information as you can from your source. The temptation for us is to keep trying to handle any kind of metadata at all times, which is super hard.

How do you see computers and electronic formats helping with minimum viable records?

George Oates: MVR might be an opportunity to create a simpler exchange of records. One computer says "Let me send you my MVR for an initial match." If the receiving computer can interpret it, then the systems can talk and ask each other for more.

The tricky part about digital humanities is that its lifeblood is in the details. For example, this section from the Tillett paper I mentioned earlier looked at the relationship between precision and recall:

Studies ... looked at precision and recall, demonstrating that the two are inversely related — greater recall means poorer precision and greater precision means poorer recall — high recall being the ability to retrieve everything that relates to a search request from the database searched, while precision is retrieving only those relevant to a user.

It's a huge step to sacrifice detail (hence, precision) in favor of recall. But, perhaps that's the step we need, as long as recall can elicit precision, if asked. Certainly in the case of computers, the less fiddly the special cases, the more straightforward it is to make a match.

Photos: Catalogue by Helga's Lobster Stew, on Flickr; profile photo by Derek Powazek

This interview was edited and condensed.


  • The Library of the Commons: Rise of the Infodex

  • Rethinking museums and libraries as living structures

  • The quiet rise of machine learning

  • Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
    Could not load more posts
    Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
    Just a second, loading more posts...
    You've reached the end.

    Don't be the product, buy the product!