Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 30 2011

02mydafsoup-01

May 30 2011

The Species Problem

These days the term 'species' is thrown around as if it were a definite unit, but the realities of defining a species are much trickier, and more contentious, than they seem. ;Biologists have argued over the definition of a species since the dawn of humanity, but have yet to come up with a single species concept that works for all types of life. This film explores current concepts used across the various fields of biology and covers some of the specific problems they encounter.

May 24 2011

The search for a minimum viable record

Catalogue by Helga's Lobster Stew, on FlickrAt first blush, bibliographic data seems like it would be a fairly straightforward thing: author, title, publisher, publication date. But that's really just the beginning of the sorts of data tracked in library catalogs. There's also a variety of metadata standards and information classification systems that need to be addressed.

The Open Library has run into these complexities and challenges as it seeks to create "one web page for every book ever published."

George Oates, Open Library lead, recently gave a presentation in which she surveyed audience members, asking them to list the five fields they thought necessary to adequately describe a book. In other words, what constitutes a "minimum viable record"? Akin to the idea of the "minimum viable product" for getting a web project coded and deployed quickly, the minimum viable record (MVR) could be a way to facilitate an easier exchange of information between library catalogs and information systems.

In the interview below, Oates explains the issues and opportunities attached to categorization and MVRs.

What are some of the challenges that libraries and archives face when compiling and comparing records?

George OatesGeorge Oates: I think the challenges for compilation and comparison of records rest in different styles, and the innate human need to collect, organize, and describe the things around us. As Barbara Tillett noted in a 2004 paper: "Once you have a collection of over say 2,000 items, a human being can no longer remember every item and needs a system to help find things."

I was struck by an article I saw on a site called Apartment Therapy, about "10 Tiny Gardens," where the author surveyed extremely different decorations and outputs within remarkable constraints. That same concept can be dropped into cataloging, where even in the old days, when librarians described books within the boundaries of a physical index card, great variation still occurred. Trying to describe a book on a 3x5 card is oddly reductionist.

It's precisely this practice that's produced this "diabolical rationality" of library metadata that Karen Coyle describes [slide No. 38]. We're not designed to be rational like this, all marching to the same descriptive drum, even though these mythical levels of control and uniformity are still claimed. It seems to be a human imperative to stretch ontological boundaries and strive for greater levels of detail.

Some specific categorization challenges are found in the way people's names are cataloged. There's the very simple difference between "Lastname, Firstname" and "Firstname Lastname" or the myriad "disambiguators" that can help tell two authors with the same name apart — like a middle initial, a birthdate, title, common name, etc.

There are also challenges attached to the normal evolution of language, and a particular classification's ability to keep up. An example is the recent introduction of the word "cooking" as an official Library of Congress Subject Heading. "Cooking" supersedes "Cookery," so now you have to make sure all the records you have in your catalog that previously referred to "Cookery" now know about this newfangled "Cooking" word. This process is something of a ouroboros, although it's certainly made easier now that mass updates are possible with software.

A useful contrast to all this is the way tagging on Flickr was never controlled (even though several Flickr members crusaded for various patterns). Now, even from this chaos, order emerges. On Flickr it's now possible to find photos of red graffiti on walls in Brooklyn, all through tags. Using metadata "native" to a digital photograph, like the date it was taken, and various camera details, you can focus even deeper, to find photos taken with a Nikon in the winter of 2008. Even though that's awesome, I'm sure it rankles professionals since Flickr also has a bunch of photos that have no tags at all.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

In a blog post, you wrote about "a metastasized level of complexity." How does that connect to our need for minimum viable records?

George Oates: What I'm trying to get at is a sense that cataloging is a bit like case law: special cataloging rules apply in even the most specific of situations. Just take a quick glance at some of the documentation on cataloging rules for a sense of that. It's incredible. As a librarian friend of mine once said, "Some catalogers like it hard."

At Open Library, we're trying to ingest catalogs from all over the place, but we're constantly tripped up by fields we don't recognize, or things in fields that probably shouldn't be there. Trying to write an importing program that's filled with special treatments and exceptions doesn't seem practical since it would need constant tweaking to keep up with new styles or standards.

The desire to simplify this sort of thing isn't new. The Dublin Core (DC) initiative came out of a meeting hosted by OCLC in 1995. There are now 15 base DC fields that can describe pretty much anything, and DC is widely used as an approachable standard for all sorts of exchanges of data today. All in all, it's really successful.

Interestingly, after 16 years, DC now has an incorporated organization, loads of contributors, and documentation that seems much more complex than "just use these 15 fields for everything." As every good archivist would tell you, it's better to archive something than nothing, and to get as much information as you can from your source. The temptation for us is to keep trying to handle any kind of metadata at all times, which is super hard.

How do you see computers and electronic formats helping with minimum viable records?

George Oates: MVR might be an opportunity to create a simpler exchange of records. One computer says "Let me send you my MVR for an initial match." If the receiving computer can interpret it, then the systems can talk and ask each other for more.

The tricky part about digital humanities is that its lifeblood is in the details. For example, this section from the Tillett paper I mentioned earlier looked at the relationship between precision and recall:

Studies ... looked at precision and recall, demonstrating that the two are inversely related — greater recall means poorer precision and greater precision means poorer recall — high recall being the ability to retrieve everything that relates to a search request from the database searched, while precision is retrieving only those relevant to a user.

It's a huge step to sacrifice detail (hence, precision) in favor of recall. But, perhaps that's the step we need, as long as recall can elicit precision, if asked. Certainly in the case of computers, the less fiddly the special cases, the more straightforward it is to make a match.

Photos: Catalogue by Helga's Lobster Stew, on Flickr; profile photo by Derek Powazek

This interview was edited and condensed.



Related:


  • The Library of the Commons: Rise of the Infodex

  • Rethinking museums and libraries as living structures

  • The quiet rise of machine learning

  • January 07 2010

    Pew Research asks questions about the Internet in 2020

    Pew Research, which seems to be interested in just about everything,
    conducts a "future of the Internet" survey every few years in which
    they throw outrageously open-ended and provocative questions at a
    chosen collection of observers in the areas of technology and
    society. Pew makes participation fun by finding questions so pointed
    that they make you choke a bit. You start by wondering, "Could I
    actually answer that?" and then think, "Hey, the whole concept is so
    absurd that I could say anything without repercussions!" So I
    participated in their href="http://www.pewinternet.org/Reports/2006/The-Future-of-the-Internet-II.aspx"
    2006 survey and did it again this week. The Pew report will
    aggregate the yes/no responses from the people they asked to
    participate, but I took the exercise as a chance to hammer home my own
    choices of issues.

    (If you'd like to take the survey, you can currently visit

    http://www.facebook.com/l/c6596;survey.confirmit.com/wix2/p1075078513.aspx

    and enter PIN 2000.)

    Will Google make us stupid?

    This first question is not about a technical or policy issue on the
    Internet or even how people use the Internet, but a purported risk to
    human intelligence and methods of inquiry. Usually, questions about
    how technology affect our learning or practice really concern our
    values and how we choose technologies, not the technology itself. And
    that's the basis on which I address such questions. I am not saying
    technology is neutral, but that it is created, adopted, and developed
    over time in a dialog with people's desires.

    I respect the questions posed by Nicholas Carr in his Atlantic
    article--although it's hard to take such worries seriously when he
    suggests that even the typewriter could impoverish writing--and would
    like to allay his concerns. The question is all about people's
    choices. If we value introspection as a road to insight, if we
    believe that long experience with issues contributes to good judgment
    on those issues, if we (in short) want knowledge that search engines
    don't give us, we'll maintain our depth of thinking and Google will
    only enhance it.

    There is a trend, of course, toward instant analysis and knee-jerk
    responses to events that degrades a lot of writing and discussion. We
    can't blame search engines for that. The urge to scoop our contacts
    intersects with the starvation of funds for investigative journalism
    to reduce the value of the reports we receive about things that are
    important for us. Google is not responsible for that either (unless
    you blame it for draining advertising revenue from newspapers and
    magazines, which I don't). In any case, social and business trends
    like these are the immediate influences on our ability to process
    information, and searching has nothing to do with them.

    What search engines do is provide more information, which we can use
    either to become dilettantes (Carr's worry) or to bolster our
    knowledge around the edges and do fact-checking while we rely mostly
    on information we've gained in more robust ways for our core analyses.
    Google frees the time we used to spend pulling together the last 10%
    of facts we need to complete our research. I read Carr's article when
    The Atlantic first published it, but I used a web search to pull it
    back up and review it before writing this response. Google is my
    friend.

    Will we live in the cloud or the desktop?

    Our computer usage will certainly move more and more to an environment
    of small devices (probably in our hands rather than on our desks)
    communicating with large data sets and applications in the cloud.
    This dual trend, bifurcating our computer resources between the tiny
    and the truly gargantuan, have many consequences that other people
    have explored in depth: privacy concerns, the risk that application
    providers will gather enough data to preclude competition, the
    consequent slowdown in innovation that could result, questions about
    data quality, worries about services becoming unavailable (like
    Twitter's fail whale, which I saw as recently as this morning), and
    more.

    One worry I have is that netbooks, tablets, and cell phones will
    become so dominant that meaty desktop systems will rise in the cost
    till they are within the reach only of institutions and professionals.
    That will discourage innovation by the wider populace and reduce us to
    software consumers. Innovation has benefited a great deal from the
    ability of ordinary computer users to bulk up their computers with a
    lot of software and interact with it at high speeds using high quality
    keyboards and large monitors. That kind of grassroots innovation may
    go away along with the systems that provide those generous resources.

    So I suggest that cloud application providers recognize the value of
    grassroots innovation--following Eric von Hippel's findings--and
    solicit changes in their services from their visitors. Make their code
    open source--but even more than that, set up test environments where
    visitors can hack on the code without having to download much
    software. Then anyone with a comfortable keyboard can become part of
    the development team.

    We'll know that software services are on a firm foundation for future
    success when each one offers a "Develop and share your plugin here"
    link.

    Will social relations get better?

    Like the question about Google, this one is more about our choices
    than our technology. I don't worry about people losing touch with
    friends and family. I think we'll continue to honor the human needs
    that have been hard-wired into us over the millions of years of
    evolution. I do think technologies ranging from email to social
    networks can help us make new friends and collaborate over long
    distances.

    I do worry, though, that social norms aren't keeping up with
    technology. For instance, it's hard to turn down a "friend" request
    on a social network, particularly from someone you know, and even
    harder to "unfriend" someone. We've got to learn that these things are
    OK to do. And we have to be able to partition our groups of contacts
    as we do in real life (work, church, etc.). More sophisticated social
    networks will probably evolve to reflect our real relationships more
    closely, but people have to take the lead and refuse to let technical
    options determine how they conduct their relationships.

    Will the state of reading and writing be improved?

    Our idea of writing changes over time. The Middle Ages left us lots of
    horribly written documents. The few people who learned to read and
    write often learned their Latin (or other language for writing) rather
    minimally. It took a long time for academies to impose canonical
    rules for rhetoric on the population. I doubt that a cover letter and
    resume from Shakespeare would meet the writing standards of a human
    resources department; he lived in an age before standardization and
    followed his ear more than rules.

    So I can't talk about "improving" reading and writing without
    addressing the question of norms. I'll write a bit about formalities
    and then about the more important question of whether we'll be able to
    communicate with each other (and enjoy what we read).

    In many cultures, writing and speech have diverged so greatly that
    they're almost separate languages. And English in Jamaica is very
    different from English in the US, although I imagine Jamaicans try
    hard to speak and write in US style when they're communicating with
    us. In other words, people do recognize norms, but usage depends on
    the context.

    Increasingly, nowadays, the context for writing is a very short form
    utterance, with constant interaction. I worry that people will lose
    the ability to state a thesis in unambiguous terms and a clear logical
    progression. But because they'll be in instantaneous contact with
    their audience, they can restate their ideas as needed until
    ambiguities are cleared up and their reasoning is unveiled. And
    they'll be learning from others along with way. Making an elegant and
    persuasive initial statement won't be so important because that
    statement will be only the first step of many.

    Let's admit that dialog is emerging as our generation's way to develop
    and share knowledge. The notion driving Ibsen's Hedda Gabler--that an
    independent philosopher such as Ejlert Løvborg could write a
    masterpiece that would in itself change the world--is passé. A
    modern Løvborg would release his insights in a series of blogs
    to which others would make thoughtful replies. If this eviscerated
    Løvborg's originality and prevented him from reaching the
    heights of inspiration--well, that would be Løvborg's fault for
    giving in to pressure from more conventional thinkers.

    If the Romantic ideal of the solitary genius is fading, what model for
    information exchange do we have? Check Plato's Symposium. Thinkers
    were expected to engage with each other (and to have fun while doing
    so). Socrates denigrated reading, because one could not interrogate
    the author. To him, dialog was more fertile and more conducive to
    truth.

    The ancient Jewish scholars also preferred debate to reading. They
    certainly had some received texts, but the vast majority of their
    teachings were generated through conversation and were not written
    down at all until the scholars realized they had to in order to avoid
    losing them.

    So as far as formal writing goes, I do believe we'll lose the subtle
    inflections and wordplay that come from a widespread knowledge of
    formal rules. I don't know how many people nowadays can appreciate all
    the ways Dickens sculpted language, for instance, but I think there
    will be fewer in the future than there were when Dickens rolled out
    his novels.

    But let's not get stuck on the aesthetics of any one period. Dickens
    drew on a writing style that was popular in his day. In the next
    century, Toni Morrison, John Updike, and Vladimir Nabokov wrote in a
    much less formal manner, but each is considered a beautiful stylist in
    his or her own way. Human inventiveness is infinite and language is a
    core skill in which we we all take pleasure, so we'll find new ways to
    play with language that are appropriate to our age.

    I believe there will always remain standards for grammar and
    expression that will prove valuable in certain contexts, and people
    who take the trouble to learn and practice those standards. As an
    editor, I encounter lots of authors with wonderful insights and
    delightful turns of phrase, but with deficits in vocabulary, grammar,
    and other skills and resources that would enable them to write better.
    I work with these authors to bring them up to industry-recognized
    standards.

    Will those in GenY share as much information about themselves as they age?

    I really can't offer anything but baseless speculation in answer to
    this question, but my guess is that people will continue to share as
    much as they do now. After all, once they've put so much about
    themselves up on their sites, what good would it do to stop? In for a
    penny, in for a pound.

    Social norms will evolve to accept more candor. After all, Ronald
    Reagan got elected President despite having gone through a divorce,
    and Bill Clinton got elected despite having smoked marijuana.
    Society's expectations evolve.

    Will our relationship to key institutions change?

    I'm sure the survey designers picked this question knowing that its
    breadth makes it hard to answer, but in consequence it's something of
    a joy to explore.

    The widespread sharing of information and ideas will definitely change
    the relative power relationships of institutions and the masses, but
    they could move in two very different directions.

    In one scenario offered by many commentators, the ease of
    whistleblowing and of promulgating news about institutions will
    combine with the ability of individuals to associate over social
    networking to create movements for change that hold institutions more
    accountable and make them more responsive to the public.

    In the other scenario, large institutions exploit high-speed
    communications and large data stores to enforce even greater
    centralized control, and use surveillance to crush opposition.

    I don't know which way things will go. Experts continually urge
    governments and businesses to open up and accept public input, and
    those institutions resist doing so despite all the benefits. So I have
    to admit that in this area I tend toward pessimism.

    Will online anonymity still be prevalent?

    Yes, I believe people have many reasons to participate in groups and
    look for information without revealing who they are. Luckily, most new
    systems (such as U.S. government forums) are evolving in ways that
    build in privacy and anonymity. Businesses are more eager to attach
    our online behavior to our identities for marketing purposes, but
    perhaps we can find a compromise where someone can maintain a
    pseudonym associated with marketing information but not have it
    attached to his or her person.

    Unfortunately, most people don't appreciate the dangers of being
    identified. But those who do can take steps to be anonymous or
    pseudonymous. As for state repression, there is something of an
    escalating war between individuals doing illegal things and
    institutions who want to uncover those individuals. So far, anonymity
    seems to be holding on, thanks to a lot of effort by those who care.

    Will the Semantic Web have an impact?

    As organizations and news sites put more and more information online,
    they're learning the value of organizing and cross-linking
    information. I think the Semantic Web is taking off in a small way on
    site after site: a better breakdown of terms on one medical site, a
    taxonomy on a Drupal-powered blog, etc.

    But Berners-Lee had a much grander vision of the Semantic Web than
    better information retrieval on individual sites. He's gunning for
    content providers and Web designers the world around to pull together
    and provide easy navigation from one site to another, despite wide
    differences in their contributors, topics, styles, and viewpoints.

    This may happen someday, just as artificial intelligence is looking
    more feasible than it was ten years ago, but the chasm between the
    present and the future is enormous. To make the big vision work, we'll
    all have to use the same (or overlapping) ontologies, with standards
    for extending and varying the ontologies. We'll need to disambiguate
    things like webbed feet from the World Wide Web. I'm sure tools to
    help us do this will get smarter, but they need to get a whole lot
    smarter.

    Even with tools and protocols in place, it will be hard to get
    billions of web sites to join the project. Here the cloud may be of
    help. If Google can perform the statistical analysis and create the
    relevant links, I don't have to do it on my own site. But I bet
    results would be much better if I had input.

    Are the next takeoff technologies evident now?

    Yes, I don't believe there's much doubt about the technologies that
    companies will commercialize and make widespread over the next five
    years. Many people have listed these technologies: more powerful
    mobile devices, ever-cheaper netbooks, virtualization and cloud
    computing, reputation systems for social networking and group
    collaboration, sensors and other small systems reporting limited
    amounts of information, do-it-yourself embedded systems, robots,
    sophisticated algorithms for slurping up data and performing
    statistical analysis, visualization tools to report the results of
    that analysis, affective technologies, personalized and location-aware
    services, excellent facial and voice recognition, electronic paper,
    anomaly-based security monitoring, self-healing systems--that's a
    reasonable list to get started with.

    Beyond five years, everything is wide open. One thing I'd like to see
    is a really good visual programming language, or something along those
    lines that is more closely matched to human strengths than our current
    languages. An easy high-level programming language would immensely
    increase productivity, reduce errors (and security flaws), and bring
    in more people to create a better Internet.

    Will the internet still be dominated by the end-to-end principle?

    I'll pick up here on the paragraph in my answer about takeoff
    technologies. The end-to-end principle is central to the Internet I
    think everybody would like to change some things about the current
    essential Internet protocols, but they don't agree what those things
    should be. So I have no expectation of a top-to-bottom redesign of the
    Internet at any point in our viewfinder. Furthermore, the inertia
    created by millions of systems running current protocols would be hard
    to overcome. So the end-to-end principle is enshrined for the
    foreseeable future.

    Mobile firms and ISPs may put up barriers, but anyone in an area of
    modern technology who tries to shut the spiget on outside
    contributions eventually becomes last year's big splash. So unless
    there's a coordinated assault by central institutions like
    governments, the inertia of current systems will combine with the
    momentum of innovation and public demand for new services to keep
    chokepoints from being serious problems.

    Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
    Could not load more posts
    Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
    Just a second, loading more posts...
    You've reached the end.

    Don't be the product, buy the product!

    Schweinderl