Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 05 2013

July 09 2013

SPIP loves Open_Data

#SPIP loves #Open_Data

L’ouverture des données et leur réutilisation citoyenne nécessite des compétences techniques (...) SPIP permet de s’affranchir des obstacles techniques pour publier et diffuser sur Internet.
Sa dernière version SPIP 3.0 permet aujourd’hui d’utiliser le Web comme base de données

une chouette présentation de cerdic sur la boucle DATA, avec des exemples de #cartographie, #YAML, #XML (beurk)…

August 14 2012

Shrinking and stretching the boundaries of markup

It’s easy to forget that XML started out as a simplification process, trimming SGML into a more manageable and more parseable specification. Once XML reached a broad audience, of course, new specifications piled on top of it to create an ever-growing stack.

That stack, despite solving many problems, brings two new issues: it’s bulky, and there are a lot of problems that even that bulk can’t solve.

Two proposals at last week’s Balisage markup conference examined different approaches to working outside of the stack, though both were clearly capable of working with that stack when appropriate.

Shrinking to MicroXML

John Cowan presented on MicroXML, a project that started with a blog post from James Clark and has since grown to be a W3C Community Group. Cowan has been maintaining the Editor’s Draft of MicroXML, as well as creating a parser library for it, MicroLark. Uche Ogbuji, also chair of the W3C Community Group, has written a series of articles (1 2) about it.

Cowan’s talk was a broad overview of practical progress on a subject that had once been controversial. Even in the early days of XML 1.0, there were plenty of people who thought that it was still too large a subset of SGML, and the Common XML I edited was one effort to describe how practitioners typically used less than the XML made available. The subset discussion has come up repeatedly, with a proposal from Tim Bray and others over the years. In an age where JSON has replaced many data-centric uses of XML, there is less pressure to emphasize programming needs (data types, for example), but still demand for simplifying to a cleaner document model.

MicroXML is both a syntax — compatible with XML 1.0 — and a model, kept as absolutely simple as possible. In many ways, they’re trying to make syntax decisions based on their impact on the model.

Ideally, Cowan said, the model would just be element nodes (with attribute maps) containing other element nodes and content. That focus on a clean model means, for example, that while you can declare XML namespaces in MicroXML, the namespace declarations are just attributes. The model doesn’t reflect namespace URIs, and applications have to do that processing. Similarly, whitespace handling is simplified, and attributes become a simple unordered map of key names to values. The syntax allows comments, but they are discarded in the model. Processing instructions remain a question, because they would complicate the model substantially, but the XML declaration and CDATA sections would go. Empty element tags are in, for now …

Some of the pieces that raised controversy with questions were the current proposal to limit element and attribute names to the ASCII character set, and the demise of processing instructions. I mostly heard cheers for the end of draconian error handling, though there were memories of the bug compatibility wars of an earlier age reminding the audience that it could in fact be worse, or weirder.

There may be, as Cowan noted “no consensus on a single conversion path” to or from JSON, but MicroXML takes some steps in that direction, suggesting that JSONML should be able to support round-tripping of the MicroXML model, and JSONx could work for JSON to XML.

While I suspect that MicroXML has a bright future ahead of it in the document space, it seems unlikely to take much territory back from JSON in the data space. MicroXML doesn’t seem to be aiming at JSON at all, however.

Overlapping information and hierarchical tools

Very few programmers want to think about overlapping data structures. In most computing cases — bad pointers? — overlapping data structures are a complex mess. However, they’re extremely common in human communications. LMNL (Layered Markup and Annotation Language), itself a decade-long conversation that has suffered badly from decaying links, has always been an outsider in the normally hierarchical markup conversation. There may be conflicts between XML trees and JSON maps, but both of those become uncomfortable when they have to represent an overlapped structure.

Wendell Piez examined the challenges of processing overlapping markup — LMNL — with tools that expect XML’s neat hierarchies. Obviously, feeding this

[source [date}1915{][title}The Housekeeper{]]
[name}Robert Frost{]
[dates}1874-1963{]] }
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]

into an XML parser would just produce error messages, even if those tags were angle brackets. Piez compiles that into XML, which represents the content separately from the specified ranges. It is, of course, not close to pretty, but it is processable. At the show, he demonstrated using this to create an annotated HTML format as well as a format that handles overlap gracefully: graphics, using SVG (or here as a PNG if your browser doesn’t like SVG).

Should you be paying attention to LMNL? If your main information concerns fit in neat hierarchies or clean boxes, probably not. If your challenges include representing human conversations, or other data forms where overlap happens, you may find these components critical, even though making them work with existing toolsets is difficult.

August 09 2012

Applying markup to complexity

When XML exploded onto the scene, it ignited visions of magical communications, simplified document storage, and a whole new wave of application capabilities. Reality has proved calmer, with competition from JSON and other formats tackling a wide variety of problems, while the biggest of the big data problems have such volume that adding markup seems likely to create new problems.

However, at the in-progress Balisage conference, it’s clear that markup remains really good at solving a middle category of problems, where its richer structures can shine without creating headaches of volume or complication. In the past, Balisage often focused on hard problems most people didn’t yet have, but this year’s program tackles challenges that more developers are encountering as their projects grow in complexity.


JSON gave programmers much of what they wanted: a simple format for shuttling (and sometimes storing) loosely structured data. Its simpler toolset, freed of a heritage of document formats and schemas, let programmers think less about information formats and more about the content of what they were sending.

Developers using XML, however, have found themselves cut off from that data flow, spending a lot of time creating ad hoc toolsets for consuming and creating JSON in otherwise XML-centric toolchains. That experience is leading toward experiments with more formal JSON integration in XQuery and XSLT — and raising some difficult questions about XML itself.

XML and JSON look at data through different lenses. XML is a tree structure of elements, attributes, and content, while JSON is arrays, objects, and values. Element order matters by default in XML, while JSON is far less ordered and contains many more anonymous structures. A paper by Mary Holstege focused primarily on possibilities type introspection in XQuery, but her talk also ventured into how that might help address the challenges presented by the different types in JSON.

Eric van der Vlist, while recognizing that XSLT 3.0 is taking some steps to integrate JSON, reported on broader visions of an XML/JSON “chimera”, though he hoped to come up with something more elegant than the legendary but impractical creature. After his talk, he also posted some broader reflections on a data model better able to accomodate both XML and JSON expectations.

Jonathan Robie reflected on the growing importance of JSON (and his own initial reluctance to take it seriously) as semi-structured data takes over the many tasks it can solve easily. He described XML as shining at handling complex documents and the XML toolset as excellent support for a “hub format,” but also thought that the XML toolchain needs something like JSON. He compared the proposed XSLT 3.0 features for handling maps with JSONiq, and agreed with Holstege and van der Vlist that different expectations about the importance of order created the largest mismatches.

Hans-Jurgen Rennau had probably the most optimistic take, describing a Unified Document Language – not a markup syntax, but a model that could accomodate varied approaches to representing data. His proposal did include concrete syntax for representing this model in XML documents, as well as a description of alternate markup styles that help represent the model beyond XML.

I don’t expect that any of these proposals, even when and if they are widely implemented, will immediately grab the attention of people happily using JSON. In the short term they will serve primarily as bridges for data, helping XML and JSON systems co-exist. In the longer term, however, they may serve as bridges between the cultures of the two formats. Both approaches have their limitations. XML is cumbersome in many cases, while JSON is less pleasantly capable of representing ordered document structures.

JSON freed web developers to create much more complex applications with data formats that feel less complicated. As developers grow more and more ambitious, however, they may find themselves moving back into complex situations where the XML toolkit looks more capable of handling information without the overhead of vast quantities of custom code. If that toolkit supports their existing formats, mixing and matching should be easier.

Metadata, content and design

Markup and data types are themselves metadata, providing information about the data they encapsulate, but Balisage and its predecessor conferences have often focused on metadata structures at higher levels — the Semantic Web, RDF, Topic Maps, and OWL. So far, this year’s talks have been cautious about making big metadata promises.

Kurt Cagle gave the only talk on a subject that once seemed to dominate the conference, ontologies and tools for managing them. His metadata stack was large, and changing near the end of the work to include SPARQL over HTTP. If Semantic Web technologies can settle into the small and focused groove Cagle described, it seems like they might find a comfortable niche in web infrastructure rather than attempting to conquer it.

Diane Kennedy discussed the PRISM Source Vocabulary, an effort similar in its focus on applying technology to solve a set of problems for a particularly difficult context. The technology in the talk was unsurprising, but the social context was difficult, describing a missionary effort, to bring metadata ideas from a very “content first” crowd to magazines, a very “design first” crowd. Multiple delivery platforms, notably the iPad, have made design first communities more willing to consider at least a subset of metadata and markup technology.

Markup and programming language boundaries

While designers are historically a difficult crowd to sell semantic markup, programmers have been a skeptical audience about the value of markup — especially when “you got your markup in my programming language.” There are, of course, many programmers attending and speaking at Balisage, but the boundaries between people who care primarily about the data and those who care primarily about the processing are a unique and ever-changing combination of blurry and (cutting) sharp.

A number of speakers described themselves as “not programmers” and their work as “not programming” despite the data processing work they were clearly doing. Ari Nordstrom opened his talk on moving as much coding as possible to XML by discussing his differences with the C# culture he works with. In another talk, Yves Marcous said “I am not a programmer” only to be told by the audience immediately “Yes, you are!”

XML’s document-centric approach to the world seems to drive people toward declarative and functional programming styles. That is partly a side-effect of looking at so many documents that it becomes convenient to turn programs into documents, an angle that is very hard to explain to those outside the document circle. However, the strong tendencies toward functional programming emerge, I suspect, from the headaches of processing markup in “traditional” object-oriented or imperative programming. The Document Object Model, long the most-criticized aspect of JavaScript, exemplifies this mismatch (compounded by a mismatch between Java and JavaScript object models, of course). As jQuery and many other developers know, navigating a document tree through declarative CSS selectors is much easier.

Steven Pemberton’s talk on serialization and abstraction examined these kinds of questions in the context of form development. Managing user interactions with forms has long been labor-intensive, with developers creating ever-more complex (and often ever-more fragile) tools for forms-based validation and interactivity. Pemberton described how decisions made early in the development of a programming discipline can leave lingering and growing costs as approaches that seemed to work for simple cases grow under the pressure of increasing requirements. The XForms work attempts to leave the growing JavaScript snarl for a more-manageable declarative approach, but has not succeeded in changing web forms culture so far.

Jorge Luis Williams and David Cramer, both of Rackspace, found a different target for a declarative approach, mixing documentation into their code for validating RESTful services. The divide between REST and other web service approaches isn’t quite the same as the declarative / imperative divide, but I still felt a natural complement between the validation approach they were using and the underlying services they were testing.

A series of talks Tuesday afternoon poked and prodded at how markup could provide services to programming languages, exploring the boundaries between them. Economist Matthew McCormick discussed a system that provided documentation and structure to libraries of mathematical functions written in a variety of programming languages. Markup served as glue between the libraries, describing common functionality. David Lee wanted a toolset that would let him extract documentation from code — not just the classic JavaDoc extraction, but compilers reporting much of their internal information about variables in a processable markup format.

Norm Walsh started in different territory, discussing the challenges of creating compact non-markup formats for XML vocabularies, but wound up in a similar place. Attempting to shift a vocabulary from an XML format to a C-like syntax introduces dissonance even as it reduces verbosity. After noting the unusual success of the RELAX NG compact syntax and the simplicity that made it possible, he showed some of his own work on creating a compact syntax for XProc, declared it flawed, and showed a shift toward more programming-like approaches that eased some of the mismatch.

If you’re a programmer reading this, you may be wondering why these boundaries should matter to you. These frontiers tend to get explored from the markup side, and it’s possible that this work doesn’t solve a problem for you now. As conference chair Tommie Usdin put it in her welcome, however, Balisage is a place to “be exposed to some things that matter to other people but not to you — or at least not to you right now.”

October 03 2011

The agile upside of XML

In a recent interview, Anna von Veh, a consultant at Say Books, and Mike McNamara, managing director at Araman Consulting Ltd & Outsell-Gilbane UK Affiliate, talked about the role of XML, the state of ebook design, and the tech-driven future of the publishing industry.

McNamara and von Veh will expand on these ideas in their presentation at TOC Frankfurt next week.

Our interview follows.

Why should publishers adopt XML?

mikemcnamara.jpgMike McNamara: There are many benefits to be gained from implementing XML in a production workflow. However, it really depends on what the publisher wants to do. For example, journal publishers probably want to reuse their content in a number of different ways for differing products and specific target markets. XML can deliver this flexibility and reusability.

A UK Legal publisher I worked with wanted to enrich its online content deliverables to its clients. The publisher added more metadata to its XML content, allowing its new search environment to deliver more accurate and focused results to clients. A fiction book publisher, on the other hand, might want to produce simple ebooks from original Microsoft Word source files and might not see any real business or technical benefit to using XML (however, I do think this will change in the future). A simple XHTML-to-ebook process might be a better option for this type of publisher.

Anna_von_Veh.jpgAnna von Veh: The very term "XML" can cause many people to run for the hills, so it's sometimes helpful to look at it differently. Do publishers want to ensure that their content is searchable and reusable for a variety of formats, in a variety of ways, for a variety of devices and even for devices that haven't yet been invented? Do they want to be able to deliver customized content to customers? If so, XML — and I include XHTML in this — is the way.

There are a number of issues. One is the value of putting legacy content into XML to make it more usable, discoverable and valuable to the publisher. The second is incorporating XML into the workflow for the front list. And then, of course, there is the question of when to incorporate XML into the workflow — at the authoring stage, editing, typesetting, post-final, etc.

While the format-centered model that most publishers are familiar with produces beautiful products, it is not one that is likely to flourish in the new world of digital publishing. Digital requires a much more rapid, flexible and agile response. Using XML, though, doesn't mean that design or creativity is dead. The hope is that it will help automate work that is being done manually over and over again, and allow publishers the freedom to focus on great ideas and creative use of their content.

What is the best way to integrate XML into an existing workflow?

Mike McNamara: I don't believe there is one "best" way. Again, it's down to what is the best way that suits that particular publisher. "XML first," "XML last" and "XML in the middle" all have their own costs, implementation requirements and benefits. I tend to favor the XML-first option, as I believe it delivers more benefits for the publisher. Though it would probably introduce more change for an organization than the other options (XML last and XML in the middle).

Anna von Veh: If you're a large publishing company with a bigger budget and lots of legacy content, then you might want to move to a full content management system (CMS) with an XML-first workflow. But a smaller publisher may want to focus on a digital-friendly Word and InDesign workflow that makes "XML last" easier. However, incorporating XML early into the workflow certainly has benefits. The challenges revolve around changing how you think about producing, editing and designing content and managing the change process.

How future-proof is XML? Will it be supplanted at some point by something like JSON?

Mike McNamara: XML is a very future-proof method for ensuring long-term protection of content. It is the format chosen by many digital archives and national libraries. True, JSON has become very popular of late, but it is mainly used today for API development, financial transactions, and messaging — and by web developers. I think JSON has a long way to go before it supplants XML — as we know and use it today — as a structured content format for use in publishing.

TOC Frankfurt 2011 — Being held on Tuesday, Oct. 11, 2011, TOC Frankfurt will feature a full day of cutting-edge keynotes and panel discussions by key figures in the worlds of publishing and technology.

Save 100€ off the regular admission price with code TOC2011OR

Is ebook design in a rut?

Mike McNamara: No, it's still developing. More thought needs to be put into adding value to the content before it gets to the ebook. Take travel guides, for example. If I want a travel book to use in the field, say on a hiking holiday, I don't want to have to carry the print product. I want the same content reconfigured as an ebook with a GPS/Wi-Fi environment added to use on my smartphone, with everything referenced from the same map that I saw in the print version.

Publishers need to get smarter with the data they have, and then deliver it in the different ways that users need.

Anna von Veh: Many current ebooks are conversions from printed books, either scanned from the printed copy or converted from PDFs. These ebooks weren't designed or planned as ebooks, and in addition, quality control was lacking after they'd been made into ebooks — and these are very bad advertisements for ebook design. Many new ebooks (i.e. those in the front list) are much better designed. However, most are still based on the idea of the print book.

A key thing is to focus on is the fact that a screen is not contained in the same way that a printed book is, and that it is an entirely different format (see Peter Meyers' great A New Kind of Book blog, and upcoming "Breaking the Page" book). I think of ebook design as being much more akin to website design, which is why I advocate hiring web designers. I like the idea of starting with the web and going to print from there. It seems right for the digital age. Also, I think anyone working in book production today — both editors and designers — should learn some web skills. Hand coding simple EPUBs is a good way to practice, and it is relevant, too.

How will digital publishing change over the next five years? Are we headed toward a world where books are URLs?

Mike McNamara: More and more content will continue to be published online. Many reference publishers are already looking to add more value to content through metadata. This would allow clients to find the right content for their immediate context via sophisticated search engines. Some publishers already allow clients to build their own licensed versions of publications from the publishers' content repositories, with automatic updates being applied as and when needed.

Consequently, publishers will continue to move toward having even smaller, more focused chunks of XML data, allowing easy assembly into virtual publications. These will all be available to download and read on multiple devices, focusing on smartphones and tablets.

The combination of smarter XML (with multimedia information), smarter search engines and smarter reading devices will define how content is created and delivered over the coming years.

Anna von Veh: In answer to the first part of the question, it depends on what we understand "digital publishing" to mean. I like to think of it as the process of publishing — i.e. the workflow itself rather than the format. In terms of the process, yes, I think the web will have a big role to play (see PressBooks), but once again, it depends on how open publishers are to change.

Much will depend, too, on exactly who the publishers are in the next five years. I think it is highly likely that tech startups will make up a large piece of the publishing pie, though they may be bought up by larger publishers and tech companies. Some of the big vendors that hold much of the current knowledge of digital publishing (and therefore, perhaps, power) may move into publishing. There are also the smaller indie and self publishers that aren't hampered so much by legacy issues. On the other hand, big publishers have financial muscle and experience in content creation, design and editorial. It's an exciting time.

As for the format, I wouldn't bet against the web there, either. I'm a fan of the web in general (my favorite ebook reader is the browser-based Ibis Reader). In mainstream publishing, a lot of educational content is migrating to the web and learning management systems (LMS).

Even if books become URLs, what is needed is a cheap and easy print-on-demand (POD) home printer and bookbinder, or print "ATM," like the Espresso Book Machine. There are many situations where printed books are still required, not the least of which are countries in poorer parts of the world where the web is a luxury. Arthur Attwell's startup Paperight is a great POD idea designed for developing countries, and it also provides publishers with income. Mobile phones, too, are gaining ground in developing countries, and they're being used for a variety of innovative businesses. Smartphones could well become the main way to read content all over the world, whether that content is contained in ebooks, website books, or other forms.

But this just looks at the technology side of things. People bond emotionally with books and stories, with the authors who create them, and with other readers who share their interests. Potentially, connections could be built between readers and the editors and designers who shape the books. In this digitally connected but often physically separated world, all these connections are becoming both easier and more important, irrespective of what form the content takes or where it lives.

This interview was edited and condensed.

Photo on home and category pages: XML_logo by Jmh2o, on Wikimedia Commons


  • The line between book and Internet will disappear

  • Metadata isn't a chore, it's a necessity
  • Here's another reason why metadata matters
  • Sometimes the questions are as enlightening as the answers

  • December 03 2010

    Four short links: 3 December 2010

    1. Data is Snake Oil (Pete Warden) -- data is powerful but fickle. A lot of theoretically promising approaches don't work because there's so many barriers between spotting a possible relationship and turning it into something useful and actionable. This is the pin of reality which deflates the bubble of inflated expectations. Apologies for the camel's nose of rhetoric poking under the metaphoric tent.
    2. XML vs the Web (James Clark) -- resignation and understanding from one of the markup legends. I think the Web community has spoken, and it's clear that what it wants is HTML5, JavaScript and JSON. XML isn't going away but I see it being less and less a Web technology; it won't be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire. (via Simon Willison)
    3. Understanding Pac Man Ghost Behaviour -- The ghosts’ AI is very simple and short-sighted, which makes the complex behavior of the ghosts even more impressive. Ghosts only ever plan one step into the future as they move about the maze. Whenever a ghost enters a new tile, it looks ahead to the next tile that it will reach, and makes a decision about which direction it will turn when it gets there. Really detailed analysis of just one component of this very successful game. (via Hacker News)
    4. The Full Stack (Facebook) -- we like to think that programming is easy. Programming is easy, but it is difficult to solve problems elegantly with programming. I like to think that a CS education teaches you this kind of "full stack" approach to looking at systems, but I suspect it's a side-effect and not a deliberate output. This is the core skill of great devops: to know what's happening up and down the stack so you're not solving a problem at level 5 that causes problems at level 3.

    October 18 2010

    The gravitational pull of information

    I just posted an interview with "Website Architecture and Design with XML" instructor Bob Boiko over on O'Reilly Answers. Much of the piece deals with the nuts and bolts of XML, but Boiko also discussed the relationship between content creators, designers, and programmers. I thought his response worth sharing here as well.

    Are the lines between designers, content creators / coders, and programmers blurring?

    Bob Boiko: The lines blurred a long time ago. It's been more than 20 years since I sat in my first software development team meeting with artists, writers, programmers, and managers all trying to figure out how to talk with each other. They continue to struggle today.

    The language of business value and information structure is a great candidate for the common tongue we're looking for. Everyone on a team should be able to say why they are doing this project and what part they play. That's business value.

    More specifically, in the center of a project you'll find the information that the project collects, stores and delivers. Designers need to understand the structure of that information in order to present it. Content creators need to understand it to originate or edit it. Programmers need to understand it to build the machinery that moves it from place to place.

    The lines that need to blur are not at the center of the design, coder, or programming disciplines, but rather where they meet. And where they meet is in the information.


    October 08 2010

    Four short links: 8 October 2010

    1. Training Lessons Learned: Interactivity (Selena Marie Deckelmann) -- again I see parallels between how the best school teachers work and the best trainers. I was working with a group of people with diverse IT backgrounds, and often, I asked individuals to try to explain in their own words various terms (like “transaction”). This helped engage the students in a way that simply stating definitions can’t. Observing their fellow students struggling with terminology helped them generate their own questions, and I saw the great results the next day - when students were able to define terms immediately, that took five minutes the day before to work through.
    2. Software Evolution Storylines -- very pretty visualizations of code development, inspired by an xkcd comic.
    3. asmxml -- XML parser written in assembly language. (via donaldsclark on Twitter)
    4. Poetic License -- the BSD license, translated into verse. Do tractor workers who love tractors a lot translate tractor manuals into blank verse? Do the best minds of plumber kid around by translating the California State Code into haikus? Computer people are like other people who love what they do. Computer people just manipulate symbols, whether they're keywords in Perl or metrical patterns in software licenses. It's not weird, really. I promise.

    June 01 2010

    Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
    Could not load more posts
    Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
    Just a second, loading more posts...
    You've reached the end.
    Get rid of the ads (sfw)

    Don't be the product, buy the product!