Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

November 21 2011

November 14 2011

Four short links: 14 November 2011

  1. Science Hack Day SF Videos ( -- the demos from Science Hack Day SF. The journey of a thousand miles starts with a Hack Day.
  2. A Cross-Sectional Study of Canine Tail-Chasing and Human Responses to It, Using a Free Video-Sharing Website (PLoSone) -- Approximately one third of tail-chasing dogs showed clinical signs, including habitual (daily or "all the time") or perseverative (difficult to distract) performance of the behaviour. These signs were observed across diverse breeds. Clinical signs appeared virtually unrecognised by the video owners and commenting viewers; laughter was recorded in 55% of videos, encouragement in 43%, and the commonest viewer descriptors were that the behaviour was "funny" (46%) or "cute" (42%).
  3. RSS Died For Your Sins (Danny O'Brien) -- if you have seven thousand people following you, a good six thousand of those are going to be people you don’t particularly like. The problem, as ever, is—how do you pick out the other thousand? Especially when they keep changing? I firmly believe that one of the pressing unsolved technological problems of the modern age is getting safely away from people you don't like, without actually throttling them to death beforehand, nor somehow coming to the conclusion that they don't exist, nor ending up turning yourself into a hateful monster.
  4. Generating Text from Functional Brain Images (Frontiers in Human Neuroscience) -- We built a model of the mental semantic representation of concrete concepts from text data and learned to map aspects of such representation to patterns of activation in the corresponding brain image. Turns out that the clustering of concepts in Wikipedia is similar to how they're clustered in the brain. They found clusters in Wikipedia, mapped to the brain activity for known words, and then used that mapping to find words for new images of brain activity. (via The Economist)

November 03 2011

Wettbewerb gestartet: WissensWert 2011

In diesem Wettbewerb stellt Wikimedia Deutschland ausgewählten
Initiativen, die Freies Wissen fördern, erneut bis zu 5.000 Euro zur
Verfügung. Einzelpersonen oder Gruppen können noch bis zum 17. November 2011 spannende Projektideen einreichen, für deren Umsetzung ihnen bisher die finanziellen Mittel fehlten.

Mitmachen können ausdrücklich nicht nur die Mitgliederinnen und Mitglieder der Wikipedia- und Wikimedia-Community, sondern auch Vertreter der Free-Culture-Bewegung, Bastler und Creative-Commons-Fans, Freundinnen vom offenen Web, Open Data, Freier Software und Freien Netzen sind ebenso aufgerufen, ihre Ideen einzureichen.

In Kürze:
* Einreichungsphase: Noch bis zum 17. November 2011
* Fördersumme: Zwischen 500 und 5.000 Euro pro Projekt
* Zielgruppe: Alle, die ein spannendes Projekt rund um freies Wissen planen

Der Autor ist Mitglied der Jury die über die Preisträger entscheidet.

October 14 2011

Publishing News: Amazon fires up B&N and BAM

Here are the stories that caught my eye in the publishing space this week.

Amazon and DC Comics forge exclusive deal, B&N and BAM lash out

NeilTweet.PNGI first got wind of this story on Bleeding Cool when Neil Gaiman tweeted it. Basically, Amazon landed an exclusive deal with DC Comics to carry 100 of its best-selling graphic novels on the Kindle. B&N was first to take issue with the digital exclusivity and pulled the print versions of all 100 graphic novels from its shelves. This week, Books-A-Million, now the second largest chain bookstore in the US after the closing of Borders, joined the fray.

The reactions seem short sighted and knee-jerkish, especially in light of reports that the exclusive deal was for a limited period of four months. Giving Amazon additional (large platform retail sale) carte blanche to the print sales over that period seems a risky and questionable business strategy at best. CNN sums up the big-picture damage this fracas is causing: "Everyone is battling, and consumers are caught in the crossfire ..."

Content consumption increases threefold with Nook, Kindle

ShelfAwareness took a look at some of the digital publishing highlights from the conferences that took place this week ahead of the Frankfurt Book Fair. The increase in ereading is no surprise — Amazon announced in May that its sales of Kindle books surpassed print sales — but the speed of the transition and the accelerated ubiquity are notable. Some key points from the ShelfAwareness piece include:

  • Both Nook and Kindle users consume three times more content than they did before buying the device.
  • "Stephen Page, publisher and CEO of Faber & Faber, said that because of ebooks, the 85-year-old publishing house this year sold books in 20 countries where it had never sold a single book in the past."
  • The importance of digitizing backlists is becoming clear: Spanish publisher Santillana reports a substantial increase in sales after putting its backlist on the Kindle. "Before doing so, the ratio of sales of Santillana backlist titles in the U.S. to its other markets was 1:15; since the Kindle move, the ratio is 2:1."

What newspapers can learn from Wikipedia's success

Nieman Lab's Megan Garber took a look this week at Benjamin Mako Hill's research on the worldwide success of Wikipedia. Hill's analysis is interesting, but what really caught my eye was the application of that analysis to the newspaper industry, as suggested by Garber:

If you want user contributions, build platforms that are familiar and easy. Lower the barriers to participation; focus on helping users to understand what you want from them rather than on dazzling them. Though gamification — with incentives that encourage certain user behaviors, complete with individual rewards (badges! titles! mayors!) — certainly has a role to play in the new news ecosystem, Hill's findings suggest that the inverse of game dynamics can be a powerful force, as well. His research highlights the value of platforms that invite rather than challenge — and the validity of contributions made for the collective good rather than the individual.

These insights also can add to the discussion on the viability of paywalls, which saw some interesting activity this week as well. Press+ and the Knight Foundation teamed up to help college newspapers install metered paywalls — not so much to make broke college students pay to read their school's news, but to provide a way to charge for subscriptions or pander for donations from parents and alumni outside the college community. In a similar vein, The Independent newspaper in the UK is going the paywall route as well — but only for readers outside the UK.

TOC NY 2012 — O'Reilly's TOC Conference, being held Feb 13-15, 2012, in New York City, is where the publishing and tech industries converge. Practitioners and executives from both camps will share what they've learned and join together to navigate publishing's ongoing transformation.

Register to attend TOC 2012


September 29 2011

Four short links: 29 September 2011

  1. Princeton Open Access Report (PDF) -- academics will need written permission to assign copyright of a paper to a journal. Of course, the faculty already had exclusive rights in the scholarly articles they write; the main effect of this new policy is to prevent them from giving away all their rights when they publish in a journal. (via CC Huang)
  2. Good Faith Collaboration -- a book on Wikipedia's culture, from MIT Press. Distributed, appropriately, under a Creative Commons Non-Commercial Share-Alike license.
  3. The Local-Global Flip -- an EDGE conversation (or monologue) by Jaron Lanier that contains more thought-provocation per column-inch than anything else you'll read this week. [I]ncreasing efficiency by itself doesn't employ people. There is a difference between saving and making money when you're unemployed. Once you're already rich, saving money and making money is the same thing, but for people who are on the bottom or even in the middle classes, saving money doesn't help you if you don't have the money to save in the first place. and The beauty of money is it creates a system of people leaving each other alone by mutual agreement. It's the only invention that does that that I'm aware of. In a world of finite limits where you don't have an infinite West you can expand into, money is the thing that gives you a little bit of peace and quiet, where you can say, "It's my money, I'm spending it". and I'm astonished at how readily a great many people I know, young people, have accepted a reduced economic prospect and limited freedoms in any substantial sense, and basically traded them for being able to screw around online. There are just a lot of people who feel that being able to get their video or their tweet seen by somebody once in a while gets them enough ego gratification that it's okay with them to still be living with their parents in their 30s, and that's such a strange tradeoff. And if you project that forward, obviously it does become a problem. are things I'm still chewing on, many days after first reading.
  4. Trolled by Gerry Sussman (Bryan O'Sullivan) -- Bryan gave a tutorial on Haskell to a conference on leading-edge programming languages and distributed systems. At one point, Gerry had a pretty amusing epigram to offer. "Haskell is the best of the obsolete programming languages!" he pronounced, with a mischievous look. Now, I know when I’m being trolled, so I said nothing and waited a moment, whereupon he continued, "but don’t take it the wrong way—I think they’re all obsolete!"

September 20 2011

Four short links: 20 September 2011

  1. Plan 9 on Android -- replacing the Java stack on Android fans with Inferno. Inferno is the Plan 9 operating system originally from Bell Labs.
  2. SmartOS -- Joyent-created open source operating system built for virtualization. (via Nelson Minar)
  3. libtcod -- open source library for creating Rogue-like games. (via Nelson Minar)
  4. Wikipedia Miner -- toolkit for working with semantics in Wikipedia pages, e.g. find the connective topics that link two chosen topics. (via Alyona Medelyan)

September 15 2011

Strata Week: Investors circle big data

This was a busy week for data stories. Here are a few that caught my attention:

Big money for big data

Opera SolutonsThere's recently been a steady stream of funding news for big data, database, and data mining companies. Last Thursday, Hadoop-based data analytics startup Platfora raised $5.7 million from Andreessen Horowitz. On Monday, 10gen announced it had raised $20 million for MongoDB, its open-source, NoSQL database. On Tuesday, Xignite said it had raised $10 million to build big data repositories for financial organizations; data storage provider Zetta announced a $9 million round; and Walmart announced it had acquired the ad targeting and data mining startup OneRiot (the terms of the deal were not disclosed). Finally, yesterday, big data analytics company Opera Solutions announced that it had raised a whopping $84 million in its first round of funding.

GigaOm's Derrick Harris offers the story behind Opera Solution's massive round of funding, noting that the company was already growing fast and doing more than $100 million per year in revenue. He also points to the company's penchant for hiring PhDs (90 so far), "something that makes it more akin to blue-chipper IBM than to many of today's big data startups pushing Hadoop or NoSQL technologies." Harris also notes that at a half-billion-dollar valuation and with 600-plus employees, Opera Solutions isn't a great acquisitions target for other big companies, even those wanting to beef up their analytics offerings. He contends this could allow Opera Solutions to remain independent and perhaps make some acquisitions of its own.

Ushahidi and Wikipedia team up for WikiSweeper

Wikipedia and UshahidiThe crisis-mapping platform Ushahidi unveiled a new tool this week to help Wikipedia editors track changes and verify sources on articles. The project, called WikiSweeper, is aimed at those highly- and rapidly-edited articles that are associated with major events.

As Ushahidi writes on its blog:

When a globally-relevant news story breaks, relevant Wikipedia pages are the subject of hundreds of edits as events unfold. As each editor looks to editing and maintaining the quality and credibility of the page, they need to manually track the news cycle, each using their own spheres of reference. The decisions that are made to accept one source while rejecting others remains opaque, as are the strategies that editors develop to alert and keep track of the latest information coming in from a variety of different sources.

WikiSweeper is based on Ushahidi's own open-source Sweeper tool, and its application to Wikipedia will help Ushahidi in turn build out its own project. After all, during major events, information comes in from multiple sources at a breakneck pace, and in crisis response, the accuracy and trustworthiness of the sources need to be quickly and transparently identified. As Ushahidi points out, this makes it a "win-win" for both organizations as they gain better tools for dealing with real-time news and social data.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Angry Birds take down pigs and the economy

Invoking the seasonal declarations come March about the amount of time Americans waste at work watching the NCAA college basketball tournament, The Atlantic's Alexis Madrigal has pointed to a far more insidious and year-round problem: the amount of hours American workers lose by playing Angry Birds.

Drawing on data about the number of minutes people spend playing Angry Birds per day — 200 million — Madrigal has calculated the resulting lost hours and lost wages. He estimates about 43,333,333 on-the-clock hours are spent playing Angry Birds each year, accounting for $1.5 billion in lost wages per year.

Obviously there are some really big assumptions in this calculation. The first is that five percent of the total Angry Bird hours are played by Americans at work ... we don't know the international breakdown, nor do we know how often people play at work. But, five percent seemed like a reasonable assumption. Second, the Pew income data for smartphone ownership is not that precise, particularly on the upper ($75,000+) and lower (less than $30,000) ends. I had to pick numbers, so I basically split Americans up into four categories: people earning $30,000, $50,000, $75,000, and $100,000, then I calculated simple hourly wages for those groups (income/52/40) and did a weighted average based on smartphone adoption in those categories. The $35 per hour number I used is comparable with the $38 that Challenger, Gray, and Christmas used for fantasy sports players. But this is certainly a rough approximation. Put it this way: I bet this estimate is right to the order of magnitude, if not in the details.

Take that, Gladwell

Malcolm Gladwell raised the ire of many social-media-savvy activists last year by claiming that "the revolution will not be tweeted." Writing in The New Yorker, Gladwell dismissed social media as a tool for change. He argued that bonds formed online are "weak" and unable to withstand the sorts of demands necessary for social change.

Gladwell's assertions have been countered in many places, and a new article analyzing social media's role in the Arab Spring takes the rebuttals to a new level.

"After analyzing over 3 million tweets, gigabytes of YouTube content and thousands of blog posts, a new study finds that social media played a central role in shaping political debates in the Arab Spring. Conversations about revolution often preceded major events on the ground, and social media carried inspiring stories of protest across international borders," the authors write.

The authors describe their research methodology for extracting and analyzing the texts from blogs and tweets, but also lamented some of the problems they faced, particularly with access to the Twitter archive.

Got data news?

Feel free to email me.


August 18 2011

July 11 2011

Der Wochenrückblick: Enquete-Streit, Content-Verbände, Wiki-Watch

Die Internet-Enquete streitet um ihren Zwischen­bericht, Content­ver­bände fordern Vorratsdaten gegen Urheberrechtsverletzungen, gegen das Pr


June 10 2011

Pavel Richter: Das Urheberrecht muss stärker ein Beteiligungsrecht werden

p { margin-bottom: 0.21cm; }Defizitär am gegenwärtigen Urheberrecht ist, dass Zugänge zu Freiem Wissen und Formen kollaborativer Werkschöpfung wei


June 08 2011

Four short links: 8 June 2011

  1. Who Writes Wikipedia -- reported widely as "bots make most of the contributions to Wikipedia", but which really should have been "edits are a lousy measure of contributions". The top bots are doing things like ensuring correctly formatted ISBN references and changing the names of navboxes--things which could be done by humans but which it would be a scandalous waste of human effort if they were. We analyse edits because it's easy to get data on edits; analysis of value is a different matter.
  2. How I Failed and Finally Succeeded at Learning How to Code (The Atlantic) -- great piece on teaching and learning programming, focusing on Project Euler. Kids are naturally curious. They love blank slates: a sandbox, a bag of LEGOs. Once you show them a little of what the machine can do they'll clamor for more. They'll want to know how to make that circle a little smaller or how to make that song go a little faster. They'll imagine a game in their head and then relentlessly fight to build it. Along the way, of course, they'll start to pick up all the concepts you wanted to teach them in the first place. And those concepts will stick because they learned them not in a vacuum, but in the service of a problem they were itching to solve.
  3. The Believing Brain -- Belief comes quickly and naturally, skepticism is slow and unnatural, and most people have a low tolerance for ambiguity.
  4. 3D Printed Rocket -- stainless steel rocket engine.

May 26 2011

Strata Week: The mortality rate of URLs

Here are some of the data stories that caught my eye this week.

Pinboard examines link rot

Pinboard's founder Maciej Ceglowski has analyzed URLs bookmarked on the site in order to examine "link rot — the depressing phenomenon in which perfectly healthy URLs stop working just a few years after appearing online." Ceglowski took a random sample of 300 URLs from every year between 1997 and 2011 in order to ascertain if the decay of URLs was linear or if, like plutonium, they tend to have a half life.

Almost half of the links from 1997 are dead, Ceglowski found. Roundly a quarter of the links from 2002 to 2006 are dead. And even 6% of links bookmarked in 2011 no longer resolve. The full results of his analysis are here.

Pinboard proportion of working links chart
Pinboard proportion of working links chart. Click here for full analysis.

Ceglowski does note that there are some problems assessing the mortality of links: some dead links actually redirect and dead domains often end up full of ads. He asks some interesting questions about his methodology — Is there a simple programmatic way to detect parked domains? What is the attrition rate for shortened links?

He's posted the raw data for others to analyze.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Where in the world is Wikipedia edited?

Wikipedia visualizationAccording to Wikipedia, there have been some 463 million edits to the site — roughly 19 edits per page. Wikimedia's data analyst Erik Zachte has unveiled a new visualization that shows exactly where in the world these edits are occurring on any given day for the various language editions of Wikipedia.

The visualization is interactive and using various keyboard shortcuts, you can navigate between different views and event markers. You can zoom into a particular area (with the + key), for example, or filter the edits by language (with the space bar). There are three types of visualizations available with this new tool: an animation of edits, a bubble map, and a heat map — all highlighting the 400,000+ edits that occur in a given day.

The tool reveals some interesting trends, not surprisingly showing different language versions more active depending on the time zones. It also demonstrates that most edits to the Chinese-language Wikipedia come from outside mainland China.

Zachte has written a blog post explaining how he created the visualization tool using HTML5 and JavaScript. He also addresses some of the measures he took to guard the privacy of Wikipedia authors, including adjusting the timestamps and rounding the latitude and longitude to a half degree.

Analyzing iPhone autocorrect errors

Although meant to be a helpful feature, the iPhone autocorrect has generated plenty of laughs with its spelling and word suggestions.

Following this tweet by Andrew Parker:

My iPhone auto-corrected "Harvard" to "Garbage". Well played Apple engineers.less than a minute ago via Proxlet Favorite Retweet Reply

Brendan O'Connor decided to take a closer look at these autocorrection errors:

I was wondering how this would happen, and then noticed that each character pair has 0 to 2 distance on the QWERTY keyboard. Perhaps their model is eager to allow QWERTY-local character substitutions. >>> zip(‘harvard’,'garbage’) [('h', 'g'), ('a', 'a'), ('r', 'r'), ('v', 'b'), ('a', 'a'), ('r', 'g'), ('d', 'e')]

O'Connor wonders if it's a problem with the corpus of the iOS language model or if that language model is under-penalizing the edit distance. Commenters on the post contend the problem is that the iOS language model is generic and not personalized. Moreover, the model doesn't actually account for the last word typed, so it tends to make non-grammatical suggestions.

Got data news?

Feel free to email me.


May 02 2011

Four short links: 2 May 2011

  1. Chinese Internet Cafes (Bryce Roberts) -- a good quick read. My note: people valued the same things in Internet cafes that they value in public libraries, and the uses are very similar. They pose a similar threat to the already-successful, which is why public libraries are threatened in many Western countries.
  2. SIFT -- the Scale Invariant Feature Transform library, built on OpenCV, is a method to detect distinctive, invariant image feature points, which easily can be matched between images to perform tasks such as object detection and recognition, or to compute geometrical transformations between images. The licensing seems dodgy--MIT code but lots of "this isn't a license to use the patent!" warnings in the LICENSE file. (via Joshua Schachter)
  3. The Secret Life of Libraries (Guardian) -- I like the idea of the most-stolen-books revealing something about a region; it's an aspect of data revealing truth. For a while, Terry Pratchett was the most-shoplifted author in England but newspapers rarely carried articles about him or mentioned his books (because they were genre fiction not "real" literature). (via Brian Flaherty)
  4. Sweble -- MediaWiki parser library. Until today, Wikitext had been poorly defined. There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers. (via Dirk Riehle)

April 15 2011


March 25 2011

Four short links: 25 March 2011

  1. Bruce Sterling at SxSW (YouTube) -- call to arms for "passionate virtuosity". (via Mike Brown)
  2. Developer Support Handbook -- Pamela Fox's collected wisdom from years of doing devrel at Google.
  3. Wikipedia Beautifier -- Chrome plugin that makes Wikipedia easier on the eyes.
  4. -- an open science community. Comment on, recommend and submit papers. Get up-to-date on a research topic. Follow a journal or an author. science.I/O is in beta and is currently focused on Computer Science.

December 23 2010

Good as news

We can see design thinking at work in web phenomena such as Facebook, Twitter and YouTube, but the predicament of printed news remains an unsolved problem

In the 1850s, a New York publisher announced that newspapers were dead: he had seen a telegraph in action. In fact, the immediacy of the telegraph made people much hungrier for news from hundreds of miles away, and proved a major catalyst in the growth of newspapers.

The telegraph story is told by Arthur Sulzberger Jr, the publisher of the New York Times, in a new book called Designing Media. His interlocutor is Bill Moggridge, the man who designed the first laptop in 1980, went on to found IDEO, the largest design firm in the world, and is currently the director of the Cooper-Hewitt National Design Museum in New York. Sulzberger is one of 37 people that Moggridge interviews in the book, from editors and TV producers to the founders of Wikipedia, Facebook, YouTube and Twitter. It's a veritable Who's Who of the people who have revolutionised media in the last decade.

Reading the interviews (excerpts of which you can also download and watch on video), I had one question at the front of my mind: what, exactly, is the relationship between design and the media revolution we are experiencing? Or, to put it another way, why is this book – which contains many fascinating insights into the way media work, some of them design-related but most of them not – entitled Designing Media? I didn't find the explanation in the book, so I called up Moggridge to ask him. His answer was simple: because media is a form of design. In fact, he argued, everything is a form of design.

To be honest, I suspected he would say that. Most people may still think that "design" refers to manufactured objects – chairs, telephones and cars – but designers have become far more expansive in their worldview. They now design customer experience and services, from internet banking systems to patient flow in a hospital. Businesses are rapidly latching on to the notion of "design thinking" – the idea that the creative problem-solving used by designers can be applied outside of traditional design – as a means of becoming more effective. Moggridge himself is a paragon of the designer dissolving the boundaries of his discipline. He is the godfather of interaction design, which started out as the design of electronic interfaces but now refers to the design of any form of user experience, from navigating a BlackBerry to paying at a checkout.

From there, it takes no great leap of imagination to understand media as design. After all, many of the new media moguls are software designers. Indeed, Chad Hurley, the founder of YouTube, started out as a graphic designer (probably the only graphic designer in history to become a billionaire). I buy the argument that design thought processes can be applied to almost anything – whether that means we call those things "design" is a semantic discussion we'll save for another time. But I find it easier to understand the argument in relation to new media rather than traditional media. It doesn't seem far fetched at all to describe social networking platforms such as Facebook and Twitter, and user-generated content sites such as Wikipedia and YouTube as forms of design.

Wikipedia's Jimmy Wales actually describes what he does as "community design". It sounds like a form of social engineering, but what he means by the phrase is that Wikipedia is not just an anarchic piece of crowd-sourcing: it's a carefully designed eco-system. If people are going to work on an encyclopedia for free, you have to create the conditions in which they're willing to do so, by giving them recognition and not profiting from their labour. It was important to Wales to make Wikipedia an open system, and so it was designed around the principle that most people are honest and well-intentioned, rather than making it a closed shop to exclude the few bad apples who want to write false or slanderous entries – in truth, he tried the closed system first with Nupedia and it failed. Yet, while it's true that anyone can write or edit an entry on Wikipedia, everything there is carefully monitored. It's often described as "democratic", but Wales himself thinks of it more as a monarchy, with the writers overseen by moderators who are in turn overseen by the king – King Jimbo, as he's known. So the design aspect isn't just how the website looks, it's how users create the content.

Immediately you can see how different design rules suggest different ideologies. Like Wales, Facebook's Mark Zuckerberg is also fixated on the idea of openness. He fervently believes that designing a platform for people to share personal information helps make the world a more open place. And he found that making things human – "just seeing someone's face" – works best. It could have all looked like email, with its Spartan text-only interface that betrays its origins in the military. But it doesn't. It's designed to make people feel more present, and engaged with a community rather than an individual. Moggridge is right to suggest that the secret to Zuckerberg's success – you may have seen him on the cover of Time this month – lies in having designed a social network where there is no layer of technology getting in people's way.

However, here's the question. We all know that the media are in a turbulent state of flux, but in what way does reading the situation as "design" help? Is it just semantics, down to the fact that the word "design" is just so malleable? Paola Antonelli, senior design curator at New York's Museum of Modern Art, doesn't think so. She recently predicted in the Economist that in the near future designers would be involved in everything from science to politics. She sees design as the uber-profession, with a skill-set that transcends all boundaries. "For a simple reason: one of design's most fundamental tasks is to help people deal with change," she says.

The design world is in confident mood, but for these predictions to come true the rest of the world needs to buy into the argument. If I was Arthur Sulzberger Jr, I'd be thinking about how designers could get me out of a massive dilemma that was costing my company hundreds of millions of dollars a year. There's only one reason why newspapers haven't yet gone the way of the telegraph and that's because they still make about 20 times more advertising revenue than websites. If you were to grant Sulzberger just one wish, I have no doubt that he would reply: I wish someone would design a way for us to make as much advertising revenue from the website as we used to make from the newspaper. Banner ads? Forget it. The fact that you can't give over most of a webpage to an ad the way you could a printed page is simply because we've all been conditioned by the early days of the web when everything was free. There's a design challenge that everyone's trying to crack. © Guardian News & Media Limited 2010 | Use of this content is subject to our Terms & Conditions | More Feeds

December 02 2010

Strata Gems: Use Wikipedia as training data

We'll be publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Try MongoDB without installing anything.

Strata 2011One of the most exciting analytical techniques is natural language processing and sentiment analysis. Given natural language text, can we use a computer to discover what's being said? Applications ranges from user interface through marketing and espionage.

The hard part of the problem is how do you teach a computer what words mean, and how do you figure out the context to select the right meaning for a word? The word "apple" could refer to the fruit, the computer company, or the Beatles' record label. Or a bank of the same name, the rock band, New York City, the singer Fiona Apple, the list goes on.

One answer is to use a classifier, which can differentiate between the different contexts in which a word is used in order to determine its sense. Most anti-spam filtering solutions use a classifier. Classifiers must be trained to be effective though, as anybody who has used anti-spam systems will tell you.

It's relatively easy to differentiate between spam and non-spam email, but how do you go about breaking down the English language to finding training data for each word sense?

Fortunately, there's a large open data source available that has put a lot of effort into the disambiguation of terms such as "apple" - Wikipedia. Data scientists often use information from Wikipedia to aid in the identification of real world entities in their work, and its use for disambiguation has been described in several reports, including this 2007 paper from Rada Mihalcea, Using Wikipedia for Automatic Word Sense Disambiguation (PDF).

Wikipedia front page

The key concept is that in the Wikipedia article for the Apple computer company, the world "apple" is used in the context of meaning the company, so you can use it to train natural language classifiers for that sense of the word. The Wikipedia article for apple the fruit offers a similar corpus for the fruity context, and so on. The Wikipedia URL for a particular concept is an unambiguous tag that you can then use to identify word sense.

Fortunately, you don't need to be a deep researcher to start using Wikipedia in this way. A recent blog post from Jim Plush shows how to use Wikipedia and Python to disambiguate words from Twitter posts. With a relatively brief Python script and training data culled from Wikipedia, Plush was able to distinguish between apple the fruit and Apple the company in the text of Twitter posts mentioning "apple".

For more information, check out the Python Natural Language Toolkit web site. Also, the Strata panel Online Sentiment, Machine Learning, and Prediction will dive into real world uses of sentiment analysis and machine learning.

November 16 2010

Four short links: 16 November 2010

  1. A Room to Let in Old Aldgate -- a lovely collection of photographs of lost buildings from The Society for Photographing Relics of Old London. Think of them as the Wayback Machine of their day. (via Fiona Rigby on Twitter)
  2. Wikipedia Fundraising A/B Tests -- get a glimpse into the science that resulted in Jimmy Wales's hollow haunted gaze staring at you with the eerie intensity of a creepy hobo talking about how tasty human liver is.
  3. It Takes A Lot of Money to Stay in Business (Ponoko) -- guest blogs by Chris Anderson on the lessons and rules of maker businesses. Most Maker businesses that I’ve talked to have to hold parts inventory closer to 25% of their annual sales.
  4. Sencha Touch -- mobile multitouch Javascript toolkit, now fully GPLed. (via Simon St Laurent)

October 15 2010

September 17 2010

Four short links: 17 September 2010

  1. BBC Jobs -- looking for someone to devise advanced machine intelligence techniques to infer high level classification metadata of audio and video content from low-level features extracted from it. (via mattb on Delicious)
  2. A History of the Iraq War Through Wikipedia Changelogs -- printed and bound volumes of the Wikipedia changelogs during the Iraq war. This is historiography. This is what culture actually looks like: a process of argument, of dissenting and accreting opinion, of gradual and not always correct codification. And for the first time in history, we’re building a system that, perhaps only for a brief time but certainly for the moment, is capable of recording every single one of those infinitely valuable pieces of information. Everything should have a history button. We need to talk about historiography, to surface this process, to challenge absolutist narratives of the past, and thus, those of the present and our future. (via Flowing Data)
  3. Nuggetize -- pulls highlights out of a page before you visit it. (via titine on Delicious)
  4. Antimov -- SparkFun running contest where a robot violates one of Asimov's three laws (not the one about hurting people though). I am in LOVE with the logo, check it out.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!