October 14 2011

Visualization of the Week: Sentiment in the Bible

New textual analysis tools are providing interesting insights into classic works of literature. Last month, for example, we looked at a visualization based on character frequency in Jane Austen novels.

Along similar lines, has just released a visualization showing a sentiment analysis of the Bible.

A blog post announcing the visualization outlines the ebbs and flows that were uncovered:

Things start off well with creation, turn negative with Job and the patriarchs, improve again with Moses, dip with the period of the judges, recover with David, and have a mixed record (especially negative when Samaria is around) during the monarchy. The exilic period isn't as negative as you might expect, nor the return period as positive. In the New Testament, things start off fine with Jesus, then quickly turn negative as opposition to his message grows. The story of the early church, especially in the epistles, is largely positive.

Screenshot from's Bible sentiment visualization
This Bible visualization from includes both the Old and New Testaments. Black indicates a positive sentiment, red negative. (Click to enlarge.) created the visualization by running the Viralheat Sentiment API across a number of translations. The raw data from OpenBible's visualization is available for download.

A second visualization breaks down the sentiment by specific book, making it easier to see those that contain overwhelmingly positive sentiment (Psalms, for example), those that contain negative sentiment (Job), and those that go from bad to worse (Jonah).

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

September 20 2011

Four short links: 20 September 2011

  1. Plan 9 on Android -- replacing the Java stack on Android fans with Inferno. Inferno is the Plan 9 operating system originally from Bell Labs.
  2. SmartOS -- Joyent-created open source operating system built for virtualization. (via Nelson Minar)
  3. libtcod -- open source library for creating Rogue-like games. (via Nelson Minar)
  4. Wikipedia Miner -- toolkit for working with semantics in Wikipedia pages, e.g. find the connective topics that link two chosen topics. (via Alyona Medelyan)

September 06 2011

Four short links: 6 September 2011

  1. The Secret Life of Javascript Primitives -- good writing and clever headlines can make even the dullest topic seem interesting. This is interesting, I hasten to add.
  2. Backup Bouncer -- software to test how effective your backup tools are: you copy files to a test area by whatever means you like, then run this tool to see whether permissions, flags, owners, contents, timestamps, etc. are preserved. (via Joshua Schachter)
  3. reVerb -- open source (GPLv3) toolkit for learning triples from text. See the paper for more details.
  4. Patterns for Large-Scale Javascript Architecture -- enterprise (aka "scalable") architectures for Javascript apps.

July 18 2011

Four short links: 18 July 2011

  1. Organisational Warfare (Simon Wardley) -- notes on the commoditisation of software, with interesting analyses of the positions of some large players. On closer inspection, Salesforce seems to be doing more than just commoditisation with an ILC pattern, as can be clearly seen from Radian's 6 acquisition. They also seem to be operating a tower and moat strategy, i.e. creating a tower of revenue (the service) around which is built a moat devoid of differential value with high barriers to entry. When their competitors finally wake up and realise that the future world of CRM is in this service space, they'll discover a new player dominating this space who has not only removed many of the opportunities to differentiate (e.g. social CRM, mobile CRM) but built a large ecosystem that creates high rates of new innovation. This should be a fairly fatal combination.
  2. Learning to Win by Reading Manuals in a Monte-Carlo Framework (MIT) -- starting with no prior knowledge of the game or its UI, the system learns how to play and to win by experimenting, and from parsed manual text. They used FreeCiv, and assessed the influence of parsing the manual shallowly and deeply. Trust MIT to turn RTFM into a paper. For human-readable explanation, see the press release.
  3. A Shapefile of the TZ Timezones of the World -- I have nothing but sympathy for the poor gentleman who compiled this. Political boundaries are notoriously arbitrary, and timezones are even worse because they don't need a war to change. (via Matt Biddulph)
  4. Microsoft Adventure -- 1979 Microsoft game for the TRS-80 has fascinating threads into the past and into what would become Microsoft's future.

June 20 2011

Four short links: 20 June 2011

  1. HD Video Recording Glasses (Kickstarter) -- as Bryce says, "wearable computing is on the rise. As the price for enabling components drops, always on connectivity in our pockets and purses increases, and access to low cost manufacturing resources and know-how rises we’ll see innovation continue to push into these most personal forms of computing." (via Bryce Roberts)
  2. Sketching in Food (Chris Heathcote) -- a set of taste tests to demonstrate that we've been food hacking for a very long time. We started with two chemical coated strips - sodium benzoate, a preservative used in lots of food that a significant percentage of people can taste (interestingly in different ways, sweet, sour and bitter). Secondly was a chemical known as PTC that about 70% of people perceive as bitter, and a smaller number perceiving as really really horribly bitter. This was to show that taste is genetic, and different people perceive the same food differently. He includes pointers to sources for the materials in the taste test.
  3. Investigating Millions of Documents by Visualizing Clusters -- recording of talk about our recent work at the AP with the Iraq and Afghanistan war logs.
  4. Managing Crowdsourced Human Computation (Slideshare) -- half a six-hour tutorial at WWW2011 on crowdsourcing and human computation. See also the author's comments. (via Matt Biddulph)

May 24 2011

Four short links: 24 May 2011

  1. Delivereads -- genius idea, a mailing list for Kindles. Yes, if you can send email then you can be a Kindle publisher. (via Sacha Judd)
  2. Abnormal Returns From the Common Stock Investments of Members of the U.S. House of Representatives -- We measure abnormal returns for more than 16,000 common stock transactions made by approximately 300 House delegates from 1985 to 2001. Consistent with the study of Senatorial trading activity, we find stocks purchased by Representatives also earn significant positive abnormal returns (albeit considerably smaller returns). A portfolio that mimics the purchases of House Members beats the market by 55 basis points per month (approximately 6% annually). (via Ellen Miller)
  3. Google News Archive Ends -- hypothesizes that old material was "too hard" to make sense of, but that seems unlikely to me. More likely is that it wasn't useful enough to their machine learning efforts. Newspapers can have their scanned/OCRed content for free now the program is being closed.
  4. Week Report 310 -- BERG's first (that I've seen) video report of the week, and it's a cracker. No newsreel, just some really clever evocation of the mood of the place and the nature of the projects. I continue to be impressed by the BERG crew's conscious creation of culture.

May 18 2011

Four short links: 18 May 2011

  1. The Future of the Library (Seth Godin) -- We need librarians more than we ever did. What we don't need are mere clerks who guard dead paper. Librarians are too important to be a dwindling voice in our culture. For the right librarian, this is the chance of a lifetime. Passionate railing against a straw man. The library profession is diverse, but huge numbers of them are grappling with the new identity of the library in a digital age. This kind of facile outside-in "get with the Internet times" message is almost laughably displaying ignorance of actual librarians, as much as "the book is dead!" displays ignorance of books and literacy. Libraries are already much more than book caves, and already see themselves as navigators to a world of knowledge for people who need that navigation help. They disproportionately serve the under-privileged, they are public spaces, they are brave and constant battlers at the front line of freedom to access information. This kind of patronising "wake up and smell the digital roses!" wank is exactly what gives technologists a bad name in other professions. Go back to your tribes of purple cows, Seth, and leave librarians to get on with helping people find, access, and use information.
  2. An Old Word for a New World (PDF) -- paper on how "innovation", which used to be pejorative, came now to be laudable. (via Evgeny Mozorov)
  3. AlchemyAPI -- free (as in beer) entity extraction API. (via Andy Baio)
  4. Referrals by LinkedIn -- the thing with social software is that outsiders can have strong visibility into the success of your software, in a way that antisocial software can't.

May 17 2011

Four short links: 17 May 2011

  1. Sorting Out 9/11 (New Yorker) -- the thorniest problem for the 9/11 memorial was the ordering of the names. Computer science to the rescue!
  2. Tagger -- Python library for extracting tags (statistically significant words or phrases) from a piece of text.
  3. Free Science, One Paper at a Time (Wired) -- Jonathan Eisen's attempt to collect and distribute his father's scientific papers (which were written while a federal employee, so in the public domain), thwarted by old-fashioned scientific publishing. “But now,” says Jonathan Eisen, “there’s this thing called the Internet. It changes not just how things can be done but how they should be done.”
  4. Internet Archive Launches Physical Archive -- I'm keen to see how this develops, because physical storage has problems that digital does not. I'd love to see the donor agreement require the donor to give the archive full rights to digitize and distribute under open licenses. That'd put the Internet Archive a step in front of traditional archives, museums, libraries, and galleries, whose donor agreements typically let donors place arbitrary specifications on use and reuse ("must be inaccessible for 50 years", "no commercial use", "no use that compromises the work", etc.), all of which are barriers to wholesale digitization and reuse.

May 16 2011

Four short links: 16 May 2011

  1. Entering the Minority Report Era -- a survey of technology inspired by or reminiscent of Minority Report, which came out ten years ago. (via Hacker News)
  2. Sally -- a tool for embedding strings in matrices, as used in machine learning. (via Matt Biddulph)
  3. GNU SIP Witch Released -- can be used to deploy private secure calling networks, whether stand-alone or in conjunction with existing VoIP infrastructure, for private institutions and national governments. (via Hacker News)
  4. Chilling Story of Genius in a Land of Chronic Unemployment (TechCrunch) -- fascinating story of Nigerian criminal tech entrepreneurs. He helps build them up; he listens to their problems. He makes them feel loved. He calls each an innocuous pet name, lest he accidentally type the wrong message into the wrong chat window. He asks for a little bit of money here and there, until men are sending him steady amounts from each paycheck. He says it takes exactly one month for a man to fall in love with him, and once he has a man’s heart, no woman can take it. I wonder what designers of social software can learn from these master emotional manipulators?

May 04 2011

Trading on sentiment

Numbers on boardComputers don't get emotionally invested in financial trades, but they do take feelings seriously.

Case in point: The financial trading dashboard managed by Thomson Reuters uses sentiment analysis data from Lexalytics to track news on 20,000 stocks and thousands of commodities. The Lexalytics system parses text from multiple sources, looking for keywords, tone, relevance and freshness. The resulting textual analysis (the meaning of the text) and sentiment analysis (the emotions in the text) is then incorporated into widely used algorithmic trading systems.

Mark Thompson, CEO of McKinley Software (the parent company of Lexalytics), told me more about this emotion-to-data conversion. "Our financial engine is something we developed over an 8-year period, and the main partner for that is Thomson Reuters," Thompson said. "The Thomson Reuters news passes through our black box and we kick out scores based on 80 different variables for all of the articles."

Algorithmic trading is automated trading where trading software takes various inputs, or "trading signals," and uses them to decide what trades to make. Trades are executed in a matter of milliseconds and there is no human intervention. In 2009, algorithmic trading accounted for more than 25 percent of all shares traded on the buy side. No human being can read the latest financial news fast enough to contribute to those buy or sell decisions. That's where sentiment analysis comes in.

"By scoring the news, within milliseconds we get a very accurate view of what's being said about a particular stock or sector," Thompson said. "Thomson Reuters sells that output to trading houses who then plug this data into their algorithmic trading models. We have found that we can predict stock market movements. We provide an extra layer of richness that trading staff haven't been able to get their hands on. Otherwise, you are just doing very two-dimensional quant processing."

Rochester Cahan, VP of Global Equity Quantitative Strategy at Deutsche Bank, has been experimenting with the Thomson Reuters system. Cahan told me that he has seen significant improvements in trading performance when the text analysis and sentiment scores are used as trading inputs. In addition, the scores are uncorrelated with existing trading signals — in other words, they provide new information to the trading system.

The most positive sentiment levels (e.g. Apple releases the iPad to universal acclaim) are not necessarily the most useful for trading. The stock price reacts very quickly so it's difficult to take advantage of the information. However, Cahan said stocks with moderate positivity tend to be overlooked by the market and can make for good buys.

I asked Thompson about the limitations of the sentiment analysis technology. He explained that even human beings don't agree on the sentiment of an article more than about 85% of the time. "The problem with our kind of engine is trying to get above 85% accuracy," Thompson said. "Beyond that level, you get a diminishing return and you need more human intervention. This leaves the human analyst to pass different types of judgements."

The competitive edge may be lost if all trading systems use sentiment analysis, but Thompson thinks there is some distance to go before we get to that point. "Everyone has a slightly different way of composing the model and using the news, and there are always advances in the technology," he said. "But there will come a point when sentiment becomes an ordinary part of the trading mix."

Photo: ABOVE by Lyfetime, on Flickr


May 03 2011

Four short links: 3 May 2011

  1. SentiWordNet -- WordNet with hints as to sentiment of particular terms, for use in sentiment analysis. (via Matt Biddulph)
  2. Word Frequency Lists and Dictionaries -- also for text analysis. This site contains what we believe is the most accurate frequency data of English. It contains word frequency lists of the top 60,000 words (lemmas) in English, collocates lists (looking at nearby words to see word meaning and use), and n-grams (the frequency of all two and three-word sequences in the corpora).
  3. Crash Course in Web Design for Startups -- When I was a wee pixel pusher I would overuse whatever graphic effect I had just learned. Text-shadow? Awesome, let's put 5px 5px 5px #444. Border-radius? Knock that up to 15px. Gradients? How about from red to black? You can imagine how horrible everything looked. Now my rule of thumb in most cases is applying just enough to make it perceivable, no more. This usually means no blur on text-shadow and just a 1px offset, or only dealing with gradients moving between a very narrow color range. Almost everything in life is improved with this rule.
  4. Leafsnap -- Columbia University, the University of Maryland and the Smithsonian Institution have pooled their expertise to create the world’s first plant identification mobile app using visual search—Leafsnap. This electronic field guide allows users to identify tree species simply by taking a photograph of the tree’s leaves. In addition to the species name, Leafsnap provides high-resolution photographs and information about the tree’s flowers, fruit, seeds and bark—giving the user a comprehensive understanding of the species. iPhone for now, Android and iPad to come. (via Fiona Romeo)

May 02 2011

Four short links: 2 May 2011

  1. Chinese Internet Cafes (Bryce Roberts) -- a good quick read. My note: people valued the same things in Internet cafes that they value in public libraries, and the uses are very similar. They pose a similar threat to the already-successful, which is why public libraries are threatened in many Western countries.
  2. SIFT -- the Scale Invariant Feature Transform library, built on OpenCV, is a method to detect distinctive, invariant image feature points, which easily can be matched between images to perform tasks such as object detection and recognition, or to compute geometrical transformations between images. The licensing seems dodgy--MIT code but lots of "this isn't a license to use the patent!" warnings in the LICENSE file. (via Joshua Schachter)
  3. The Secret Life of Libraries (Guardian) -- I like the idea of the most-stolen-books revealing something about a region; it's an aspect of data revealing truth. For a while, Terry Pratchett was the most-shoplifted author in England but newspapers rarely carried articles about him or mentioned his books (because they were genre fiction not "real" literature). (via Brian Flaherty)
  4. Sweble -- MediaWiki parser library. Until today, Wikitext had been poorly defined. There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers. (via Dirk Riehle)

