Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

November 15 2011

Four short links: 15 November 2011

  1. Cost-Effectiveness of Internet-Based Self-Management Compared with Usual Care in Asthma (PLoSone) -- Internet-based self-management of asthma can be as effective as current asthma care and costs are similar.
  2. Apache Lucy -- full-text search engine library written in C and targeted at dynamic languages. It is a "loose C" port of Apache Lucene™, a search engine library for Java.
  3. The Near Future of Citizen Science (Fiona Romeo) -- near future of science is all about honing the division of labour between professionals, amateurs and bots. See Bryce's bionic software riff. (via Matt Jones)
  4. Microsoft's Patent Claims Against Android (Groklaw) -- behold, citizen, the formidable might of Microsoft's patents and how they justify a royalty from every Android device equal to that which you would owe if you built a Windows Mobile device: These Microsoft patents can be divided into several basic categories: (1) the '372 and '780 patents relate to web browsers; (2) the '551 and '233 patents relate to electronic document annotation and highlighting; (3) the '522 patent relates to resources provided by operating systems; (4) the '517 and '352 patents deal with compatibility with file names once employed by old, unused, and outmoded operating systems; (5) the '536 and '853 patents relate to simulating mouse inputs using non-mouse devices; and (6) the '913 patent relates to storing input/output access factors in a shared data structure. A shabby display of patent menacing.

September 30 2010

Strata Week: Behind LinkedIn Signal

Professional social networking site LinkedIn yesterday announced a new service, Signal, that applies the filters of the LinkedIn network over status updates, such as those from Twitter. Signal lets you do things such as watch tweets from particular industries, companies or locales, or filter by your professional network. All in real time.

Screenshot of LinkedIn Signal

Overlaying the Twitter nation with LinkedIn's map is a great idea, so what's the technology behind Signal? Like fellow social networks Facebook and Twitter, LinkedIn has a smart big data and analytics team, who often leverage or create open source solutions.

LinkedIn engineer John Wang (@javasoze) gave some clues as to Signal's infrastructure of "Zoie, Bobo, Sensei and Lucene", and I thought it would be fascinating to examine the parts in more detail.

Signal uses a variety of open source technologies, some developed in-house at LinkedIn by their Search, Network and Analytics team.

  • Zoie (source code) is a real-time search and indexing system built on top of the Apache Lucene search platform. As documents are added to the index, they become immediately searchable.
  • Bobo is another extension to Apache Lucene. While Lucene is great for searching free text data, Bobo takes it a step further and provides faceted searching and browsing over data sets (source code)
  • Sensei (source code) is a distributed, scalable, database offering fast searching and indexing. It is particularly tuned to answer the kind of queries LinkedIn excels at: free text search, restricted over various axes in their social network. Sensei uses Bobo and Zoie, adding clustered, elastic database features.
  • Voldemort is an open source fault-tolerant distributed key-value store, similar to Amazon's Dynamo.

LinkedIn also use the Scala and JRuby JVM programming languages, alongside Java.

If you're interested in hearing more about LinkedIn Signal, check out the coverage on TechCrunch,, Mashable and The Daily Beast.

Bringing visualization back to the future

Speaking at this week's Web 2.0 Expo in New York, Julia Grace of IBM encouraged attendees to raise their game with data visualization. As long ago as the 1980s movie directors envisioned exciting and dynamic data visualizations, but today most people are still sharing flat two-dimensional charts, which restrict the opportunities for understanding and telling stories with data. Julia decided to make some location-based data very real by projecting it onto a massive globe.

Julia's talk is embedded below, and you can also read an extended interview with her published earlier this month on O'Reilly Radar.

Hadoop goes viral

Software vendor Karmasphere creates developer tools for data intelligence that work with Hadoop-based SMAQ big data systems. They recently commissioned a study into Hadoop usage. One of the most interesting results of the survey suggests that Hadoop systems tend to start as skunkworks projects inside organizations, and move rapidly into production.

Once used inside an organization, Hadoop appears to spread:

Additionally, organizations are finding that the longer Hadoop is used, the more useful it is found to be; 65% of organizations using Hadoop for a year or more indicated more than three reasons for using Hadoop, as compared to 36% for new users.

There are challenges too. Hadoop offers the benefits of affordable big data processing, but it has an immature ecosystem that is only just starting to emerge. Respondents to the Karmasphere survey indicated that pain points included a steep learning curve, hiring qualified people, tool availability and educational materials.

This is good news for vendors such as Karmasphere, Datameer and IBM, all of whom are concentrating on making Hadoop work in ways that are familiar to enterprises, through the medium of IDEs and spreadsheet interfaces.

SciDB source released

The SciDB database is an answer to the data and analytic needs of the scientific world; serving among others the needs of biology, physics, and astronomy. In the words of their website, a database "for the toughest problems on the planet." SciDB Inc., the sponsors of the open source project, say that although science has become steadily more data intensive, scientists have had to use databases intended for commercial, rather than scientific, applications.

One of the most intriguing aspects of SciDB is that it emanates from the work of serial database innovator Michael Stonebraker. Scientific data is inherently multi-dimensional, Stonebraker told The Register earlier this month, and thus ill-suited for use with traditional relational databases.

The SciDB project has now made their source code available. The current release, R0.5, is an early stage product, for the "curious and intrepid". It features a new array query language, known as AQL, an SQL-like language extended for the array data model of SciDB. The release will run on Linux systems, and is expected to be followed up at the end of the year by a more robust and stable version.

SciDB is available under the GPL3 free software license, and may be downloaded on application to the SciDB team. According to the authors, more customary use of open source repositories is likely to follow soon.

Send us news

Email us news, tips and interesting tidbits at

May 25 2010

Four short links: 25 May 2010

  1. Lending Merry-Go-Round -- these guys have been Australia's sharpest satire for years, filling the role of the Daily Show. Here they ask some strong questions about the state of Europe's economies ... (via jdub on Twitter)
  2. What's Powering the Guardian's Content API -- Scala and Solr/Lucene on EC2 is the short answer. The long answer reveals the details of their setup, including some of their indexing tricks that means Solr can index all their content in just an hour. (via Simon Willison)
  3. What I Learned About Engineering from the Panama Canal (Pete Warden) -- I consider myself a cheerful pessimist. I've been through enough that I know how steep the odds of success are, but I've made a choice that even a hopeless fight in a good cause is worthwhile. What a lovely attitude!
  4. Mapping the Evolution of Scientific Fields (PLoSone) -- clever use of data. We build an idea network consisting of American Physical Society Physics and Astronomy Classification Scheme (PACS) numbers as nodes representing scientific concepts. Two PACS numbers are linked if there exist publications that reference them simultaneously. We locate scientific fields using a community finding algorithm, and describe the time evolution of these fields over the course of 1985-2006. The communities we identify map to known scientific fields, and their age depends on their size and activity. We expect our approach to quantifying the evolution of ideas to be relevant for making predictions about the future of science and thus help to guide its development.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
Get rid of the ads (sfw)

Don't be the product, buy the product!