Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 07 2012

Strata Week: Data prospecting with Kaggle

Here are a few of the data stories that caught my attention this week:

Prospecting for data

KaggleThe data science competition site Kaggle is extending its features with a new service called Prospect. Prospect allows companies to submit a data sample to the site without having a pre-ordained plan for a contest. In turn, the data scientists using Kaggle can suggest ways in which machine learning could best uncover new insights and answer less-obvious questions — and what sorts of data competitions could be based on the data.

As GigaOm's Derrick Harris describes it: "It's part of a natural evolution of Kaggle from a plucky startup to an IT company with legs, but it's actually more like a prequel to Kaggle's flagship predictive modeling competitions than it is a sequel." It's certainly a good way for companies to get their feet wet with predictive modeling.

Practice Fusion, a web-based electronic health records system for physicians, has launched the inaugural Kaggle Prospect challenge.

HP's big data plans

Last year, Hewlett Packard made a move away from the personal computing business and toward enterprise software and information management. It's a move that was marked in part by the $10 billion it paid to acquire Autonomy. Now we know a bit more about HP's big data plans for its Information Optimization Portfolio, which has been built around Autonomy's Intelligent Data Operating Layer (IDOL).

ReadWriteWeb's Scott M. Fulton takes a closer look at HP's big data plans.

The latest from Cloudera

Cloudera released a number of new products this week: Cloudera Manager 3.7.6; Hue 2.0.1; and of course CDH 4.0, its Hadoop distribution.

CDH 4.0 includes:

"... high availability for the filesystem, ability to support multiple namespaces, HBase table and column level security, improved performance, HBase replication and greatly improved usability and browser support for the Hue web interface. Cloudera Manager 4 includes multi-cluster and multi-version support, automation for high availability and MapReduce2, multi-namespace support, cluster-wide heatmaps, host monitoring and automated client configurations."

Social data platform DataSift also announced this week that it was powering its Hadoop clusters with CDH to perform the "Big Data heavy lifting to help deliver DataSift's Historics, a cloud-computing platform that enables entrepreneurs and enterprises to extract business insights from historical public Tweets."

Have data news to share?

Feel free to email us.

OSCON 2012 Data Track — Today's system architectures embrace many flavors of data: relational, NoSQL, big data and streaming. Learn more in the Data track at OSCON 2012, being held July 16-20 in Portland, Oregon.

Save 20% on registration with the code RADAR

Related:

May 10 2012

Four short links: 10 May 2012

  1. Gravity in the Margins (Got Medieval) -- illuminating illuminated manuscripts with Mario. (via BoingBoing)
  2. Hours Days, Who's Counting? (Jon Udell) -- What prompted me to check? My friend Mike Caulfield, who’s been teaching and writing about quantitative literacy, says it’s because in this case I did have some touchstone facts parked in my head, including the number 10 million (roughly) for barrels of oil imported daily to the US. The reason I’ve been working through a bunch of WolframAlpha exercises lately is that I know I don’t have those touchstones in other areas, and want to develop them. The idea of "touchstone facts" resonates with me.
  3. Spotting Fake Reviewer Groups in Consumer Reviews (PDF) -- gotta love any paper that says We calculated the "spamicity" (degree of spam) of each group by assigning 1 point for each spam judgment, 0.5 point for each borderline judgment and 0 point for each non-spam judgment a group received and took the average of all 8 labelers. (via Google Research Blog)
  4. Visualizing Physical Activity Using Abstract Ambient Art (Quantified Self) -- kinda like the iTunes visualizer but for your Fitbit Tracker.

May 02 2012

Four short links: 2 May 2012

  1. Punting on SxSW (Brad Feld) -- I came across this old post and thought: if you can make money by being a dick, or make money by being a caring family person, why would you choose to be a dick? As far as I can tell, being a dick is optional. Brogrammers, take note. Be more like Brad Feld, who prioritises his family and acts accordingly.
  2. Probabilistic Structures for Data Mining -- readable introduction to useful algorithms and datastructures showing their performance, reliability, and resources trade-off. (via Hacker News)
  3. Dataset -- a Javascript library for transforming, querying, manipulating data from different sources.
  4. Many HTTPS Servers are Insecure -- 75% still vulnerable to the BEAST attack.

April 19 2012

Four short links: 19 April 2012

  1. Superfastmatch -- open source text comparison tool, used to locate plagiarism/churnalism in online news sites. You can pull out the text engine and use it for your own "find where this text is used elsewhere" applications (e.g., what's being forwarded out in email, how much of this RFP is copy and paste, what's NOT boilerplate in this contract, etc.). (via Pete Warden)
  2. Ten Design Principles for Engaging Math Tasks (Dan Meyer) -- education gold, engagement gold, and some serious ideas you can use in your own apps.
  3. Clustering Related Stories (Jenny Finkel) -- description of how to cluster related stories, talks about some of the tricks. Interesting without being too scary.
  4. Prince of Persia (GitHub) -- I have waited to see if the novelty wore off, but I still find this cool: 1980s source code on GitHub.

April 16 2012

What it takes to build great machine learning products

Machine learning (ML) is all the rage, riding tight on the coattails of the "big data" wave. Like most technology hype, the enthusiasm far exceeds the realization of actual products. Arguably, not since Google's tremendous innovations in the late '90s/early 2000s has algorithmic technology led to a product that has permeated the popular culture. That's not to say there haven't been great ML wins since, but none have as been as impactful or had computational algorithms at their core. Netflix may use recommendation technology, but Netflix is still Netflix without it. There would be no Google if Page, Brin, et al., hadn't exploited the graph structure of the web and anchor text to improve search.

So why is this? It's not for lack of trying. How many startups have aimed to bring natural language processing (NLP) technology to the masses, only to fade into oblivion after people actually try their products? The challenge in building great products with ML lies not in just understanding basic ML theory, but in understanding the domain and problem sufficiently to operationalize intuitions into model design. Interesting problems don't have simple off-the-shelf ML solutions. Progress in important ML application areas, like NLP, come from insights specific to these problems, rather than generic ML machinery. Often, specific insights into a problem and careful model design make the difference between a system that doesn't work at all and one that people will actually use.

The goal of this essay is not to discourage people from building amazing products with ML at their cores, but to be clear about where I think the difficulty lies.

Progress in machine learning

Machine learning has come a long way over the last decade. Before I started grad school, training a large-margin classifier (e.g., SVM) was done via John Platt's batch SMO algorithm. In that case, training time scaled poorly with the amount of training data. Writing the algorithm itself required understanding quadratic programming and was riddled with heuristics for selecting active constraints and black-art parameter tuning. Now, we know how to train a nearly performance-equivalent large-margin classifier in linear time using a (relatively) simple online algorithm (PDF). Similar strides have been made in (probabilistic) graphical models: Markov-chain Monte Carlo (MCMC) and variational methods have facilitated inference for arbitrarily complex graphical models [1]. Anecdotally, take at look at papers over the last eight years in the proceedings of the Association for Computational Linguistics (ACL), the premiere natural language processing publication. A top paper from 2011 has orders of magnitude more technical ML sophistication than one from 2003.

On the education front, we've come a long way as well. As an undergrad at Stanford in the early-to-mid 2000s, I took Andrew Ng's ML course and Daphne Koller's probabilistic graphical model course. Both of these classes were among the best I took at Stanford and were only available to about 100 students a year. Koller's course in particular was not only the best course I took at Stanford, but the one that taught me the most about teaching. Now, anyone can take these courses online.

As an applied ML person — specifically, natural language processing — much of this progress has made aspects of research significantly easier. However, the core decisions I make are not which abstract ML algorithm, loss-function, or objective to use, but what features and structure are relevant to solving my problem. This skill only comes with practice. So, while it's great that a much wider audience will have an understanding of basic ML, it's not the most difficult part of building intelligent systems.

Interesting problems are never off the shelf

The interesting problems that you'd actually want to solve are far messier than the abstractions used to describe standard ML problems. Take machine translation (MT), for example. Naively, MT looks like a statistical classification problem: You get an input foreign sentence and have to predict a target English sentence. Unfortunately, because the space of possible English is combinatorially large, you can't treat MT as a black-box classification problem. Instead, like most interesting ML applications, MT problems have a lot of structure and part of the job of a good researcher is decomposing the problem into smaller pieces that can be learned or encoded deterministically. My claim is that progress in complex problems like MT comes mostly from how we decompose and structure the solution space, rather than ML techniques used to learn within this space.

Machine translation has improved by leaps and bounds throughout the last decade. I think this progress has largely, but not entirely, come from keen insights into the specific problem, rather than generic ML improvements. Modern statistical MT originates from an amazing paper, "The mathematics of statistical machine translation" (PDF), which introduced the noisy-channel architecture on which future MT systems would be based. At a very simplistic level, this is how the model works [2]: For each foreign word, there are potential English translations (including the null word for foreign words that have no English equivalent). Think of this as a probabilistic dictionary. These candidate translation words are then re-ordered to create a plausible English translation. There are many intricacies being glossed over: how to efficiently consider candidate English sentences and their permutations, what model is used to learn the systematic ways in which reordering occurs between languages, and the details about how to score the plausibility of the English candidate (the language model).

The core improvement in MT came from changing this model. So, rather than learning translation probabilities of individual words, to instead learn models of how to translate foreign phrases to English phrases. For instance, the German word "abends" translates roughly to the English prepositional phrase "in the evening." Before phrase-based translation (PDF), a word-based model would only get to translate to a single English word, making it unlikely to arrive at the correct English translation [3]. Phrase-based translation generally results in more accurate translations with fluid, idiomatic English output. Of course, adding phrased-based emissions introduces several additional complexities, including how to how to estimate phrase-emissions given that we never observe phrase segmentation; no one tells us that "in the evening" is a phrase that should match up to some foreign phrase. What's surprising here is that there aren't general ML improvements that are making this difference, but problem-specific model design. People can and have implemented more sophisticated ML techniques for various pieces of an MT system. And these do yield improvements, but typically far smaller than good problem-specific research insights.

Franz Och, one of the authors of the original Phrase-based papers, went on to Google and became the principle person behind the search company's translation efforts. While the intellectual underpinnings of Google's system go back to Och's days as a research scientist at the Information Sciences Institute (and earlier as a graduate student), much of the gains beyond the insights underlying phrase-based translation (and minimum-error rate training, another of Och's innovations) came from a massive software engineering effort to scale these ideas to the web. That effort itself yielded impressive research into large-scale language models and other areas of NLP. It's important to note that Och, in addition to being a world-class researcher, is also, by all accounts, an incredibly impressive hacker and builder. It's this rare combination of skill that can bring ideas all the way from a research project to where Google Translate is today.

Defining the problem

But I think there's an even bigger barrier beyond ingenious model design and engineering skills. In the case of machine translation and speech recognition, the problem being solved is straightforward to understand and well-specified. Many of the NLP technologies that I think will revolutionize consumer products over the next decade are much vaguer. How, exactly, can we take the excellent research in structured topic models, discourse processing, or sentiment analysis and make a mass-appeal consumer product?

Consider summarization. We all know that in some way, we'll want products that summarize and structure content. However, for computational and research reasons, you need to restrict the scope of this problem to something for which you can build a model, an algorithm, and ultimately evaluate. For instance, in the summarization literature, the problem of multi-document summarization is typically formulated as selecting a subset of sentences from the document collection and ordering them. Is this the right problem to be solving? Is the best way to summarize a piece of text a handful of full-length sentences? Even if a summarization is accurate, does the Franken-sentence structure yield summaries that feel inorganic to users?

Or, consider sentiment analysis. Do people really just want a coarse-grained thumbs-up or thumbs-down on a product or event? Or do they want a richer picture of sentiments toward individual aspects of an item (e.g., loved the food, hated the decor)? Do people care about determining sentiment attitudes of individual reviewers/utterances, or producing an accurate assessment of aggregate sentiment?

Typically, these decisions are made by a product person and are passed off to researchers and engineers to implement. The problem with this approach is that ML-core products are intimately constrained by what is technically and algorithmically feasible. In my experience, having a technical understanding of the range of related ML problems can inspire product ideas that might not occur to someone without this understanding. To draw a loose analogy, it's like architecture. So much of the construction of a bridge is constrained by material resources and physics that it doesn't make sense to have people without that technical background design a bridge.

The goal of all this is to say that if you want to build a rich ML product, you need to have a rich product/design/research/engineering team. All the way from the nitty gritty of how ML theory works to building systems to domain knowledge to higher-level product thinking to technical interaction and graphic design; preferably people who are world-class in one of these areas but also good in several. Small talented teams with all of these skills are better equipped to navigate the joint uncertainty with respect to product vision as well as model design. Large companies that have research and product people in entirely different buildings are ill-equipped to tackle these kinds of problems. The ML products of the future will come from startups with small founding teams that have this full context and can all fit in the proverbial garage.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

[1]: Although MCMC is a much older statistical technique, its broad use in large-scale machine learning applications is relatively recent.

[2]: The model is generative, so what's being described here is from the point-of-view of inference; the model's generative story works in reverse.

[3]: IBM model 3 introduced the concept of fertility to allow a given word to generate multiple independent target translation words. While this could generate the required translation, the probability of the model doing so is relatively low.

Related:

April 13 2012

Top Stories: April 9-13, 2012

Here's a look at the top stories published across O'Reilly sites this week.


Carsharing saves U.S. city governments millions in operating costs
Carsharing initiatives in a number of U.S. cities are part of a broader trend that suggests the ways we work, play and learn are changing.

Complexity fails: A lesson from storage simplification
Simple systems scale effectively, while complex systems struggle to overcome the multiplicative effect of potential failure points. This shows us why the most reliable and scalable clouds are those made up of fewer, simpler parts.

Operations, machine learning and premature babies
Machine learning and access to huge amounts of data allowed IBM to make an important discovery about premature infants. If web operations teams could capture everything — network data, environmental data, I/O subsystem data, etc. — what would they find out?


State of the Computer Book Market 2011
In his annual report, Mike Hendrickson analyzes tech book sales and industry data: Part 1, Overall Market; Part 2, The Categories; Part 3, The Publishers; Part 4, The Languages; Part 5, Wrap-Up and Digital.


Never, ever "out of print"
In a recent interview, attorney Dana Newman tackled issues surrounding publishing rights in the digital landscape. She said changes in the current model are needed to keep things equitable for both publishers and authors.


Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference, May 29 - 31 in San Francisco. Save 20% on registration with the code RADAR20.

Photo of servers: Google Production Server, v1 by Pargon, on Flickr

April 05 2012

Editorial Radar with Mike Loukides & Mike Hendrickson

Mike Loukides and Mike Hendrickson, two of O'Reilly Media's editors, sat down recently to talk about what's on their editorial radars. Mike and Mike have almost 50 years of combined technical book publishing experience and I always enjoy listening to their insight.

In this session, they discuss what they see in the tech space including:

  • How 3D Printing and personal manufacturing will revolutionize the way business is conducted in the U.S. [Discussed at the 00:43 mark ]
  • The rise of mobile and device sensors and how intelligence will be added to all sorts of devices. [Discussed at the 02:15 mark ]
  • Clear winners in today's code space: JavaScript. With Node.js, D3, HTML5, JavaScript is stepping up the plate. [Discussed at the 04:12 mark ]
  • A discussion on the best first language to teach programming and how we need to provide learners with instruction for the things they want to do. [Discussed at the 06:03 mark ]

You can view the entire interview in the following video.

Next month, Mike and Mike will be talking about functional languages.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

March 23 2012

Top Stories: March 19-23, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Why StreetEasy rolled its own maps
Google's decision to start charging for its Maps API is leading some companies to mull other options. StreetEasy's Sebastian Delmont explains why and how his team made a change.

What is Dart?
Dart is a new structured web programming platform designed to enable complex, high-performance apps for the modern web. Kathy Walrath and Seth Ladd, members of Google's developer relations team, explain Dart's purpose and its applications.

My Paleo Media Diet
Jim Stogdill is tired of running on the info treadmill, so he's changing his media habits. His new approach: "Where I can, adapt to my surroundings; where I can't, adapt my surroundings to me."


The unreasonable necessity of subject experts
We can't forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. Subject experts are the ones who find the stories data wants to tell.

Direct sales uncover hidden trends for publishers
A recent O'Reilly customer survey revealed unusual results (e.g. laptops/desktops remain popular ereading devices). These sorts of insights are made possible by O'Reilly's direct sales channel.


Where Conference 2012 is where the people working on and using location technologies explore emerging trends in software development, tools, business strategies and marketing. Save 20% on registration with the code RADAR20.

March 22 2012

Strata Week: Machine learning vs domain expertise

Here are a few of the data stories that caught my attention this week:

Debating the future of subject area expertise

Data Science Debate panel at Strata CA 12
The "Data Science Debate" panel at Strata California 2012. Watch the debate.

The Oxford-style debate at Strata continues to be one of the most-talked-about events from the conference. This week, it's O'Reilly's Mike Loukides who weighs in with his thoughts on the debate, which had the motion "In data science, domain expertise is more important than machine learning skill." (For those that weren't there, the machine learning side "won." See Mike Driscoll's summary and full video from the debate.)

Loukides moves from the unreasonable effectiveness of data to examine the "unreasonable necessity of subject experts." He writes that:

"Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes 'unreasonably effective' through the conversation that takes place after the numbers have been crunched ... We can only take our inexplicable results at face value if we're just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they're based. And that's the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can't forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems."

Microsoft hires former Yahoo chief scientist

Microsoft has hired Raghu Ramakrishnan as a technical fellow for its Server and Tools Business (STB), reports ZDNet's Mary Jo Foley. According to his new company bio, Ramakrishnan's work will involve "big data and integration between STB's cloud offerings and the Online Services Division's platform assets."

Ramakrishnan comes to Microsoft from Yahoo, where he's been the chief scientist for three divisions — Audience, Cloud Platforms and Search. As Foley notes, Ramakrishnan's move is another indication that Microsoft is serious about "playing up its big data assets." Strata chair Edd Dumbill examined Microsoft's big data strategy earlier this year, noting in particular its work on a Hadoop distribution for Windows server and Azure.

Analyzing the value of social media data

How much is your data worth? The Atlantic's Alexis Madrigal does a little napkin math based on figures from the Internet Advertising Bureau to come up with a broad and ambiguous range between half a cent and $1,200 — depending on how you decide to make the calculation, of course.

In an effort to make those measurements easier and more useful, Google unveiled some additional reports as part of its Analytics product this week. It's a move Google says will help marketers:

"... identify the full value of traffic coming from social sites and measure how they lead to direct conversions or assist in future conversions; understand social activities happening both on and off of your site to help you optimize user engagement and increase social key performance indicators (KPIs); and make better, more efficient data-driven decisions in your social media marketing programs."

Engagement and conversion metrics for each social network will now be trackable through Google Analytics. Partners for this new Social Data Hub, include Disqus, Echo, Reddit, Diigo, and Digg, among others.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

March 16 2012

Four short links: 16 March 2012

  1. Militarizing Your Backyard With Python and Computer Vision (video) -- using a water cannon, computer video, Arduino, and Python to keep marauding squirrel hordes under control. See the finished result for Yakkity Saxed moist rodent goodness.
  2. Soundbite -- dialogue search for Apple's Final Cut Pro and Adobe Premiere Pro. Boris Soundbite quickly and accurately finds any word or phrase spoken in recorded media. Shoot squirrels with computer vision, search audio with computer hearing. We live in the future, people. (via Andy Baio)
  3. Single Page Apps with Backbone.js -- interesting and detailed dissection of how one site did it. Single page apps are where the server sends back one HTML file which changes (via Javascript) in response to the user's activity, possibly with API calls happening in the background, but where the browser is very definitely not requesting more full HTML pages from the server. The idea is to have speed (pull less across the wire each time the page changes) and also to use the language you already know to build the web page (Javascript).
  4. Why Finish Books? (NY Review of Books) -- the more bad books you finish, the fewer good ones you''ll have time to start. Applying this to the rest of life is left as an exercise for the reader.

February 22 2012

Four short links: 22 February 2012

  1. Hashbangs (Dan Webb) -- why those terrible #! URLs are a bad idea. Looks like they're going away with pushState coming to browsers. As Dan says, "URLs are forever". Let's get them right. I'm fascinated by how URLs are changing meaning and use over time.
  2. DNA Sequencing on a USB Stick -- this has been going the rounds, but I think there's a time coming when scientific data generation can be crowdsourced. I care about a particular type of fish, but it hasn't been sequenced. Can I catch one, sequence it, upload the sequence, and get insight into the animal by automated detection of similar genes from other animals? Let those who care do the boring work, let scientists work on the analysis.
  3. The US Recording Industry is Stealing From Me (Bruce Simpson) -- automated content detection at YouTube has created an industry of parasites who claim copyright infringement and then receive royalties from the ads shown on the allegedly infringing videos.
  4. Ubuntu on Android -- carry a desktop in your pocket? Tempting. It's for manufacturers, not something you install on existing handsets, which I'm sure will create tension with the open source world at Ubuntu's heart. Then again, creating tension with the open source world at Ubuntu's heart does seem to be Canonical's core competency ....

December 26 2011

Four short links: 26 December 2011

  1. Pattern -- a BSD-licensed bundle of Python tools for data retrieval, text analysis, and data visualization. If you were going to get started with accessible data (Twitter, Google), the fundamentals of analysis (entity extraction, clustering), and some basic visualizations of graph relationships, you could do a lot worse than to start here.
  2. Factorie (Google Code) -- Apache-licensed Scala library for a probabilistic modeling technique successfully applied to [...] named entity recognition, entity resolution, relation extraction, parsing, schema matching, ontology alignment, latent-variable generative models, including latent Dirichlet allocation. The state-of-the-art big data analysis tools are increasingly open source, presumably because the value lies in their application not in their existence. This is good news for everyone with a new application.
  3. Playtomic -- analytics as a service for gaming companies to learn what players actually do in their games. There aren't many fields untouched by analytics.
  4. Write or Die -- iPad app for writers where, if you don't keep writing, it begins to delete what you wrote earlier. Good for production to deadlines; reflective editing and deep thought not included.

December 23 2011

November 30 2011

November 03 2011

Four short links: 3 November 2011

  1. Feedback Without Frustration (YouTube) -- Scott Berkun at the HIVE conference talks about how feedback fails, and how to get it successfully. He is so good.
  2. Americhrome -- history of the official palette of the United States of America.
  3. Discovering Talented Musicians with Musical Analysis (Google Research blgo) -- very clever, they do acoustical analysis and then train up a machine learning engine by asking humans to rate some tracks. Then they set it loose on YouTube and it finds people who are good but not yet popular. My favourite: I'll Follow You Into The Dark by a gentleman with a wonderful voice.
  4. Dark Sky (Kickstarter) -- hyperlocal hyper-realtime weather prediction. Uses radar imagery to figure out what's going on around you, then tells you what the weather will be like for the next 30-60 minutes. Clever use of data plus software.

September 08 2011

Four short links: 8 September 2011

  1. x86 Opcode Sheet (PDF) -- I love instruction set charts, they're the periodic table of opcodes. Just as the table of elements makes visually apparent the regular construction and common properties of the elements, so to do these charts convey the regular construction and common behaviour of the opcodes. (via programming.reddit)
  2. Monopoly as a Markov Chain -- see the probability of being on any given square after a given move. (via Joe Johnston)
  3. BoxerApp -- package old DOS games for your Mac.
  4. How Old Is Your Globe? -- determine the age of your globe from the no-longer-existant countries it contains. (via Richard Soderberg)

September 06 2011

Four short links: 6 September 2011

  1. The Secret Life of Javascript Primitives -- good writing and clever headlines can make even the dullest topic seem interesting. This is interesting, I hasten to add.
  2. Backup Bouncer -- software to test how effective your backup tools are: you copy files to a test area by whatever means you like, then run this tool to see whether permissions, flags, owners, contents, timestamps, etc. are preserved. (via Joshua Schachter)
  3. reVerb -- open source (GPLv3) toolkit for learning triples from text. See the paper for more details.
  4. Patterns for Large-Scale Javascript Architecture -- enterprise (aka "scalable") architectures for Javascript apps.

August 12 2011

Four short links: 12 August 2011

  1. Hippocampus Text Adventure -- written as an exercise in learning Python, you explore the hippocampus. It's simple, but I like the idea of educational text adventures. (Well, educational in that you learn about more than the axe-throwing behaviour of the cave-dwelling dwarf)
  2. Pandas -- BSD-licensed Python data analysis library.
  3. Building Lanyrd -- Simon Willison's talk (with slides) about the technology under Lanyrd and the challenges in building with and deploying it.
  4. Electronic Skin Monitors Heart, Brain, and Muscles (Discover Magazine blogs) -- this is freaking awesome proof-of-concept. Interview with the creator of a skin-mounted sensor, attached like a sticker, is flexible, inductively powered, and much more. This represents a major step forward in possibilities for personal data-gathering. (via Courtney Johnston)

July 18 2011

Four short links: 18 July 2011

  1. Organisational Warfare (Simon Wardley) -- notes on the commoditisation of software, with interesting analyses of the positions of some large players. On closer inspection, Salesforce seems to be doing more than just commoditisation with an ILC pattern, as can be clearly seen from Radian's 6 acquisition. They also seem to be operating a tower and moat strategy, i.e. creating a tower of revenue (the service) around which is built a moat devoid of differential value with high barriers to entry. When their competitors finally wake up and realise that the future world of CRM is in this service space, they'll discover a new player dominating this space who has not only removed many of the opportunities to differentiate (e.g. social CRM, mobile CRM) but built a large ecosystem that creates high rates of new innovation. This should be a fairly fatal combination.
  2. Learning to Win by Reading Manuals in a Monte-Carlo Framework (MIT) -- starting with no prior knowledge of the game or its UI, the system learns how to play and to win by experimenting, and from parsed manual text. They used FreeCiv, and assessed the influence of parsing the manual shallowly and deeply. Trust MIT to turn RTFM into a paper. For human-readable explanation, see the press release.
  3. A Shapefile of the TZ Timezones of the World -- I have nothing but sympathy for the poor gentleman who compiled this. Political boundaries are notoriously arbitrary, and timezones are even worse because they don't need a war to change. (via Matt Biddulph)
  4. Microsoft Adventure -- 1979 Microsoft game for the TRS-80 has fascinating threads into the past and into what would become Microsoft's future.

July 13 2011

Four short links: 13 July 2011

  1. Freebase in Node.js (github) -- handy library for interacting with Freebase from node code. (via Rob McKinnon)
  2. Formalize -- CSS library to provide a standard style for form elements. (via Emma Jane Hogbin)
  3. Suggesting More Friends Using the Implicit Social Graph (PDF) -- Google paper on the algorithm behind Friend Suggest. Related: Katango. (via Big Data)
  4. Dyslexia -- a typeface for dyslexics. (via Richard Soderberg)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl