Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 02 2014

Four short links: 3 January 2014

  1. Commotion — open source mesh networks.
  2. WriteLaTeX — online collaborative LaTeX editor. No, really. This exists. In 2014.
  3. Distributed Systems — free book for download, goal is to bring together the ideas behind many of the more recent distributed systems – systems such as Amazon’s Dynamo, Google’s BigTable and MapReduce, Apache’s Hadoop etc.
  4. How Netflix Reverse-Engineered Hollywood (The Atlantic) — Using large teams of people specially trained to watch movies, Netflix deconstructed Hollywood. They paid people to watch films and tag them with all kinds of metadata. This process is so sophisticated and precise that taggers receive a 36-page training document that teaches them how to rate movies on their sexually suggestive content, goriness, romance levels, and even narrative elements like plot conclusiveness.

November 05 2013

Four short links: 5 November 2013

  1. Influx DBopen-source, distributed, time series, events, and metrics database with no external dependencies.
  2. Omega (PDF) — ���exible, scalable schedulers for large compute clusters. From Google Research.
  3. GraspJSSearch and replace your JavaScript code based on its structure rather than its text.
  4. Amazon Mines Its Data Trove To Bet on TV’s Next Hit (WSJ) — Amazon produced about 20 pages of data detailing, among other things, how much a pilot was viewed, how many users gave it a 5-star rating and how many shared it with friends.

September 24 2013

August 14 2013

August 08 2013

Four short links: 9 August 2013

  1. DEFCON Documentary — free download, I’m looking forward to watching it on the flight back to NZ.
  2. Global-Scale Systems — botnets as example of the scale of networks and systems we’ll have to build but don’t have experience in.
  3. MediaGoblin — GNU project to build a decentralized alternative to Flickr, YouTube, SoundCloud, etc.
  4. Teaching TCP/IP Headers with Legos — genius. (via BoingBoing)
  5. April 19 2012

    Strata Week: The rise of the robot essay graders

    Here are a few of the data stories that caught my attention this week.

    Automated essay-scoring software scores as well as humans

    Taking a test at the Real Estate Investing College by Casey Serin, on FlickrRobot essay graders: They grade the same as humans. That's the conclusion of a study conducted by University of Akron's Dean of the College of Education Mark Shermis and Kaggle data scientist Ben Hamner. The researchers examined some 22,000 essays that were administered to junior and high school students as part of their states' standardized testing process, comparing the grades given by human graders and those given by automated grading software. They found that "overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre" (PDF of the report).

    "The demonstration showed conclusively that automated essay scoring systems are fast, accurate, and cost effective," says Tom Vander Ark, managing partner at the investment firm Learn Capital, in a press release touting the study's results.

    The study coincides with an active competition hosted on Kaggle and sponsored by the Hewlett Foundation, in which data scientists are challenged with developing the best algorithm to automatically grade student essays. "Better tests support better learning," noted the foundation's Education Program Director Barbara Chow in the press release. "This demonstration of rapid and accurate automated essay scoring will encourage states to include more writing in their state assessments. And, the more we can use essays to assess what students have learned, the greater the likelihood they'll master important academic content, critical thinking, and effective communication."

    Personally, I like writing for a human audience. Bots leave really stupid blog comments — but I bet there's an algorithm for that too.

    Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

    Save 20% on registration with the code RADAR20

    Scaling Instagram

    The billion-dollar acquisition of the mobile photo-sharing app Instagram was big news last week. The news coincided with a presentation by co-founder Mike Krieger at an AirBnB Tech Talk about how the startup managed to scale to 30 million users worldwide with a small team of back-end developers (a very small team, in fact). Krieger's presentation is interesting in its own right, of course, but news of the acquisition by Facebook certainly fueled interest — in the deal and in the tech under the Instagram hood.

    Krieger's slides can be found here. The presentation details some of the early and ongoing challenges of handling the app's increasing number of users and their photos (including the recent roll-out of an Android app, which added another million new users in just 12 hours). Although Instagram hasn't suffered any major outages of the likes seen by Twitter and Tumblr, Krieger does note a number of early problems, including a missing favicon.ico that was causing a lot of 404 errors in Django.

    Auditing data.gov.uk

    The UK's National Audit Office has just released its look at the government's open data efforts, reports The Guardian. Although the open data initiative gets good marks for the "tsunami of data" it's released — 8,300 datasets — there remain questions about cost and usage.

    Governmental departments estimate they spend between £53,000 and £500,000 each year on publishing the data, with the police crime maps, for example, costing £300,000 to set up and £150,000 per year to maintain. And it's not clear that the data is in demand, according to the National Audit Office report: "None of the departments reported significant spontaneous public demand for the standard dataset releases." This doesn't account for the ways in which third-party vendors may be using the data, however.

    Big Data Week

    April 23-29 is "Big Data Week," an event created by DataSift that will feature meetups and hackathons in several cities around the world. Big Data Week aims to bring together the "core communities" — data scientists, data technologies, data visualization, and data business. A list of events is available on the Big Data Week website.

    Got data news?

    Feel free to email me.

    Photo: Taking a test at the Real Estate Investing College

    February 16 2012

    Strata Week: The data behind Yahoo's front page

    Here are a few of the data stories that caught my attention this week.

    Data and personalization drive Yahoo's front page

    Yahoo offered a peak behind the scenes of its front page with the release of the Yahoo C.O.R.E. Data Visualization. The visualization provides a way to view some of the demographic details behind what Yahoo visitors are clicking on.

    The C.O.R.E. (Content Optimization and Relevance Engine) technology was created by Yahoo Labs. The tech is used by Yahoo News and its Today module to personalize results for its visitors — resulting in some 13,000,000 unique story combinations per day. According to Yahoo:

    "C.O.R.E. determines how stories should be ordered, dependent on each user. Similarly, C.O.R.E. figures out which story categories (i.e. technology, health, finance, or entertainment) should be displayed prominently on the page to help deepen engagement for each viewer."

    Screenshot from Yahoo's CORE visualization
    Screenshot from Yahoo's CORE data visualization. See the full visualization here.

    Scaling Tumblr

    Over on the High Scalability blog, Todd Huff examines how the blogging site Tumblr was able to scale its infrastructure, something that Huff describes as more challenging than the scaling that was necessary at Twitter.

    To put give some idea of the scope of the problem, Hoff cites these figures:

    "Growing at over 30% a month has not been without challenges. Some reliability problems among them. It helps to realize that Tumblr operates at surprisingly huge scales: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers."

    Hoff interviews Blake Matheny, distributed systems engineer at Tumblr, for a look at the architecture of both "old" and "new" Tumblr. When the startup began, it was hosted on Rackspace where "it gave each custom domain blog an A record. When they outgrew Rackspace there were too many users to migrate."

    The article also describes the Tumblr firehose, noting again its differences from Twitter's. "A challenge is to distribute so much data in real-time," Huff writes. "[Tumblr} wanted something that would scale internally and that an application ecosystem could reliably grow around. A central point of distribution was needed." Although Tumblr initially used Scribe/Hadoop, "this model stopped scaling almost immediately, especially at peak where people are creating 1000s of posts a second."

    Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

    Save 20% on registration with the code RADAR20


    Visualization creation

    Data scientist Pete Warden offers his own lessons learned about building visualizations this week in a story here on Radar. His first tip: "Play with your data" -- that is, before you decide what problem you want to solve or visualization you want to create, take the time to know the data you're working with.

    Warden writes:

    "The more time you spend manipulating and examining the raw information, the more you understand it at a deep level. Knowing your data is the essential starting point for any visualization."

    Warden explains how he was able to create a visualization for his new travel startup, Jetpac, that showed where American Facebook users go on vacation. Warden's tips aren't simply about the tools he used; he also walks through the conceptualization of the project as well as the crunching of the data.

    Got data news?

    Feel free to email me.

    Related:

    March 09 2011

    Four short links: 9 March 2011

    1. R Studio -- AGPLv3-licensed IDE for R. It brings your R console, source code, plots, help, history, and workspace browser into one cohesive package. We've added some neat productivity features like a searchable endless command history, function/symbol completion, data import dialog with preview, one-click Sweave compile, and more. Source on github. Built as a web-app on Google AppEngine, from Joe Cheng who did Windows Live Writer at Microsoft. (via DeWitt Clinton)
    2. Adventures in Participatory Audience -- Nina Simon helped thirteen students produce three projects to encourage participation in museum audiences: Xavier, Stringing Connections, and Dirty Laundry. My favourite was Dirty Laundry, where people shared secrets connected to works of art. Nina's description of what she learned has some nuggets: friendly faces welcoming people in gets better response than a card with instructions, and I am still flummoxed as to what would make someone admit to an affair or bad parenting in a sterile art gallery, or the devastating one that read, "I avoid the important, difficult conversations with those I love the most." Audience participation in the real world has lessons on what works for those who would build social software.
    3. Why Generic Machine Learning Fails -- Returns for increasing data size come from two sources: (1) the importance of tails and (2) the cost of model innovation. When tails are important, or when model innovation is difficult relative to cost of data capture, then more data is the answer. [...] Machine learning is not undifferentiated heavy lifting, it’s not commoditizable like EC2, and closer to design than coding. The Netflix prize is a good example: the last 10% reduction in RMSE wasn't due to more powerful generic algorithms, but rather due to some very clever thinking about the structure of the problem; observations like "people who rate a whole slew of movies at one time tend to be rating movies they saw a long time ago" from BellKor.
    4. Anatomy of a Crushing -- Maciej Ceglowski describes how pinboard.in survived the flood of Delicious émigrées. It took several rounds of rewrites to get the simple tag cloud script right, and this made me very skittish about touching any other parts of the code over the next few days, even when the fixes were easy and obvious. The part of my brain that knew what to do no longer seemed to be connected directly to my hands.

    February 18 2011

    Four short links: 18 February 2011

    1. DSPL: DataSet Publishing Language (Google Code) -- a representation language for the data and metadata of datasets. Datasets described in this format can be processed by Google and visualized in the Google Public Data Explorer. XML metadata on CSV, geo-enabled, with linkable data. (via Michal Migurski on Delicious)
    2. Why is Evidence So Hard for Politicians -- Ben Goldacre nails how politicians go about "evidence-based policy making": So the Minister has cherry picked only the good findings, from only one report, while ignoring the peer-reviewed literature. Most crucially, he cherry-picks findings he likes whilst explicitly claiming that he is fairly citing the totality of the evidence from a thorough analysis. I can produce good evidence that I have a magical two-headed coin, if I simply disregard all the throws where it comes out tails.
    3. Celery: Distributed Task Queue -- asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well. MIT-style licensed, written in Python, RabbitMQ is the recommended message broker. (via Joshua Schachter on Delicious)
    4. pixelfari -- Safari hacked to look like it's running on an 8-bit computer. This sense of playfulness with the medium is something I love about the best coders. They think "ha, wouldn't it be funny if ..." and then can make it happen.

    September 22 2010

    Four short links: 22 September 2010

    1. The Rise of Amazon Web Services -- Stephen O'Grady points out that Amazon has become an enterprise sales company but we don't treat it as such because we think of it as a retail company that's dabbling in technology. I think of Amazon as an automation company: they automate and optimize everything, and a data center is just a warehouse for MIPS. (via Matt Asay)
    2. Celery Project -- a distributed task queue. (via joshua on Delicious)
    3. Memory Upgrade (The Economist) -- a photofit system that uses evolutionary algorithms to generate the suspects' faces, and does clever things like animated distortions to call out features the witness might recall. Technology going beyond automated sketch artists.
    4. The Particle Adventure: The Fundamental of Matter and Force -- basic physics in easy-to-understand language with illustrations, all in bite-size pieces (and 1998-era web design). I'm pondering what one of these would be like for computers, and whether "how do these actually work?" has the same romance as "how does the world really work?".

    September 09 2010

    Four short links: 9 September 2010

    1. CloudUSB -- a USB key containing your operating environment and your data + a protected folder so nobody can access you data, even if you lost the key + a backup program which keeps a copy of your data on an online disk, with double password protection. (via ferrouswheel on Twitter)
    2. FCC APIs -- for spectrum licenses, consumer broadband tests, census block search, and more. (via rjweeks70 on Twitter)
    3. Sibyl: A system for large scale machine learning (PDF) -- paper from Google researchers on how to build machine learning on top of a system designed for batch processing. (via Greg Linden)
    4. The Surprisingness of What We Say About Ourselves (BERG London) -- I made a chart of word-by-word surprisingness: given the statement so far, could Scribe predict what would come next?

    July 29 2010

    Four short links: 29 July 2010

    1. How to Raise Funds for Non-Profits (Joi Ichi) -- One organization sent a message to all of their donors during the Haiti crisis asking them to give to an NGO that they had vetted. They didn't ask for any money for themselves. This had a hugely positive effect and the donors trust in the group increased. Wallets aren't zero sum.
    2. legislation.gov.uk -- very elegant legislation system for the UK. Check out the annual analysis, for example. (via rchards on Twitter)
    3. The Great WebKit Comparison Table -- So far I’ve tested 14 different mobile WebKits, and they are all slightly different. You can find the details below. (via Andrew Savikas)
    4. Node and Scaling in the Small vs Scaling in the Large (al3x) -- In a system of no significant scale, basically anything works. The power of today’s hardware is such that, for example, you can build a web application that supports thousands of users using one of the slowest available programming languages, brutally inefficient datastore access and storage patterns, zero caching, no sensible distribution of work, no attention to locality, etc. etc. Basically, you can apply every available anti-pattern and still come out the other end with a workable system, simply because the hardware can move faster than your bad decision-making.

    May 18 2010

    April 28 2010

    February 09 2010

    Four short links: 9 February 2010

    1. Track DC -- informative drill-down report from Washington DC government about the different departments. (via Sunlight Labs blog)
    2. Errors in Scientific Software -- a 1994 study of scientific software that found inconsistent interfaces (1 in 7 for Fortran, 1 in 37 for C) and poor use of arithmetic such that significant figures declined from 6sf in the data to 1sf in the result. (via "If you're going to do good science, release the computer code too" in the Guardian)
    3. How Farmville Scales -- 75M players/month (28M/day), 1/4 of disk activity is writes, 50% higher load spikes, 3G/s traffic go between Farmville and Facebook at peak, LAMP stack, nagios+munin+puppet. (via Hacker News)
    4. Mathematical Philology -- when two manuscripts of the same text differ, which is correct? This PLoSONE paper looked at all such discrepancies in Lucretius's De Rerum Natura and found that the traditional principle of choosing the more difficult reading (on the grounds that errors are from humans unconsciously simplifying) has a strong information theory justification for it. Interesting to see this less than a week after an MIT Technology Review article on quantum teleportation remarked, There is a growing sense that the properties of the universe are best described not by the laws that govern matter but by the laws that govern information.

    January 04 2010

    Four short links: 4 January 2010

    1. Why Git Is So Fast -- interesting mailing list post about the problems that the JGit folks had when they tried to make their Java version of Git go faster. Higher level languages hide enough of the machine that we can't make all of these optimizations. A reminder that you must know and control the systems you're running on if you want to get great performance. (via Hacker News)
    2. Wooden Combination Lock -- you'll easily understand how combination locks work with this find piece of crafty construction work.
    3. From Moleskine to Market -- how a leading font designer designs fonts. Fascinating, and beautiful, and it makes me covet his skills.
    4. Terrastore -- open source distributed document store, HTTP accessible, data and queries are distributed, built on Terracotta which is built on ehcache. A NoSQL database built on Java tools that serious Java developers respect, the first such one that I've noticed. Notice that all the interesting work going on in the NoSQL arena is happening in open source projects.

    December 22 2009

    Four short links: 22 December 2009

    1. Trading Shares in Milliseconds (Technology Review) -- With the rise of automation, the bulk of U.S. stock trading has moved from the once-crowded floor of Manhattan's New York Stock Exchange (NYSE) to silent server farms run by exchanges and broker-dealers across the country: the proportion of all trades that the NYSE handles has shrunk from 80 percent in 2005 to 40 percent today. Trading is now essentially a virtual art, and its practitioners put such a premium on speed that NASDAQ has considered issuing equal 100-foot lengths of cable to the brokers who send orders to its exchange servers. (via Hacker News)
    2. Stream iTunes Over SSH -- short script that lets you tunnel itunes from one machine to another over ssh (by default iTunes only shares on the local network).
    3. Doodle -- simple way to schedule a common meeting time. (via joshua on Delicious)
    4. Crowdsourcing -- Simon Willison's thoughtful "lessons learned" from his crowdsourcing projects at the Guardian. Crowdsourcing is not as simple as "give them a wiki and they will fill it" (this is related to the failed "everyone in the world wants to work on my broken payroll system" theory of open source), and Simon explains some of the subtleties. The reviewing experience the first time round was actually quite lonely. We deliberately avoided showing people how others had marked each page because we didn’t want to bias the results. Unfortunately this meant the site felt like a bit of a ghost town, even when hundreds of other people were actively reviewing things at the same time. For the new version, we tried to provide a much better feeling of activity around the site. We added “top reviewer” tables to every assignment, MP and political party as well as a “most active reviewers in the past 48 hours” table on the homepage (this feature was added to the first project several days too late). User profile pages got a lot more attention, with more of a feel that users were collecting their favourite pages in to tag buckets within their profile.

    Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
    Could not load more posts
    Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
    Just a second, loading more posts...
    You've reached the end.

    Don't be the product, buy the product!

    Schweinderl