Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 28 2014

Four short links: 28 January 2014

  1. Intel On-Device Voice Recognition (Quartz) — interesting because the tension between client-side and server-side functionality is still alive and well. Features migrate from core to edge and back again as cycles, data, algorithms, and responsiveness expectations change.
  2. Meet Microsoft’s Personal Assistant (Bloomberg) — total information awareness assistant. By Seeing, Hearing, and Knowing All, in the future even elevators will be trying to read our minds. (via The Next Web)
  3. Microsoft Contributes Cloud Server Designs to Open Compute ProjectAs part of this effort, Microsoft Open Technologies Inc. is open sourcing the software code we created for the management of hardware operations, such as server diagnostics, power supply and fan control. We would like to help build an open source software community within OCP as well. (via Data Center Knowledge)
  4. Open Tissue Wiki — open source (ZLib license) generic algorithms and data structures for rapid development of interactive modeling and simulation.

January 02 2014

Four short links: 3 January 2014

  1. Commotion — open source mesh networks.
  2. WriteLaTeX — online collaborative LaTeX editor. No, really. This exists. In 2014.
  3. Distributed Systems — free book for download, goal is to bring together the ideas behind many of the more recent distributed systems – systems such as Amazon’s Dynamo, Google’s BigTable and MapReduce, Apache’s Hadoop etc.
  4. How Netflix Reverse-Engineered Hollywood (The Atlantic) — Using large teams of people specially trained to watch movies, Netflix deconstructed Hollywood. They paid people to watch films and tag them with all kinds of metadata. This process is so sophisticated and precise that taggers receive a 36-page training document that teaches them how to rate movies on their sexually suggestive content, goriness, romance levels, and even narrative elements like plot conclusiveness.
Sponsored post

November 19 2013

Four short links: 19 November 2013

  1. Why The Banner Ad is Heroic — enough to make Dave Eggers cry. Advertising triumphalism rampant.
  2. Udacity/Thrun ProfileA student taking college algebra in person was 52% more likely to pass than one taking a Udacity class, making the $150 price tag–roughly one-third the normal in-state tuition–seem like something less than a bargain. In which Udacity pivots to hiring-sponsored workforce training and the new educational revolution looks remarkably like sponsored content.
  3. Amazing is Building Substations (GigaOm) — the company even has firmware engineers whose job it is to rewrite the archaic code that normally runs on the switchgear designed to control the flow of power to electricity infrastructure. Pretty sure that wasn’t a line item in the pitch deck for “the first Internet bookstore”.
  4. Panoramic Images — throw the camera in the air, get a 360×360 image from 36 2-megapixel lenses. Not sure that throwing was previously a recognised UI gesture.

November 18 2013

Four short links: 18 November 2013

  1. The Virtuous Pipeline of Code (Public Resource) — Chicago partnering with Public Resource to open its legal codes for good. “This is great! What can we do to help?” Bravo Chicago, and everyone else—take note!
  2. Smithsonian’s 3D Data — models of 21 objects, from a gunboat to the Wright Brothers’ plane, to a wooly mammoth skeleton, to Lincoln’s life masks. I wasn’t able to find a rights statement on the site which explicitly governed the 3D models. (via Smithsonian Magazine)
  3. Anki’s Robot Cars (Xconomy) — The common characteristics of these future products, in Sofman’s mind: “Relatively simple and elegant hardware; incredibly complicated software; and Web and wireless connectivity to be able to continually expand the experience over time.” (via Slashdot)
  4. An Empirical Evaluation of TCP Performance in Online GamesWe show that because TCP was originally designed for unidirectional and network-limited bulk data transfers, it cannot adapt well to MMORPG traffic. In particular, the window-based congestion control and the fast retransmit algorithm for loss recovery are ineffective. Furthermore, TCP is overkill, as not every game packet needs to be transmitted in a reliably and orderly manner. We also show that the degraded network performance did impact users’ willingness to continue a game.

September 04 2013

Sharing is a competitive advantage

In October, we’re bringing our Velocity conference to New York for the first time. Let’s face it, a company expanding its conference to other locations isn’t anything that unique. And given the thriving startup scene in New York, there’s no real surprise we’d like to have a presence there, either. In that sense, we’ll be doing what we’ve already been doing for years with the Velocity conference in California: sharing expert knowledge about the skills and technologies that are critical for building scalable, resilient, high-availability websites and services.

But there’s an even more compelling reason we’re looking to New York: the finance industry. We’d be foolish and remiss if we acted like it didn’t factor in to our decision, and that we didn’t also share some common concerns, especially on the operational side of things. The Velocity community spends a great deal of time navigating significant operational realities — infrastructure, cost, risk, failures, resiliency; we have a great deal to share with people working in finance, and I’d wager, a great deal to learn in return. If Google or Amazon go down, they lose money. (I’m not saying this is a good thing, mind you.) When a “technical glitch” occurs in financial service systems, we get flash crashes, a complete suspension of the Nasdaq, and whatever else comes next — all with potentially catastrophic outcomes.

There are some massive cultural and regulatory differences between web companies and financial organizations, yet both have a relentless focus on rapid growth. However, they’ve approached growth (and the risk management associated with it) in two very different ways. Financial companies are highly competitive in a very closed manner — trading algorithms are held close to the chest like cards in a poker game. Conversely, the kind of exponential growth seen in the web industry over the past 20 years or so is borne out of a uniquely open ethos of sharing. Don’t get me wrong, startups are competitive, as they should be. And yet, in the second year of the Velocity conference in 2010, there was someone from Facebook operations talking about how they keep the site running, and another person talked about their front-end engineering frameworks, tools, and processes. And the same from Twitter.

In 2011, things got really interesting. People started talking about how they’d screwed up — and this wasn’t just swapping disaster stories; they talked about how to learn and benefit from screwing up, how to make failure a feature — something planned for and expected. A robust, honest culture of sharing information has evolved at Velocity that reflects a unique, counterintuitive startup mentality: sharing is a competitive advantage. Being able to see and talk openly about what other companies were doing spurred innovation and accelerated the rate of change. A thousand developers fixing bugs in Apache is more efficient than 1,000 developers fixing bugs in 1,000 different proprietary web servers. To those familiar with open source software, admittedly, this will not seem such a transcendent concept, but in other industries this could rightfully be considered the exact opposite, and giving away your competitive advantage.

Velocity is doing something different from just sharing software. It’s sharing ideas about operations and performance in high-stakes environments. And if the stakes are high in Silicon Valley, they’re that much higher in NYC. So, what can someone in the finance development or operations world learn from Velocity? For starters, dig a bit deeper into reports on recent outages and you start to see that the “technical glitches” are more often than not some combination of manual configuration and human error issues — they exist in the fuzzy boundaries between human decision making and automation. This is a problem space that the Velocity conference and community at large has been investigating and innovating in for some time now. Automation — of configuration management, development hand-offs, testing, and deployment — has had a dramatic effect on the speed, stability, and precision of web software in recent years, and I’d argue a similar shift needs to happen soon for financial software as well. That said, automation in a vacuum, or done poorly, can be worse than the alternatives.

It’s these types of conversations we aim to bring to a new location and new industries come October. We hope you’ll join us there.

August 08 2013

The web performance I want

There’s been a lot said and written about web performance since the Velocity conference. And steps both forward and back — is the web getting faster? Are developers using increased performance to add more useless gunk to their pages, taking back performance gains almost as quickly as they’re achieved?

I don’t want to leap into that argument; Arvind Jain did a good job of discussing the issues at Velocity Santa Clara and in a blog post on Google’s analytics site. But, I do want to discuss (all right, flame) about one issue that bugs me.

I see a lot of pages that appear to load quickly. I click on a site, and within a second, I have an apparently readable page.

“Apparently,” however, is a loaded word because a second later, some new component of the page loads, causing the browser to re-layout the page, so everything jumps around. Then comes the pop-over screen, asking if I want to subscribe or take a survey. (Most online renditions of print magazines: THIS MEANS YOU!). Then another resize, as another component appears. If I want to scroll down past the lead picture, which is usually uninteresting, I often find that I can’t because the browser is still laying out bits and pieces of the page. It’s almost as if the developers don’t want me to read the page. That’s certainly the effect they achieve.

I see plenty of sites that take 10, 20 seconds to load and function properly, where “function properly” means that the page scrolls, the text doesn’t jump around, and there’s no extraneous crap obscuring it. You can bet that I’m not waiting that long. No way. I don’t even mind pop-overs that much (well, I do, really), but when I click “no,” I want them to go away immediately, not hang around until tons of bloat have finished loading.

I do understand that many sites need to make money, and I’m not unsympathetic to paywalls. But man, if your strategy for getting me to subscribe is to annoy the hell out of me, it’s not working. It’s not going to work. And I really don’t understand how anyone could think that it would work. A site that’s a pleasure to read is going to get me as a repeat visitor, and maybe even a subscriber. A site that’s a pain to use, that frustrates me every time I visit — well, what do you think?

We’ve learned a lot about web performance in the last few years, but it seems to me that we’ve learned it the wrong way. Cruftifying web pages in ways that make them unusable until the entire page has loaded is not what Velocity is about. That’s not what web performance is about. And it’s a great way to prevent your audience from returning.

August 05 2013

Four short links: 6 August 2013

  1. White Hat’s Dilemma (Google Docs) — amazeballs preso with lots of tough ethical questions for people in the computer field.
  2. Chinese Hacking Team Caught Taking Over Decoy Water Plant (MIT Tech Review) — Wilhoit went on to show evidence that other hacking groups besides APT1 intentionally seek out and compromise water plant systems. Between March and June this year, 12 honeypots deployed across eight different countries attracted 74 intentional attacks, 10 of which were sophisticated enough to wrest complete control of the dummy control system.
  3. Web Tracing FrameworkRich tools for instrumenting, analyzing, and visualizing web apps.
  4. CoreOSLinux kernel + systemd. That’s about it. CoreOS has just enough bits to run containers, but does not ship a package manager itself. In fact, the root partition is completely read-only, to guarantee consistency and make updates reliable. Docker-compatible.

February 14 2013

Four short links: 14 February 2013

  1. Welcome to the Malware-Industrial Complex (MIT) — brilliant phrase, sound analysis.
  2. Stupid Stupid xBoxThe hardcore/soft-tv transition and any lead they feel they have is simply not defensible by licensing other industries’ generic video or music content because those industries will gladly sell and license the same content to all other players. A single custom studio of 150 employees also can not generate enough content to defensibly satisfy 76M+ customers. Only with quality primary software content from thousands of independent developers can you defend the brand and the product. Only by making the user experience simple, quick, and seamless can you defend the brand and the product. Never seen a better put statement of why an ecosystem of indies is essential.
  3. Data Feedback Loops for TV (Salon) — Netflix’s data indicated that the same subscribers who loved the original BBC production also gobbled down movies starring Kevin Spacey or directed by David Fincher. Therefore, concluded Netflix executives, a remake of the BBC drama with Spacey and Fincher attached was a no-brainer, to the point that the company committed $100 million for two 13-episode seasons.
  4. wrka modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU. It combines a multithreaded design with scalable event notification systems such as epoll and kqueue.

December 18 2012

Four short links: 18 December 2012

  1. Credibility Ranking of Tweets During High Impact Events (PDF) — interesting research. Situational awareness information is information that leads to gain in the knowledge or update about details of the event, like the location, people aff*ected, causes, etc. We found that on average, 30% content about an event, provides situational awareness information about the event, while 14% was spam. (via BoingBoing)
  2. The Commodore 64 — interesting that Chuck Peddle (who designed the 6502) and Bob Yannes (who designed the SID chip) are still alive. This article safely qualifies as Far More Than You Ever Thought You Wanted To Know About The C64 but it is fascinating. The BASIC housed in its ROM (“BASIC 2.0″) was painfully antiquated. It was actually the same BASIC that Tramiel had bought from Microsoft for the original PET back in 1977. Bill Gates, in a rare display of naivete, sold him the software outright for a flat fee of $10,000, figuring Commodore would have to come back soon for another, better version. He obviously didn’t know Jack Tramiel very well. Ironically, Commodore did have on hand a better BASIC 4.0 they had used in some of the later PET models, but Tramiel nixed using it in the Commodore 64 because it would require a more expensive 16 K rather than 8 K of ROM chips to house.
  3. The Performance Calendar — an article each day about speed. (via Steve Souders)
  4. Mr China Comes to America (The Atlantic) — long piece on the return of manufacturing to America, featuring Foo camper Liam Casey.

October 11 2012

Four short links: 11 October 2012

  1. ABalytics — dead simple A/B testing with Google Analytics. (via Dan Mazzini)
  2. Fastest Rubik Cube Solver is Made of Lego — it takes less than six seconds to solve the cube. Watch the video, it’s … wow. Also cool is watching it fail. (via Hacker News)
  3. Fairfax Watches BitTorrent (TorrentFreak) — At a government broadband conference in Sydney, Fairfax’s head of video Ricky Sutton admitted that in a country with one of the highest percentage of BitTorrent users worldwide, his company determines what shows to buy based on the popularity of pirated videos online.
  4. Web Performance Tools (Steve Souders) — compilation of popular web performance tools. Reminds me of nmap’s list of top security tools.

October 09 2012

Six themes from Velocity Europe

By Steve Souders and John Allspaw

More than 700 performance and operations engineers were in London last week for Velocity Europe 2012. Below, Velocity co-chairs Steve Souders and John Allspaw note high-level themes from across the various tracks (especially the hallway track) that are emerging for the WPO and DevOps communities.

Velocity Europe 2012 in LondonVelocity Europe 2012 in London

Performance themes from Steve Souders

I was in awe of the speaker and exhibitor lineup going into Velocity Europe. It was filled with knowledgeable gurus and industry leaders. As Velocity Europe unfolded a few themes kept recurring, and I wanted to share those with you.

Performance matters more — The places and ways that web performance matters keeps growing. The talks at Velocity covered desktop, mobile (native, web, and hybrid), tablet, TV, DSL, cable, FiOS, 3G, 4G, LTE, and WiMAX across social, financial, ecommerce, media, games, sports, video, search, analytics, advertising, and enterprise. Although all of the speakers were technical, they talked about how the focus on performance extends to other departments in their companies as well as the impact performance has on their users. Web performance has permeated all aspects of the web and has become a primary focus for web companies.

Organizational challenges are the hardestLonely Planet and SoundCloud talked about how the challenges in shifting their organizational culture to focus on performance were more difficult than the technical work required to actually improve performance. During the hallway track, myself and a few other speakers were asked about ways to initiate this culture shift. There’s growing interest in figuring out how to change a company’s culture to value and invest in performance. This reminded me of our theme from Velocity 2009, the impact of performance and operations on the bottom line, where we brought in case studies that described the benefits of web performance using the vocabulary of the business. In 2013 I predict we’ll see a heavier emphasis on case studies and best practices for making performance a priority for the organization using a slightly different vocabulary, with terms like “culture,” “buy-in” and “DNA.”

The community is huge — As of today there are 42 web performance meetup groups totaling nearly 15,000 members worldwide: 15,000 members just over three years! In addition to meetup groups, Aaron Kulick and Stephen Thair organized the inaugural WebPerfDays events in Santa Clara, Calif. and London (respectively). WebPerfDays, modelled after DevOpsDays, is an unconference for the web performance community organized by the web performance community. Although these two events coincided with Velocity, the intent is that anyone in the world can use the resources (templates, website, Twitter handle, etc.) to organize their own WebPerfDays. A growing web performance community means more projects, events, analyses, etc. reaching more people. I encourage you to attend your local web performance meetup group. If there isn’t one, then organize it. And consider organizing your own WebPerfDays as a one-day mini-Velocity in your own backyard.

Operations themes from John Allspaw

As if it was an extension of what we saw at Velocity U.S., there were a number of talks that underscored the importance of the human factor in web operations. I gave a tutorial called “Escalating Scenarios: A Deep Dive Into Outage Pitfalls” that mostly centered around the situations when ops teams find themselves responding to complex failure scenarios. Stephen Nelson-Smith gave a whirlwind tour of patterns and anti-patterns on workflows and getting things done in an engineering and operations context.

Gene Kim, Damon Edwards, John Willis, and Patrick Debois looked at the fundamentals surrounding development and operations cooperation and collaboration, in “DevOps Patterns Distilled.” Mike Rembetsy and Patrick McDonnell followed up with the implementation of those fundamentals at Etsy over a four-year period.

Theo Schlossnagle, ever the “dig deep” engineer, spoke on monitoring and observability. He gave some pretty surgical techniques for peering into production infrastructure in order to get an idea of what’s going on under the hood, with DTrace and tcpdump.

A number of talks covered handling large-scale growth:

These are just a few of the highlights we saw at Velocity Europe in London. As usual, half the fun was the hallway track: engineers trading stories, details, and approaches over food and drink. A fun and educational time was had by all.

June 26 2012

Four short links: 26 June 2012

  1. SnapItHD -- camera captures full 360-degree panorama and users select and zoom regions afterward. (via Idealog)
  2. Iago (GitHub) -- Twitter's load-generation tool.
  3. AutoCAD Worm Stealing Blueprints -- lovely, malware that targets inventions. The worm, known as ACAD/Medre.A, is spreading through infected AutoCAD templates and is sending tens of thousands of stolen documents to email addresses in China. This one has soured, but give the field time ... anything that can be stolen digitally, will be. (via Slashdot)
  4. Designing For and Against the Manufactured Normalcy Field (Greg Borenstein) -- Tim said this was one of his favourite sessions at this year's Foo Camp: breaking the artificial normality than we try to cast over new experiences so as to make them safe and comfortable.

June 08 2012

Four short links: 8 June 2012

  1. HAproxy -- high availability proxy, cf Varnish.
  2. Opera Reviews SPDY -- thoughts on the high-performance HTTP++ from a team with experience implementing their own protocols. Section 2 makes a good intro to the features of SPDY if you've not been keeping up.
  3. Jetpants -- Tumblr's automation toolkit for handling monstrously large MySQL database topologies. (via Hacker News)
  4. LeakedIn -- check if your LinkedIn password was leaked. Chris Shiflett had this site up before LinkedIn had publicly admitted the leak.

June 07 2012

What is DevOps?

Adrian Cockcroft's article about NoOps at Netflix ignited a controversy that has been smouldering for some months. John Allspaw's detailed response to Adrian's article makes a key point: What Adrian described as "NoOps" isn't really. Operations doesn't go away. Responsibilities can, and do, shift over time, and as they shift, so do job descriptions. But no matter how you slice it, the same jobs need to be done, and one of those jobs is operations. What Adrian is calling NoOps at Netflix isn't all that different from Operations at Etsy. But that just begs the question: What do we mean by "operations" in the 21st century? If NoOps is a movement for replacing operations with something that looks suspiciously like operations, there's clearly confusion. Now that some of the passion has died down, it's time to get to a better understanding of what we mean by operations and how it's changed over the years.

At a recent lunch, John noted that back in the dawn of the computer age, there was no distinction between dev and ops. If you developed, you operated. You mounted the tapes, you flipped the switches on the front panel, you rebooted when things crashed, and possibly even replaced the burned out vacuum tubes. And you got to wear a geeky white lab coat. Dev and ops started to separate in the '60s, when programmer/analysts dumped boxes of punch cards into readers, and "computer operators" behind a glass wall scurried around mounting tapes in response to IBM JCL. The operators also pulled printouts from line printers and shoved them in labeled cubbyholes, where you got your output filed under your last name.

The arrival of minicomputers in the 1970s and PCs in the '80s broke down the wall between mainframe operators and users, leading to the system and network administrators of the 1980s and '90s. That was the birth of modern "IT operations" culture. Minicomputer users tended to be computing professionals with just enough knowledge to be dangerous. (I remember when a new director was given the root password and told to "create an account for yourself" ... and promptly crashed the VAX, which was shared by about 30 users). PC users required networks; they required support; they required shared resources, such as file servers and mail servers. And yes, BOFH ("Bastard Operator from Hell") serves as a reminder of those days. I remember being told that "no one" else is having the problem you're having — and not getting beyond it until at a company meeting we found that everyone was having the exact same problem, in slightly different ways. No wonder we want ops to disappear. No wonder we wanted a wall between the developers and the sysadmins, particularly since, in theory, the advent of the personal computer and desktop workstation meant that we could all be responsible for our own machines.

But somebody has to keep the infrastructure running, including the increasingly important websites. As companies and computing facilities grew larger, the fire-fighting mentality of many system administrators didn't scale. When the whole company runs on one 386 box (like O'Reilly in 1990), mumbling obscure command-line incantations is an appropriate way to fix problems. But that doesn't work when you're talking hundreds or thousands of nodes at Rackspace or Amazon. From an operations standpoint, the big story of the web isn't the evolution toward full-fledged applications that run in the browser; it's the growth from single servers to tens of servers to hundreds, to thousands, to (in the case of Google or Facebook) millions. When you're running at that scale, fixing problems on the command line just isn't an option. You can't afford letting machines get out of sync through ad-hoc fixes and patches. Being told "We need 125 servers online ASAP, and there's no time to automate it" (as Sascha Bates encountered) is a recipe for disaster.

The response of the operations community to the problem of scale isn't surprising. One of the themes of O'Reilly's Velocity Conference is "Infrastructure as Code." If you're going to do operations reliably, you need to make it reproducible and programmatic. Hence virtual machines to shield software from configuration issues. Hence Puppet and Chef to automate configuration, so you know every machine has an identical software configuration and is running the right services. Hence Vagrant to ensure that all your virtual machines are constructed identically from the start. Hence automated monitoring tools to ensure that your clusters are running properly. It doesn't matter whether the nodes are in your own data center, in a hosting facility, or in a public cloud. If you're not writing software to manage them, you're not surviving.

Furthermore, as we move further and further away from traditional hardware servers and networks, and into a world that's virtualized on every level, old-style system administration ceases to work. Physical machines in a physical machine room won't disappear, but they're no longer the only thing a system administrator has to worry about. Where's the root disk drive on a virtual instance running at some colocation facility? Where's a network port on a virtual switch? Sure, system administrators of the '90s managed these resources with software; no sysadmin worth his salt came without a portfolio of Perl scripts. The difference is that now the resources themselves may be physical, or they may just be software; a network port, a disk drive, or a CPU has nothing to do with a physical entity you can point at or unplug. The only effective way to manage this layered reality is through software.

So infrastructure had to become code. All those Perl scripts show that it was already becoming code as early as the late '80s; indeed, Perl was designed as a programming language for automating system administration. It didn't take long for leading-edge sysadmins to realize that handcrafted configurations and non-reproducible incantations were a bad way to run their shops. It's possible that this trend means the end of traditional system administrators, whose jobs are reduced to racking up systems for Amazon or Rackspace. But that's only likely to be the fate of those sysadmins who refuse to grow and adapt as the computing industry evolves. (And I suspect that sysadmins who refuse to adapt swell the ranks of the BOFH fraternity, and most of us would be happy to see them leave.) Good sysadmins have always realized that automation was a significant component of their job and will adapt as automation becomes even more important. The new sysadmin won't power down a machine, replace a failing disk drive, reboot, and restore from backup; he'll write software to detect a misbehaving EC2 instance automatically, destroy the bad instance, spin up a new one, and configure it, all without interrupting service. With automation at this level, the new "ops guy" won't care if he's responsible for a dozen systems or 10,000. And the modern BOFH is, more often than not, an old-school sysadmin who has chosen not to adapt.

James Urquhart nails it when he describes how modern applications, running in the cloud, still need to be resilient and fault tolerant, still need monitoring, still need to adapt to huge swings in load, etc. But he notes that those features, formerly provided by the IT/operations infrastructures, now need to be part of the application, particularly in "platform as a service" environments. Operations doesn't go away, it becomes part of the development. And rather than envision some sort of uber developer, who understands big data, web performance optimization, application middleware, and fault tolerance in a massively distributed environment, we need operations specialists on the development teams. The infrastructure doesn't go away — it moves into the code; and the people responsible for the infrastructure, the system administrators and corporate IT groups, evolve so that they can write the code that maintains the infrastructure. Rather than being isolated, they need to cooperate and collaborate with the developers who create the applications. This is the movement informally known as "DevOps."

Amazon's EBS outage last year demonstrates how the nature of "operations" has changed. There was a marked distinction between companies that suffered and lost money, and companies that rode through the outage just fine. What was the difference? The companies that didn't suffer, including Netflix, knew how to design for reliability; they understood resilience, spreading data across zones, and a whole lot of reliability engineering. Furthermore, they understood that resilience was a property of the application, and they worked with the development teams to ensure that the applications could survive when parts of the network went down. More important than the flames about Amazon's services are the testimonials of how intelligent and careful design kept applications running while EBS was down. Netflix's ChaosMonkey is an excellent, if extreme, example of a tool to ensure that a complex distributed application can survive outages; ChaosMonkey randomly kills instances and services within the application. The development and operations teams collaborate to ensure that the application is sufficiently robust to withstand constant random (and self-inflicted!) outages without degrading.

Taken at IBM's headquarter On the other hand, during the EBS outage, nobody who wasn't an Amazon employee touched a single piece of hardware. At the time, JD Long tweeted that the best thing about the EBS outage was that his guys weren't running around like crazy trying to fix things. That's how it should be. It's important, though, to notice how this differs from operations practices 20, even 10 years ago. It was all over before the outage even occurred: The sites that dealt with it successfully had written software that was robust, and carefully managed their data so that it wasn't reliant on a single zone. And similarly, the sites that scrambled to recover from the outage were those that hadn't built resilience into their applications and hadn't replicated their data across different zones.

In addition to this redistribution of responsibility, from the lower layers of the stack to the application itself, we're also seeing a redistribution of costs. It's a mistake to think that the cost of operations goes away. Capital expense for new servers may be replaced by monthly bills from Amazon, but it's still cost. There may be fewer traditional IT staff, and there will certainly be a higher ratio of servers to staff, but that's because some IT functions have disappeared into the development groups. The bonding is fluid, but that's precisely the point. The task — providing a solid, stable application for customers — is the same. The locations of the servers on which that application runs, and how they're managed, are all that changes.

One important task of operations is understanding the cost trade-offs between public clouds like Amazon's, private clouds, traditional colocation, and building their own infrastructure. It's hard to beat Amazon if you're a startup trying to conserve cash and need to allocate or deallocate hardware to respond to fluctuations in load. You don't want to own a huge cluster to handle your peak capacity but leave it idle most of the time. But Amazon isn't inexpensive, and a larger company can probably get a better deal taking its infrastructure to a colocation facility. A few of the largest companies will build their own datacenters. Cost versus flexibility is an important trade-off; scaling is inherently slow when you own physical hardware, and when you build your data centers to handle peak loads, your facility is underutilized most of the time. Smaller companies will develop hybrid strategies, with parts of the infrastructure hosted on public clouds like AWS or Rackspace, part running on private hosting services, and part running in-house. Optimizing how tasks are distributed between these facilities isn't simple; that is the province of operations groups. Developing applications that can run effectively in a hybrid environment: that's the responsibility of developers, with healthy cooperation with an operations team.

The use of metrics to monitor system performance is another respect in which system administration has evolved. In the early '80s or early '90s, you knew when a machine crashed because you started getting phone calls. Early system monitoring tools like HP's OpenView provided limited visibility into system and network behavior but didn't give much more information than simple heartbeats or reachability tests. Modern tools like DTrace provide insight into almost every aspect of system behavior; one of the biggest challenges facing modern operations groups is developing analytic tools and metrics that can take advantage of the data that's available to predict problems before they become outages. We now have access to the data we need, we just don't know how to use it. And the more we rely on distributed systems, the more important monitoring becomes. As with so much else, monitoring needs to become part of the application itself. Operations is crucial to success, but operations can only succeed to the extent that it collaborates with developers and participates in the development of applications that can monitor and heal themselves.

Success isn't based entirely on integrating operations into development. It's naive to think that even the best development groups, aware of the challenges of high-performance, distributed applications, can write software that won't fail. On this two-way street, do developers wear the beepers, or IT staff? As Allspaw points out, it's important not to divorce developers from the consequences of their work since the fires are frequently set by their code. So, both developers and operations carry the beepers. Sharing responsibilities has another benefit. Rather than finger-pointing post-mortems that try to figure out whether an outage was caused by bad code or operational errors, when operations and development teams work together to solve outages, a post-mortem can focus less on assigning blame than on making systems more resilient in the future. Although we used to practice "root cause analysis" after failures, we're recognizing that finding out the single cause is unhelpful. Almost every outage is the result of a "perfect storm" of normal, everyday mishaps. Instead of figuring out what went wrong and building procedures to ensure that something bad can never happen again (a process that almost always introduces inefficiencies and unanticipated vulnerabilities), modern operations designs systems that are resilient in the face of everyday errors, even when they occur in unpredictable combinations.

In the past decade, we've seen major changes in software development practice. We've moved from various versions of the "waterfall" method, with interminable up-front planning, to "minimum viable product," continuous integration, and continuous deployment. It's important to understand that the waterfall and methodology of the '80s aren't "bad ideas" or mistakes. They were perfectly adapted to an age of shrink-wrapped software. When you produce a "gold disk" and manufacture thousands (or millions) of copies, the penalties for getting something wrong are huge. If there's a bug, you can't fix it until the next release. In this environment, a software release is a huge event. But in this age of web and mobile applications, deployment isn't such a big thing. We can release early, and release often; we've moved from continuous integration to continuous deployment. We've developed techniques for quick resolution in case a new release has serious problems; we've mastered A/B testing to test releases on a small subset of the user base.

All of these changes require cooperation and collaboration between developers and operations staff. Operations groups are adopting, and in many cases, leading in the effort to implement these changes. They're the specialists in resilience, in monitoring, in deploying changes and rolling them back. And the many attendees, hallway discussions, talks, and keynotes at O'Reilly's Velocity conference show us that they are adapting. They're learning about adopting approaches to resilience that are completely new to software engineering; they're learning about monitoring and diagnosing distributed systems, doing large-scale automation, and debugging under pressure. At a recent meeting, Jesse Robbins described scheduling EMT training sessions for operations staff so that they understood how to handle themselves and communicate with each other in an emergency. It's an interesting and provocative idea, and one of many things that modern operations staff bring to the mix when they work with developers.

What does the future hold for operations? System and network monitoring used to be exotic and bleeding-edge; now, it's expected. But we haven't taken it far enough. We're still learning how to monitor systems, how to analyze the data generated by modern monitoring tools, and how to build dashboards that let us see and use the results effectively. I've joked about "using a Hadoop cluster to monitor the Hadoop cluster," but that may not be far from reality. The amount of information we can capture is tremendous, and far beyond what humans can analyze without techniques like machine learning.

Likewise, operations groups are playing a huge role in the deployment of new, more efficient protocols for the web, like SPDY. Operations is involved, more than ever, in tuning the performance of operating systems and servers (even ones that aren't under our physical control); a lot of our "best practices" for TCP tuning were developed in the days of ISDN and 56 Kbps analog modems, and haven't been adapted to the reality of Gigabit Ethernet, OC48* fiber, and their descendants. Operations groups are responsible for figuring out how to use these technologies (and their successors) effectively. We're only beginning to digest IPv6 and the changes it implies for network infrastructure. And, while I've written a lot about building resilience into applications, so far we've only taken baby steps. There's a lot there that we still don't know. Operations groups have been leaders in taking best practices from older disciplines (control systems theory, manufacturing, medicine) and integrating them into software development.

And what about NoOps? Ultimately, it's a bad name, but the name doesn't really matter. A group practicing "NoOps" successfully hasn't banished operations. It's just moved operations elsewhere and called it something else. Whether a poorly chosen name helps or hinders progress remains to be seen, but operations won't go away; it will evolve to meet the challenges of delivering effective, reliable software to customers. Old-style system administrators may indeed be disappearing. But if so, they are being replaced by more sophisticated operations experts who work closely with development teams to get continuous deployment right; to build highly distributed systems that are resilient; and yes, to answer the pagers in the middle of the night when EBS goes down. DevOps.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Photo: Taken at IBM's headquarters in Armonk, NY. By Mike Loukides.


June 06 2012

A crazy awesome gaming infrastructure

In this Velocity Podcast, I had a conversation with Sarah Novotny (@sarahnovotny), CIO of Meteor Entertainment. This conversation centers mostly on building a high-performance gaming infrastructure and bridging the gap between IT and business. Sarah has some great insights into building an environment for human performance that goes along with your quest for more reliable, scalable, tolerant, and secure web properties.

Our conversation lasted 00:15:44 and if you want to pinpoint any particular topic, you can find the specific timing below. Sarah provides some of her background and experience as well as what she is currently doing at Meteor here. The full conversation is outlined below.

  • As a CIO, how do you bridge the gap between technology and business? 00:02:28

  • How do you educate corporate stakeholders about the importance of DevOps and the impact it can have on IT? 00:03:26

  • How does someone set up best practices in an organization? 00:05:24

  • Are there signals that DevOps is actually happening where development and operations are merging? 00:08:31

  • How do you measure performance and make large changes in an online game without disrupting players? 00:09:59

  • How do you prepare for scaling your crazy awesome infrastructure needed for game play? 00:12:28

  • Have you gathered metrics on public and private clouds and do you know which ones to scale to when needed? 00:14:03

In addition to her work at Meteor, Sarah is co-chair of OSCON 2012 (being held July 16-20 in Portland, Ore.). We hope to see you there. You can also read Sarah's blog for more insights into what she's up to.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


May 24 2012

Four short links: 24 May 2012

  1. Last Saturday My Son Found His People at the Maker Faire -- aww to the power of INFINITY.
  2. Dictionaries Linking Words to Concepts (Google Research) -- Wikipedia entries for concepts, text strings from searches and the oppressed workers down the Text Mines, and a count indicating how often the two were related.
  3. Magic Wand (Kickstarter) -- I don't want the game, I want a Bluetooth magic wand. I don't want to click the OK button, I want to wave a wand and make it so! (via Pete Warden)
  4. E-Commerce Performance (Luke Wroblewski) -- If a page load takes more than two seconds, 40% are likely to abandon that site. This is why you should follow Steve Souders like a hawk: if your site is slower than it could be, you're leaving money on the table.

May 10 2012

O'Reilly Radar Show 5/10/12: The surprising rise of JavaScript

Below you'll find the script and associated links from the May 10, 2012 episode of O'Reilly Radar. An archive of past shows is available through O'Reilly Media's YouTube channel and you can subscribe to episodes of O’Reilly Radar via iTunes.

The Radar interview

JavaScript’s ascendance has caught many people by surprise. Fluent Conference co-chair Peter Cooper explains why and how it happened in this episode of O’Reilly Radar [interview begins 12 seconds in].

Radar posts of note

Here’s a look at some of the top stories recently published across O’Reilly [segment begins at 11:58].

First up, Mike Hendrickson has published his annual five-part analysis of the computer book market. "State of the Computer Book Market" is a must-read for publishers and developers alike. The full report is also available as a free ebook. Read the series.

In a recent interview with Etsy's Mike Brittain we learned that a failure in secondary content doesn't need to take down an entire website. Brittain explains how to build resilience into UIs and allow for graceful failures. Read the post.

Finally, in our piece "Big data in Europe" Big Data Week organizers Stewart Townsend and Carlos Somohano share the distinctions and opportunities of Europe's data scene. Read the post.

As always, links to these stories and other resources mentioned during this episode are available at

Radar video spotlight

During a recent podcast interview, Velocity Conference chair Steve Souders described himself as an "optimization nut." Find out what that means — and discover how to stay on top of the latest web ops and performance techniques — in this episode’s video spotlight [segment begins at 13:04].

Here are the web operations and performance resources Steve Souders mentions during the video spotlight segment:


All of the links and resources noted during this episode — including those mentioned by Steve Souders in the previous segment — are available at

Also, you can always catch episodes of O’Reilly Radar at and subscribe to episodes through iTunes.

That’s all we have for now. Thanks for joining us and we’ll see you again soon.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

May 09 2012

Theo Schlossnagle on DevOps as a career

In this new Velocity Podcast, I had a conversation with Theo Schlossnagle (@postwait), the founder and CEO of OmniTI. This conversation centers mostly on DevOps as a discipline and career. Theo, as always, has some interesting insights into DevOps and how to build a successful career in this industry.

Our conversation lasted 00:13:21. If you want to pinpoint any particular topic, you can find the specific timing below. I will apologize now: Theo's image froze a couple minutes into our conversation, but since it was our second attempt at this, and it is a conversation, I feel the content of his answers is what most of us what to hear, not whether or not he is smiling or gesturing.

  • Are we splitting hairs with our terms of WebOps, DevOps, WebDev, etc? 00:00:42
  • What are the important goals developers should have in mind when building Systems that Operate? 00:01:28
  • How do you define, spec and set best practices for your DevOps organization so that your whole team is working well? 00:02:38
  • What does a typical day look like in the DevOps world? 00:03:39
  • What are the key attributes and skills someone should have to become a skilled DevOps? 00:04:50
  • What is the hardest to master for a young DevOps, security, scalability, reliability or performance?00:06:22
  • Is DevOps more of a craft, discipline, methodology, way of thinking, what is it?00:07:35
  • If your DevOps is operating well, do you notice it and how do you measure it if all is well?00:08:47
  • What do you think the most significant thing a sharp DevOps person can contribute to an organization, and how do they know if they have achieved excellence? 00:10:16

If you would like to hear Theo speak on "It's All About Telemetry," he is presenting at the 2012 Velocity Conference in Santa Clara, Calif. on Tuesday 6/26/12 at 1:00pm. We hope to see you there.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


Giving the Velocity website a performance makeover

Zebulon Young and I, web producers at O'Reilly Media, recently spent time focusing on the performance of the Velocity website. We were surprised by the results we achieved with a relatively small amount of effort. In two days we dropped Velocity's page weight by 49% and reduced the total average U.S. load time by 3.5 seconds1. This is how we did it.

Velocity is about speed, right?

To set the stage, here's the average load time for Velocity's home page as measured2 by Keynote before our work:

Chart: 7 Second Load Times

As the averages hovered above seven seconds, these load times definitely needed work. But where to start?

The big picture

If you take a look at the raw numbers for Velocity, you'll see that, while it's a relatively simple page, there's something much bigger behind the scenes. As measured3 above, the full page weight was 507 kB and there were 87 objects. This meant that the first time someone visited Velocity, their browser had to request and display a total of 87 pieces of HTML, images, CSS, and more — the whole of which totaled nearly half a megabyte:

Chart: Total Bytes 507k, Total Objects 87

Here's a breakdown of the content types by size:

Content Pie Chart

To top it off, a lot of these objects were still being served directly from our Santa Rosa, Calif. data center, instead of our Content Delivery Network (CDN). The problem with expecting every visitor to connect to our servers in California is simple: Not every visitor is near Santa Rosa. Velocity's visitors are all over the globe, so proper use of a CDN means that remote visitors will be served objects much closer to the connection they are currently using. Proximity improves delivery.

Getting started

At this point, we had three simple goals to slim down Velocity:

  1. Move all static objects to the CDN
  2. Cut down total page weight (kilobytes)
  3. Minimize the number of objects

1) CDN relocation and image compression

Our first task was compressing images and relocating static objects to the CDN. Using and the Google Page Speed lossless compression tools, we got to work crushing those image file sizes down.

To get a visual of the gains that we made, here are before and after waterfall charts from tests that we performed using Look at the download times for ubergizmo.jpg:

Before CDN Waterfall

You can see that the total download time for that one image dropped from 2.5 seconds to 0.3 seconds. This is far from a scientific A/B comparison, so you won't always see results this dramatic from CDN usage and compression, but we're definitely on the right track.

2) Lazy loading images

When you're trimming fat from your pages to improve load time, an obvious step is to only load what you need, and only load it when you need it. The Velocity website features a column of sponsor logos down the right-hand side of most pages. At the time of this writing, 48 images appear in that column, weighing in at 233 kB. However, only a fraction of those logos appear in even a large browser window without scrolling down.

Sidebar Sponsor Image Illustration

We addressed the impact these images had on load time in two ways. First, we deferred the load of these images until after the rest of the page had rendered — allowing the core page content to take priority. Second, when we did load these images, we only loaded those that would be visible in the current viewport. Additional logos are then loaded as they are scrolled into view.

These actions were accomplished by replacing the <img> tags in the HTML rendered by the server with text and meta-data that is then acted upon by JavaScript after the page loads. The code, which has room for additional enhancements, can be downloaded from GitHub.

The result of this enhancement was the removal of 48 requests and a full 233 kB from the initial page load, just for the sponsor images4. Even when the page has been fully rendered in the most common browser window size of 1366 x 768 pixels, this means cutting up to 44 objects and 217 kB from the page weight. Of course, the final page weight varies by how much of the page a visitor views, but the bottom line is that these resources don't delay the rendering of the primary page content. This comes at the cost of only a slight delay before the targeted images are displayed when the page initially loads and when it is scrolled. This delay might not be acceptable in all cases, but it's a valuable tool to have on your belt.

3) Using Sprites

The concept of using sprites for images has always been closely tied to Steve Souders' first rule for faster-loading websites, make fewer HTTP requests. The idea is simple: combine your background images into a single image, then use CSS to display only the important parts.

Historically there's been some reluctance to embrace the use of sprites because it seems as though there's a lot of work for marginal benefits. In the case of Velocity, I found that creation of the sprites only took minutes with the use of Steve Souders' simple SpriteMe tool. The results were surprising:

Sprite Consolidation Illustration

Just by combining some images and (once again) compressing the results, we saw a drop of page weight by 47 kB and the total number of objects reduced by 11.

4) Reassessing third-party widgets (Flickr and Twitter)

Third-party widget optimization can be one of the most difficult performance challenges to face. The code often isn't your own, isn't hosted on your servers, and, because of this, there are inherent inflexibilities. In the case of Velocity, we didn't have many widgets to review and optimize. After we spent some time surveying the site, we found two widgets that needed some attention.

The Flickr widget

The Flickr widget on Velocity was using JavaScript to pull three 75x75 pixel images directly from Flickr so they could be displayed on the "2011 PHOTOS" section seen here:

Flickr Widget Screenshot

There were a couple of problems with this. One, the randomization of images isn't essential to the user experience. Two, even though the images from Flickr are only 75x75, they were averaging about 25 kB each, which is huge for a tiny JPEG. With this in mind, we did away with the JavaScript altogether and simply hosted compressed versions of the images on our CDN.

With that simple change, we saved 56 kB (going from 76 kB to 20 kB) in file size alone.

The "Tweet" widget

As luck would have it, there had already been talk of removing the Tweet widget from the Velocity site before we began our performance efforts. After some investigation into how often the widget was used, then some discussion of its usefulness, we decided the Twitter widget was no longer essential. We removed the Twitter widget and the JavaScript that was backing it.

Tweet Widget Screenshot

The results

So without further ado, let's look at the results of our two-day WPO deep dive. As you can see by our "after" Keynote readings, the total downloaded size dropped to 258.6 kB and the object count slimmed down to 34:

After WPO Content Breakdown

After WPO Content Pie Chart

Our starting point of 507 kB with 87 objects, was reduced by 49%, with 56% fewer objects on the page.

And for the most impressive illustration of the performance gains that were made, here's the long-term graph of Velocity's load times, in which they start around 7 seconds and settle around 2.5 seconds:

Chart Showing Drop to 2.5 Second Average Load Times


The biggest lesson we learned throughout this optimization process was that there isn't one single change that makes your website fast. All of the small performance changes we made added up, and suddenly we were taking seconds off our page's load times. With a little time and consideration, you may find similar performance enhancements in your own site.

And one last thing: Zeb and I will see you at Velocity in June.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

1, 2, 3Measurements and comparisons taken with Keynote (Application Perspective - ApP) Emulated Browser monitoring tools.

4We also applied this treatment to the sponsor banner in the page footer, for additional savings.

Reposted bycremeathalis

April 27 2012

Top Stories: April 23-27, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Design your website for a graceful fail
A failure in secondary content doesn't need to take down an entire website. Here, Etsy's Mike Brittain explains how to build resilience into UIs and allow for graceful failures.

Big data in Europe
European application of big data is ramping up, but its spread is different from the patterns seen in the U.S. In this interview, Big Data Week organizers Stewart Townsend and Carlos Somohano share the key distinctions and opportunities associated with Europe's data scene.

The rewards of simple code
Simple code is born from planning, discipline and grinding work. But as author Max Kanat-Alexander notes in this interview, the benefits of simple code are worth the considerable effort it requires.

Fitness for geeks
Programmers who spend 14 hours a day in front of a computer know how hard it is to step away from the cubicle. But as "Fitness for Geeks" author Bruce Perry notes in this podcast, getting fit doesn't need to be daunting.

Joshua Bixby on the business of performance
Strangeloop's Joshua Bixby discusses the business of speed and why web performance optimization is an institutional need.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference, May 29 - 31 in San Francisco. Save 20% on registration with the code RADAR20.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...