Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 12 2012

Velocity Profile: Schlomo Schapiro

This is part of the Velocity Profiles series, which highlights the work and knowledge of web ops and performance experts.

Schlomo SchapiroSchlomo Schapiro
Systems Architect, Open Source Evangelist

How did you get into web operations and performance?

Previously I was working as a consultant for Linux, open source tools and virtualization. While this is a great job, it has one major drawback: One usually does not stay with one customer long enough to enable the really big changes, especially with regard to how the customer works. When ImmobilienScout24 came along and offered me the job as a Systems Architect, this was my ticket out of consulting and into diving deeply into a single customer scenario. The challenges that ImmobilienScout24 faced were very much along the lines that occupied me as well:

  • How to change from "stable operations" to "stable change."
  • How to fully automate a large data center and stop doing repeating tasks manually.
  • How to drastically increase the velocity of our release cycles.

What are your most memorable projects?

There are a number of them:

  • An internal open source project to manage the IT desktops by the people who use them.
  • An open source project, Lab Manager Light, that turns a standard VMware vSphere environment into a self-service cloud.
  • The biggest and still very much ongoing project is the new deployment and systems automation for our data center. The approach — which is also new — is to unify the management of our Linux servers under the built-in package manager, in our case RPM. That way all files on the servers are already taken care of and we only need to centrally orchestrate the package roll-out waves and service start/stop. The tools we use for this are published here.
  • Help to nudge us to embrace DevOps last year after the development went agile some three years ago.
  • Most important of all, I feel that ImmobilienScout24 is now on its way to maintain and build upon the technological edge matching our market share as the dominating real-estate listing portal in Germany. This will actually enable us to keep growing and setting the pace in the ever-faster Internet world.

What's the toughest problem you've had to solve?

The real challenge is not to hack up a quick solution but to work as a team to build a sustainable world. Technical debt discussions are now a major part of my daily work. As tedious as they can be, I strongly believe that at our current state sustainability is at least as important as innovation.

What tools and techniques do you rely on most?

Asking questions and trying to understand with everybody together how things really work. Walking a lot through the office with a coffee cup and talking to people. Taking the time to sit down with a colleague at the keyboard and seeing things through. Sometimes it helps to shorten a discussion with a a little hacking and "look, it just works" — but this should always be a way to start a discussion. The real work is better done together as a team.

What is your web operations and performance super power?

I hope that I manage to help us all to look forward to the next day at work. I also try to simplify things until they are really simple, and I annoy everybody by nagging about separation of concerns.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


June 07 2012

What is DevOps?

Adrian Cockcroft's article about NoOps at Netflix ignited a controversy that has been smouldering for some months. John Allspaw's detailed response to Adrian's article makes a key point: What Adrian described as "NoOps" isn't really. Operations doesn't go away. Responsibilities can, and do, shift over time, and as they shift, so do job descriptions. But no matter how you slice it, the same jobs need to be done, and one of those jobs is operations. What Adrian is calling NoOps at Netflix isn't all that different from Operations at Etsy. But that just begs the question: What do we mean by "operations" in the 21st century? If NoOps is a movement for replacing operations with something that looks suspiciously like operations, there's clearly confusion. Now that some of the passion has died down, it's time to get to a better understanding of what we mean by operations and how it's changed over the years.

At a recent lunch, John noted that back in the dawn of the computer age, there was no distinction between dev and ops. If you developed, you operated. You mounted the tapes, you flipped the switches on the front panel, you rebooted when things crashed, and possibly even replaced the burned out vacuum tubes. And you got to wear a geeky white lab coat. Dev and ops started to separate in the '60s, when programmer/analysts dumped boxes of punch cards into readers, and "computer operators" behind a glass wall scurried around mounting tapes in response to IBM JCL. The operators also pulled printouts from line printers and shoved them in labeled cubbyholes, where you got your output filed under your last name.

The arrival of minicomputers in the 1970s and PCs in the '80s broke down the wall between mainframe operators and users, leading to the system and network administrators of the 1980s and '90s. That was the birth of modern "IT operations" culture. Minicomputer users tended to be computing professionals with just enough knowledge to be dangerous. (I remember when a new director was given the root password and told to "create an account for yourself" ... and promptly crashed the VAX, which was shared by about 30 users). PC users required networks; they required support; they required shared resources, such as file servers and mail servers. And yes, BOFH ("Bastard Operator from Hell") serves as a reminder of those days. I remember being told that "no one" else is having the problem you're having — and not getting beyond it until at a company meeting we found that everyone was having the exact same problem, in slightly different ways. No wonder we want ops to disappear. No wonder we wanted a wall between the developers and the sysadmins, particularly since, in theory, the advent of the personal computer and desktop workstation meant that we could all be responsible for our own machines.

But somebody has to keep the infrastructure running, including the increasingly important websites. As companies and computing facilities grew larger, the fire-fighting mentality of many system administrators didn't scale. When the whole company runs on one 386 box (like O'Reilly in 1990), mumbling obscure command-line incantations is an appropriate way to fix problems. But that doesn't work when you're talking hundreds or thousands of nodes at Rackspace or Amazon. From an operations standpoint, the big story of the web isn't the evolution toward full-fledged applications that run in the browser; it's the growth from single servers to tens of servers to hundreds, to thousands, to (in the case of Google or Facebook) millions. When you're running at that scale, fixing problems on the command line just isn't an option. You can't afford letting machines get out of sync through ad-hoc fixes and patches. Being told "We need 125 servers online ASAP, and there's no time to automate it" (as Sascha Bates encountered) is a recipe for disaster.

The response of the operations community to the problem of scale isn't surprising. One of the themes of O'Reilly's Velocity Conference is "Infrastructure as Code." If you're going to do operations reliably, you need to make it reproducible and programmatic. Hence virtual machines to shield software from configuration issues. Hence Puppet and Chef to automate configuration, so you know every machine has an identical software configuration and is running the right services. Hence Vagrant to ensure that all your virtual machines are constructed identically from the start. Hence automated monitoring tools to ensure that your clusters are running properly. It doesn't matter whether the nodes are in your own data center, in a hosting facility, or in a public cloud. If you're not writing software to manage them, you're not surviving.

Furthermore, as we move further and further away from traditional hardware servers and networks, and into a world that's virtualized on every level, old-style system administration ceases to work. Physical machines in a physical machine room won't disappear, but they're no longer the only thing a system administrator has to worry about. Where's the root disk drive on a virtual instance running at some colocation facility? Where's a network port on a virtual switch? Sure, system administrators of the '90s managed these resources with software; no sysadmin worth his salt came without a portfolio of Perl scripts. The difference is that now the resources themselves may be physical, or they may just be software; a network port, a disk drive, or a CPU has nothing to do with a physical entity you can point at or unplug. The only effective way to manage this layered reality is through software.

So infrastructure had to become code. All those Perl scripts show that it was already becoming code as early as the late '80s; indeed, Perl was designed as a programming language for automating system administration. It didn't take long for leading-edge sysadmins to realize that handcrafted configurations and non-reproducible incantations were a bad way to run their shops. It's possible that this trend means the end of traditional system administrators, whose jobs are reduced to racking up systems for Amazon or Rackspace. But that's only likely to be the fate of those sysadmins who refuse to grow and adapt as the computing industry evolves. (And I suspect that sysadmins who refuse to adapt swell the ranks of the BOFH fraternity, and most of us would be happy to see them leave.) Good sysadmins have always realized that automation was a significant component of their job and will adapt as automation becomes even more important. The new sysadmin won't power down a machine, replace a failing disk drive, reboot, and restore from backup; he'll write software to detect a misbehaving EC2 instance automatically, destroy the bad instance, spin up a new one, and configure it, all without interrupting service. With automation at this level, the new "ops guy" won't care if he's responsible for a dozen systems or 10,000. And the modern BOFH is, more often than not, an old-school sysadmin who has chosen not to adapt.

James Urquhart nails it when he describes how modern applications, running in the cloud, still need to be resilient and fault tolerant, still need monitoring, still need to adapt to huge swings in load, etc. But he notes that those features, formerly provided by the IT/operations infrastructures, now need to be part of the application, particularly in "platform as a service" environments. Operations doesn't go away, it becomes part of the development. And rather than envision some sort of uber developer, who understands big data, web performance optimization, application middleware, and fault tolerance in a massively distributed environment, we need operations specialists on the development teams. The infrastructure doesn't go away — it moves into the code; and the people responsible for the infrastructure, the system administrators and corporate IT groups, evolve so that they can write the code that maintains the infrastructure. Rather than being isolated, they need to cooperate and collaborate with the developers who create the applications. This is the movement informally known as "DevOps."

Amazon's EBS outage last year demonstrates how the nature of "operations" has changed. There was a marked distinction between companies that suffered and lost money, and companies that rode through the outage just fine. What was the difference? The companies that didn't suffer, including Netflix, knew how to design for reliability; they understood resilience, spreading data across zones, and a whole lot of reliability engineering. Furthermore, they understood that resilience was a property of the application, and they worked with the development teams to ensure that the applications could survive when parts of the network went down. More important than the flames about Amazon's services are the testimonials of how intelligent and careful design kept applications running while EBS was down. Netflix's ChaosMonkey is an excellent, if extreme, example of a tool to ensure that a complex distributed application can survive outages; ChaosMonkey randomly kills instances and services within the application. The development and operations teams collaborate to ensure that the application is sufficiently robust to withstand constant random (and self-inflicted!) outages without degrading.

Taken at IBM's headquarter On the other hand, during the EBS outage, nobody who wasn't an Amazon employee touched a single piece of hardware. At the time, JD Long tweeted that the best thing about the EBS outage was that his guys weren't running around like crazy trying to fix things. That's how it should be. It's important, though, to notice how this differs from operations practices 20, even 10 years ago. It was all over before the outage even occurred: The sites that dealt with it successfully had written software that was robust, and carefully managed their data so that it wasn't reliant on a single zone. And similarly, the sites that scrambled to recover from the outage were those that hadn't built resilience into their applications and hadn't replicated their data across different zones.

In addition to this redistribution of responsibility, from the lower layers of the stack to the application itself, we're also seeing a redistribution of costs. It's a mistake to think that the cost of operations goes away. Capital expense for new servers may be replaced by monthly bills from Amazon, but it's still cost. There may be fewer traditional IT staff, and there will certainly be a higher ratio of servers to staff, but that's because some IT functions have disappeared into the development groups. The bonding is fluid, but that's precisely the point. The task — providing a solid, stable application for customers — is the same. The locations of the servers on which that application runs, and how they're managed, are all that changes.

One important task of operations is understanding the cost trade-offs between public clouds like Amazon's, private clouds, traditional colocation, and building their own infrastructure. It's hard to beat Amazon if you're a startup trying to conserve cash and need to allocate or deallocate hardware to respond to fluctuations in load. You don't want to own a huge cluster to handle your peak capacity but leave it idle most of the time. But Amazon isn't inexpensive, and a larger company can probably get a better deal taking its infrastructure to a colocation facility. A few of the largest companies will build their own datacenters. Cost versus flexibility is an important trade-off; scaling is inherently slow when you own physical hardware, and when you build your data centers to handle peak loads, your facility is underutilized most of the time. Smaller companies will develop hybrid strategies, with parts of the infrastructure hosted on public clouds like AWS or Rackspace, part running on private hosting services, and part running in-house. Optimizing how tasks are distributed between these facilities isn't simple; that is the province of operations groups. Developing applications that can run effectively in a hybrid environment: that's the responsibility of developers, with healthy cooperation with an operations team.

The use of metrics to monitor system performance is another respect in which system administration has evolved. In the early '80s or early '90s, you knew when a machine crashed because you started getting phone calls. Early system monitoring tools like HP's OpenView provided limited visibility into system and network behavior but didn't give much more information than simple heartbeats or reachability tests. Modern tools like DTrace provide insight into almost every aspect of system behavior; one of the biggest challenges facing modern operations groups is developing analytic tools and metrics that can take advantage of the data that's available to predict problems before they become outages. We now have access to the data we need, we just don't know how to use it. And the more we rely on distributed systems, the more important monitoring becomes. As with so much else, monitoring needs to become part of the application itself. Operations is crucial to success, but operations can only succeed to the extent that it collaborates with developers and participates in the development of applications that can monitor and heal themselves.

Success isn't based entirely on integrating operations into development. It's naive to think that even the best development groups, aware of the challenges of high-performance, distributed applications, can write software that won't fail. On this two-way street, do developers wear the beepers, or IT staff? As Allspaw points out, it's important not to divorce developers from the consequences of their work since the fires are frequently set by their code. So, both developers and operations carry the beepers. Sharing responsibilities has another benefit. Rather than finger-pointing post-mortems that try to figure out whether an outage was caused by bad code or operational errors, when operations and development teams work together to solve outages, a post-mortem can focus less on assigning blame than on making systems more resilient in the future. Although we used to practice "root cause analysis" after failures, we're recognizing that finding out the single cause is unhelpful. Almost every outage is the result of a "perfect storm" of normal, everyday mishaps. Instead of figuring out what went wrong and building procedures to ensure that something bad can never happen again (a process that almost always introduces inefficiencies and unanticipated vulnerabilities), modern operations designs systems that are resilient in the face of everyday errors, even when they occur in unpredictable combinations.

In the past decade, we've seen major changes in software development practice. We've moved from various versions of the "waterfall" method, with interminable up-front planning, to "minimum viable product," continuous integration, and continuous deployment. It's important to understand that the waterfall and methodology of the '80s aren't "bad ideas" or mistakes. They were perfectly adapted to an age of shrink-wrapped software. When you produce a "gold disk" and manufacture thousands (or millions) of copies, the penalties for getting something wrong are huge. If there's a bug, you can't fix it until the next release. In this environment, a software release is a huge event. But in this age of web and mobile applications, deployment isn't such a big thing. We can release early, and release often; we've moved from continuous integration to continuous deployment. We've developed techniques for quick resolution in case a new release has serious problems; we've mastered A/B testing to test releases on a small subset of the user base.

All of these changes require cooperation and collaboration between developers and operations staff. Operations groups are adopting, and in many cases, leading in the effort to implement these changes. They're the specialists in resilience, in monitoring, in deploying changes and rolling them back. And the many attendees, hallway discussions, talks, and keynotes at O'Reilly's Velocity conference show us that they are adapting. They're learning about adopting approaches to resilience that are completely new to software engineering; they're learning about monitoring and diagnosing distributed systems, doing large-scale automation, and debugging under pressure. At a recent meeting, Jesse Robbins described scheduling EMT training sessions for operations staff so that they understood how to handle themselves and communicate with each other in an emergency. It's an interesting and provocative idea, and one of many things that modern operations staff bring to the mix when they work with developers.

What does the future hold for operations? System and network monitoring used to be exotic and bleeding-edge; now, it's expected. But we haven't taken it far enough. We're still learning how to monitor systems, how to analyze the data generated by modern monitoring tools, and how to build dashboards that let us see and use the results effectively. I've joked about "using a Hadoop cluster to monitor the Hadoop cluster," but that may not be far from reality. The amount of information we can capture is tremendous, and far beyond what humans can analyze without techniques like machine learning.

Likewise, operations groups are playing a huge role in the deployment of new, more efficient protocols for the web, like SPDY. Operations is involved, more than ever, in tuning the performance of operating systems and servers (even ones that aren't under our physical control); a lot of our "best practices" for TCP tuning were developed in the days of ISDN and 56 Kbps analog modems, and haven't been adapted to the reality of Gigabit Ethernet, OC48* fiber, and their descendants. Operations groups are responsible for figuring out how to use these technologies (and their successors) effectively. We're only beginning to digest IPv6 and the changes it implies for network infrastructure. And, while I've written a lot about building resilience into applications, so far we've only taken baby steps. There's a lot there that we still don't know. Operations groups have been leaders in taking best practices from older disciplines (control systems theory, manufacturing, medicine) and integrating them into software development.

And what about NoOps? Ultimately, it's a bad name, but the name doesn't really matter. A group practicing "NoOps" successfully hasn't banished operations. It's just moved operations elsewhere and called it something else. Whether a poorly chosen name helps or hinders progress remains to be seen, but operations won't go away; it will evolve to meet the challenges of delivering effective, reliable software to customers. Old-style system administrators may indeed be disappearing. But if so, they are being replaced by more sophisticated operations experts who work closely with development teams to get continuous deployment right; to build highly distributed systems that are resilient; and yes, to answer the pagers in the middle of the night when EBS goes down. DevOps.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Photo: Taken at IBM's headquarters in Armonk, NY. By Mike Loukides.


Sponsored post
5371 6093 500
rockyourmind, foods, 2010-2020.

So Long, and Thanks for All the Fish.
Reposted fromRockYourMind RockYourMind

June 05 2012

Velocity Profile: Kate Matsudaira

This is part of the Velocity Profiles series, which highlights the work and knowledge of web ops and performance experts.

Kate MatsudairaKate Matsudaira
VP Engineering

How did you get into web operations and performance?

I started working as a software engineer, and being at Amazon working on the internals of the retail website it was almost impossible not to have some exposure to pager duty and operations. As my career progressed and I moved into leadership roles on teams working on 24/7 websites, typically spanning hundreds of servers (and now instances), it was necessary to understand operations and performance.

What was your most memorable project?

Memorable can be two things, really good or really bad. Right now I am excited about the work we have been doing on to make our website super fast and work well across devices (and all the data mining and machine learning is also really interesting).

As for really bad, though, there was a launch almost a decade ago where we implemented an analytics datastore on top of a relational database instead of something like map/reduce. If only Hadoop and all the other great data technologies were around and prevalent back then!

What's the toughest problem you've had to solve?

Building an index of all the links on the web (a link search engine, basically) in one year with less than $1 million, including the team.

What tools and techniques do you rely on most?

Tools: pick the best one for the job at hand. Techniques: take the time to slow down before making snap judgements.

Who do you follow in the web operations and performance world?

Artur Bergman, Cliff Moon, Ben Black, John Allspaw, Rob Treat, and Theo Schlossnagle.

What is your web operations and performance super power?

Software architecture. You have to design your applications to be operational.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


May 30 2012

Which is easier to tune, humans or machines?

In this new Velocity Podcast, I had a conversation with Kate Matsudaira (@katemats), Vice President of Engineering at This conversation centers mostly on the human side of engineering and performance. Kate has some great insights into building an environment for human performance that goes along with your quest for more performant, reliable, scalable, tolerant, secure web properties.

Our conversation lasted 00:20:00 and if you want to pinpoint any particular topic, you can find the specific timing below. Kate provides some of her background and experience as well as what she is currently doing at here. The full conversation is outlined below.

  • Which is easier to tune for performance, humans or machines? 00:00:30

  • To achieve better performance from people, how do you teach people to trade-off the variables time, cost, quality and scope? 00:02:32

  • What do you look for when you hire engineers that will work on highly performant web properties? 00:05:06

  • In this talent-surplus economy, do you find it more difficult to hire engineers? 00:07:10

  • How do you demonstrate DevOps and Performance engineering value to an organization? 00:08:36

  • How does one go about monitoring everything and not slow down your web properties with monitoring everything? 00:12:56

  • Does continuous improvement help deliver performant properties? 00:15:14

If you would like to hear Kate speak on "Leveling up - Taking your operations and engineering role to the next level," she is presenting at the 2012 Velocity Conference in Santa Clara, Calif. on Wednesday 6/27/12 at 1:00 pm. We hope to see you there.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


April 13 2012

Top Stories: April 9-13, 2012

Here's a look at the top stories published across O'Reilly sites this week.

Carsharing saves U.S. city governments millions in operating costs
Carsharing initiatives in a number of U.S. cities are part of a broader trend that suggests the ways we work, play and learn are changing.

Complexity fails: A lesson from storage simplification
Simple systems scale effectively, while complex systems struggle to overcome the multiplicative effect of potential failure points. This shows us why the most reliable and scalable clouds are those made up of fewer, simpler parts.

Operations, machine learning and premature babies
Machine learning and access to huge amounts of data allowed IBM to make an important discovery about premature infants. If web operations teams could capture everything — network data, environmental data, I/O subsystem data, etc. — what would they find out?

State of the Computer Book Market 2011
In his annual report, Mike Hendrickson analyzes tech book sales and industry data: Part 1, Overall Market; Part 2, The Categories; Part 3, The Publishers; Part 4, The Languages; Part 5, Wrap-Up and Digital.

Never, ever "out of print"
In a recent interview, attorney Dana Newman tackled issues surrounding publishing rights in the digital landscape. She said changes in the current model are needed to keep things equitable for both publishers and authors.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference, May 29 - 31 in San Francisco. Save 20% on registration with the code RADAR20.

Photo of servers: Google Production Server, v1 by Pargon, on Flickr

April 09 2012

Operations, machine learning and premature babies

Julie Steele and I recently had lunch with Etsy's John Allspaw and Kellan Elliott-McCrea. I'm not sure how we got there, but we made a connection that was (to me) astonishing between web operations and medical care for premature infants.

I've written several times about IBM's work in neonatal intensive care at the University of Toronto. In any neonatal intensive care unit (NICU), every baby is connected to dozens of monitors. And each monitor is streaming hundreds of readings per second into various data systems. They can generate alerts if anything goes severely out of spec, but in normal operation, they just generate a summary report for the doctor every half hour or so.

IBM discovered that by applying machine learning to the full data stream, they were able to diagnose some dangerous infections a full day before any symptoms were noticeable to a human. That's amazing in itself, but what's more important is what they were looking for. I expected them to be looking for telltale spikes or irregularities in the readings: perhaps not serious enough to generate an alarm on their own, but still, the sort of things you'd intuitively expect of a person about to become ill. But according to Anjul Bhambhri, IBM's Vice President of Big Data, the telltale signal wasn't spikes or irregularities, but the opposite. There's a certain normal variation in heart rate, etc., throughout the day, and babies who were about to become sick didn't exhibit the variation. Their heart rate was too normal; it didn't change throughout the day as much as it should.

That observation strikes me as revolutionary. It's easy to detect problems when something goes out of spec: If you have a fever, you know you're sick. But how do you detect problems that don't set off an alarm? How many diseases have early symptoms that are too subtle for a human to notice, and only accessible to a machine learning system that can sift through gigabytes of data?

In our conversation, we started wondering how this applied to web operations. We have gigabytes of data streaming off of our servers, but the state of system and network monitoring hasn't changed in years. We look for parameters that are out of spec, thresholds that are crossed. And that's good for a lot of problems: You need to know if the number of packets coming into an interface suddenly goes to zero. But what if the symptom we should look for is radically different? What if crossing a threshold isn't what indicates trouble, but the disappearance (or diminution) of some regular pattern? Is it possible that our computing infrastructure also exhibits symptoms that are too subtle for a human to notice but would easily be detectable via machine learning?

We talked a bit about whether it was possible to alarm on the first (and second) derivatives of some key parameters, and of course it is. Doing so would require more sophistication than our current monitoring systems have, but it's not too hard to imagine. But it also misses the point. Once you know what to look for, it's relatively easy to figure out how to detect it. IBM's insight wasn't detecting the patterns that indicated a baby was about to become sick, but using machine learning to figure out what the patterns were. Can we do the same? It's not inconceivable, though it wouldn't be easy.

Web operations has been on the forefront of "big data" since the beginning. Long before we were talking about sentiment analysis or recommendations engines, webmasters and system administrators were analyzing problems by looking through gigabytes of server and system logs, using tools that were primitive or non-existent. MRTG and HP's OpenView were savage attempts to put together information dashboards for IT groups. But at most enterprises, operations hasn't taken the next step. Operations staff doesn't have the resources (neither computational nor human) to apply machine intelligence to our problems. We'd have to capture all the data coming off our our servers for extended periods, not just the server logs that we capture now, but any every kind of data we can collect: network data, environmental data, I/O subsystem data, you name it. At a recent meetup about finance, Abhi Mehta encouraged people to capture and save "everything." He was talking about financial data, but the same applies here. We'd need to build Hadoop clusters to monitor our server farms; we'd need Hadoop clusters to monitor our Hadoop clusters. It's a big investment of time and resources. If we could make that investment, what would we find out? I bet that we'd be surprised.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


January 04 2012

The feedback economy

Military strategist John Boyd spent a lot of time understanding how to win battles. Building on his experience as a fighter pilot, he broke down the process of observing and reacting into something called an Observe, Orient, Decide, and Act (OODA) loop. Combat, he realized, consisted of observing your circumstances, orienting yourself to your enemy's way of thinking and your environment, deciding on a course of action, and then acting on it.

OODA chart
The Observe, Orient, Decide, and Act (OODA) loop. Click to enlarge.

The most important part of this loop isn't included in the OODA acronym, however. It's the fact that it's a loop. The results of earlier actions feed back into later, hopefully wiser, ones. Over time, the fighter "gets inside" their opponent's loop, outsmarting and outmaneuvering them. The system learns.

Boyd's genius was to realize that winning requires two things: being able to collect and analyze information better, and being able to act on that information faster, incorporating what's learned into the next iteration. Today, what Boyd learned in a cockpit applies to nearly everything we do.

Data-obese, digital-fast

In our always-on lives we're flooded with cheap, abundant information. We need to capture and analyze it well, separating digital wheat from digital chaff, identifying meaningful undercurrents while ignoring meaningless social flotsam. Clay Johnson argues that we need to go on an information diet, and makes a good case for conscious consumption. In an era of information obesity, we need to eat better. There's a reason they call it a feed, after all.

It's not just an overabundance of data that makes Boyd's insights vital. In the last 20 years, much of human interaction has shifted from atoms to bits. When interactions become digital, they become instantaneous, interactive, and easily copied. It's as easy to tell the world as to tell a friend, and a day's shopping is reduced to a few clicks.

The move from atoms to bits reduces the coefficient of friction of entire industries to zero. Teenagers shun e-mail as too slow, opting for instant messages. The digitization of our world means that trips around the OODA loop happen faster than ever, and continue to accelerate.

We're drowning in data. Bits are faster than atoms. Our jungle-surplus wetware can't keep up. At least, not without Boyd's help. In a society where every person, tethered to their smartphone, is both a sensor and an end node, we need better ways to observe and orient, whether we're at home or at work, solving the world's problems or planning a play date. And we need to be constantly deciding, acting, and experimenting, feeding what we learn back into future behavior.

We're entering a feedback economy.

The big data supply chain

Consider how a company collects, analyzes, and acts on data.

The big data supply chain
The big data supply chain. Click to enlarge.

Let's look at these components in order.

Data collection

The first step in a data supply chain is to get the data in the first place.

Information comes in from a variety of sources, both public and private. We're a promiscuous society online, and with the advent of low-cost data marketplaces, it's possible to get nearly any nugget of data relatively affordably. From social network sentiment, to weather reports, to economic indicators, public information is grist for the big data mill. Alongside this, we have organization-specific data such as retail traffic, call center volumes, product recalls, or customer loyalty indicators.

The legality of collection is perhaps more restrictive than getting the data in the first place. Some data is heavily regulated — HIPAA governs healthcare, while PCI restricts financial transactions. In other cases, the act of combining data may be illegal because it generates personally identifiable information (PII). For example, courts have ruled differently on whether IP addresses aren't PII, and the California Supreme Court ruled that zip codes are. Navigating these regulations imposes some serious constraints on what can be collected and how it can be combined.

The era of ubiquitous computing means that everyone is a potential source of data, too. A modern smartphone can sense light, sound, motion, location, nearby networks and devices, and more, making it a perfect data collector. As consumers opt into loyalty programs and install applications, they become sensors that can feed the data supply chain.

In big data, the collection is often challenging because of the sheer volume of information, or the speed with which it arrives, both of which demand new approaches and architectures.

Ingesting and cleaning

Once the data is collected, it must be ingested. In traditional business intelligence (BI) parlance, this is known as Extract, Transform, and Load (ETL): the act of putting the right information into the correct tables of a database schema and manipulating certain fields to make them easier to work with.

One of the distinguishing characteristics of big data, however, is that the data is often unstructured. That means we don't know the inherent schema of the information before we start to analyze it. We may still transform the information — replacing an IP address with the name of a city, for example, or anonymizing certain fields with a one-way hash function — but we may hold onto the original data and only define its structure as we analyze it.


The information we've ingested needs to be analyzed by people and machines. That means hardware, in the form of computing, storage, and networks. Big data doesn't change this, but it does change how it's used. Virtualization, for example, allows operators to spin up many machines temporarily, then destroy them once the processing is over.

Cloud computing is also a boon to big data. Paying by consumption destroys the barriers to entry that would prohibit many organizations from playing with large datasets, because there's no up-front investment. In many ways, big data gives clouds something to do.


Where big data is new is in the platforms and frameworks we create to crunch large amounts of information quickly. One way to speed up data analysis is to break the data into chunks that can be analyzed in parallel. Another is to build a pipeline of processing steps, each optimized for a particular task.

Big data is often about fast results, rather than simply crunching a large amount of information. That's important for two reasons:

  1. Much of the big data work going on today is related to user interfaces and the web. Suggesting what books someone will enjoy, or delivering search results, or finding the best flight, requires an answer in the time it takes a page to load. The only way to accomplish this is to spread out the task, which is one of the reasons why Google has nearly a million servers.
  2. We analyze unstructured data iteratively. As we first explore a dataset, we don't know which dimensions matter. What if we segment by age? Filter by country? Sort by purchase price? Split the results by gender? This kind of "what if" analysis is exploratory in nature, and analysts are only as productive as their ability to explore freely. Big data may be big. But if it's not fast, it's unintelligible.

Much of the hype around big data companies today is a result of the retooling of enterprise BI. For decades, companies have relied on structured relational databases and data warehouses — many of them can't handle the exploration, lack of structure, speed, and massive sizes of big data applications.

Machine learning

One way to think about big data is that it's "more data than you can go through by hand." For much of the data we want to analyze today, we need a machine's help.

Part of that help happens at ingestion. For example, natural language processing tries to read unstructured text and deduce what it means: Was this Twitter user happy or sad? Is this call center recording good, or was the customer angry?

Machine learning is important elsewhere in the data supply chain. When we analyze information, we're trying to find signal within the noise, to discern patterns. Humans can't find signal well by themselves. Just as astronomers use algorithms to scan the night's sky for signals, then verify any promising anomalies themselves, so to can data analysts use machines to find interesting dimensions, groupings, or patterns within the data. Machines can work at a lower signal-to-noise ratio than people.

Human exploration

While machine learning is an important tool to the data analyst, there's no substitute for human eyes and ears. Displaying the data in human-readable form is hard work, stretching the limits of multi-dimensional visualization. While most analysts work with spreadsheets or simple query languages today, that's changing.

Creve Maples, an early advocate of better computer interaction, designs systems that take dozens of independent, data sources and displays them in navigable 3D environments, complete with sound and other cues. Maples' studies show that when we feed an analyst data in this way, they can often find answers in minutes instead of months.

This kind of interactivity requires the speed and parallelism explained above, as well as new interfaces and multi-sensory environments that allow an analyst to work alongside the machine, immersed in the data.


Big data takes a lot of storage. In addition to the actual information in its raw form, there's the transformed information; the virtual machines used to crunch it; the schemas and tables resulting from analysis; and the many formats that legacy tools require so they can work alongside new technology. Often, storage is a combination of cloud and on-premise storage, using traditional flat-file and relational databases alongside more recent, post-SQL storage systems.

During and after analysis, the big data supply chain needs a warehouse. Comparing year-on-year progress or changes over time means we have to keep copies of everything, along with the algorithms and queries with which we analyzed it.

Sharing and acting

All of this analysis isn't much good if we can't act on it. As with collection, this isn't simply a technical matter — it involves legislation, organizational politics, and a willingness to experiment. The data might be shared openly with the world, or closely guarded.

The best companies tie big data results into everything from hiring and firing decisions, to strategic planning, to market positioning. While it's easy to buy into big data technology, it's far harder to shift an organization's culture. In many ways, big data adoption isn't a hardware retirement issue, it's an employee retirement one.

We've seen similar resistance to change each time there's a big change in information technology. Mainframes, client-server computing, packet-based networks, and the web all had their detractors. A NASA study into the failure of Ada, the first object-oriented language, concluded that proponents had over-promised, and there was a lack of a supporting ecosystem to help the new language flourish. Big data, and its close cousin, cloud computing, are likely to encounter similar obstacles.

A big data mindset is one of experimentation, of taking measured risks and assessing their impact quickly. It's similar to the Lean Startup movement, which advocates fast, iterative learning and tight links to customers. But while a small startup can be lean because it's nascent and close to its market, a big organization needs big data and an OODA loop to react well and iterate fast.

The big data supply chain is the organizational OODA loop. It's the big business answer to the lean startup.

Measuring and collecting feedback

Just as John Boyd's OODA loop is mostly about the loop, so big data is mostly about feedback. Simply analyzing information isn't particularly useful. To work, the organization has to choose a course of action from the results, then observe what happens and use that information to collect new data or analyze things in a different way. It's a process of continuous optimization that affects every facet of a business.

Replacing everything with data

Software is eating the world. Verticals like publishing, music, real estate and banking once had strong barriers to entry. Now they've been entirely disrupted by the elimination of middlemen. The last film projector rolled off the line in 2011: movies are now digital from camera to projector. The Post Office stumbles because nobody writes letters, even as Federal Express becomes the planet's supply chain.

Companies that get themselves on a feedback footing will dominate their industries, building better things faster for less money. Those that don't are already the walking dead, and will soon be little more than case studies and colorful anecdotes. Big data, new interfaces, and ubiquitous computing are tectonic shifts in the way we live and work.

A feedback economy

Big data, continuous optimization, and replacing everything with data pave the way for something far larger, and far more important, than simple business efficiency. They usher in a new era for humanity, with all its warts and glory. They herald the arrival of the feedback economy.

The efficiencies and optimizations that come from constant, iterative feedback will soon become the norm for businesses and governments. We're moving beyond an information economy. Information on its own isn't an advantage, anyway. Instead, this is the era of the feedback economy, and Boyd is, in many ways, the first feedback economist.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


October 27 2011

What's on the agenda for Velocity Europe

Velocity EuropeVelocity Europe is less than two weeks away. It's happening November 8-9 in Berlin at the Hotel Maritim ProArte. I've heard good things about the venue and am excited to get there and check it out.

This event has been a long time coming. A handful of web performance and operations savants (including members of the Program Committee) have been encouraging us for years to bring Velocity to Europe, and now it's actually happening. And (drum roll please) the price is only EUR 600 (excl. VAT) if you use the 20% discount code veu11sts.

The Velocity Europe speaker line-up is exceptional. Some highlights include:

  • Jon Jenkins from is talking about their approach to the challenges of mobile browsing. Jon is the Director of Software Development for Amazon Silk. I'm looking forward to more details about Silk's split architecture.
  • Tim Morrow delivers the background for Betfair's promise to deliver a fast experience to their customers, and their progress on that promise.
  • Theo Schlossnagle is a recognized leader at Velocity. He's giving two talks on web operations careers and monitoring.
  • Estelle Weyl joins Velocity for the first time talking about the nuances of mobile rendering performance. I learn something new every time I hear Estelle speak, so I'm excited to welcome her to Velocity.
  • Ivo Teel discusses the balance we all face between features and performance and how they're handling that at Spil Games.
  • Jeff Veen knows the importance of third-party performance and availability as the CEO of Typekit. Jeff is an amazing, engaging speaker. Reading his session description gave me goosebumps with anticipation: "Jeff sat on a couch in the Typekit offices, staring out the window, and wondering if everything their company had been working towards was about to slip through their fingers …"

There's much much more – lightning demos, browser vendor talks, John Allspaw on anticipating failure, David Mandelin on JavaScript performance – I've got to stop here but please check out the entire schedule.

I want to give a shout out to the Velocity Europe Program Committee: Patrick Debois, Aaron Peters, Schlomo Schapiro, Jeroen Tjepkema, and Sean Treadway. They've participated in numerous video concalls (yay Google Hangouts!) to review proposals, build the program, and shape Velocity to be a European conference. And they might have one more card up their sleeve – more on that later.

If you're heading to Berlin you should also check out CounchConf Berlin on Nov 7. NoSQL has great performance benefits and Couchbase is a good choice for many mobile apps. Use couchconf_discount for 10% off registration.

The last time I was in Berlin was for 2009. The city had a high-tech vibe and the crowd was extremely knowledgeable and enthusiastic. I'm excited to get back to Berlin for Velocity Europe and do the web performance and operations deep dives that are the core of Velocity. If you want to have a website that's always fast and always up, Velocity Europe is the place to be. I hope to see you there.

Velocity Europe, being held Nov. 8-9 in Berlin, will bring together the web operations and performance communities for two days of critical training, best practices, and case studies.

Save 20% on registration with the code veu11sts


June 02 2011

Velocity 2011

This year is our fourth Velocity conference on web performance and operations. What began with a meeting between Steve Souders, Jesse Robbins, Tim O'Reilly, and others at OSCON 2007, has become a thriving community. We're expecting this year to sell out again, even with significantly more space than we had last year. It will be the largest Velocity yet.

According to Tim O'Reilly, the motivation behind the 2007 meeting was a call to "gather the tribe" of people who cared about web performance and create a new conference. The audience for this conference, in 2007 as it is today, is made up of the people who keep the web running, the people behind the biggest websites in the world. The participants in that first meeting reflected a change that was underway in the relationship between web development and web operations. Even saying that there was a single "tribe" to be gathered was important. Participants in the 2007 meeting had already realized that there was a single web performance community, uniting both developers and operations staff. But in many organizations, web performance and web operations were disciplines that were poorly defined, poorly documented, and insufficiently recognized. In some organizations, then and now, people involved with web operations and web development were hardly even aware of each other.

The participants in that 2007 meeting came from organizations that were in the process of finding common ground between developers and operations, of making the cultural changes that allowed them to work together productively, and wanted to share those insights with the rest of the Web community. Both developers and operations staff are trying to solve the same problems. Customers aren't happy if the site isn't up; customers aren't happy if the site is slow and unresponsive; new features don't do anyone any good if they can't be deployed in a production environment. Developers and operations staff have no trouble agreeing on these unifying principles. Having agreed on these unifying principles, developers and operations staff quickly discover that they speak the same language: they both know the code intimately, understand performance issues, and understand tweaking servers and hand-tuning code to optimize performance. And when we held the first Velocity conference in 2008, it was indeed a "meeting of the tribe" — two groups that found they were really allies.

Velocity 2011, being held June 14-16 in Santa Clara, Calif., offers the skills and tools you need to master web performance and operations.

Save 20% on registration with the code VEL11RAD

Agility, infrastructure, and code

Velocity became the ground for discussing and testing a number of important new ideas. Perhaps one of the most important was the idea of agile operations. Agile development methodologies had taken the software world by storm: instead of long, waterfall-driven release cycles, software developers started by building a minimal product, then iterating quickly to add features, fix bugs, and refactor the design. Continuous integration soon became part of the agile world, with frequent builds and testing of the entire software package. This practice couldn't help but affect operations, and (with a certain amount of trepidation), forward-thinking companies like Flickr started deploying many times a day. Each deployment represented a small change: part of a new feature, a bug fix, whatever. This was revolutionary. Frequent deployment meant that bugs surfaced before the developers had moved on to their next project, and they were still available to fix problems.

At the same time, tools for managing large networks were improving. They had to improve; we were long past the stage where networks of computers could be set up and managed by hand, on a one-at-a-time basis. Better tools were particularly important in a server environment, where software installation and configuration were increasingly complex, and web companies had long since moved from individual servers to server farms. Cfengine, the first tool for automating software installation and configuration, started a revolution in the mid-'90s, which is carried on by Puppet and Chef. Along with better tools came a change in the nature of the job. Rather than mumbling individual incantations, combined with some ad-hoc scripting, system administration became software development, and infrastructure became code. If there was any doubt about this shift, Amazon Web Services put an end to it. You can't administer servers from the console when you don't even know where the servers are, and you're starting them and shutting them down by the dozens, if not thousands.

Optimizing the client side

One of the discoveries that led to Velocity comes from Steve Souders' "High Performance Web Sites." We've long known that performance was important, but for most of the history of the Web, we thought the answer to performance problems lay in the infrastructure: make the servers faster, get more out of the databases, etc. These are certainly important, but Steve showed convincingly that the biggest contribution to slow web pages is stuff that happens after the browser gets the response. Optimizing what you send to the browser is key, and tuning the servers secondary for creating a faster user experience. Hence a hallmark of Velocity is extended discussions of client-side performance optimization: compression, breaking JavaScript up into small, digestible chunks that can be loaded as required, optimizing the use of CSS, and so on. Another hallmark of Velocity is the presence of lead developers from all the major browser vendors, ready to talk about standards and making it easier for developers to optimize their code so that it works across all browsers.

One talk in particular crystallized just how important performance is: In their 2009 presentation, "Performance Related Changes and their User Impact," Eric Schurman of Microsoft's Bing and Jake Brutlag of Google showed that imperceptibly small increases in response time cause users to move away from your site and to another site. If response time is more than a second, you're losing a significant portion of your traffic. Here was proof that even milliseconds counted and users clearly respond to degradation that they can't detect.

But perhaps the companies the speakers represented were even more important than their results: developers from Microsoft and Google were talking, together, about the importance of performance to the future of the Web. As important as the results were, getting competitors like Microsoft and Google on the same stage to talk about web performance was a tremendous validation of Velocity's core premise that performance is central to taking the web into the next decade.

Mobile, HTML5, Node, and what lies ahead

We're now in the next decade, and mobile has become part of the discussion in ways we could only begin to anticipate a few years ago. Web performance for mobile devices is not the same as desktop web performance; and it's becoming ever more important. Last year, we heard that users expect mobile websites to be as responsive as desktop sites. But are the same techniques effective for optimizing mobile performance? We're in the process of finding out. It looks like client-side optimization is even more important in the mobile world than for the desktop/laptop web.

With those broader themes in mind, what's Velocity about this year? We have plenty of what you've come to expect: lots of material on the culture and integration of development and operations teams, plenty of sessions on measuring performance, plenty of ways to optimize your HTML, CSS, and JavaScript. There's a new track specifically on mobile performance, and a new track specifically for products and services, where vendors can showcase their offerings.

Here are some of the topics that we'll explore at Velocity 2011:

  • HTML5 is a lot more than a bunch of new tags; it's a major change in how we write and deliver web applications. It represents a significant change in the balance of power between the client-side and the server-side, and promises to have a huge impact on web optimization.

  • Node.js is a new high performance
    server platform; you
    couldn't go very far at Velocity 2010 without hearing someone talk
    about it in the hall. Its event-driven architecture is particularly
    suited for high performance, low-latency web sites. Sites based wholly
    or partially on Node are showing up everywhere, and forcing us to
    rethink the design of web applications.

  • Since mobile is increasing in importance, we've given it a
    whole conference track, covering mobile performance measurement and
    optimization, realtime analytics, concrete performance tips, and more.

  • In the past, we've frequently talked about building systems that are
    robust in the face of various kinds of damage; we've got more lined up
    on resilience engineering and reliability.

  • We're finally out of IPv4 address space, and the move to IPv6 has
    definite implications for operations and performance optimization.
    While we only have one IPv6 talk in this year's program, we can
    expect to see more in the future.

This year, we're expecting our largest crowd ever. It's going to be
an exciting show, with people like

Nicole Sullivan
looking at what's really important in HTML5 and CSS3;

Steve Souders
introducing this year's crop of performance tools;

Sarah Novotny
discussing best strategies for effective web caching;

John Allspaw
on conducting post-mortems effectively;
much more.

Finally, Velocity doesn't end on June 16. We're planning Velocity conferences to take place in Europe and China later in 2011 — details are coming soon and we hope to see you there. And if you can't make it to either of those locations, we'll see you again in June, 2012.


August 02 2010

Operations: The secret sauce revisited

Guest blogger Andrew Clay Shafer is helping telcos and hosting providers implement cloud services at Cloudscaling. He co-founded Reductive Labs, creators of Puppet, the configuration management framework. Andrew preaches the "infrastructure is code" gospel, and he supports approaches for applying agile methods to infrastructure and operations. Some of those perspectives were captured in his chapter in the O'Reilly book "Web Operations."

"Technical debt" is used two ways in the analysis of software systems. The phrase was first introduced in 1992 by Ward Cunningham to describe the premise that increased speed of delivery provides other advantages, and that the debt leveraged to gain those advantages should be strategically paid back.

Somewhere along the way, technical debt also became synonymous with poor implementation; reflecting the difference between the current state of a code base and an idealized one. I have used the term both ways, and I think they both have merit.

Technical debt can be extended and discussed along several additional axes: process debt, personnel debt, experience debt, user experience debt, security debt, documentation debt, etc. For this discussion, I won't quibble about the nuances of categorization. Instead, I want to take a high-level look at operations and infrastructure choices people make and the impact of those choices.

The technical debt metaphor

Debts are created by some combination of choice and circumstance. Modern economies are predicated on the flow of debt as much as anything else, but not all debt is created equal. There is a qualitative difference between a mortgage and carrying significant debt on maxed-out credit cards. The point being that there are a variety of ways to incur debt, and the quality of debts have different consequences.

Jesse Robbins' Radar post about operations as the secret sauce talked about boot strapping web startups in 80 hours. It included the following infographic showing the time cost of traditional versus special sauce operations:

I contend that the ongoing difference in time cost between the two solutions is the interest being paid on technical debt.

Understanding is really the crux of the matter. No one who really understands compound interest would intentionally make frivolous purchases on a credit card and not make every effort to pay down high interest debt. Just as no one who really understands web operations would create infrastructure with an exponentially increasing cost of maintenance. Yet, people do both of these things.

As the graph is projected out, the ongoing cost of maintenance in both projects reflects the maxim of "the rich get richer." One project can focus on adding value and differentiating itself in the market while the other will eventually be crushed under the weight of its own maintenance.

Technical debt and the Big Ball of Mud

Without a counterbalancing investment, system and software architectures succumb to entropy and become more difficult to understand. The classic "Big Ball of Mud" by Brian Foote and Joseph Yoder catalogs forces that contribute to the creation of haphazard and undifferentiated software architectures. They are:

  • Time
  • Cost
  • Experience
  • Skill
  • Visibility
  • Complexity
  • Change
  • Scale

These same forces apply just as much to infrastructure and operations, especially if you understand the "infrastructure is code" mantra. If you look at the original "Tale of Two Ops Teams" graphic, both teams spent almost the same amount of time before the launch. If we assume that these are representative, then the difference between the two approaches is essentially experience and skill, which is likely to be highly correlated with cost. As the project moves forward, the difference in experience and skill reflects itself in how the teams spend time, provide visibility and handle complexity, change and scale.

Using this list, and the assumption that balls of mud are synonymous with high technical debt, combating technical debt becomes an exercise in minimizing the impact of these forces.

  • Time and cost are what they are, and often have an inverse relationship. From a management perspective, I would like everything now and for free, so everything else is a compromise. Undue time pressure will always result in something else being compromised. That compromise will often start charging interest immediately.
  • Experience is invaluable, but sometimes hard to measure and overvalued in technology. Doing the same thing over and over with a technology is not 10 years of experience, it is the first year of experience 10 times. Intangible experience should not be measured in time, and experience in this sense is related to skill.
  • Visibility has two facets in ops work: Visibility into the design and responsibilities of the systems, and real-time metrics and alerting on the state of the system. The first allows us to take action, the second informs us that we should.
  • Complex problems can require complex solutions. Scale and change add complexity. Complexity obscures visibility and understanding.

Each of these forces and specific examples of how they impact infrastructure would fill a book, but hopefully that is enough to get people thinking and frame a discussion.

There is a force that may be missing from the "Big Ball of Mud": tools (which might be an oversight, might be an attempt to remain tool-agnostic, or might be considered a cross-cutting aspect of cost, experience and skill). That's not to say that tools don't add some complexity and the potential for technical debt as well. But done well, tools provide ongoing insight into how and why systems are configured the way they are, illumination of the complexity and connections of the systems, and a mechanism to rapidly implement changes. That is just an example. Every tool choice, from the operating system, to the web server, to the database, to the monitoring and more, has an impact on the complexity, visibility and flexibility of the systems, and therefore impacts operations effectiveness.

Many parallels can be drawn between operations and fire departments. One big difference is most fire departments don't spend much time actually putting out fires. If operations is reacting all the time, that indicates considerable technical debt. Furthermore, in reactive environments, the probability is high that the solutions of today are contributing to the technical debt and the fires of tomorrow.

Focus must be directed toward getting the fires under control in a way that doesn't contribute to future fires. The coarse metric of time spent reactively responding to incidents versus the time spent proactively completing ops-related projects is a great starting point for understanding the situation. One way to insure operations is always a cost center is to keep treating it like one. When the flow of technical debt is understood and well managed, operations is certainly a competitive advantage.


June 29 2010

Creating Cultural Change

At Velocity 2010, John Rauser presented four funny & powerful examples of cultural change, from a campaign at his office to get people to fill the coffee pot after taking the last cup, to an award winning advertising campaign. This talk explains how to "sneak past people's mental filters" and make things happen.

June 21 2010

On the performance of clouds

Public clouds are based on the economics of sharing. Cloud providers can charge less, and sell computing on an hourly basis without long-term contracts, because they're spreading costs and skills across many customers.

But a shared model means that your application is competing with other users' applications for scarce resources. The pact you're making with a public cloud, for better or worse, is that the advantages of elasticity and pay-as-you-go economics outweigh any problems you'll face.

Enterprises are skeptical because clouds force them to relinquish control over the underlying networks and architectures on which their applications run. Is performance acceptable? Will clouds be reliable? What's the tradeoff, particularly now that we know speed matters so much?

We (Bitcurrent) decided to find out. With the help of Webmetrics, we built four test applications: a small object, a large object, a million calculations, and a 500,000-row table scan. We ported the applications to five different clouds, and monitored them for a month. We discovered that performance varies widely by test type and cloud:

cloud performance results

Here are some of the lessons learned:

  • All of the services handled the small image well.
  • PaaS clouds were more efficient at delivering the large object, possibly because of their ability to distribute workload out to caching tiers better than an individual virtual machine can do.
  • didn't handle CPU workloads well, even with a tenth of the load of other agents. Amazon was slow for CPU, but we were using the least-powerful of Amazon's EC2 machines.
  • Google's ability to handle I/O, even under heavy load, was unmatched. Rackspace also dispatched the I/O tests quickly. Then again, it took us 37 hours to insert the data into Google's Bigtable.

In the end, it's clear that there's no single "best" cloud: PaaS (App Engine, scales easily, but locks you in; IaaS (Rackspace, Amazon, Terremark) offers portability, but leaves you doing all the scaling work yourself.

The full 50-page report is available free from Webmetrics.

Web performance and cloud architecture will be key topics at this week's Velocity conference.

June 03 2010

Velocity Culture: Web Operations, DevOps, etc...

Velocity 2010 is happening on June 22-24 (right around the corner!).  This year we've added third track, Velocity Culture, dedicated to exploring what we've learned about how great teams & organizations work together to succeed at scale. 

Web Operations, or WebOps, is what many of us have been calling these ideas for years.  Recently the term "DevOps" has become a kind of rallying cry that is resonating with many, along with variations on Agile Operations. No matter what you call it, our experiences over the past decade taught us that Culture matters more than any tool or technology in building, adapting, and scaling the web.

Here is a small sample of the upcoming Velocity Culture sessions:

Ops Meta-Metrics: The Currency You Use to Pay For Change
John Allspaw (
Change to production environments can cause a good deal of stress and strain amongst development and operations teams. More and more organizations are seeing benefits from deploying small code changes more frequently, for stability and productivity reasons. But how can you figure out how much change is appropriate for your application or your culture?

A Day in the Life of Facebook Operations
Tom Cook (Facebook)
Facebook’s Technical Operations team has to balance this need for constant availability with a fast-moving and experimental engineering culture. We release code every day. Additionally, we are supporting exponential user growth while still managing an exceptionally high radio of users per employee within engineering and operations.

This talk will go into how Facebook is “run” day-to-day with particular focus on actual tools in use (configuration management systems, monitoring, automation, etc), how we detect anomalies and respond to them, and the processes we use internally for rapidly pushing out changes while still keeping a handle on site stability.

Change Management: A Scientific Classification
Andrew Shafer (Cloudscaling)
Change management is the combination of process and tools by which changes are made to production systems. Approaches range from cowboy style, making changes to the live site, to complex rituals with secret incantations, coming full circle to continuous deployment. This presentation will highlight milestone practices along this spectrum, establishing a matrix for evaluating deployment process.

There is a tremendous amount happing in our space in the coming weeks in addition to the conference itself.  First, the "Web Operations" book which John Allspaw & I edited goes to print on June 15th.  We're really excited about how it came together.  Then, immediately after Velocity is DevOpsDays, which is a great community event that continues the conversation after Velocity (and is free).  Hope to see you all there!

November 24 2009

Velocity 2010: Fast By Default

We're entering our third year of Velocity, the Web Performance & Operations Conference. Velocity 2010 will be June 22-24, 2010 in Santa Clara, CA. It's going to be another incredible year.

Steve Souders & I have set a new theme this year, "Fast by Default".  We want the broader Velocity community & to adopt it as a shared mission & mantra. The reason for this is simple...

Fast isn't a Feature. Fast is a Requirement.

At Velocity earlier this year Marissa Meyer explained why performance mattered so much to Google. Then Eric Schurman (Bing & Velocity Program Committee member) and Jake Brutlag (Google Search) made history with a co-presentation on just how crucial performance is to revenue .

Phil Dixon of Shopzilla explained that a 5 second performance improvement increased their revenue by 7-12 percent while reducing hardware spend by 50%!!!

Fast means Client, Server, Infrastructure, Operations, & Organizations

Getting to Fast isn't just about any one part of the system. Browser & Client performance is crucial, and requires an equally fast server & infrastructure to support it. When load increases, infrastructure must scale quickly or performance suffers. The operational tools and processes for managing software & infrastructure must support rapid changes in a dynamic environment, and be backed by an organization & culture that embraces it.

We're Looking for Speakers - Submit your Proposals Now!

Do you have ideas and experience for improving Web Performance & Operations and making things "Fast by Default"? We want you as a speaker at Velocity 2010.

Submit your Proposals Now! Entires are due no later than January 11th, 2010 at 11:59 PM Pacific.

One more thing...

velocity-olc.pngQuite a few people have asked us to have Velocity conferences more frequently & beyond the SF Bay Area, and so we're going to try something new. On December 8 we'll be running our first ever Velocity Online Conference.

Past Velocity Conference participants get a 50% discount & get a 25% discount off Velocity 2010.

See the full schedule after the jump...

Velocity Online Conference
Tuesday, December 8, 2009
9:00am-12:40pm PT

Introduction to SPDY

Speaker: Mike Belshe

Faster Load Times Through Deferred JavaScript Evaluation

Speaker: Charles Jolley

Making Rails Even Faster by Default

Speaker: Yehuda Katz

Load Balancing & Reverse Proxies with Varnish & More + Q&A

Speaker: Artur Bergman

Browserscope: Profiling the Way to a Better Browser

Speaker: Lindsey Simon

CouchDB from 10,000 ft + Q&A

Speaker: J Chris Anderson

Operations Roundtable

Moderator: Jesse Robbins
Speakers: Artur BergmanAdam JacobJohn Allspaw

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...