Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 27 2011

Velocity 2011 retrospective

This was my third Velocity conference, and it's been great to see it grow. Smoothly running a conference with roughly 2,000 registered attendees (well over 50% more than 2010) is itself a testament to the importance of operations.

Velocity 2011 had many highlights, and below you'll find mine. I'll confess to some biases up front: I'm more interested in operations than development, so if you think mobile and web performance are underrepresented here, you're probably right. There was plenty of excellent content in those tracks, so please add your thoughts about those and other sessions in the comments area.

Blame and resilience engineering

I was particularly impressed by John Allspaw's session on "Advanced Post-mortem Fu." To summarize it in a sentence, it was about getting beyond blame. The job of a post-mortem analysis isn't to assign blame for failure, but to realize that failure happens and to plan the conditions under which failure will be less likely in the future. This isn't just an operational issue; moving beyond blame is a broader cultural imperative. Cultural historians have made much of the transition from shame culture — where if you failed you were "shamed" and obliged to leave the community — to guilt culture, where shame is internalized (the world of Hawthorne's "Scarlet Letter"). Now, we're clearly in a "blame culture," where it's always someone else's fault and nothing is ever over until the proper people have been blamed (and, more than likely, sued). That's not a way forward, for web ops any more than finance or medicine. John presented some new ways for thinking about failure, studying it, and making it less likely without assigning blame. There is no single root cause; many factors contribute to both success and failure, and you won't understand either until you take the whole system into account. Once you've done that, you can work on ways to improve operations to make failure less likely.

Velocity 2011 Online Access Pass
Couldn't make it to Velocity? Purchase the Velocity Online Access pass for $495 and get the following: The Velocity bookshelf — six essential O'Reilly ebooks; access to three upcoming online conferences; and access to the extensive Velocity 2011 conference video archive

Learn more about the Velocity Online Access Pass

John's talk raised the idea of "resilience engineering," an important
theme that's emerging from the Velocity culture. Resilience
engineering isn't just about making things that work; anyone can do
that. It's about designing systems that stay running in the face of
problems. Along similar lines, Justin Sheehy's talk was specifically about
resilience in the design of Riak. It was fascinating to
see how to design a distributed database so that any node could
suddenly disappear with no loss of data. Erlang, which Riak uses, encourages developers
to write partition tasks into small pieces that are free to crash,
running under a supervisor that restarts failed tasks.

Bryan Cantrill's excellent presentation on instrumenting the real-time web using Node.js and DTrace would win my vote for the best technical presentation of the conference, but it was most notable for his rant on Greenland and Antarctica's plot to take over the world. While the rant was funny, it's important not to forget the real message: DTrace is an underused but extremely flexible tool that can tell you exactly what is going on inside an application. It's more complex than other profiling tools I've seen, but in return for complexity, it lets you specify exactly what you want to know, and delivers results without requiring special compilation or even affecting the application's performance.

Data and performance

John Rauser's workshop on statistics ("Decisions in the Face of Uncertainty"), together with his keynote ("Look at your Data"), was another highlight. The workshop was an excellent introduction to working with data, an increasingly important tool for anyone interested in performance. But the keynote took it a step further, going beyond the statistics and looking at the actual raw data, spread across the living room floor. That was a powerful reminder that summary statistics are not always the last word in data: the actual data, the individual entries in your server logs, may hold the clues to your performance problem.

Velocity observations and overarching themes

There were many other memorable moments at Velocity (Steve Souders' belly dance wasn't one of them). I was amazed that Sean Power managed to do an Ignite Karaoke (a short impromptu presentation against a set of slides he didn't see in advance) that wasn't just funny, but actually almost made sense.

I could continue, but I would end up listing every session I attended; my only regret is that I couldn't attend more. Video for the conference keynotes is available online, so you can catch up on some of what you missed. The post-conference online access pass provides video for all the sessions for which presenters gave us permission.

Women's MeetupThe excellent sessions aren't the only news from Velocity. The Velocity Women's Networking Meetup had more than double the previous years' attendance; the group photo (right) has more people than I can count. The job board was stuffed to the gills. The exhibit hall dwarfed 2010's — I'd guess we had three times as many exhibitors — and there were doughnuts! But more than the individual sessions, the exhibits, the food, or the parties, I'll remember the overarching themes of cultural change; the many resources available for studying and improving performance; and most of all, the incredible people I met, all of whom contributed to making this conference a success.

We'll see you at the upcoming Velocity Europe in November and Velocity China in December, and next year at Velocity 2012 in California.


May 18 2011

How resilience engineering applies to the web world

In a recent interview, John Allspaw (@allspaw), vice president of technical operations at Etsy and a speaker at Velocity 2011, talked about how resilience engineering can be applied to web environments. Allspaw said the typical "name/blame/shame" postmortem meeting isn't an effective approach. "Second stories" are where the real vulnerabilities and solutions will be found.

Allspaw elaborates in the following Q&A.

What is resilience engineering?

JohnAllspaw.jpg John Allspaw: In the past 20 years, experts in the safety and human factors fields have been crystallizing some of the patterns that they saw when investigating disasters and failures in the "high risk" industries: aviation, space travel, chemical manufacturing, healthcare, etc. These patterns formed the basis for resilience engineering. They all surround the concept that a resilient system is one that can adjust its functioning prior to, during, and after an unexpected or undesired event.

There is a lot that web development and operations can learn from this field because the concepts map easily to the requirements for running successful systems online. One of the pieces of resilience engineering that I find fascinating is in the practical realization that the "system" in that context isn't just the software and machines that have been built to do work, but also the humans who build, operate, and maintain these infrastructures. This means not only looking at faults — or the potential for faults — at the component level, but at the human and process level as well.

This approach is supported by a rich history of complex systems enduring unexpected changes only because of operator's adaptive capacities. I don't think I've felt so inspired by another field of engineering. As the web engineering discipline matures, we should be paying attention to research that comes from elsewhere, not just in our own little world. Resilience engineering is an excellent example of that.

Velocity 2011, being held June 14-16 in Santa Clara, Calif., offers the skills and tools you need to master web performance and operations.

Save 20% on registration with the code VEL11RAD

How is resilience engineering shaped by human factors science?

John Allspaw: To be clear, I consider myself a student of these research topics, not an expert by any stretch. Human factors is a wide and multidisciplinary field that includes how humans relate to their environments in general. This includes how we react and work within socio-technical systems as it relates to safety and other concerns at the man-machine boundary. That's where cognitive and resilience engineering overlap.

Resilience engineering as its own distinct field is also quite young, but it has roots in human factors, safety and reliability engineering.

Does web engineering put too much focus on tools?

John Allspaw: While resilience engineering might feel like scenario and contingency planning with some academic rigor behind it, I think it's more than that.

By looking at not only what went wrong with complex systems, but also what went right, the pattern emerges that no matter how much intelligence and automation we build into the guts of an application or network, it's still the designers and operators who are at the core of what makes a system resilient. In addition, I think real resilience is built when the designers and operators themselves are considered a critical part of the system.

In web engineering the focus is too often on the machines and tooling. The idea that automation and redundancy alone will make a system resilient is a naive one. Really successful web organizations understand this, as do the high-risk industries I mentioned before.

What are the most common human errors in web operations? What can companies do to avoid them?

John Allspaw: Categorizing human errors is in and of itself an entire topic, and may or may not be useful in learning how to prevent future failures. Lapses, slips, and violations are known as common types, each with their own subtypes and "solutions," but these are very context-specific. Operators can make errors related to the complexity of the specific task, their training or experience, etc. I think James Reason's "Human Error" could be considered the canonical source on human error types and forms.

There's an idea among many resilience engineering researchers that attributing root cause to "human error" is ineffective in preventing future failures, for a number of different reasons. One is that for complex systems, failures almost never have a single "root" cause, but instead have multiple contributing factors. some of those might be latent failure scenarios that pre-existed but never exercised until they combined with some other context.

Another reason that the label is ineffective is that it doesn't result in a specific remediation. You can't simply end a postmortem meeting or root cause analysis with "well, let's just chalk it up to human error." The remediation items can't simply be "follow the procedure next time" or "be more vigilant." On the contrary, that's where you should begin the investigation. Actionable remediation items and real learning about failures emerge only after digging deeper into the intentions and motivations for performing actions that contributed to the unexpected outage. Human error research calls this approach looking at "second stories."

Documenting what operators should have done doesn't explain why it made sense for them to do what they did. If you get at those second stories, you'll be learning a lot more about how failures occur to begin with.

How can companies put failure to good use?

John Allspaw: Erik Hollnagel has said that the four cornerstones of resilience engineering are: anticipation, monitoring, response, and learning.

Learning in that context includes having a postmortem or post-issue meeting or process void of a "name/blame/shame" approach. Instead, it should be one that searches for those second stories and improves a company's anticipation for failure by finding the underlying systemic vulnerabilities.

Responding to failure as it's happening is interesting to me as well, and patterns can be found in the research across industries. Troubleshooting complex systems is a difficult business. Getting a team to calmly and purposefully carry out troubleshooting under time, money, and cultural pressures is an even tougher problem, but successful companies are good at it. My field can always be better at building organizational resilience in the face of escalating situations.

Postmortems are learning opportunities, and when they're done well, they feed back into how organizations can bolster their abilities to anticipate, look for, and respond to unexpected events. They also provide rich details on how a team's response to an outage can improve, which can point to all sorts of non-technical adjustments as well.

Should "resilience" be a management objective?

John Allspaw: That's the obvious conclusion made from the high-risk industries, and I think it's intuitive to think that way in online businesses. "Faster, cheaper, better" needs to be augmented with "more resilient" in order to get a full view of how an organization progresses with handling unexpected scenarios. We see successful companies taking this to heart. On the technical front, you see approaches like continuous deployment and gameday exercises, and on the business side we're starting to see postmortems on business decisions and direction that are guiding design and product road maps.

This interview was edited and condensed.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...