Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

July 02 2013

July 29 2011

Happy SysAdmin Appreciation Day!

Sysadmin DayToday is the 12th Annual System Administrator Appreciation Day. If you are reading this (or any other) page, sending a message, watching a video, reading an email, or doing anything else that touches the web ... you can thank a SysAdmin.

To all of you that care so much about building & running the infrastructure that we depend on every day... thank you. You are exceptional people, and you are doing work that matters.

Here's a video of Tim O'Reilly talking about how he came to learn how awesome SysAdmins are.

Sponsored post
soup-sponsored will be discontinued :(

Dear fans and users,
today, we have to share very sad news. will stop working in less than 10 days. :(
It's breaking our heart and we honestly tried whatever we could to keep the platform up and running. But the high costs and low revenue streams made it impossible to continue with it. We invested a lot of personal time and money to operate the platform, but when it's over, it's over.
We are really sorry. is part of the internet history and online for one and a half decades.
Here are the hard facts:
- In 10 days the platform will stop working.
- Backup your data in this time
- We will not keep backups nor can we recover your data
July, 20th, 2020 is the due date.
Please, share your thoughts and feelings here.
Reposted bydotmariuszMagoryannerdanelmangoerainbowzombieskilledmyunicorn

June 27 2011

Velocity 2011 retrospective

This was my third Velocity conference, and it's been great to see it grow. Smoothly running a conference with roughly 2,000 registered attendees (well over 50% more than 2010) is itself a testament to the importance of operations.

Velocity 2011 had many highlights, and below you'll find mine. I'll confess to some biases up front: I'm more interested in operations than development, so if you think mobile and web performance are underrepresented here, you're probably right. There was plenty of excellent content in those tracks, so please add your thoughts about those and other sessions in the comments area.

Blame and resilience engineering

I was particularly impressed by John Allspaw's session on "Advanced Post-mortem Fu." To summarize it in a sentence, it was about getting beyond blame. The job of a post-mortem analysis isn't to assign blame for failure, but to realize that failure happens and to plan the conditions under which failure will be less likely in the future. This isn't just an operational issue; moving beyond blame is a broader cultural imperative. Cultural historians have made much of the transition from shame culture — where if you failed you were "shamed" and obliged to leave the community — to guilt culture, where shame is internalized (the world of Hawthorne's "Scarlet Letter"). Now, we're clearly in a "blame culture," where it's always someone else's fault and nothing is ever over until the proper people have been blamed (and, more than likely, sued). That's not a way forward, for web ops any more than finance or medicine. John presented some new ways for thinking about failure, studying it, and making it less likely without assigning blame. There is no single root cause; many factors contribute to both success and failure, and you won't understand either until you take the whole system into account. Once you've done that, you can work on ways to improve operations to make failure less likely.

Velocity 2011 Online Access Pass
Couldn't make it to Velocity? Purchase the Velocity Online Access pass for $495 and get the following: The Velocity bookshelf — six essential O'Reilly ebooks; access to three upcoming online conferences; and access to the extensive Velocity 2011 conference video archive

Learn more about the Velocity Online Access Pass

John's talk raised the idea of "resilience engineering," an important
theme that's emerging from the Velocity culture. Resilience
engineering isn't just about making things that work; anyone can do
that. It's about designing systems that stay running in the face of
problems. Along similar lines, Justin Sheehy's talk was specifically about
resilience in the design of Riak. It was fascinating to
see how to design a distributed database so that any node could
suddenly disappear with no loss of data. Erlang, which Riak uses, encourages developers
to write partition tasks into small pieces that are free to crash,
running under a supervisor that restarts failed tasks.

Bryan Cantrill's excellent presentation on instrumenting the real-time web using Node.js and DTrace would win my vote for the best technical presentation of the conference, but it was most notable for his rant on Greenland and Antarctica's plot to take over the world. While the rant was funny, it's important not to forget the real message: DTrace is an underused but extremely flexible tool that can tell you exactly what is going on inside an application. It's more complex than other profiling tools I've seen, but in return for complexity, it lets you specify exactly what you want to know, and delivers results without requiring special compilation or even affecting the application's performance.

Data and performance

John Rauser's workshop on statistics ("Decisions in the Face of Uncertainty"), together with his keynote ("Look at your Data"), was another highlight. The workshop was an excellent introduction to working with data, an increasingly important tool for anyone interested in performance. But the keynote took it a step further, going beyond the statistics and looking at the actual raw data, spread across the living room floor. That was a powerful reminder that summary statistics are not always the last word in data: the actual data, the individual entries in your server logs, may hold the clues to your performance problem.

Velocity observations and overarching themes

There were many other memorable moments at Velocity (Steve Souders' belly dance wasn't one of them). I was amazed that Sean Power managed to do an Ignite Karaoke (a short impromptu presentation against a set of slides he didn't see in advance) that wasn't just funny, but actually almost made sense.

I could continue, but I would end up listing every session I attended; my only regret is that I couldn't attend more. Video for the conference keynotes is available online, so you can catch up on some of what you missed. The post-conference online access pass provides video for all the sessions for which presenters gave us permission.

Women's MeetupThe excellent sessions aren't the only news from Velocity. The Velocity Women's Networking Meetup had more than double the previous years' attendance; the group photo (right) has more people than I can count. The job board was stuffed to the gills. The exhibit hall dwarfed 2010's — I'd guess we had three times as many exhibitors — and there were doughnuts! But more than the individual sessions, the exhibits, the food, or the parties, I'll remember the overarching themes of cultural change; the many resources available for studying and improving performance; and most of all, the incredible people I met, all of whom contributed to making this conference a success.

We'll see you at the upcoming Velocity Europe in November and Velocity China in December, and next year at Velocity 2012 in California.


June 17 2011

Velocity 2011 debrief

Women's MeetupVelocity wrapped up yesterday. This was Velocity's fourth year and every year has seen significant growth, but this year felt like a tremendous step up in all areas. Total attendance grew from 1,200 last year to more than 2,000 people. The workshops were huge, the keynotes were packed, and the sessions in each track were bigger than anyone expected. The exhibit hall was more than twice as big as last year and it was still crowded every time I was there.

Sample some of the tweets to see the reaction of attendees, sponsors, and exhibitors.

Several folks on the #velocityconf Twitter stream have been asking about slides and videos. You can find those on the Velocity Slides and Videos page. There are about 25 slide decks up there right now. The rest of the slides will be posted as we receive them from the speakers. Videos of all the keynotes will be made available for free. Several are there already posted, including "Career Development" by Theo Schlossnagle, "JavaScript & Metaperformance" by Doug Crockford, and "Look at Your Data" by the omni-awesome John Rauser. Videos of every afternoon session are also available via the Velocity Online Access Pass ($495).

Velocity 2011 had a great crowd with a lot of energy. Check out the Velocity photos to get a feel for what was happening. We had more women speakers than ever before and I was psyched when I saw this photo of the Women's Networking Meet Up that took place during the conference (also posted above).

Velocity 2011: Take-aways, Trends, and Highlights — In this webcast following Velocity 2011, program chairs Steve Souders and John Allspaw will identify and discuss key trends and announcements that came out of the event and how they will impact the web industry in the year to come.

Join us on Friday, June 24, 2011, at 10 am PT

Register for this free webcast

Make sure to check out all the announcements that were made at Velocity. There were a couple big announcements about Velocity itself, including:

  • After four years Jesse Robbins is passing the co-chair mantle to John Allspaw. I worked with John at Yahoo! when he was with Flickr. John is VP of Tech Ops at Etsy now. He stepped into many of the co-chair duties at this Velocity in preparation for taking on the role at the next Velocity.
  • Speaking of the next Velocity, we announced there will be a Velocity Europe in November in Berlin.
    The exact venue and dates will be announced soon, followed quickly by a call for proposals.
    We're extremely excited about expanding Velocity to Europe and look forward to connecting with the performance and operations communities there,
    and helping grow the WPO and devops industries in that part of the world.
    In addition, the second Velocity China will be held in Beijing in December 2011.
  • And of course we'll be back next June for our fifth year of Velocity here in the Bay Area.

I covered a lot in this post and didn't even talk about any of the themes, trends, and takeaways. John and I will be doing that at the Velocity Wrap-up Webcast on Friday, June 24 at 10am PT. It's free so invite your friends and colleagues to join in.


June 02 2011

Velocity 2011

This year is our fourth Velocity conference on web performance and operations. What began with a meeting between Steve Souders, Jesse Robbins, Tim O'Reilly, and others at OSCON 2007, has become a thriving community. We're expecting this year to sell out again, even with significantly more space than we had last year. It will be the largest Velocity yet.

According to Tim O'Reilly, the motivation behind the 2007 meeting was a call to "gather the tribe" of people who cared about web performance and create a new conference. The audience for this conference, in 2007 as it is today, is made up of the people who keep the web running, the people behind the biggest websites in the world. The participants in that first meeting reflected a change that was underway in the relationship between web development and web operations. Even saying that there was a single "tribe" to be gathered was important. Participants in the 2007 meeting had already realized that there was a single web performance community, uniting both developers and operations staff. But in many organizations, web performance and web operations were disciplines that were poorly defined, poorly documented, and insufficiently recognized. In some organizations, then and now, people involved with web operations and web development were hardly even aware of each other.

The participants in that 2007 meeting came from organizations that were in the process of finding common ground between developers and operations, of making the cultural changes that allowed them to work together productively, and wanted to share those insights with the rest of the Web community. Both developers and operations staff are trying to solve the same problems. Customers aren't happy if the site isn't up; customers aren't happy if the site is slow and unresponsive; new features don't do anyone any good if they can't be deployed in a production environment. Developers and operations staff have no trouble agreeing on these unifying principles. Having agreed on these unifying principles, developers and operations staff quickly discover that they speak the same language: they both know the code intimately, understand performance issues, and understand tweaking servers and hand-tuning code to optimize performance. And when we held the first Velocity conference in 2008, it was indeed a "meeting of the tribe" — two groups that found they were really allies.

Velocity 2011, being held June 14-16 in Santa Clara, Calif., offers the skills and tools you need to master web performance and operations.

Save 20% on registration with the code VEL11RAD

Agility, infrastructure, and code

Velocity became the ground for discussing and testing a number of important new ideas. Perhaps one of the most important was the idea of agile operations. Agile development methodologies had taken the software world by storm: instead of long, waterfall-driven release cycles, software developers started by building a minimal product, then iterating quickly to add features, fix bugs, and refactor the design. Continuous integration soon became part of the agile world, with frequent builds and testing of the entire software package. This practice couldn't help but affect operations, and (with a certain amount of trepidation), forward-thinking companies like Flickr started deploying many times a day. Each deployment represented a small change: part of a new feature, a bug fix, whatever. This was revolutionary. Frequent deployment meant that bugs surfaced before the developers had moved on to their next project, and they were still available to fix problems.

At the same time, tools for managing large networks were improving. They had to improve; we were long past the stage where networks of computers could be set up and managed by hand, on a one-at-a-time basis. Better tools were particularly important in a server environment, where software installation and configuration were increasingly complex, and web companies had long since moved from individual servers to server farms. Cfengine, the first tool for automating software installation and configuration, started a revolution in the mid-'90s, which is carried on by Puppet and Chef. Along with better tools came a change in the nature of the job. Rather than mumbling individual incantations, combined with some ad-hoc scripting, system administration became software development, and infrastructure became code. If there was any doubt about this shift, Amazon Web Services put an end to it. You can't administer servers from the console when you don't even know where the servers are, and you're starting them and shutting them down by the dozens, if not thousands.

Optimizing the client side

One of the discoveries that led to Velocity comes from Steve Souders' "High Performance Web Sites." We've long known that performance was important, but for most of the history of the Web, we thought the answer to performance problems lay in the infrastructure: make the servers faster, get more out of the databases, etc. These are certainly important, but Steve showed convincingly that the biggest contribution to slow web pages is stuff that happens after the browser gets the response. Optimizing what you send to the browser is key, and tuning the servers secondary for creating a faster user experience. Hence a hallmark of Velocity is extended discussions of client-side performance optimization: compression, breaking JavaScript up into small, digestible chunks that can be loaded as required, optimizing the use of CSS, and so on. Another hallmark of Velocity is the presence of lead developers from all the major browser vendors, ready to talk about standards and making it easier for developers to optimize their code so that it works across all browsers.

One talk in particular crystallized just how important performance is: In their 2009 presentation, "Performance Related Changes and their User Impact," Eric Schurman of Microsoft's Bing and Jake Brutlag of Google showed that imperceptibly small increases in response time cause users to move away from your site and to another site. If response time is more than a second, you're losing a significant portion of your traffic. Here was proof that even milliseconds counted and users clearly respond to degradation that they can't detect.

But perhaps the companies the speakers represented were even more important than their results: developers from Microsoft and Google were talking, together, about the importance of performance to the future of the Web. As important as the results were, getting competitors like Microsoft and Google on the same stage to talk about web performance was a tremendous validation of Velocity's core premise that performance is central to taking the web into the next decade.

Mobile, HTML5, Node, and what lies ahead

We're now in the next decade, and mobile has become part of the discussion in ways we could only begin to anticipate a few years ago. Web performance for mobile devices is not the same as desktop web performance; and it's becoming ever more important. Last year, we heard that users expect mobile websites to be as responsive as desktop sites. But are the same techniques effective for optimizing mobile performance? We're in the process of finding out. It looks like client-side optimization is even more important in the mobile world than for the desktop/laptop web.

With those broader themes in mind, what's Velocity about this year? We have plenty of what you've come to expect: lots of material on the culture and integration of development and operations teams, plenty of sessions on measuring performance, plenty of ways to optimize your HTML, CSS, and JavaScript. There's a new track specifically on mobile performance, and a new track specifically for products and services, where vendors can showcase their offerings.

Here are some of the topics that we'll explore at Velocity 2011:

  • HTML5 is a lot more than a bunch of new tags; it's a major change in how we write and deliver web applications. It represents a significant change in the balance of power between the client-side and the server-side, and promises to have a huge impact on web optimization.

  • Node.js is a new high performance
    server platform; you
    couldn't go very far at Velocity 2010 without hearing someone talk
    about it in the hall. Its event-driven architecture is particularly
    suited for high performance, low-latency web sites. Sites based wholly
    or partially on Node are showing up everywhere, and forcing us to
    rethink the design of web applications.

  • Since mobile is increasing in importance, we've given it a
    whole conference track, covering mobile performance measurement and
    optimization, realtime analytics, concrete performance tips, and more.

  • In the past, we've frequently talked about building systems that are
    robust in the face of various kinds of damage; we've got more lined up
    on resilience engineering and reliability.

  • We're finally out of IPv4 address space, and the move to IPv6 has
    definite implications for operations and performance optimization.
    While we only have one IPv6 talk in this year's program, we can
    expect to see more in the future.

This year, we're expecting our largest crowd ever. It's going to be
an exciting show, with people like

Nicole Sullivan
looking at what's really important in HTML5 and CSS3;

Steve Souders
introducing this year's crop of performance tools;

Sarah Novotny
discussing best strategies for effective web caching;

John Allspaw
on conducting post-mortems effectively;
much more.

Finally, Velocity doesn't end on June 16. We're planning Velocity conferences to take place in Europe and China later in 2011 — details are coming soon and we hope to see you there. And if you can't make it to either of those locations, we'll see you again in June, 2012.


June 01 2011

The state of speed and the quirks of mobile optimization

Google's performance evangelist and Velocity co-chair Steve Souders (@souders) recently talked with me about speed, browser wars, and desktop performance vs mobile performance. He also discussed a new project he's working on called the HTTP Archive, which documents how web content is constructed, how it changes over time, and how those changes play out.

Our interview follows.

What are the major factors slowing down site performance?

SteveSouders.jpgSteve Souders: For years when developers started focusing on the performance of their websites, they would start on the back end, optimizing C++ code or database queries. Then we discovered that about 10% or 20% of the overall page load time was spent on the back end. So if you cut that in half, you only improve things 5%, maybe 10%. In many cases, you can reduce the back end time to zero and most users won't notice.

So really, improvement comes from the time spent on the front end, on the network transferring resources and in the browser pulling in those resources. In the case of JavaScript and CSS, it's in parsing them and executing JavaScript. Without any changes in user network connection speeds, websites are able to cut their page load times in half. And that's because even with fast connection speeds or slow connection speeds, there are ways that the browser downloads these resources that developers can control. For example, more parallel downloads or less network overhead in managing connections.

We can work around some of the network problems, but inside the browser there's very little that developers can do in how JavaScript and CSS are handled. And of those two, JavaScript is a much bigger problem. Websites have a lot more JavaScript on them than CSS, and what they do with that JavaScript takes a lot more time. I always tell website owners: "If you care about how fast your website is, the first place to look is JavaScript. And if you can adopt some of the performance best practices we have around JavaScript, you can easily make gains in having your website load faster."

Why do load times vary by browser?

Steve Souders: Even if you're on the same machine, loading a page in one browser vs another can lead to very different timing. Some of the factors that affect this difference are things like the JavaScript engine, caching, network behavior, and rendering.

I don't think it's likely that we ever will see standardization in all of those areas — which I think is a good thing. If we look at the competition in the last few years in JavaScript engines, I think we would all agree that that competition has resulted in tremendous technological growth in that space. We can see that same growth in other areas as the focus on performance continues to grow.

Velocity 2011, being held June 14-16 in Santa Clara, Calif., offers the skills and tools you need to master web performance and operations.

Save 20% on registration with the code VEL11RAD

Are we in the middle of a "speed arms race" between browser developers?

Steve Souders: We're certainly in a phase where there's a lot of competition across the browser teams, and speed is one of the major competitive differentiators. That's music to my ears. I don't know if we're in the middle of it, because it's been going on for two or three years now. Going forward, I think speed is always going to be a critical factor for a browser to be successful. So perhaps we're just at the very beginning of that race.

Starting around 2005 and 2006, we started to see web apps far outpacing the capabilities of the browsers that they ran in, mostly in terms of JavaScript and CSS but also in resource downloads and the size of resources. I'll be honest, I was nervous about the adoption of AJAX and Web 2.0, given the current state of the browsers, but after that explosion, the browsers took notice, and I think that's when this focus on performance really took off. We've seen huge improvements in network behavior, parallel downloads, and JavaScript performance. JavaScript engines have become much faster, and improvements in CSS and layout — and some of the awareness around these performance best practices — has helped as well.

We're just reaching the point where the browsers are catching up with the rich interactive web apps that they're hosting. And all of a sudden, HTML5 came on the scene — the audio tag, video tag, canvas, SVG, web workers, custom font files — and I think as we see these HTML5 features get wider adoption, browsers are going to have to put even more focus on performance. Certainly mobile is another area where browser performance is going to have a lot of growth and is going to have a critical impact on the adoption and success of the web.

What new optimization quirks or obstacles does mobile browsing create?

Steve Souders: As large and multi-dimensional as the browser matrix currently is, it's nothing compared to the size of that matrix for mobile, where we have even more browsers, hardware profiles, connection speeds, types of connections, and proxies.

One of the biggest challenges developers are going to face on the mobile side is getting a handle on the performance of what we're building across the devices we care about. I talked earlier about how on the desktop, without any change in connection speed, developers could work around some of the network obstacles and get significant improvement in their page load times. On mobile, that's going to be more difficult.

The connections on mobile are slower, but they're also constrained in other ways. For example, the number of connections per server and the maximum number of connections across all servers are typically lower on mobile devices than they are on the desktop. And the path that HTTP requests have to take from a mobile device to their origin server can be much more convoluted and slower than they are on the desktop.

So, network performance is going to be a major obstacle, but we can't forget about JavaScript and CSS. Mobile devices have less power than desktops. The same amount of JavaScript and CSS — or even half the amount of JavaScript and CSS — that we have in the desktop could take significantly longer when executed on a mobile platform.

What should companies be doing to optimize mobile browsing?

Steve Souders: Developers are in a great place when it comes to building desktop websites because there's a significant number of performance best practices out there with a lot of research behind them and a lot of tooling and automation. The problem is, we don't have any of that for mobile. That's the goal, but right now, it doesn't exist.

When I started talking a year or so ago about switching the focus to mobile performance, most people would respond with, "Don't the best practices we have for desktop also apply to mobile?" And I always said the same thing, "I don't know, but I'm going to find out." My guess is that half of them are important, a quarter of them don't really matter, and a quarter of them actually hurt mobile performance. Then there's a whole set of performance best practices that are really important for mobile but don't matter so much for the desktop, so no one's really focused on them.

In the first few months that I've been able to focus on mobile, that's played out pretty well. There are some things, like domain sharding, that are really great for desktop performance but actually hurt mobile performance. And there are other things — like "data: URIs" for embedding images, and relying on localStorage for long-term caching — that are great for mobile and don't exist in any of the popular lists of performance best practices. Unfortunately, companies that want to invest in mobile performance don't have a large body of best practices to refer to. And that's where we are now, at least that's where I am now — trying to identify those best practices. Once we have them, we can evangelize, codify, and automate them.

What is the HTTP Archive and how can developers use it to improve site speed?

HTTP Archive logoSteve Souders: Over the last five years, we've seen a lot of interest in website optimization — and websites have changed over that time. Unfortunately, we don't have any record of what those changes have been, how significant they've been, or what areas we've seen change and what areas we haven't seen change. The purpose of the HTTP Archive is to give us that history.

It's similar to the Internet Archive started by Brewster Kahle in 1996 — the Internet Archive collects the web's content and the HTTP Archive archives how that content was built and served. Both are important: The Internet Archive provides society with a record of the evolution of digital media on the web, and the HTTP Archive provides a record of how that digital content has been served to users and how it's changing, specifically for people interested in website performance.

This project will highlight areas where we've seen good traction of performance best practices and where we haven't. Another insight that will come from this is that it's important, for example, for browser vendors and JavaScript framework developers to develop tools and features that can be adopted by developers to improve performance. It's also important to provide support for development patterns that are currently popular on the Internet. We can't ignore the way the web is currently built and just author new features and wait for developers to adopt them. The HTTP Archive will provide some perspective on current development practices and patterns so browser developers can focus on performance optimizations that fit with those patterns.

Image transfer size and image request chart
Click to enlarge and see more trends data from the HTTP Archive.

Right now, there aren't that many slices of the data, but the ones that are there are pretty powerful. I think the most impactful ones are the trending charts because we can see how the web is changing over time. For example, we noticed that the size of images has grown about 12% over the last five months. That's pretty significant. And there are new technologies that address performance issues like image size — Google has recently released the WebP proposal for a new image format that reduces image size. So, the adoption of that new format by developers and other browsers might be accelerated when they see that image size is growing and will consume even more bandwidth going forward.

Associated photo on index pages: Speedy Gonzales by blmurch, on Flickr


May 20 2011

Four short links: 20 May 2011

  1. BitCoin Watch -- news and market analysis for this artificial currency. (If you're outside the BitCoin world wondering wtf the fuss is all about try We Use Coins for a gentle primer and then Is BitCoin a Good Idea? for the case against) (via Andy Baio)
  2. Time Capsule -- send your Flickr photos from a year ago. I love that technology helps us connect not just with other people right now, but with ourselves in the future. Compare TwitShift and Foursquare and Seven Years Ago. (via Really Interesting Group)
  3. HTTP Archive Mobile -- mobile performance data. The top 100 web pages average out at 271kb vs 401kb for their desktop incarnations, which still seems unjustifiably high to me.
  4. Skype at Conferences -- The two editors of the book were due to lead the session but were at the wrong ends of a skype three way video conference which stuttered into a dalekian half life without really quite making the breakthrough into comprehensibility. After various attempts to rewire, reconfigure and reboot, we gave up and had what turned into a good conversation among the dozen people round the table in London. Conference organizers, take note: Skype at conferences is a recipe for fail.

May 18 2011

How resilience engineering applies to the web world

In a recent interview, John Allspaw (@allspaw), vice president of technical operations at Etsy and a speaker at Velocity 2011, talked about how resilience engineering can be applied to web environments. Allspaw said the typical "name/blame/shame" postmortem meeting isn't an effective approach. "Second stories" are where the real vulnerabilities and solutions will be found.

Allspaw elaborates in the following Q&A.

What is resilience engineering?

JohnAllspaw.jpg John Allspaw: In the past 20 years, experts in the safety and human factors fields have been crystallizing some of the patterns that they saw when investigating disasters and failures in the "high risk" industries: aviation, space travel, chemical manufacturing, healthcare, etc. These patterns formed the basis for resilience engineering. They all surround the concept that a resilient system is one that can adjust its functioning prior to, during, and after an unexpected or undesired event.

There is a lot that web development and operations can learn from this field because the concepts map easily to the requirements for running successful systems online. One of the pieces of resilience engineering that I find fascinating is in the practical realization that the "system" in that context isn't just the software and machines that have been built to do work, but also the humans who build, operate, and maintain these infrastructures. This means not only looking at faults — or the potential for faults — at the component level, but at the human and process level as well.

This approach is supported by a rich history of complex systems enduring unexpected changes only because of operator's adaptive capacities. I don't think I've felt so inspired by another field of engineering. As the web engineering discipline matures, we should be paying attention to research that comes from elsewhere, not just in our own little world. Resilience engineering is an excellent example of that.

Velocity 2011, being held June 14-16 in Santa Clara, Calif., offers the skills and tools you need to master web performance and operations.

Save 20% on registration with the code VEL11RAD

How is resilience engineering shaped by human factors science?

John Allspaw: To be clear, I consider myself a student of these research topics, not an expert by any stretch. Human factors is a wide and multidisciplinary field that includes how humans relate to their environments in general. This includes how we react and work within socio-technical systems as it relates to safety and other concerns at the man-machine boundary. That's where cognitive and resilience engineering overlap.

Resilience engineering as its own distinct field is also quite young, but it has roots in human factors, safety and reliability engineering.

Does web engineering put too much focus on tools?

John Allspaw: While resilience engineering might feel like scenario and contingency planning with some academic rigor behind it, I think it's more than that.

By looking at not only what went wrong with complex systems, but also what went right, the pattern emerges that no matter how much intelligence and automation we build into the guts of an application or network, it's still the designers and operators who are at the core of what makes a system resilient. In addition, I think real resilience is built when the designers and operators themselves are considered a critical part of the system.

In web engineering the focus is too often on the machines and tooling. The idea that automation and redundancy alone will make a system resilient is a naive one. Really successful web organizations understand this, as do the high-risk industries I mentioned before.

What are the most common human errors in web operations? What can companies do to avoid them?

John Allspaw: Categorizing human errors is in and of itself an entire topic, and may or may not be useful in learning how to prevent future failures. Lapses, slips, and violations are known as common types, each with their own subtypes and "solutions," but these are very context-specific. Operators can make errors related to the complexity of the specific task, their training or experience, etc. I think James Reason's "Human Error" could be considered the canonical source on human error types and forms.

There's an idea among many resilience engineering researchers that attributing root cause to "human error" is ineffective in preventing future failures, for a number of different reasons. One is that for complex systems, failures almost never have a single "root" cause, but instead have multiple contributing factors. some of those might be latent failure scenarios that pre-existed but never exercised until they combined with some other context.

Another reason that the label is ineffective is that it doesn't result in a specific remediation. You can't simply end a postmortem meeting or root cause analysis with "well, let's just chalk it up to human error." The remediation items can't simply be "follow the procedure next time" or "be more vigilant." On the contrary, that's where you should begin the investigation. Actionable remediation items and real learning about failures emerge only after digging deeper into the intentions and motivations for performing actions that contributed to the unexpected outage. Human error research calls this approach looking at "second stories."

Documenting what operators should have done doesn't explain why it made sense for them to do what they did. If you get at those second stories, you'll be learning a lot more about how failures occur to begin with.

How can companies put failure to good use?

John Allspaw: Erik Hollnagel has said that the four cornerstones of resilience engineering are: anticipation, monitoring, response, and learning.

Learning in that context includes having a postmortem or post-issue meeting or process void of a "name/blame/shame" approach. Instead, it should be one that searches for those second stories and improves a company's anticipation for failure by finding the underlying systemic vulnerabilities.

Responding to failure as it's happening is interesting to me as well, and patterns can be found in the research across industries. Troubleshooting complex systems is a difficult business. Getting a team to calmly and purposefully carry out troubleshooting under time, money, and cultural pressures is an even tougher problem, but successful companies are good at it. My field can always be better at building organizational resilience in the face of escalating situations.

Postmortems are learning opportunities, and when they're done well, they feed back into how organizations can bolster their abilities to anticipate, look for, and respond to unexpected events. They also provide rich details on how a team's response to an outage can improve, which can point to all sorts of non-technical adjustments as well.

Should "resilience" be a management objective?

John Allspaw: That's the obvious conclusion made from the high-risk industries, and I think it's intuitive to think that way in online businesses. "Faster, cheaper, better" needs to be augmented with "more resilient" in order to get a full view of how an organization progresses with handling unexpected scenarios. We see successful companies taking this to heart. On the technical front, you see approaches like continuous deployment and gameday exercises, and on the business side we're starting to see postmortems on business decisions and direction that are guiding design and product road maps.

This interview was edited and condensed.


May 11 2011

How the cloud helps Netflix

NetflixAs Internet-based companies outgrow their data centers, they're looking at larger cloud-based infrastructures such as those offered by Microsoft, Google, and Amazon. Last year, Netflix made such a transition when it moved some of its services into Amazon's cloud.

In a recent interview, Adrian Cockcroft, (@adrianco) cloud architect at Netflix and a speaker at Velocity 2011, talked about what it took to move Netflix to the cloud, why they chose Amazon's platform, and how the company is accommodating the increasing demands of streaming.

Our interview follows.

Why did Netflix choose to migrate to Amazon's cloud?

AdrianCockcroft.jpg Adrian Cockcroft: We couldn't build our own data centers fast enough to track our growth rate and global roll out, so we leveraged Amazon's ability to build and run large-scale infrastructure. In doing that, we got extreme agility. For example, when we decided to test world-wide deployment of services, our developers were immediately able to launch large-scale deployments and tests on another continent, with no planning delay.

What architectural changes were required to move from a conventional data center to a cloud environment?

Adrian Cockcroft: We took the opportunity to re-work our apps to a fine-grain SOA-style architecture, where each developer pushes his own auto-scaled service. We made a clean separation of stateful services and stateless business logic, and designed with the assumption that large numbers of systems would fail and that we should keep running without intervention. This was largely about paying down our technical debt and building a scalable web-based product using current best practices.

Velocity 2011, being held June 14-16 in Santa Clara, Calif., offers the skills and tools you need to master web performance and operations.

Save 20% on registration with the code VEL11RAD

What issues are you facing as streaming demand increases?

Adrian Cockcroft: We work with all three "terabit-scale" content delivery networks — Level 3, Limelight, and Akamai. They stream our movies to the end customer, and if there is a problem with one of them, traffic automatically switches to another. We don't see any limits on how much traffic we can stream. We aren't trying to feed everyone in the world from a single central point — it's widely distributed.

Netflix doesn't ask customers to change much on their side (browsers, speeds, etc.) — how do you achieve this level of inclusivity, and do you see it continuing?

Adrian Cockcroft: We have very wide ranging support for streaming devices and expect this to continue. We are working on the HTML5 video tag standards, which may eventually allow DRM-protected playback of movies on any browser with no plugin. We currently depend on Silverlight for Windows and Mac OS, and we don't have a supported DRM mechanism for playback on Linux browsers.

For hardware devices, we work with the chip manufacturers to build Netflix-ready versions of the chipsets used to build TV sets and Blu-ray players. That way we are included in almost all new Internet-connected TV devices.

This interview was edited and condensed.


April 29 2011

Four short links: 29 April 2011

  1. Kathy Sierra Nails Gamification -- I rarely link to things on O'Reilly sites, and have never before linked to something on Radar, but the comments here from Kathy Sierra are fantastic. She nails what makes me queasy about shallow gamification behaviours: replacing innate rewards with artificial ones papers over shitty products/experiences instead of fixing them, and don't get people to a flow state. what is truly potentially motivating for its own sake (like getting people to try snowboarding the first few times... The beer may be what gets them there, but the feeling of flying through fresh powder is what sustains it, but only if we quit making it Just About The Beer and frickin teach them to fly). (via Jim Stogdill)
  2. Patient Driven Social Network Refutes Study, Publishes Its Own Results -- The health-data-sharing website PatientsLikeMe published what it is calling a “patient-initiated observational study” refuting a 2008 report that found the drug lithium carbonate could slow the progression of the neurodegenerative disease amyotrophic lateral sclerosis or ALS. The new findings were published earlier this week in the journal Nature Biotechnology. (via mthomps)
  3. Corporate Transparency -- learn where, when and by whom your chocolate bar was made, from which chocolate stock, etc. This kind of traceability and provenance information is underrated in business. (via Jim Stogdill)
  4. SPDY -- Google's effort to replace HTTP with something faster. It has been the protocol between Chrome and Google's servers, now they hope it will go wider. All connections are encrypted and compressed out of the box.

April 21 2011

Developing countries and Open Compute

Open Compute ProjectDuring a panel discussion after the recent Facebook Open Compute announcement, a couple of panelists — Jason Waxman, GM in Intel's server platforms group, and Forrest Norrod, VP and GM of Dell's server platform — indicated the project could be beneficial to developing countries. Waxman said:

The reality is, you walk into data centers in emerging countries and it's a 2-kilowatt rack and there's maybe three servers in that rack, and the whole data center is powered inefficiently — their air is going every which way and it's hot, it's cold. It costs a lot. It's not ecologically conscious. By opening up this platform and by building awareness of what the best practices are in how to build a data center, how to make efficient servers and why you should care about building efficient servers and how to densely populate into a rack, there are a lot of places ... that can benefit from this type of information.

In a similar vein, Norrod said:

I think what you're going to see happen here is an opportunity for those Internet companies in the developing world to take a leap forward, jumping over the last 15 years of learnings, and exploiting the most efficient data center and server designs that we have today.

The developing countries angle intrigued me, so I sent an email to Benetech founder and CEO Jim Fruchterman to get his take. Fruchterman's company has a unique focus: apply the "intellectual capital and resources of Silicon Valley" to create solutions around the world for a variety of social problems. Recent projects have focused on human rights, literacy, and the development of the Miradi nature conservation project software.

His verdict? While efficient data centers are useful, they're secondary to pressing issues like infrastructure, reliable power, and basic literacy.

Fruchterman's reply follows:

JimFruchterman.jpgWhile I'm excited about an open initiative coming from Facebook, I'm not so sure that its impact on developing countries will be all that significant in the foreseeable future. Watching the announcement video, I didn't find these words coming out of the Facebook teams' mouths, but instead the Intel and Dell panelists. And, their comments focused mostly on India, China and Brazil — not exactly your typical "developing" countries.

The good news is, of course, that these open plans show how to reduce energy and acquisition costs per compute cycle. So, anyone building a data center can build a cheaper and lower power data center. That's great. But, building data centers is probably not on the top of the wish lists of most developing countries. Telecom and broadband infrastructure, reliable power (at the grid level, not the server power supply level), end-user device cost and reliability, localization, and even basic literacy seem to be more crucial to these communities. And, most of these factors are prerequisites to investing significantly in data centers.

Of course, our biggest concerns around Facebook are around free speech, anonymous speech, and the protection of human rights defenders. Facebook is increasingly a standard part of global user experience, and we think that it's crucial that Facebook get in front of these concerns, rather than being inadvertently a tool of repressive governments. We're glad that groups like the Electronic Frontier Foundation (EFF) have been working with Facebook and seeing progress, but we need more.

Fruchterman's response was edited and condensed.


April 07 2011

What Facebook's Open Compute Project means

Open Compute ProjectToday, Jonathan Heiliger, VP of Operations at Facebook, and his team announced the Open Compute Project, releasing their data center hardware stack as open source. This is a revolutionary project, and I believe it's one of the most important in infrastructure history. Let me explain why.

The way we operate systems and datacenters at web scale is fundamentally different than the world most server vendors seem to design their products to run in.

Web-scale systems focus on the entire system as a whole. In our world, individual servers are not special, and treating them as special can be dangerous. We expect servers to fail and we increasingly rely on the software we write to manage those failures. In many cases, the most valuable thing we can do when hardware fails is to simply provision a new one as quickly as possible. That means having enough capacity to do that, a way of programmatically managing the infrastructure, and an easy way to replace the failed components.

The server vendors have been slow to make this transition because they have been focused on individual servers, rather than systems as a whole. What we want to buy is racks of machines, with power and networking preconfigured, which we can wheel in, bolt down, and plug in. For the most part we don't care about logos, faceplates, and paint jobs. We won't use complex integrated proprietary management interfaces, and we haven't cared about video cards in a long time ... although it is still very hard to buy a server without them.

This gap is what led Google to build their own machines optimized for their own applications in their own datacenters. When Google did this, they gained a significant competitive advantage. Nobody else could deploy as much compute power as quickly and efficiently. To complete with Google's developers you also must compete with their operations and data center teams. As Tim O'Reilly said: "Operations is the new secret sauce."

When Jonathan and his team set out to build Facebook's new datacenter in Oregon, they knew they would have to do something similar to achieve the needed efficiency. Jonathan says that the Prineville, Ore. data center uses 38% less energy to do the same work as Facebook's existing facilities, while costing 24% less.

Facebook then took the revolutionary step of releasing the designs for most of the hardware in the datacenter under the Creative Commons license. They released everything from the power supply and battery backup systems to the rack hardware, motherboards, chassis, battery cabinets, and even their electrical and mechanical construction specifications.

This is a gigantic step for open source hardware, for the evolution of the web and cloud computing, and for infrastructure and operations in general. This is the beginning of a shift that began with open source software, from vendors and consumers to a participatory and collaborative model. Jonathan explains:

"The ultimate goal of the Open Compute Project, however, is to spark a collaborative dialogue. We're already talking with our peers about how we can work together on Open Compute Project technology. We want to recruit others to be part of this collaboration — and we invite you to join us in this mission to collectively develop the most efficient computing infrastructure possible."

At the announcement this morning, Graham Weston of Rackspace announced that they would be participating in Open Compute, which is an ideal compliment to the OpenStack cloud computing projects. Representatives from Dell and HP spoke at the announcement and also said that they would participate in this new project. The conversation has already begun.

March 02 2011

Four short links: 2 March 2011

  1. Unicode in Python, Completely Demystified -- a good introduction to Unicode in Python, which helped me with some code. (via Hacker News)
  2. A Ban on Brain-Boosting Drugs (Chronicle of Higher Education) -- Simply calling the use of study drugs "unfair" tells us nothing about why colleges should ban them. If such drugs really do improve academic performance among healthy students (and the evidence is scant), shouldn't colleges put them in the drinking water instead? After all, it would be unfair to permit wealthy students to use them if less privileged students can't afford them. As we start to hack our bodies and minds, we'll face more questions about legitimacy and ethics of those actions. Not, of course, about using coffee and Coca-Cola, ubiquitous performance-enhancing stimulants that are mysteriously absent from bans and prohibitions.
  3. Copywrongs -- Matt Blaze spits the dummy on IEEE and ACM copyright policies. In particular, the IEEE is explicitly preventing authors from distributing copies of the final paper. We write scientific papers first and last because we want them read. When papers were disseminated solely in print form it might have been reasonable to expect authors to donate the copyright in exchange for production and distribution. Today, of course, this model seems, at best, quaintly out of touch with the needs of researchers and academics who no longer desire or tolerate the delay and expense of seeking out printed copies of far-flung documents. We expect to find on it on the open web, and not hidden behind a paywall, either.
  4. On the Engineering of SaaS -- An upgrade process, for example, is an entirely different beast. Making it robust and repeatable is far less important than making it quick and reversible. This is because the upgrade only every happens once: on your install. Also, it only ever has to work right in one, exact variant of the environment: yours. And while typical customers of software can schedule an outage to perform an upgrade, scheduling downtime in SaaS is nearly impossible. So, you must be able to deploy new releases quickly, if not entirely seamlessly — and in the event of failure, rollback just as rapidly.

October 26 2010

Four short links: 26 October 2010

  1. 12 Months with MongoDB (Worknik) -- every type of retrieval got faster than their old MySQL store, and there are some other benefits too. They note that the admin tools aren't really there for MongoDB, so "there is a blurry hand-off between IT Ops and Engineering." (via Hacker News)
  2. Dawn of a New Day -- Ray Ozzie's farewell note to Microsoft. Clear definition of the challenges to come: At first blush, this world of continuous services and connected devices doesn’t seem very different than today. But those who build, deploy and manage today’s websites understand viscerally that fielding a truly continuous service is incredibly difficult and is only achieved by the most sophisticated high-scale consumer websites. And those who build and deploy application fabrics targeting connected devices understand how challenging it can be to simply & reliably just ‘sync’ or ‘stream’. To achieve these seemingly simple objectives will require dramatic innovation in human interface, hardware, software and services. (via Tim O'Reilly on Twitter)
  3. A Civic Hacktivism Abecedary -- good ideas matched with exquisite quotes and language. My favourite: Kick at the darkness until it bleeds daylight. (via Francis Irving on Twitter)
  4. UI Guidelines for Mobile and Web Programming -- collection of pointers to official UI guidelines from Nokia, Apple, Microsoft, MeeGo, and more.

August 02 2010

Operations: The secret sauce revisited

Guest blogger Andrew Clay Shafer is helping telcos and hosting providers implement cloud services at Cloudscaling. He co-founded Reductive Labs, creators of Puppet, the configuration management framework. Andrew preaches the "infrastructure is code" gospel, and he supports approaches for applying agile methods to infrastructure and operations. Some of those perspectives were captured in his chapter in the O'Reilly book "Web Operations."

"Technical debt" is used two ways in the analysis of software systems. The phrase was first introduced in 1992 by Ward Cunningham to describe the premise that increased speed of delivery provides other advantages, and that the debt leveraged to gain those advantages should be strategically paid back.

Somewhere along the way, technical debt also became synonymous with poor implementation; reflecting the difference between the current state of a code base and an idealized one. I have used the term both ways, and I think they both have merit.

Technical debt can be extended and discussed along several additional axes: process debt, personnel debt, experience debt, user experience debt, security debt, documentation debt, etc. For this discussion, I won't quibble about the nuances of categorization. Instead, I want to take a high-level look at operations and infrastructure choices people make and the impact of those choices.

The technical debt metaphor

Debts are created by some combination of choice and circumstance. Modern economies are predicated on the flow of debt as much as anything else, but not all debt is created equal. There is a qualitative difference between a mortgage and carrying significant debt on maxed-out credit cards. The point being that there are a variety of ways to incur debt, and the quality of debts have different consequences.

Jesse Robbins' Radar post about operations as the secret sauce talked about boot strapping web startups in 80 hours. It included the following infographic showing the time cost of traditional versus special sauce operations:

I contend that the ongoing difference in time cost between the two solutions is the interest being paid on technical debt.

Understanding is really the crux of the matter. No one who really understands compound interest would intentionally make frivolous purchases on a credit card and not make every effort to pay down high interest debt. Just as no one who really understands web operations would create infrastructure with an exponentially increasing cost of maintenance. Yet, people do both of these things.

As the graph is projected out, the ongoing cost of maintenance in both projects reflects the maxim of "the rich get richer." One project can focus on adding value and differentiating itself in the market while the other will eventually be crushed under the weight of its own maintenance.

Technical debt and the Big Ball of Mud

Without a counterbalancing investment, system and software architectures succumb to entropy and become more difficult to understand. The classic "Big Ball of Mud" by Brian Foote and Joseph Yoder catalogs forces that contribute to the creation of haphazard and undifferentiated software architectures. They are:

  • Time
  • Cost
  • Experience
  • Skill
  • Visibility
  • Complexity
  • Change
  • Scale

These same forces apply just as much to infrastructure and operations, especially if you understand the "infrastructure is code" mantra. If you look at the original "Tale of Two Ops Teams" graphic, both teams spent almost the same amount of time before the launch. If we assume that these are representative, then the difference between the two approaches is essentially experience and skill, which is likely to be highly correlated with cost. As the project moves forward, the difference in experience and skill reflects itself in how the teams spend time, provide visibility and handle complexity, change and scale.

Using this list, and the assumption that balls of mud are synonymous with high technical debt, combating technical debt becomes an exercise in minimizing the impact of these forces.

  • Time and cost are what they are, and often have an inverse relationship. From a management perspective, I would like everything now and for free, so everything else is a compromise. Undue time pressure will always result in something else being compromised. That compromise will often start charging interest immediately.
  • Experience is invaluable, but sometimes hard to measure and overvalued in technology. Doing the same thing over and over with a technology is not 10 years of experience, it is the first year of experience 10 times. Intangible experience should not be measured in time, and experience in this sense is related to skill.
  • Visibility has two facets in ops work: Visibility into the design and responsibilities of the systems, and real-time metrics and alerting on the state of the system. The first allows us to take action, the second informs us that we should.
  • Complex problems can require complex solutions. Scale and change add complexity. Complexity obscures visibility and understanding.

Each of these forces and specific examples of how they impact infrastructure would fill a book, but hopefully that is enough to get people thinking and frame a discussion.

There is a force that may be missing from the "Big Ball of Mud": tools (which might be an oversight, might be an attempt to remain tool-agnostic, or might be considered a cross-cutting aspect of cost, experience and skill). That's not to say that tools don't add some complexity and the potential for technical debt as well. But done well, tools provide ongoing insight into how and why systems are configured the way they are, illumination of the complexity and connections of the systems, and a mechanism to rapidly implement changes. That is just an example. Every tool choice, from the operating system, to the web server, to the database, to the monitoring and more, has an impact on the complexity, visibility and flexibility of the systems, and therefore impacts operations effectiveness.

Many parallels can be drawn between operations and fire departments. One big difference is most fire departments don't spend much time actually putting out fires. If operations is reacting all the time, that indicates considerable technical debt. Furthermore, in reactive environments, the probability is high that the solutions of today are contributing to the technical debt and the fires of tomorrow.

Focus must be directed toward getting the fires under control in a way that doesn't contribute to future fires. The coarse metric of time spent reactively responding to incidents versus the time spent proactively completing ops-related projects is a great starting point for understanding the situation. One way to insure operations is always a cost center is to keep treating it like one. When the flow of technical debt is understood and well managed, operations is certainly a competitive advantage.


June 21 2010

On the performance of clouds

Public clouds are based on the economics of sharing. Cloud providers can charge less, and sell computing on an hourly basis without long-term contracts, because they're spreading costs and skills across many customers.

But a shared model means that your application is competing with other users' applications for scarce resources. The pact you're making with a public cloud, for better or worse, is that the advantages of elasticity and pay-as-you-go economics outweigh any problems you'll face.

Enterprises are skeptical because clouds force them to relinquish control over the underlying networks and architectures on which their applications run. Is performance acceptable? Will clouds be reliable? What's the tradeoff, particularly now that we know speed matters so much?

We (Bitcurrent) decided to find out. With the help of Webmetrics, we built four test applications: a small object, a large object, a million calculations, and a 500,000-row table scan. We ported the applications to five different clouds, and monitored them for a month. We discovered that performance varies widely by test type and cloud:

cloud performance results

Here are some of the lessons learned:

  • All of the services handled the small image well.
  • PaaS clouds were more efficient at delivering the large object, possibly because of their ability to distribute workload out to caching tiers better than an individual virtual machine can do.
  • didn't handle CPU workloads well, even with a tenth of the load of other agents. Amazon was slow for CPU, but we were using the least-powerful of Amazon's EC2 machines.
  • Google's ability to handle I/O, even under heavy load, was unmatched. Rackspace also dispatched the I/O tests quickly. Then again, it took us 37 hours to insert the data into Google's Bigtable.

In the end, it's clear that there's no single "best" cloud: PaaS (App Engine, scales easily, but locks you in; IaaS (Rackspace, Amazon, Terremark) offers portability, but leaves you doing all the scaling work yourself.

The full 50-page report is available free from Webmetrics.

Web performance and cloud architecture will be key topics at this week's Velocity conference.

April 14 2010

Web operators are brain surgeons

As humans rely on the Internet for all aspects of our lives, our ability to think increasingly depends on fast, reliable applications. The web is our collective consciousness, which means web operators become the brain surgeons of our distributed nervous system.

Velocity conference 2010Each technology we embrace makes us more and more reliant on the web. Armed with mobile phones, we forget phone numbers. Given personal email, we ditch our friends' postal addresses. With maps on our hips, we ignore the ones in our glovebox.

For much of the Western world, technology, culture, and society are indistinguishable. We're sneaking up on the hive mind, as the ubiquitous computing envisioned by Mark Weiser over 20 years ago becomes a reality. Today's web tells you what's interesting. It learns from your behavior. It shares, connects, and suggests. It's real-time and contextual. These connected systems augment humanity, and we rely on them more and more while realizing that dependency less and less. Twitter isn't a site; it's a message bus for humans.

The singularity is indeed near, and its grey matter is the web.

Now think what that means for those who make the web run smoothly. Take away our peripheral brains, and we're helpless. We'll suddenly be unable to do things we took for granted, much as a stroke victim loses the ability to speak. Take away our web, and we'll be unable to find our way, or translate text, or tap into the wisdom of crowds, or alert others to an emergency.

We're not ready for this. Alvin Toffler once said, "The future always arrives too fast ... and in the wrong order." A slow-down will feel like collective Alzheimers. Web latency will make us sluggish, not only because thoughts travel more slowly, but also because delay makes us less productive. In 1981, IBM proved that as applications speed up, workers become exponentially more productive (pdf).

Web operators are responsible for keeping the grey matter running. As we become more dependent on our collective consciousness, web operators will be much more involved in end-user experience measurement, from application design to real user monitoring. They'll need to upgrade their sniffer skills to include psychology and cognitive modeling. And they'll track new metrics -- like productivity, number of tasks completed per hour, mistakes made, and so on -- along with their lower-level operational metrics.

They'll also be specialists, brought in to diagnose and repair complex problems. They'll have to drill down from high-level issues like poor adoption and high bounce rates into root causes: heavy page load, packet loss, BGP, big data constraints, caching, and so on. Finally, they'll become systems thinkers, understanding how the combination of data center, cloud, network, storage, and client technologies produce a particular end-user experience.

So give your web operator some respect. Forget the central nervous system; it's the century of the distributed nervous system, and web operators are its brain surgeons.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...