Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 03 2012

Four short links: 3 August 2012

  1. Urban Camouflage WorkshopMost of the day was spent crafting urban camouflage intended to hide the wearer from the Kinect computer vision system. By the end of the workshop we understood how to dress to avoid detection for the three different Kinect formats. (via Beta Knowledge)
  2. Starting a Django Project The Right Way (Jeff Knupp) — I wish more people did this: it’s not enough to learn syntax these days. Projects live in a web of best practices for source code management, deployment, testing, and migrations.
  3. FailCona one-day conference for technology entrepreneurs, investors, developers and designers to study their own and others’ failures and prepare for success. Figure out how to learn from failures—they’re far more common than successes. (via Krissy Mo)
  4. Google Fiber in the Real World (Giga Om) — These tests show one of the limitations of Google’s Fiber network: other services. Since Google Fiber is providing virtually unheard of speeds for their subscribers, companies like Apple and I suspect Hulu, Netflix and Amazon will need to keep up. Are you serving DSL speeds to fiber customers? (via Jonathan Brewer)

April 26 2012

Design your website for a graceful fail

Websites go down. It happens. But in many cases it might be possible to deal with and explain a failure while keeping user frustration to a minimum.

Mike Brittain (@mikebrittain), director of engineering at Etsy, addressed the resilient user experience in our recent interview. Among his insights from the full interview (below):

  • Designing an experience that can adapt to individual service failures and partial degradations requires an intermingling between software engineers, operations teams and product and design teams.
  • Previous experience designing for cable-connected devices may skew our connectivity expectations when it comes to more fragile mobile networks.

Brittain will expand on these ideas and more in his keynote address "Building Resilient User Experiences" at Velocity 2012 in June.

Our full interview follows.

What is a "resilient" user experience — and what are a few of the main practices involved in ensuring an acceptable UX during an outage?

MikeBrittain_headshot.pngMike Brittain: Resilient user experiences are adaptable to individual failure modes within the system — allowing users to continue to use the service even in a partially degraded scenario.

Large-scale websites are driven by numerous databases, APIs, and other back-end services. Without thoughtful application design, any failure in an individual service might bubble up as a generic "Server Error." This sort of response completely blocks the user from any further experience and has the potential to degrade the user's confidence in your website, software or brand.

Consider an article page on the New York Times' website. There is the primary content of the page: the body of the article itself. And then there are all sorts of ancillary content and modules, such as social sharing tools, personalization details if you're signed-in, comments, articles recommended for you, most emailed articles, advertisements, etc. If something were to go wrong while retrieving the primary content for the page — the article body — you might not be able to provide anything meaningful to the reader. But if one or more services failed for generating any of those ancillary modules, it's likely to have a much lower impact on the reader. So, a resilient design would allow for any of those individual modules to fail gracefully without blocking the reader from completing the primary action on the site — reading news.

Here's another example closer to my own heart: The primary action for visitors to Etsy is to find, review, and purchase handcrafted goods. A product page on Etsy.com includes all sorts of ancillary information and tools, including a mechanism for marking a product as a "favorite." If the Favorites system goes down, we wouldn't want to return an error page to the visitor. Instead, we would hide the tool altogether. Meanwhile, visitors can continue to find and purchase products during this degradation. In fact, many of them may be blissfully unaware that the feature even exists while it is unavailable.

In the DevOps culture, we see increasing intermingling of experience and knowledge between software engineers and operations teams. Engineers who understand well how their software is operated, and the interplay between various services and back-ends, often understand failure modes and can adapt. Their software and hardware architecture may take advantage of patterns like redundant services, failover services, or retry attempts after failures.

Resilient user experiences require further intermingling with product and design teams. Product design is focused almost entirely on user experience when the system is assumed to be working properly. So, we need to have product designers commingling with engineers to better understand individual failure modes and to plan for them.

Do these interface practices vary for desktops/laptops versus mobile or tablets?

Mike Brittain: These principles apply to any user interface. But as we move into using more mobile devices and networks, we need to consider the relative fragility of the network that connects our software (e.g. a smartphone app) to servers on the Internet.

Our design process may be hampered by our prior experiences in which computers and web browsers connected to the Internet by physical cables suffered relatively low network failure rates. As such, our expectations may be that the network is seldom a failure point. We're moving rapidly into a world where mobile software connects to back-end services over cellular data networks — not to mention that the handset may be moving at high speed by car or train. So, we need to design resilience into our UIs anywhere we depend on network availability for data.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

How do you set up a front-end to fail gracefully?

Mike Brittain: Front-end could mean client-side, or it could refer to the forward-most server-side script in the request flow, which talks to other back-end services to generate the HTML for a web page. Both situations are valid for resilient design.

In designing resilient UIs, you expect failures in each and every back-end service. Examples might include connection failures, connection timeouts, response timeouts, or corrupted/incomplete data in a response. A resilient UI traps these failures at a low level and provides a usable response, rather than throwing a general exception that causes the entire page to fail.

On the client side, this could mean detecting failures in Ajax responses and allowing the user experience to continue unblocked, or by retrying after a given amount of time. This could be during page render, or maybe during a user interaction. Those familiar with Gmail may recognize that during periods of network congestion or back-end failures, the small status message that reads, "sending," when you send an email sometimes changes to "still trying …" or "offline." This is preferred over a general "failed to send email" after a single attempt.

Some general patterns for resilient UI include:

  • Disable or hide features that are failing.
  • Provide fallback (default) content in place of dynamic content or feature that cannot be reached or displayed.
  • Avoid behaviors that block UI display or interaction.
  • Detect service failures and allow for retries.
  • Failover to redundant services.

Systems engineers may recognize these patterns in low-level services or protocols. But these patterns are not as familiar to front-end engineers, product developers, and designers — who plan more around success than around failure. I don't mean for that statement to be divisive, but I do think it's true of the current state of how we build software and how we build the web.

How do you make your community aware of a failure?

Mike Brittain: In the case of small failures, the idea is to obscure the failure in a way that it does not block the primary use case for the site (e.g. we don't shut down product pages because the Favorites service is failing). Your community may not need much communication around this.

When things really go wrong, you want to be upfront and clear about failures. Use specific terms, rather than general. Provide context of time and estimated time to resolution whenever possible. If you have a service that fails and will be unavailable until you restore data over a period of, say, three hours, it's better to tell your visitors to check back in three hours than to have them hammering the refresh button on their browser for 20 minutes as they build up frustration.

You want to make sure this information is within reach for your users. I actually think at Etsy we have some pretty good patterns for this. We start with a status blog that is hosted outside of our primary network and should be available even if our data center is unreachable. Most service warnings or error messages on Etsy.com will provide a link to this blog. And anytime we have a service outage posted to this blog, a service warning is automatically posted at the top of any pages within our community forums and anywhere else that members would go looking for help on our site.

In your Velocity 2012 keynote summary, you mention "validating failure scenarios with 'game days'." What's a game day and how does it work?

Mike Brittain: The term game day" describes an exercise that tests some failure scenario in production. These drills are used to test hypotheses about how our systems will react to specific failures. They also surface any surprises about how the system reacts while we are actively observing.

We do this in production because development, testing, and staging environments are seldom 100% symmetric with production. You may have different numbers of machines, different volumes of data, or simulated versus live traffic. The downside is that these drills will impact real visitors. The upside is that you build real confidence within your team and exercise your abilities to cope with real failures.

We regularly test configuration flags across our site to ensure that we haven't unwired configuration logic for features we have been patching or improving. We also want to confirm that the user experience degrades gracefully when the flag is turned off. For example, when we disable the Favorites service on our site, we expect reads and writes to the data store to stop and we would expect various parts of the UI to hide the Favorites tools. Our game day would allow us to prove these out.

We would be surprised to find that disabling Favorites causes entire pages on the site to fail, rather than to degrade gracefully. We would be surprised if some processes continued to read from or write to the service while the config flag was disabled. And we would be further surprised to find unrelated services failing outright when the Favorites service was disabled. These are scenarios that might not be observed by simulated testing outside of production.

This interview was edited and condensed.

Associated photo on home and category pages: 404 error message something went wrong by IvanWalsh.com, on Flickr

Related:

April 24 2012

Four short links: 24 April 2012

  1. 3D-Printing Pharmaceuticals (BoingBoing) -- Prof Cronin added: "3D printers are becoming increasingly common and affordable. It's entirely possible that, in the future, we could see chemical engineering technology which is prohibitively expensive today filter down to laboratories and small commercial enterprises. "Even more importantly, we could use 3D printers to revolutionise access to health care in the developing world, allowing diagnosis and treatment to happen in a much more efficient and economical way than is possible now.
  2. Bolt Action Tactical Pen (Uncrate) -- silliness.
  3. Ken Robinson's Sunday Sermon (Vimeo) -- In our culture, not to know is to be at fault socially… People pretend to know lots of things they don’t know. Because the worst thing to do is appear to be uninformed about something, to not have an opinion… We should know the limits of our knowledge and understand what we don’t know, and be willing to explore things we don’t know without feeling embarrassed of not knowing about them. If you work with someone who hides ignorance or failure, you're working with a timebomb and one of your highest priorities should be to change that mindset or replace the person. (via Maria Popova)
  4. Using Android Camera in HTML Apps (David Calhoun) -- From your browser you can now upload pictures and videos from the camera as well as sounds from the microphone. The returned data should be available to manipulate via the File API (via Josh Clark)

June 27 2011

Velocity 2011 retrospective

This was my third Velocity conference, and it's been great to see it grow. Smoothly running a conference with roughly 2,000 registered attendees (well over 50% more than 2010) is itself a testament to the importance of operations.

Velocity 2011 had many highlights, and below you'll find mine. I'll confess to some biases up front: I'm more interested in operations than development, so if you think mobile and web performance are underrepresented here, you're probably right. There was plenty of excellent content in those tracks, so please add your thoughts about those and other sessions in the comments area.

Blame and resilience engineering

I was particularly impressed by John Allspaw's session on "Advanced Post-mortem Fu." To summarize it in a sentence, it was about getting beyond blame. The job of a post-mortem analysis isn't to assign blame for failure, but to realize that failure happens and to plan the conditions under which failure will be less likely in the future. This isn't just an operational issue; moving beyond blame is a broader cultural imperative. Cultural historians have made much of the transition from shame culture — where if you failed you were "shamed" and obliged to leave the community — to guilt culture, where shame is internalized (the world of Hawthorne's "Scarlet Letter"). Now, we're clearly in a "blame culture," where it's always someone else's fault and nothing is ever over until the proper people have been blamed (and, more than likely, sued). That's not a way forward, for web ops any more than finance or medicine. John presented some new ways for thinking about failure, studying it, and making it less likely without assigning blame. There is no single root cause; many factors contribute to both success and failure, and you won't understand either until you take the whole system into account. Once you've done that, you can work on ways to improve operations to make failure less likely.

Velocity 2011 Online Access Pass
Couldn't make it to Velocity? Purchase the Velocity Online Access pass for $495 and get the following: The Velocity bookshelf — six essential O'Reilly ebooks; access to three upcoming online conferences; and access to the extensive Velocity 2011 conference video archive

Learn more about the Velocity Online Access Pass


John's talk raised the idea of "resilience engineering," an important
theme that's emerging from the Velocity culture. Resilience
engineering isn't just about making things that work; anyone can do
that. It's about designing systems that stay running in the face of
problems. Along similar lines, Justin Sheehy's talk was specifically about
resilience in the design of Riak. It was fascinating to
see how to design a distributed database so that any node could
suddenly disappear with no loss of data. Erlang, which Riak uses, encourages developers
to write partition tasks into small pieces that are free to crash,
running under a supervisor that restarts failed tasks.

Bryan Cantrill's excellent presentation on instrumenting the real-time web using Node.js and DTrace would win my vote for the best technical presentation of the conference, but it was most notable for his rant on Greenland and Antarctica's plot to take over the world. While the rant was funny, it's important not to forget the real message: DTrace is an underused but extremely flexible tool that can tell you exactly what is going on inside an application. It's more complex than other profiling tools I've seen, but in return for complexity, it lets you specify exactly what you want to know, and delivers results without requiring special compilation or even affecting the application's performance.

Data and performance

John Rauser's workshop on statistics ("Decisions in the Face of Uncertainty"), together with his keynote ("Look at your Data"), was another highlight. The workshop was an excellent introduction to working with data, an increasingly important tool for anyone interested in performance. But the keynote took it a step further, going beyond the statistics and looking at the actual raw data, spread across the living room floor. That was a powerful reminder that summary statistics are not always the last word in data: the actual data, the individual entries in your server logs, may hold the clues to your performance problem.

Velocity observations and overarching themes

There were many other memorable moments at Velocity (Steve Souders' belly dance wasn't one of them). I was amazed that Sean Power managed to do an Ignite Karaoke (a short impromptu presentation against a set of slides he didn't see in advance) that wasn't just funny, but actually almost made sense.

I could continue, but I would end up listing every session I attended; my only regret is that I couldn't attend more. Video for the conference keynotes is available online, so you can catch up on some of what you missed. The post-conference online access pass provides video for all the sessions for which presenters gave us permission.

Women's MeetupThe excellent sessions aren't the only news from Velocity. The Velocity Women's Networking Meetup had more than double the previous years' attendance; the group photo (right) has more people than I can count. The job board was stuffed to the gills. The exhibit hall dwarfed 2010's — I'd guess we had three times as many exhibitors — and there were doughnuts! But more than the individual sessions, the exhibits, the food, or the parties, I'll remember the overarching themes of cultural change; the many resources available for studying and improving performance; and most of all, the incredible people I met, all of whom contributed to making this conference a success.

We'll see you at the upcoming Velocity Europe in November and Velocity China in December, and next year at Velocity 2012 in California.



Related:


February 03 2011

Four short links: 3 February 2011

  1. Curveship -- a new interactive fiction system that can tell the same story in many different ways. Check out the examples on the home page. Important because interactive fiction and the command-lines of our lives are inextricably intertwined.
  2. Egypt's Revolution: Coming to an Economy Near You (Umair Haque) -- more dystopic prediction, but this phrase rings true: The lesson: You can't steal the future forever — and, in a hyperconnected world, you probably can't steal as much of it for as long.
  3. Why Startups Fail -- failure is a more instructive teacher than success, so simply studying successful startups isn't enough. (via Hacker News)
  4. Computer Science and Philosophy -- Oxford is offering a program studying CS and Philosophy together. the two disciplines share a broad focus on the representation of information and rational inference, embracing common interests in algorithms, cognition, intelligence, language, models, proof, and verification. Computer Scientists need to be able to reflect critically and philosophically about these, as they push forward into novel domains. Philosophers need to understand them within a world increasingly shaped by computer technology, in which a whole new range of enquiry has opened up, from the philosophy of AI, artificial life and computation, to the ethics of privacy and intellectual property, to the epistemology of computer models (e.g. of global warming). I wish every CS student had taken a course in ethics.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl