Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 12 2013

Four short links: 13 August 2013

  1. How Things Work: Summer Games Edition — admire the real craftsmanship in those early games. This has a great description of using raster interrupts to extend the number of sprites, and how and why double-buffering was expensive in terms of memory.
  2. IAMA: Etsy Ops Team (Reddit) — the Etsy ops team does an IAMA on Reddit. Everything from uptime to this sage advice about fluid data: A nice 18 year old Glenfiddich scales extremely well, especially if used in an active active configuration with a glass in each hand. The part of Scotland where Glenfiddich is located also benefits from near-permanent exposure to the Cloud (several clouds in fact). (via Nelson Minar)
  3. Who Learns What When You Log Into Facebook (Tim Bray) — nice breakdown of who learns what and how, part of Tim’s work raising the qualify of conversation about online federated identity.
  4. lolcommits — takes a photo of the programmer on each git commit. (via Nelson Minar)

November 09 2012

June 14 2012

One Short Link: 14 June 2012

Etsy did something significant. I'm not talking about funding scholarships to Hacker School, though kudos to Etsy, 37Signals, and Yammer for putting money into it. And serious respect to Marc Hedlund for putting it together—he didn't just submit a bug report on the world, he submitted a patch. Marc's Ignite talk at Foo about this was incredibly moving: he accomplished something at scale, something beyond a single hiring decision.

What I find truly significant is the stark quantification of the untapped (previously uninvited) interest: 661 women applied where 7 had applied before. The number of scholarships and the size of the programming class were dwarfed by the number of women who wanted in, and jubilation at the success of the Etsy campaign has to be accompanied by serious thought about how to tackle the next order of magnitude in scale. And because it's a problem worthy of your cleverness, I've made this the only short link today. Use the time you would have spent reading about Map/Reduce and devops to solve this scaling problem instead—you'll truly be working on something that matters.

April 26 2012

Design your website for a graceful fail

Websites go down. It happens. But in many cases it might be possible to deal with and explain a failure while keeping user frustration to a minimum.

Mike Brittain (@mikebrittain), director of engineering at Etsy, addressed the resilient user experience in our recent interview. Among his insights from the full interview (below):

  • Designing an experience that can adapt to individual service failures and partial degradations requires an intermingling between software engineers, operations teams and product and design teams.
  • Previous experience designing for cable-connected devices may skew our connectivity expectations when it comes to more fragile mobile networks.

Brittain will expand on these ideas and more in his keynote address "Building Resilient User Experiences" at Velocity 2012 in June.

Our full interview follows.

What is a "resilient" user experience — and what are a few of the main practices involved in ensuring an acceptable UX during an outage?

MikeBrittain_headshot.pngMike Brittain: Resilient user experiences are adaptable to individual failure modes within the system — allowing users to continue to use the service even in a partially degraded scenario.

Large-scale websites are driven by numerous databases, APIs, and other back-end services. Without thoughtful application design, any failure in an individual service might bubble up as a generic "Server Error." This sort of response completely blocks the user from any further experience and has the potential to degrade the user's confidence in your website, software or brand.

Consider an article page on the New York Times' website. There is the primary content of the page: the body of the article itself. And then there are all sorts of ancillary content and modules, such as social sharing tools, personalization details if you're signed-in, comments, articles recommended for you, most emailed articles, advertisements, etc. If something were to go wrong while retrieving the primary content for the page — the article body — you might not be able to provide anything meaningful to the reader. But if one or more services failed for generating any of those ancillary modules, it's likely to have a much lower impact on the reader. So, a resilient design would allow for any of those individual modules to fail gracefully without blocking the reader from completing the primary action on the site — reading news.

Here's another example closer to my own heart: The primary action for visitors to Etsy is to find, review, and purchase handcrafted goods. A product page on Etsy.com includes all sorts of ancillary information and tools, including a mechanism for marking a product as a "favorite." If the Favorites system goes down, we wouldn't want to return an error page to the visitor. Instead, we would hide the tool altogether. Meanwhile, visitors can continue to find and purchase products during this degradation. In fact, many of them may be blissfully unaware that the feature even exists while it is unavailable.

In the DevOps culture, we see increasing intermingling of experience and knowledge between software engineers and operations teams. Engineers who understand well how their software is operated, and the interplay between various services and back-ends, often understand failure modes and can adapt. Their software and hardware architecture may take advantage of patterns like redundant services, failover services, or retry attempts after failures.

Resilient user experiences require further intermingling with product and design teams. Product design is focused almost entirely on user experience when the system is assumed to be working properly. So, we need to have product designers commingling with engineers to better understand individual failure modes and to plan for them.

Do these interface practices vary for desktops/laptops versus mobile or tablets?

Mike Brittain: These principles apply to any user interface. But as we move into using more mobile devices and networks, we need to consider the relative fragility of the network that connects our software (e.g. a smartphone app) to servers on the Internet.

Our design process may be hampered by our prior experiences in which computers and web browsers connected to the Internet by physical cables suffered relatively low network failure rates. As such, our expectations may be that the network is seldom a failure point. We're moving rapidly into a world where mobile software connects to back-end services over cellular data networks — not to mention that the handset may be moving at high speed by car or train. So, we need to design resilience into our UIs anywhere we depend on network availability for data.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

How do you set up a front-end to fail gracefully?

Mike Brittain: Front-end could mean client-side, or it could refer to the forward-most server-side script in the request flow, which talks to other back-end services to generate the HTML for a web page. Both situations are valid for resilient design.

In designing resilient UIs, you expect failures in each and every back-end service. Examples might include connection failures, connection timeouts, response timeouts, or corrupted/incomplete data in a response. A resilient UI traps these failures at a low level and provides a usable response, rather than throwing a general exception that causes the entire page to fail.

On the client side, this could mean detecting failures in Ajax responses and allowing the user experience to continue unblocked, or by retrying after a given amount of time. This could be during page render, or maybe during a user interaction. Those familiar with Gmail may recognize that during periods of network congestion or back-end failures, the small status message that reads, "sending," when you send an email sometimes changes to "still trying …" or "offline." This is preferred over a general "failed to send email" after a single attempt.

Some general patterns for resilient UI include:

  • Disable or hide features that are failing.
  • Provide fallback (default) content in place of dynamic content or feature that cannot be reached or displayed.
  • Avoid behaviors that block UI display or interaction.
  • Detect service failures and allow for retries.
  • Failover to redundant services.

Systems engineers may recognize these patterns in low-level services or protocols. But these patterns are not as familiar to front-end engineers, product developers, and designers — who plan more around success than around failure. I don't mean for that statement to be divisive, but I do think it's true of the current state of how we build software and how we build the web.

How do you make your community aware of a failure?

Mike Brittain: In the case of small failures, the idea is to obscure the failure in a way that it does not block the primary use case for the site (e.g. we don't shut down product pages because the Favorites service is failing). Your community may not need much communication around this.

When things really go wrong, you want to be upfront and clear about failures. Use specific terms, rather than general. Provide context of time and estimated time to resolution whenever possible. If you have a service that fails and will be unavailable until you restore data over a period of, say, three hours, it's better to tell your visitors to check back in three hours than to have them hammering the refresh button on their browser for 20 minutes as they build up frustration.

You want to make sure this information is within reach for your users. I actually think at Etsy we have some pretty good patterns for this. We start with a status blog that is hosted outside of our primary network and should be available even if our data center is unreachable. Most service warnings or error messages on Etsy.com will provide a link to this blog. And anytime we have a service outage posted to this blog, a service warning is automatically posted at the top of any pages within our community forums and anywhere else that members would go looking for help on our site.

In your Velocity 2012 keynote summary, you mention "validating failure scenarios with 'game days'." What's a game day and how does it work?

Mike Brittain: The term game day" describes an exercise that tests some failure scenario in production. These drills are used to test hypotheses about how our systems will react to specific failures. They also surface any surprises about how the system reacts while we are actively observing.

We do this in production because development, testing, and staging environments are seldom 100% symmetric with production. You may have different numbers of machines, different volumes of data, or simulated versus live traffic. The downside is that these drills will impact real visitors. The upside is that you build real confidence within your team and exercise your abilities to cope with real failures.

We regularly test configuration flags across our site to ensure that we haven't unwired configuration logic for features we have been patching or improving. We also want to confirm that the user experience degrades gracefully when the flag is turned off. For example, when we disable the Favorites service on our site, we expect reads and writes to the data store to stop and we would expect various parts of the UI to hide the Favorites tools. Our game day would allow us to prove these out.

We would be surprised to find that disabling Favorites causes entire pages on the site to fail, rather than to degrade gracefully. We would be surprised if some processes continued to read from or write to the service while the config flag was disabled. And we would be further surprised to find unrelated services failing outright when the Favorites service was disabled. These are scenarios that might not be observed by simulated testing outside of production.

This interview was edited and condensed.

Associated photo on home and category pages: 404 error message something went wrong by IvanWalsh.com, on Flickr

Related:

April 12 2012

Four short links: 12 April 2012

  1. Big Data in Finance (PDF, 9M) -- Algo trading systems have begun to resemble an arms race. Competition, data, and the race for real-time.
  2. A Parent's Guide to 21st Century Learning (Edutopia, free registration required to download) -- What should collaboration, creativity, communication, and critical thinking look like in a modern classroom? How can parents help educators accomplish their goals? We hope this guide helps bring more parents into the conversation about improving education. (via Derek Wenmoth)
  3. Chess Intelligence and Winning -- survey of IQ gaps between contestants needed to win competitions. We could view cops and killers as being involved in a grim contest. In the USA around 65% of all murders are solved. That converts to an average “murder” ELO rating difference between police and murderers of 108 ELO points. It is also known that the mean IQs of murderers and policemen are 87 and 102, respectively. So successfully solving murders is a puzzle then the “a” coefficient is 0.041, and each IQ point difference is worth 7.2 ELO points. I suspect this is masturbatory math extrapolation rather than anything significant or predictive, but the cops-vs-robbers IQ contest was an interesting angle. (via Dr Data's Blog)
  4. Etsy Hacker Grants: Supporting Women in Technology -- Today, in conjunction with Hacker School, Etsy is announcing a new scholarship and sponsorship program for women in technology: we’ll be hosting the summer 2012 session of Hacker School in the Etsy headquarters, and we’re providing ten Etsy Hacker Grants of $5,000 each — a total of $50,000 — to women who want to join but need financial support to do so. Our goal is to bring 20 women to New York to participate, and we hope this will be the first of many steps to encourage more women into engineering at Etsy and across the industry.

Reposted bydatenwolf datenwolf

July 26 2011

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl