Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 16 2012

Velocity Profile: Justin Huff

This is part of the Velocity Profiles series, which highlights the work and knowledge of web ops and performance experts.

Justin HuffJustin Huff
Software Engineer
PicMonkey
@jjhuff

How did you get into web operations and performance?

Picnik's founders Mike Harrington and Darrin Massena needed someone who knew something about Linux. Darrin and I had known each other for a few years, so my name came up. At the time, I was doing embedded systems work, but ended up moonlighting for Picnik. It wasn't long before I came over full time. I always expected to help them get off the ground and then they'd find a "real sysadmin" to take over. Turns out, I ended up enjoying ops! I was lucky enough to straddle the world between ops and back-end dev. Sound familiar?

What is your most memorable project?

Completing a tight database upgrade at a Starbucks mid-way between Seattle and Portland. "Replicate faster, PLEASE!" Also, in the build-up to Picnik's acquisition by Google, Mike asked me what it would take to handle 10 times our current traffic and to do it in 30 days. We doubled Picnik's hardware, including a complete network overhaul. It went flawlessly and continued to serve Picnik until Google shut it down in April of this year.

What's the toughest problem you've had to solve?

When Flickr launched with Picnik as its photo editor, we started to see really weird behavior causing some Flickr API calls to hang. I spent a good chunk of that day on the phone with John Allspaw and finally identified an issue with how our NAT box was munging TCP timestamps that were interacting badly with Flickr's servers. I learned a couple things: First, both John and I were able to gather highly detailed info (tcpdumps) at key points in our networks (and hosts) — sometimes you just have to go deep; second, it's absolutely imperative that you have good technical contacts with your partners.

What tools and techniques do you rely on most?

Graphs and monitoring are critical. Vim, because I can't figure out Emacs. Automation, because I can't even remember what I had for breakfast.

Who do you follow in the web operations and performance world?

Bryan Berry (@bryanwb) is great. Joe Williams (@williamsjoe) is doing great stuff — and his Twitter profile pic is awesome.

What is your web operations and performance super power?

I think I'm good at building, maintaining, and understanding complete systems. Other engineering disciplines are typically concerned about the details of a single part of a larger system. As web engineers, we have to grok the system, the components, and their interactions ... at 2 AM.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Related:

May 09 2012

Theo Schlossnagle on DevOps as a career

In this new Velocity Podcast, I had a conversation with Theo Schlossnagle (@postwait), the founder and CEO of OmniTI. This conversation centers mostly on DevOps as a discipline and career. Theo, as always, has some interesting insights into DevOps and how to build a successful career in this industry.

Our conversation lasted 00:13:21. If you want to pinpoint any particular topic, you can find the specific timing below. I will apologize now: Theo's image froze a couple minutes into our conversation, but since it was our second attempt at this, and it is a conversation, I feel the content of his answers is what most of us what to hear, not whether or not he is smiling or gesturing.

  • Are we splitting hairs with our terms of WebOps, DevOps, WebDev, etc? 00:00:42
  • What are the important goals developers should have in mind when building Systems that Operate? 00:01:28
  • How do you define, spec and set best practices for your DevOps organization so that your whole team is working well? 00:02:38
  • What does a typical day look like in the DevOps world? 00:03:39
  • What are the key attributes and skills someone should have to become a skilled DevOps? 00:04:50
  • What is the hardest to master for a young DevOps, security, scalability, reliability or performance?00:06:22
  • Is DevOps more of a craft, discipline, methodology, way of thinking, what is it?00:07:35
  • If your DevOps is operating well, do you notice it and how do you measure it if all is well?00:08:47
  • What do you think the most significant thing a sharp DevOps person can contribute to an organization, and how do they know if they have achieved excellence? 00:10:16

If you would like to hear Theo speak on "It's All About Telemetry," he is presenting at the 2012 Velocity Conference in Santa Clara, Calif. on Tuesday 6/26/12 at 1:00pm. We hope to see you there.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Related:

May 02 2012

Velocity Profile: Sergey Chernyshev

This is part of the Velocity Profiles series, which highlights the work and knowledge of web ops and performance experts.

Sergey ChernyshevSergey Chernyshev
Director of web systems and applications, truTV
Organizer, New York Web Performance Meetup
@sergeyche, @perfplanet

How did you get into web operations and performance?

I've been doing web development and operations since 1996. Before there were different people running websites, one person was responsible for everything. So in addition to adding features, I was making sure websites were running and running fast. In 2007, I heard Steve Souders and Teni Thurer present their first findings at the Web 2.0 Expo, and after that, I was converted to the church of web performance optimization (WPO).

What is your most memorable project?

The most memorable are the two projects I'm most active on: Show Slow and running the New York Web Performance Meetup.

What's the toughest problem you've had to solve?

The toughest is to make people believe that WPO is important and change perspectives on how to approach performance. It's far from solved, but I hope I helped by kick-starting a local community movement — we now have 16 active groups across the globe with more than 5,000 members.

What tools and techniques do you rely on most?

Show Slow and WebPageTest.

Who do you follow in the web operations and performance world?

I run the @perfplanet account on Twitter where I follow a bunch of people and re-tweet WPO-related tweets. You can see my list here.

In addition, Brad Fitzpatrick of LiveJournal fame isn't doing much of this work these days, but he's behind many great technologies, including Memcached, Gearman and more.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Related:

April 09 2012

Operations, machine learning and premature babies

Julie Steele and I recently had lunch with Etsy's John Allspaw and Kellan Elliott-McCrea. I'm not sure how we got there, but we made a connection that was (to me) astonishing between web operations and medical care for premature infants.

I've written several times about IBM's work in neonatal intensive care at the University of Toronto. In any neonatal intensive care unit (NICU), every baby is connected to dozens of monitors. And each monitor is streaming hundreds of readings per second into various data systems. They can generate alerts if anything goes severely out of spec, but in normal operation, they just generate a summary report for the doctor every half hour or so.

IBM discovered that by applying machine learning to the full data stream, they were able to diagnose some dangerous infections a full day before any symptoms were noticeable to a human. That's amazing in itself, but what's more important is what they were looking for. I expected them to be looking for telltale spikes or irregularities in the readings: perhaps not serious enough to generate an alarm on their own, but still, the sort of things you'd intuitively expect of a person about to become ill. But according to Anjul Bhambhri, IBM's Vice President of Big Data, the telltale signal wasn't spikes or irregularities, but the opposite. There's a certain normal variation in heart rate, etc., throughout the day, and babies who were about to become sick didn't exhibit the variation. Their heart rate was too normal; it didn't change throughout the day as much as it should.

That observation strikes me as revolutionary. It's easy to detect problems when something goes out of spec: If you have a fever, you know you're sick. But how do you detect problems that don't set off an alarm? How many diseases have early symptoms that are too subtle for a human to notice, and only accessible to a machine learning system that can sift through gigabytes of data?

In our conversation, we started wondering how this applied to web operations. We have gigabytes of data streaming off of our servers, but the state of system and network monitoring hasn't changed in years. We look for parameters that are out of spec, thresholds that are crossed. And that's good for a lot of problems: You need to know if the number of packets coming into an interface suddenly goes to zero. But what if the symptom we should look for is radically different? What if crossing a threshold isn't what indicates trouble, but the disappearance (or diminution) of some regular pattern? Is it possible that our computing infrastructure also exhibits symptoms that are too subtle for a human to notice but would easily be detectable via machine learning?

We talked a bit about whether it was possible to alarm on the first (and second) derivatives of some key parameters, and of course it is. Doing so would require more sophistication than our current monitoring systems have, but it's not too hard to imagine. But it also misses the point. Once you know what to look for, it's relatively easy to figure out how to detect it. IBM's insight wasn't detecting the patterns that indicated a baby was about to become sick, but using machine learning to figure out what the patterns were. Can we do the same? It's not inconceivable, though it wouldn't be easy.

Web operations has been on the forefront of "big data" since the beginning. Long before we were talking about sentiment analysis or recommendations engines, webmasters and system administrators were analyzing problems by looking through gigabytes of server and system logs, using tools that were primitive or non-existent. MRTG and HP's OpenView were savage attempts to put together information dashboards for IT groups. But at most enterprises, operations hasn't taken the next step. Operations staff doesn't have the resources (neither computational nor human) to apply machine intelligence to our problems. We'd have to capture all the data coming off our our servers for extended periods, not just the server logs that we capture now, but any every kind of data we can collect: network data, environmental data, I/O subsystem data, you name it. At a recent meetup about finance, Abhi Mehta encouraged people to capture and save "everything." He was talking about financial data, but the same applies here. We'd need to build Hadoop clusters to monitor our server farms; we'd need Hadoop clusters to monitor our Hadoop clusters. It's a big investment of time and resources. If we could make that investment, what would we find out? I bet that we'd be surprised.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


Related:




February 16 2012

Strata Week: The data behind Yahoo's front page

Here are a few of the data stories that caught my attention this week.

Data and personalization drive Yahoo's front page

Yahoo offered a peak behind the scenes of its front page with the release of the Yahoo C.O.R.E. Data Visualization. The visualization provides a way to view some of the demographic details behind what Yahoo visitors are clicking on.

The C.O.R.E. (Content Optimization and Relevance Engine) technology was created by Yahoo Labs. The tech is used by Yahoo News and its Today module to personalize results for its visitors — resulting in some 13,000,000 unique story combinations per day. According to Yahoo:

"C.O.R.E. determines how stories should be ordered, dependent on each user. Similarly, C.O.R.E. figures out which story categories (i.e. technology, health, finance, or entertainment) should be displayed prominently on the page to help deepen engagement for each viewer."

Screenshot from Yahoo's CORE visualization
Screenshot from Yahoo's CORE data visualization. See the full visualization here.

Scaling Tumblr

Over on the High Scalability blog, Todd Huff examines how the blogging site Tumblr was able to scale its infrastructure, something that Huff describes as more challenging than the scaling that was necessary at Twitter.

To put give some idea of the scope of the problem, Hoff cites these figures:

"Growing at over 30% a month has not been without challenges. Some reliability problems among them. It helps to realize that Tumblr operates at surprisingly huge scales: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers."

Hoff interviews Blake Matheny, distributed systems engineer at Tumblr, for a look at the architecture of both "old" and "new" Tumblr. When the startup began, it was hosted on Rackspace where "it gave each custom domain blog an A record. When they outgrew Rackspace there were too many users to migrate."

The article also describes the Tumblr firehose, noting again its differences from Twitter's. "A challenge is to distribute so much data in real-time," Huff writes. "[Tumblr} wanted something that would scale internally and that an application ecosystem could reliably grow around. A central point of distribution was needed." Although Tumblr initially used Scribe/Hadoop, "this model stopped scaling almost immediately, especially at peak where people are creating 1000s of posts a second."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


Visualization creation

Data scientist Pete Warden offers his own lessons learned about building visualizations this week in a story here on Radar. His first tip: "Play with your data" -- that is, before you decide what problem you want to solve or visualization you want to create, take the time to know the data you're working with.

Warden writes:

"The more time you spend manipulating and examining the raw information, the more you understand it at a deep level. Knowing your data is the essential starting point for any visualization."

Warden explains how he was able to create a visualization for his new travel startup, Jetpac, that showed where American Facebook users go on vacation. Warden's tips aren't simply about the tools he used; he also walks through the conceptualization of the project as well as the crunching of the data.

Got data news?

Feel free to email me.

Related:

February 03 2012

October 13 2011

Velocity is coming to Europe

Velocity EuropeThe first Velocity Europe conference is happening November 8-9 in Berlin. I'm excited about this because there is a good deal of innovation and web-engineering thinking going on in Europe that isn't well known. It'll be great to meet and connect with people in this part of the world who are making a difference in our tribe of web operations and performance. (Pssst ... here's a 20% discount code you can use to register: veu11jsp.)

One of the things that I love about Velocity is that it brings to light topics that haven't had widespread public discussion. What were once considered exotic areas of technology and practice — things like database scaling, configuration management, TCP internals and mobile performance — are now becoming commonplace.

In his work evangelizing Web Performance Optimization (WPO), Steve Souders has been talking about the importance of client-side performance. The results are dramatic. Having come from a server-centric background, I can remember events when the website would feel slow to a non-technical CEO. He'd walk over to the systems administrators and ask naive questions like "Why don't we make sure we have enough servers?!" or "Can't we get a faster network?" The assumption made here is that performance lies solely on the server side, whose destiny lies in the hands of systems and network admins.

This typically results in a performance tuning exercise by sysadmins and DBAs that doesn't usually help, because the focus is in the wrong place. When the page generation takes a couple of hundreds of milliseconds and the rendering of the full page in the browser takes a couple of seconds, it's almost like putting racing tires on a bulldozer; it's not going to increase performance very much.

Steve has said in the past that only ~20% of performance is spent on the server side. The other 80%-plus is spent in the front end. On the face of it, you'd think that this realization would get server-side engineers off the hook for being solely responsible for the site being "slow." This is certainly true, but there's more to it.

As a result of the research from Steve and others in the WPO industry, we've seen an expansion of domain expertise: server-side engineers are expanding their focus from just looking at server-side responses to client-side performance. And similarly, developers who normally write client-side code are learning more about things like TCP handshakes, DNS lookups, and the various mechanics of Content Delivery Networks, which hadn't previously been their focus.

Velocity has helped bring about this change in focus — from a world where server-side and client-side engineers focused only on their own respective areas, to one where all web engineers are expected to have a broader, more "holistic" view of the pipeline, up and down the stack.

Steve told me a story of hearing a backend engineer espousing longer Expires headers, inlining (data: URIs), and sprites as ways to reduce server load. "It brings joy to my heart when I hear these words" he said. "I realize all sides are working together to create the most optimized web delivery system possible. The end result is a better, faster, more enjoyable experience for users."

That's a good thing. Come to Velocity Europe and help us build this faster future!

Related:

June 29 2010

Creating Cultural Change

At Velocity 2010, John Rauser presented four funny & powerful examples of cultural change, from a campaign at his office to get people to fill the coffee pot after taking the last cup, to an award winning advertising campaign. This talk explains how to "sneak past people's mental filters" and make things happen.

June 03 2010

Velocity Culture: Web Operations, DevOps, etc...


Velocity 2010 is happening on June 22-24 (right around the corner!).  This year we've added third track, Velocity Culture, dedicated to exploring what we've learned about how great teams & organizations work together to succeed at scale. 

Web Operations, or WebOps, is what many of us have been calling these ideas for years.  Recently the term "DevOps" has become a kind of rallying cry that is resonating with many, along with variations on Agile Operations. No matter what you call it, our experiences over the past decade taught us that Culture matters more than any tool or technology in building, adapting, and scaling the web.

Here is a small sample of the upcoming Velocity Culture sessions:

Ops Meta-Metrics: The Currency You Use to Pay For Change
John Allspaw (Etsy.com)
Change to production environments can cause a good deal of stress and strain amongst development and operations teams. More and more organizations are seeing benefits from deploying small code changes more frequently, for stability and productivity reasons. But how can you figure out how much change is appropriate for your application or your culture?

A Day in the Life of Facebook Operations
Tom Cook (Facebook)
Facebook’s Technical Operations team has to balance this need for constant availability with a fast-moving and experimental engineering culture. We release code every day. Additionally, we are supporting exponential user growth while still managing an exceptionally high radio of users per employee within engineering and operations.

This talk will go into how Facebook is “run” day-to-day with particular focus on actual tools in use (configuration management systems, monitoring, automation, etc), how we detect anomalies and respond to them, and the processes we use internally for rapidly pushing out changes while still keeping a handle on site stability.

Change Management: A Scientific Classification
Andrew Shafer (Cloudscaling)
Change management is the combination of process and tools by which changes are made to production systems. Approaches range from cowboy style, making changes to the live site, to complex rituals with secret incantations, coming full circle to continuous deployment. This presentation will highlight milestone practices along this spectrum, establishing a matrix for evaluating deployment process.

There is a tremendous amount happing in our space in the coming weeks in addition to the conference itself.  First, the "Web Operations" book which John Allspaw & I edited goes to print on June 15th.  We're really excited about how it came together.  Then, immediately after Velocity is DevOpsDays, which is a great community event that continues the conversation after Velocity (and is free).  Hope to see you all there!
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl