Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 17 2013

Four short links: 17 April 2013

  1. Computer Software Archive (Jason Scott) — The Internet Archive is the largest collection of historical software online in the world. Find me someone bigger. Through these terabytes (!) of software, the whole of the software landscape of the last 50 years is settling in. (And documentation and magazines and …). Wow.
  2. 7 in 10 Doctors Have a Self-Tracking Patientthe most common ways of sharing data with a doctor, according to the physicians, were writing it out by hand or giving the doctor a paper printout. (via Richard MacManus)
  3. opsmezzo — open-sourced provisioning tools from the Nodejitsu team. (via Nuno Job)
  4. Hacking Secret Ciphers with Pythonteaches complete beginners how to program in the Python programming language. The book features the source code to several ciphers and hacking programs for these ciphers. The programs include the Caesar cipher, transposition cipher, simple substitution cipher, multiplicative & affine ciphers, Vigenere cipher, and hacking programs for each of these ciphers. The final chapters cover the modern RSA cipher and public key cryptography.

March 07 2013

Four short links: 6 March 2013

  1. High Performance Networking in Google Chrome — far more than you ever wanted to know about how Chrome is so damn fast.
  2. Tactical Chathow the military uses IRC to wage war.
  3. http-console — a REPL loop for HTTP.
  4. Inductive Charger for Magic Mouse — my biggest bugbear with Bluetooth devices is the incessant appetite for batteries. Huzzah!

February 25 2013

Four short links: 25 February 2013

  1. Xenotext — Sci Foo Camper Christian Bök is closer to his goal of “living poetry”: A short stanza enciphered into a string of DNA and injected into an “unkillable” bacterium, Bök’s poem is designed to trigger the micro-organism to create a corresponding protein that, when decoded, is a verse created by the organism. In other words, the harmless bacterium, Deinococcus radiodurans (known as an extremophile because of its ability to survive freezing, scorching, or the vacuum of outer space), will be a poetic bug.
  2. Notes on Distributed Systems for Young Bloods — why distributed systems are different. Coordination is very hard. Avoid coordinating machines wherever possible. This is often described as “horizontal scalability”. The real trick of horizontal scalability is independence – being able to get data to machines such that communication and consensus between those machines is kept to a minimum. Every time two machines have to agree on something, the service is harder to implement. Information has an upper limit to the speed it can travel, and networked communication is flakier than you think, and your idea of what constitutes consensus is probably wrong.
  3. Lemnos Labs — hardware incubator in SF. (via Jim Stogdill)
  4. OLPC Built the Young Lady’s Illustrated Primer — Neil Stephenson imagined it, OLPC built it. Science fiction is a hugely powerful focusing device for creativity and imagination. (via Matt Jones)

February 19 2013

Four short links: 19 February 2013

  1. Using Silk Road — exploring the transactions, probability of being busted, and more. Had me at the heading Silk Road as Cyphernomicon’s black markets. Estimates of risk of participating in the underground economy.
  2. Travis CIa hosted continuous integration service for the open source community. It is integrated with GitHub.
  3. Chinese Cyber-Espionage Unit (PDF) — exposé of one of China’s Cyber Espionage Units. (via Reddit /r/netsec)
  4. $250 Arduino-Powered Hand Made by a Teenthe third version of his robotic hand. The hand is primarily made with 3D printing, with the exception of motors, gears, and other hardware. The control system is activated by flexing a pre-chosen muscle, such as curling your toes, then the movement is chosen and controlled by a series of eyeblinks and an EEG headset to measure brainwaves. The most remarkable part is that the hand costs a mere $250.

February 14 2013

Four short links: 14 February 2013

  1. Welcome to the Malware-Industrial Complex (MIT) — brilliant phrase, sound analysis.
  2. Stupid Stupid xBoxThe hardcore/soft-tv transition and any lead they feel they have is simply not defensible by licensing other industries’ generic video or music content because those industries will gladly sell and license the same content to all other players. A single custom studio of 150 employees also can not generate enough content to defensibly satisfy 76M+ customers. Only with quality primary software content from thousands of independent developers can you defend the brand and the product. Only by making the user experience simple, quick, and seamless can you defend the brand and the product. Never seen a better put statement of why an ecosystem of indies is essential.
  3. Data Feedback Loops for TV (Salon) — Netflix’s data indicated that the same subscribers who loved the original BBC production also gobbled down movies starring Kevin Spacey or directed by David Fincher. Therefore, concluded Netflix executives, a remake of the BBC drama with Spacey and Fincher attached was a no-brainer, to the point that the company committed $100 million for two 13-episode seasons.
  4. wrka modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU. It combines a multithreaded design with scalable event notification systems such as epoll and kqueue.

January 28 2013

Four short links: 28 January 2013

  1. Aaron’s Army — powerful words from Carl Malamud. Aaron was part of an army of citizens that believes democracy only works when the citizenry are informed, when we know about our rights—and our obligations. An army that believes we must make justice and knowledge available to all—not just the well born or those that have grabbed the reigns of power—so that we may govern ourselves more wisely.
  2. Vaurien the Chaos TCP Monkeya project at Netflix to enhance the infrastructure tolerance. The Chaos Monkey will randomly shut down some servers or block some network connections, and the system is supposed to survive to these events. It’s a way to verify the high availability and tolerance of the system. (via Pete Warden)
  3. Foto Forensics — tool which uses image processing algorithms to help you identify doctoring in images. The creator’s deconstruction of Victoria’s Secret catalogue model photos is impressive. (via Nelson Minar)
  4. All Trials Registered — Ben Goldacre steps up his campaign to ensure trial data is reported and used accurately. I’m astonished that there are people who would withhold data, obfuscate results, or opt out of the system entirely, let alone that those people would vigorously assert that they are, in fact, professional scientists.

January 03 2013

Four short links: 3 January 2013

  1. Community Memory (Wired) — In the early 1970s, Efrem Lipkin, Mark Szpakowski and Lee Felsenstein set up a series of these terminals around San Francisco and Berkeley, providing access to an electronic bulletin board housed by a XDS-940 mainframe computer. This started out as a social experiment to see if people would be willing to share via computer — a kind of “information flea market,” a “communication system which allows people to make contact with each other on the basis of mutually expressed interest,” according to a brochure from the time. What evolved was a proto-Facebook-Twitter-Yelp-Craigslist-esque database filled with searchable roommate-wanted and for-sale items ads, restaurant recommendations, and, well, status updates, complete with graphics and social commentary. But did it have retargeted ads, promoted tweets, and opt-in messages from partners? I THOUGHT NOT. (via BoingBoing)
  2. Latency Numbers Every Programmer Should Know (EECS Berkeley) — exactly that. I was always impressed by Artur Bergman’s familiarity with the speed of packets across switches, RAM cache misses, and HDD mean seek times. Now you can be that impressive person.
  3. Feds Requiring Black Boxes in All Vehicles (Wired) — [Q]uestions remain about the black boxes and data. Among them, how long should a black box retain event data, who owns the data, can a motorist turn off the black box and can the authorities get the data without a warrant. This is starting as regulatory compliance, but should be seized as an opportunity to have a quantified self.
  4. Average Age of StackExchange Users by Tag (Brian Bondy) — no tag is associated with people who have a mean age over 30. Did I miss the plague that wiped out all the programmers over the age of 30? Or does age bring with it supreme knowledge so that old people like me never have to use StackExchange? Yes, that must be it. *cough*

December 05 2012

Four short links: 5 December 2012

  1. The Benefits of Poetry for Professionals (HBR) — Harman Industries founder Sidney Harman once told The New York Times, “I used to tell my senior staff to get me poets as managers. Poets are our original systems thinkers. They look at our most complex environments and they reduce the complexity to something they begin to understand.”
  2. First Few Milliseconds of an HTTPS Connection — far more than you ever wanted to know about how HTTPS connections are initiated.
  3. Google Earth EngineDevelop, access and run algorithms on the full Earth Engine data archive, all using Google’s parallel processing platform. (via Nelson Minar)
  4. 3D Printing Popup Store Opens in NYC (Makezine Blog) — MAKE has partnered with 3DEA, a pop up 3D printing emporium in New York City’s fashion district. The store will sell printers and 3D printed objects as well as offer a lineup of classes, workshops, and presentations from the likes of jewelry maker Kevin Wei, 3D printing artist Josh Harker, and Shapeways’ Duann Scott. This. is. awesome!

November 28 2012

Four short links: 28 November 2012

  1. Moral Machinesit will no longer be optional for machines to have ethical systems. Your car is speeding along a bridge at fifty miles per hour when errant school bus carrying forty innocent children crosses its path. Should your car swerve, possibly risking the life of its owner (you), in order to save the children, or keep going, putting all forty kids at risk? If the decision must be made in milliseconds, the computer will have to make the call. (via BoingBoing)
  2. Hystrixa latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable. More information. (via Tom Loosemore)
  3. Offline First: A Better HTML5 Experience — can’t emphasize how important it is to have offline functionality for the parts of the world that don’t have blanket 3G/LTE/etc coverage. (280 south from SF, for example).
  4. Disaster of Biblical Proportions (Business Insider) — impressive collection of graphs and data showing commodity prices indicate our species is living beyond its means.

November 26 2012

The Narwhal and the Orca

Leaving politics aside, there’s a lot that can be learned from the technical efforts of the Obama and Romney campaigns. Just about everyone agrees that the Obama campaign’s Narwhal project was a great success, and that Romney’s Orca was a failure. But why?

I have one very short answer. If you follow technology, you don’t have to read between the lines much to realize that Narwhal embodied the best of the DevOps movement: rapid iteration, minimal barriers between developers and operations staff, heavy use of “cloud” technology, and constant testing to prove that you can handle outages and heavy load. In contrast, Romney’s Orca was a traditional corporate IT project gone bad. The plans were developed by non-technical people (it’s not clear what technical expertise Romney had on his team), and implemented by consultants. There were token testing efforts, but as far as I could tell, no serious attempts to break or stress the system so the team understood how to keep it running when the going got tough.

It’s particularly important to look at two factors: the way testing was done, which I’ve already mentioned, and the use of cloud computing. While Orca was “tested,” there is a big difference between passing automated test suites and the sort of game day exercise that the Narwhal team performed several times. In a game day, you’re actively trying to break the system in real time: in this case, a fully deployed copy of the actual system. You unplug the routers; you shut down the clusters; you bombard the system with traffic at inconceivable volumes. And the people responsible for keeping the system up are on the team that built it, and the team that ran it in real life. If you read the Atlantic account of Narwhal’s game day, you’ll see that it involved everyone, from the most senior developers on down, sweating it out to keep the system running in the face of disaster. They even simulated what would happen if one of Amazon’s availability zones went down, as has happened several times in the past few years (and happened again a few days before the election). Game day gave the Obama team a detailed plan for just about every conceivable outage, and the confidence that, if something inconceivable happened, they could handle it.

You never get the sense that Orca was tested in the same way. If it had, the Romney team would have had a plan for what happened when network outages occurred, or when the server load went critical. I don’t see any evidence that the consultants who wrote the code were involved when operational problems started to show up. There were minimal plans for backup or disaster recovery, and on election day, they found out that the network couldn’t take the load. The Romney campaign had someone asking the right questions; but he didn’t get any answers.

Narwhal’s use of Amazon Web Services was another significant advantage. “In the cloud” is a cliche, but the capabilities Amazon provided were anything but. The Narwhal team didn’t have to worry about running out of compute capacity, because they could start more server instances as needed. Their disaster strategy included maintaing a hot backup of their applications in Amazon’s Western zone, in case the Eastern zone failed. Because they were relying on Amazon’s network services, network capacity wasn’t a concern. Amazon’s network handles Black Friday and Christmas with ease, along with many popular Internet sites. Election day wasn’t a challenge for their network.

While nobody knows exactly what the Orca team did, it’s believed that they were operating out of a single data center, either in Boston Garden or nearby, running on a fixed set of servers. This looks very much like a traditional IT operation, where the computing facilities are all owned and either on-premises or at a colocation facility. Arguably, this gives you increased control and security, though I believe that these advantages are a mirage. I’d much rather have Amazon’s staff fighting attempts to compromise their infrastructure than anyone I could afford to hire. The downside is that the Romney team wasn’t able to add capacity as load increased on election day. It’s just not possible to acquire and integrate new hardware on that time scale. Furthermore, by concentrating their servers at a single location, the Orca team effectively concentrated all their traffic, leading to outages when load exceeded capacity.

I’ve seen studies claiming that 68% of IT projects fail, so the end result isn’t a big surprise. I don’t believe that Orca’s failure was the determining factor in the election. But it is a cautionary tale for anyone working in IT, whether at a web startup or a large, traditional IT shop. Separating developers from the operations staff responsible for running the system is inviting trouble. Consulting contracts that extend beyond development to deployment and operations aren’t unknown, but neither are they common. DevOps is impossible when the devs have met their contractual obligations and have left the building. Inadequate testing, particularly stress testing of the entire system, is a further step in the wrong direction. And while cloud computing in itself doesn’t prevent or forestall disaster, building an IT infrastructure that runs entirely on-premises denies you the flexibility you need to deal with the problems that will inevitably arise.

What will we see in the 2016 election? By then, DevOps may well sound staid and corporate, and we’ll be looking forward to the next trendy thing. But I can guarantee that the major campaigns in the next presidential election will have learned the lessons of 2012: take advantage of the cloud, don’t separate your development staff from operations, and Test All The Things. If you’re working in IT now, you don’t have to wait to put those lessons into practice.


October 09 2012

Six themes from Velocity Europe

By Steve Souders and John Allspaw

More than 700 performance and operations engineers were in London last week for Velocity Europe 2012. Below, Velocity co-chairs Steve Souders and John Allspaw note high-level themes from across the various tracks (especially the hallway track) that are emerging for the WPO and DevOps communities.

Velocity Europe 2012 in LondonVelocity Europe 2012 in London

Performance themes from Steve Souders

I was in awe of the speaker and exhibitor lineup going into Velocity Europe. It was filled with knowledgeable gurus and industry leaders. As Velocity Europe unfolded a few themes kept recurring, and I wanted to share those with you.

Performance matters more — The places and ways that web performance matters keeps growing. The talks at Velocity covered desktop, mobile (native, web, and hybrid), tablet, TV, DSL, cable, FiOS, 3G, 4G, LTE, and WiMAX across social, financial, ecommerce, media, games, sports, video, search, analytics, advertising, and enterprise. Although all of the speakers were technical, they talked about how the focus on performance extends to other departments in their companies as well as the impact performance has on their users. Web performance has permeated all aspects of the web and has become a primary focus for web companies.

Organizational challenges are the hardestLonely Planet and SoundCloud talked about how the challenges in shifting their organizational culture to focus on performance were more difficult than the technical work required to actually improve performance. During the hallway track, myself and a few other speakers were asked about ways to initiate this culture shift. There’s growing interest in figuring out how to change a company’s culture to value and invest in performance. This reminded me of our theme from Velocity 2009, the impact of performance and operations on the bottom line, where we brought in case studies that described the benefits of web performance using the vocabulary of the business. In 2013 I predict we’ll see a heavier emphasis on case studies and best practices for making performance a priority for the organization using a slightly different vocabulary, with terms like “culture,” “buy-in” and “DNA.”

The community is huge — As of today there are 42 web performance meetup groups totaling nearly 15,000 members worldwide: 15,000 members just over three years! In addition to meetup groups, Aaron Kulick and Stephen Thair organized the inaugural WebPerfDays events in Santa Clara, Calif. and London (respectively). WebPerfDays, modelled after DevOpsDays, is an unconference for the web performance community organized by the web performance community. Although these two events coincided with Velocity, the intent is that anyone in the world can use the resources (templates, website, Twitter handle, etc.) to organize their own WebPerfDays. A growing web performance community means more projects, events, analyses, etc. reaching more people. I encourage you to attend your local web performance meetup group. If there isn’t one, then organize it. And consider organizing your own WebPerfDays as a one-day mini-Velocity in your own backyard.

Operations themes from John Allspaw

As if it was an extension of what we saw at Velocity U.S., there were a number of talks that underscored the importance of the human factor in web operations. I gave a tutorial called “Escalating Scenarios: A Deep Dive Into Outage Pitfalls” that mostly centered around the situations when ops teams find themselves responding to complex failure scenarios. Stephen Nelson-Smith gave a whirlwind tour of patterns and anti-patterns on workflows and getting things done in an engineering and operations context.

Gene Kim, Damon Edwards, John Willis, and Patrick Debois looked at the fundamentals surrounding development and operations cooperation and collaboration, in “DevOps Patterns Distilled.” Mike Rembetsy and Patrick McDonnell followed up with the implementation of those fundamentals at Etsy over a four-year period.

Theo Schlossnagle, ever the “dig deep” engineer, spoke on monitoring and observability. He gave some pretty surgical techniques for peering into production infrastructure in order to get an idea of what’s going on under the hood, with DTrace and tcpdump.

A number of talks covered handling large-scale growth:

These are just a few of the highlights we saw at Velocity Europe in London. As usual, half the fun was the hallway track: engineers trading stories, details, and approaches over food and drink. A fun and educational time was had by all.

August 03 2012

Four short links: 3 August 2012

  1. Urban Camouflage WorkshopMost of the day was spent crafting urban camouflage intended to hide the wearer from the Kinect computer vision system. By the end of the workshop we understood how to dress to avoid detection for the three different Kinect formats. (via Beta Knowledge)
  2. Starting a Django Project The Right Way (Jeff Knupp) — I wish more people did this: it’s not enough to learn syntax these days. Projects live in a web of best practices for source code management, deployment, testing, and migrations.
  3. FailCona one-day conference for technology entrepreneurs, investors, developers and designers to study their own and others’ failures and prepare for success. Figure out how to learn from failures—they’re far more common than successes. (via Krissy Mo)
  4. Google Fiber in the Real World (Giga Om) — These tests show one of the limitations of Google’s Fiber network: other services. Since Google Fiber is providing virtually unheard of speeds for their subscribers, companies like Apple and I suspect Hulu, Netflix and Amazon will need to keep up. Are you serving DSL speeds to fiber customers? (via Jonathan Brewer)

June 26 2012

Four short links: 26 June 2012

  1. SnapItHD -- camera captures full 360-degree panorama and users select and zoom regions afterward. (via Idealog)
  2. Iago (GitHub) -- Twitter's load-generation tool.
  3. AutoCAD Worm Stealing Blueprints -- lovely, malware that targets inventions. The worm, known as ACAD/Medre.A, is spreading through infected AutoCAD templates and is sending tens of thousands of stolen documents to email addresses in China. This one has soured, but give the field time ... anything that can be stolen digitally, will be. (via Slashdot)
  4. Designing For and Against the Manufactured Normalcy Field (Greg Borenstein) -- Tim said this was one of his favourite sessions at this year's Foo Camp: breaking the artificial normality than we try to cast over new experiences so as to make them safe and comfortable.

June 12 2012

Velocity Profile: Schlomo Schapiro

This is part of the Velocity Profiles series, which highlights the work and knowledge of web ops and performance experts.

Schlomo SchapiroSchlomo Schapiro
Systems Architect, Open Source Evangelist

How did you get into web operations and performance?

Previously I was working as a consultant for Linux, open source tools and virtualization. While this is a great job, it has one major drawback: One usually does not stay with one customer long enough to enable the really big changes, especially with regard to how the customer works. When ImmobilienScout24 came along and offered me the job as a Systems Architect, this was my ticket out of consulting and into diving deeply into a single customer scenario. The challenges that ImmobilienScout24 faced were very much along the lines that occupied me as well:

  • How to change from "stable operations" to "stable change."
  • How to fully automate a large data center and stop doing repeating tasks manually.
  • How to drastically increase the velocity of our release cycles.

What are your most memorable projects?

There are a number of them:

  • An internal open source project to manage the IT desktops by the people who use them.
  • An open source project, Lab Manager Light, that turns a standard VMware vSphere environment into a self-service cloud.
  • The biggest and still very much ongoing project is the new deployment and systems automation for our data center. The approach — which is also new — is to unify the management of our Linux servers under the built-in package manager, in our case RPM. That way all files on the servers are already taken care of and we only need to centrally orchestrate the package roll-out waves and service start/stop. The tools we use for this are published here.
  • Help to nudge us to embrace DevOps last year after the development went agile some three years ago.
  • Most important of all, I feel that ImmobilienScout24 is now on its way to maintain and build upon the technological edge matching our market share as the dominating real-estate listing portal in Germany. This will actually enable us to keep growing and setting the pace in the ever-faster Internet world.

What's the toughest problem you've had to solve?

The real challenge is not to hack up a quick solution but to work as a team to build a sustainable world. Technical debt discussions are now a major part of my daily work. As tedious as they can be, I strongly believe that at our current state sustainability is at least as important as innovation.

What tools and techniques do you rely on most?

Asking questions and trying to understand with everybody together how things really work. Walking a lot through the office with a coffee cup and talking to people. Taking the time to sit down with a colleague at the keyboard and seeing things through. Sometimes it helps to shorten a discussion with a a little hacking and "look, it just works" — but this should always be a way to start a discussion. The real work is better done together as a team.

What is your web operations and performance super power?

I hope that I manage to help us all to look forward to the next day at work. I also try to simplify things until they are really simple, and I annoy everybody by nagging about separation of concerns.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


June 08 2012

Four short links: 8 June 2012

  1. HAproxy -- high availability proxy, cf Varnish.
  2. Opera Reviews SPDY -- thoughts on the high-performance HTTP++ from a team with experience implementing their own protocols. Section 2 makes a good intro to the features of SPDY if you've not been keeping up.
  3. Jetpants -- Tumblr's automation toolkit for handling monstrously large MySQL database topologies. (via Hacker News)
  4. LeakedIn -- check if your LinkedIn password was leaked. Chris Shiflett had this site up before LinkedIn had publicly admitted the leak.

June 07 2012

What is DevOps?

Adrian Cockcroft's article about NoOps at Netflix ignited a controversy that has been smouldering for some months. John Allspaw's detailed response to Adrian's article makes a key point: What Adrian described as "NoOps" isn't really. Operations doesn't go away. Responsibilities can, and do, shift over time, and as they shift, so do job descriptions. But no matter how you slice it, the same jobs need to be done, and one of those jobs is operations. What Adrian is calling NoOps at Netflix isn't all that different from Operations at Etsy. But that just begs the question: What do we mean by "operations" in the 21st century? If NoOps is a movement for replacing operations with something that looks suspiciously like operations, there's clearly confusion. Now that some of the passion has died down, it's time to get to a better understanding of what we mean by operations and how it's changed over the years.

At a recent lunch, John noted that back in the dawn of the computer age, there was no distinction between dev and ops. If you developed, you operated. You mounted the tapes, you flipped the switches on the front panel, you rebooted when things crashed, and possibly even replaced the burned out vacuum tubes. And you got to wear a geeky white lab coat. Dev and ops started to separate in the '60s, when programmer/analysts dumped boxes of punch cards into readers, and "computer operators" behind a glass wall scurried around mounting tapes in response to IBM JCL. The operators also pulled printouts from line printers and shoved them in labeled cubbyholes, where you got your output filed under your last name.

The arrival of minicomputers in the 1970s and PCs in the '80s broke down the wall between mainframe operators and users, leading to the system and network administrators of the 1980s and '90s. That was the birth of modern "IT operations" culture. Minicomputer users tended to be computing professionals with just enough knowledge to be dangerous. (I remember when a new director was given the root password and told to "create an account for yourself" ... and promptly crashed the VAX, which was shared by about 30 users). PC users required networks; they required support; they required shared resources, such as file servers and mail servers. And yes, BOFH ("Bastard Operator from Hell") serves as a reminder of those days. I remember being told that "no one" else is having the problem you're having — and not getting beyond it until at a company meeting we found that everyone was having the exact same problem, in slightly different ways. No wonder we want ops to disappear. No wonder we wanted a wall between the developers and the sysadmins, particularly since, in theory, the advent of the personal computer and desktop workstation meant that we could all be responsible for our own machines.

But somebody has to keep the infrastructure running, including the increasingly important websites. As companies and computing facilities grew larger, the fire-fighting mentality of many system administrators didn't scale. When the whole company runs on one 386 box (like O'Reilly in 1990), mumbling obscure command-line incantations is an appropriate way to fix problems. But that doesn't work when you're talking hundreds or thousands of nodes at Rackspace or Amazon. From an operations standpoint, the big story of the web isn't the evolution toward full-fledged applications that run in the browser; it's the growth from single servers to tens of servers to hundreds, to thousands, to (in the case of Google or Facebook) millions. When you're running at that scale, fixing problems on the command line just isn't an option. You can't afford letting machines get out of sync through ad-hoc fixes and patches. Being told "We need 125 servers online ASAP, and there's no time to automate it" (as Sascha Bates encountered) is a recipe for disaster.

The response of the operations community to the problem of scale isn't surprising. One of the themes of O'Reilly's Velocity Conference is "Infrastructure as Code." If you're going to do operations reliably, you need to make it reproducible and programmatic. Hence virtual machines to shield software from configuration issues. Hence Puppet and Chef to automate configuration, so you know every machine has an identical software configuration and is running the right services. Hence Vagrant to ensure that all your virtual machines are constructed identically from the start. Hence automated monitoring tools to ensure that your clusters are running properly. It doesn't matter whether the nodes are in your own data center, in a hosting facility, or in a public cloud. If you're not writing software to manage them, you're not surviving.

Furthermore, as we move further and further away from traditional hardware servers and networks, and into a world that's virtualized on every level, old-style system administration ceases to work. Physical machines in a physical machine room won't disappear, but they're no longer the only thing a system administrator has to worry about. Where's the root disk drive on a virtual instance running at some colocation facility? Where's a network port on a virtual switch? Sure, system administrators of the '90s managed these resources with software; no sysadmin worth his salt came without a portfolio of Perl scripts. The difference is that now the resources themselves may be physical, or they may just be software; a network port, a disk drive, or a CPU has nothing to do with a physical entity you can point at or unplug. The only effective way to manage this layered reality is through software.

So infrastructure had to become code. All those Perl scripts show that it was already becoming code as early as the late '80s; indeed, Perl was designed as a programming language for automating system administration. It didn't take long for leading-edge sysadmins to realize that handcrafted configurations and non-reproducible incantations were a bad way to run their shops. It's possible that this trend means the end of traditional system administrators, whose jobs are reduced to racking up systems for Amazon or Rackspace. But that's only likely to be the fate of those sysadmins who refuse to grow and adapt as the computing industry evolves. (And I suspect that sysadmins who refuse to adapt swell the ranks of the BOFH fraternity, and most of us would be happy to see them leave.) Good sysadmins have always realized that automation was a significant component of their job and will adapt as automation becomes even more important. The new sysadmin won't power down a machine, replace a failing disk drive, reboot, and restore from backup; he'll write software to detect a misbehaving EC2 instance automatically, destroy the bad instance, spin up a new one, and configure it, all without interrupting service. With automation at this level, the new "ops guy" won't care if he's responsible for a dozen systems or 10,000. And the modern BOFH is, more often than not, an old-school sysadmin who has chosen not to adapt.

James Urquhart nails it when he describes how modern applications, running in the cloud, still need to be resilient and fault tolerant, still need monitoring, still need to adapt to huge swings in load, etc. But he notes that those features, formerly provided by the IT/operations infrastructures, now need to be part of the application, particularly in "platform as a service" environments. Operations doesn't go away, it becomes part of the development. And rather than envision some sort of uber developer, who understands big data, web performance optimization, application middleware, and fault tolerance in a massively distributed environment, we need operations specialists on the development teams. The infrastructure doesn't go away — it moves into the code; and the people responsible for the infrastructure, the system administrators and corporate IT groups, evolve so that they can write the code that maintains the infrastructure. Rather than being isolated, they need to cooperate and collaborate with the developers who create the applications. This is the movement informally known as "DevOps."

Amazon's EBS outage last year demonstrates how the nature of "operations" has changed. There was a marked distinction between companies that suffered and lost money, and companies that rode through the outage just fine. What was the difference? The companies that didn't suffer, including Netflix, knew how to design for reliability; they understood resilience, spreading data across zones, and a whole lot of reliability engineering. Furthermore, they understood that resilience was a property of the application, and they worked with the development teams to ensure that the applications could survive when parts of the network went down. More important than the flames about Amazon's services are the testimonials of how intelligent and careful design kept applications running while EBS was down. Netflix's ChaosMonkey is an excellent, if extreme, example of a tool to ensure that a complex distributed application can survive outages; ChaosMonkey randomly kills instances and services within the application. The development and operations teams collaborate to ensure that the application is sufficiently robust to withstand constant random (and self-inflicted!) outages without degrading.

Taken at IBM's headquarter On the other hand, during the EBS outage, nobody who wasn't an Amazon employee touched a single piece of hardware. At the time, JD Long tweeted that the best thing about the EBS outage was that his guys weren't running around like crazy trying to fix things. That's how it should be. It's important, though, to notice how this differs from operations practices 20, even 10 years ago. It was all over before the outage even occurred: The sites that dealt with it successfully had written software that was robust, and carefully managed their data so that it wasn't reliant on a single zone. And similarly, the sites that scrambled to recover from the outage were those that hadn't built resilience into their applications and hadn't replicated their data across different zones.

In addition to this redistribution of responsibility, from the lower layers of the stack to the application itself, we're also seeing a redistribution of costs. It's a mistake to think that the cost of operations goes away. Capital expense for new servers may be replaced by monthly bills from Amazon, but it's still cost. There may be fewer traditional IT staff, and there will certainly be a higher ratio of servers to staff, but that's because some IT functions have disappeared into the development groups. The bonding is fluid, but that's precisely the point. The task — providing a solid, stable application for customers — is the same. The locations of the servers on which that application runs, and how they're managed, are all that changes.

One important task of operations is understanding the cost trade-offs between public clouds like Amazon's, private clouds, traditional colocation, and building their own infrastructure. It's hard to beat Amazon if you're a startup trying to conserve cash and need to allocate or deallocate hardware to respond to fluctuations in load. You don't want to own a huge cluster to handle your peak capacity but leave it idle most of the time. But Amazon isn't inexpensive, and a larger company can probably get a better deal taking its infrastructure to a colocation facility. A few of the largest companies will build their own datacenters. Cost versus flexibility is an important trade-off; scaling is inherently slow when you own physical hardware, and when you build your data centers to handle peak loads, your facility is underutilized most of the time. Smaller companies will develop hybrid strategies, with parts of the infrastructure hosted on public clouds like AWS or Rackspace, part running on private hosting services, and part running in-house. Optimizing how tasks are distributed between these facilities isn't simple; that is the province of operations groups. Developing applications that can run effectively in a hybrid environment: that's the responsibility of developers, with healthy cooperation with an operations team.

The use of metrics to monitor system performance is another respect in which system administration has evolved. In the early '80s or early '90s, you knew when a machine crashed because you started getting phone calls. Early system monitoring tools like HP's OpenView provided limited visibility into system and network behavior but didn't give much more information than simple heartbeats or reachability tests. Modern tools like DTrace provide insight into almost every aspect of system behavior; one of the biggest challenges facing modern operations groups is developing analytic tools and metrics that can take advantage of the data that's available to predict problems before they become outages. We now have access to the data we need, we just don't know how to use it. And the more we rely on distributed systems, the more important monitoring becomes. As with so much else, monitoring needs to become part of the application itself. Operations is crucial to success, but operations can only succeed to the extent that it collaborates with developers and participates in the development of applications that can monitor and heal themselves.

Success isn't based entirely on integrating operations into development. It's naive to think that even the best development groups, aware of the challenges of high-performance, distributed applications, can write software that won't fail. On this two-way street, do developers wear the beepers, or IT staff? As Allspaw points out, it's important not to divorce developers from the consequences of their work since the fires are frequently set by their code. So, both developers and operations carry the beepers. Sharing responsibilities has another benefit. Rather than finger-pointing post-mortems that try to figure out whether an outage was caused by bad code or operational errors, when operations and development teams work together to solve outages, a post-mortem can focus less on assigning blame than on making systems more resilient in the future. Although we used to practice "root cause analysis" after failures, we're recognizing that finding out the single cause is unhelpful. Almost every outage is the result of a "perfect storm" of normal, everyday mishaps. Instead of figuring out what went wrong and building procedures to ensure that something bad can never happen again (a process that almost always introduces inefficiencies and unanticipated vulnerabilities), modern operations designs systems that are resilient in the face of everyday errors, even when they occur in unpredictable combinations.

In the past decade, we've seen major changes in software development practice. We've moved from various versions of the "waterfall" method, with interminable up-front planning, to "minimum viable product," continuous integration, and continuous deployment. It's important to understand that the waterfall and methodology of the '80s aren't "bad ideas" or mistakes. They were perfectly adapted to an age of shrink-wrapped software. When you produce a "gold disk" and manufacture thousands (or millions) of copies, the penalties for getting something wrong are huge. If there's a bug, you can't fix it until the next release. In this environment, a software release is a huge event. But in this age of web and mobile applications, deployment isn't such a big thing. We can release early, and release often; we've moved from continuous integration to continuous deployment. We've developed techniques for quick resolution in case a new release has serious problems; we've mastered A/B testing to test releases on a small subset of the user base.

All of these changes require cooperation and collaboration between developers and operations staff. Operations groups are adopting, and in many cases, leading in the effort to implement these changes. They're the specialists in resilience, in monitoring, in deploying changes and rolling them back. And the many attendees, hallway discussions, talks, and keynotes at O'Reilly's Velocity conference show us that they are adapting. They're learning about adopting approaches to resilience that are completely new to software engineering; they're learning about monitoring and diagnosing distributed systems, doing large-scale automation, and debugging under pressure. At a recent meeting, Jesse Robbins described scheduling EMT training sessions for operations staff so that they understood how to handle themselves and communicate with each other in an emergency. It's an interesting and provocative idea, and one of many things that modern operations staff bring to the mix when they work with developers.

What does the future hold for operations? System and network monitoring used to be exotic and bleeding-edge; now, it's expected. But we haven't taken it far enough. We're still learning how to monitor systems, how to analyze the data generated by modern monitoring tools, and how to build dashboards that let us see and use the results effectively. I've joked about "using a Hadoop cluster to monitor the Hadoop cluster," but that may not be far from reality. The amount of information we can capture is tremendous, and far beyond what humans can analyze without techniques like machine learning.

Likewise, operations groups are playing a huge role in the deployment of new, more efficient protocols for the web, like SPDY. Operations is involved, more than ever, in tuning the performance of operating systems and servers (even ones that aren't under our physical control); a lot of our "best practices" for TCP tuning were developed in the days of ISDN and 56 Kbps analog modems, and haven't been adapted to the reality of Gigabit Ethernet, OC48* fiber, and their descendants. Operations groups are responsible for figuring out how to use these technologies (and their successors) effectively. We're only beginning to digest IPv6 and the changes it implies for network infrastructure. And, while I've written a lot about building resilience into applications, so far we've only taken baby steps. There's a lot there that we still don't know. Operations groups have been leaders in taking best practices from older disciplines (control systems theory, manufacturing, medicine) and integrating them into software development.

And what about NoOps? Ultimately, it's a bad name, but the name doesn't really matter. A group practicing "NoOps" successfully hasn't banished operations. It's just moved operations elsewhere and called it something else. Whether a poorly chosen name helps or hinders progress remains to be seen, but operations won't go away; it will evolve to meet the challenges of delivering effective, reliable software to customers. Old-style system administrators may indeed be disappearing. But if so, they are being replaced by more sophisticated operations experts who work closely with development teams to get continuous deployment right; to build highly distributed systems that are resilient; and yes, to answer the pagers in the middle of the night when EBS goes down. DevOps.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

Photo: Taken at IBM's headquarters in Armonk, NY. By Mike Loukides.


June 06 2012

A crazy awesome gaming infrastructure

In this Velocity Podcast, I had a conversation with Sarah Novotny (@sarahnovotny), CIO of Meteor Entertainment. This conversation centers mostly on building a high-performance gaming infrastructure and bridging the gap between IT and business. Sarah has some great insights into building an environment for human performance that goes along with your quest for more reliable, scalable, tolerant, and secure web properties.

Our conversation lasted 00:15:44 and if you want to pinpoint any particular topic, you can find the specific timing below. Sarah provides some of her background and experience as well as what she is currently doing at Meteor here. The full conversation is outlined below.

  • As a CIO, how do you bridge the gap between technology and business? 00:02:28

  • How do you educate corporate stakeholders about the importance of DevOps and the impact it can have on IT? 00:03:26

  • How does someone set up best practices in an organization? 00:05:24

  • Are there signals that DevOps is actually happening where development and operations are merging? 00:08:31

  • How do you measure performance and make large changes in an online game without disrupting players? 00:09:59

  • How do you prepare for scaling your crazy awesome infrastructure needed for game play? 00:12:28

  • Have you gathered metrics on public and private clouds and do you know which ones to scale to when needed? 00:14:03

In addition to her work at Meteor, Sarah is co-chair of OSCON 2012 (being held July 16-20 in Portland, Ore.). We hope to see you there. You can also read Sarah's blog for more insights into what she's up to.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


June 05 2012

Velocity Profile: Kate Matsudaira

This is part of the Velocity Profiles series, which highlights the work and knowledge of web ops and performance experts.

Kate MatsudairaKate Matsudaira
VP Engineering

How did you get into web operations and performance?

I started working as a software engineer, and being at Amazon working on the internals of the retail website it was almost impossible not to have some exposure to pager duty and operations. As my career progressed and I moved into leadership roles on teams working on 24/7 websites, typically spanning hundreds of servers (and now instances), it was necessary to understand operations and performance.

What was your most memorable project?

Memorable can be two things, really good or really bad. Right now I am excited about the work we have been doing on to make our website super fast and work well across devices (and all the data mining and machine learning is also really interesting).

As for really bad, though, there was a launch almost a decade ago where we implemented an analytics datastore on top of a relational database instead of something like map/reduce. If only Hadoop and all the other great data technologies were around and prevalent back then!

What's the toughest problem you've had to solve?

Building an index of all the links on the web (a link search engine, basically) in one year with less than $1 million, including the team.

What tools and techniques do you rely on most?

Tools: pick the best one for the job at hand. Techniques: take the time to slow down before making snap judgements.

Who do you follow in the web operations and performance world?

Artur Bergman, Cliff Moon, Ben Black, John Allspaw, Rob Treat, and Theo Schlossnagle.

What is your web operations and performance super power?

Software architecture. You have to design your applications to be operational.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


May 16 2012

Velocity Profile: Justin Huff

This is part of the Velocity Profiles series, which highlights the work and knowledge of web ops and performance experts.

Justin HuffJustin Huff
Software Engineer

How did you get into web operations and performance?

Picnik's founders Mike Harrington and Darrin Massena needed someone who knew something about Linux. Darrin and I had known each other for a few years, so my name came up. At the time, I was doing embedded systems work, but ended up moonlighting for Picnik. It wasn't long before I came over full time. I always expected to help them get off the ground and then they'd find a "real sysadmin" to take over. Turns out, I ended up enjoying ops! I was lucky enough to straddle the world between ops and back-end dev. Sound familiar?

What is your most memorable project?

Completing a tight database upgrade at a Starbucks mid-way between Seattle and Portland. "Replicate faster, PLEASE!" Also, in the build-up to Picnik's acquisition by Google, Mike asked me what it would take to handle 10 times our current traffic and to do it in 30 days. We doubled Picnik's hardware, including a complete network overhaul. It went flawlessly and continued to serve Picnik until Google shut it down in April of this year.

What's the toughest problem you've had to solve?

When Flickr launched with Picnik as its photo editor, we started to see really weird behavior causing some Flickr API calls to hang. I spent a good chunk of that day on the phone with John Allspaw and finally identified an issue with how our NAT box was munging TCP timestamps that were interacting badly with Flickr's servers. I learned a couple things: First, both John and I were able to gather highly detailed info (tcpdumps) at key points in our networks (and hosts) — sometimes you just have to go deep; second, it's absolutely imperative that you have good technical contacts with your partners.

What tools and techniques do you rely on most?

Graphs and monitoring are critical. Vim, because I can't figure out Emacs. Automation, because I can't even remember what I had for breakfast.

Who do you follow in the web operations and performance world?

Bryan Berry (@bryanwb) is great. Joe Williams (@williamsjoe) is doing great stuff — and his Twitter profile pic is awesome.

What is your web operations and performance super power?

I think I'm good at building, maintaining, and understanding complete systems. Other engineering disciplines are typically concerned about the details of a single part of a larger system. As web engineers, we have to grok the system, the components, and their interactions ... at 2 AM.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!