Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 12 2012

Tullio Crali – Tempo!

La forza della curva

Bombardamento aereo

Die Wikipedia über den italienischen Maler des Futurismus Tullio Crali (1910 - 2000).

(Gefunden bei

Reposted fromglaserei glaserei

December 15 2011

Strata Week: A new Internet data transfer speed record

Here are a few of the data stories that caught my attention this week:

New world record for data transfer speed

Scientists announced this week that they had broken the world record for Internet speed by transferring data at 186 Gbps.

Researchers built an optical fiber network between the University of Victoria Computing Centre in Victoria, British Columbia, and the Washington State Convention Center in Seattle, Wash. According to a Caltech press release, "with a simultaneous data rate of 88 Gbps in the opposite direction, the team reached a sustained two-way data rate of 186 Gbps between two data centers, breaking the team's previous peak-rate record of 119 Gbps set in 2009."

The new record-breaking speed is fast enough to transfer roughly 100,000 Blu-ray disks a day. The research on faster Internet speeds is underway to better handle the data coming from the Large Hadron Collider at CERN. "More than 100 petabytes (more than four million Blu-ray disks) of data have been processed, distributed, and analyzed using a global grid of 300 computing and storage facilities located at laboratories and universities around the world," according to Caltech, "and the data volume is expected to rise a thousand-fold as physicists crank up the collision rates and energies at the LHC." Faster data transfer will hopefully make it possible for more researchers to be able to work with the petabyte-scale data from CERN.

The following video explains the hardware and technology behind the latest speed record:

Data predictions for 2012

This was a "coming out" year for big data and data science, according to O'Reilly's Edd Dumbill, who posted his 2012 data predictions this week. Dumbill has identified five areas in which he thinks we'll see more development in the next year:

  • More powerful and expressive tools for analysis. Specifically, better programming language support.
  • Development of data science workflows and tools. In other words, there will be clearer processes for how data teams work.
  • Rise of data marketplaces — the "directory" and the "delivery."
  • Streaming data processing, as opposed to batch processing.
  • Increased understanding of and demand for visualization. "If becoming a data-driven organization is about fostering a better feel for data among all employees, visualization plays a vital role in delivering data manipulation abilities to those without direct programming or statistical skills," Dumbill writes.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.


October 27 2011

What's on the agenda for Velocity Europe

Velocity EuropeVelocity Europe is less than two weeks away. It's happening November 8-9 in Berlin at the Hotel Maritim ProArte. I've heard good things about the venue and am excited to get there and check it out.

This event has been a long time coming. A handful of web performance and operations savants (including members of the Program Committee) have been encouraging us for years to bring Velocity to Europe, and now it's actually happening. And (drum roll please) the price is only EUR 600 (excl. VAT) if you use the 20% discount code veu11sts.

The Velocity Europe speaker line-up is exceptional. Some highlights include:

  • Jon Jenkins from is talking about their approach to the challenges of mobile browsing. Jon is the Director of Software Development for Amazon Silk. I'm looking forward to more details about Silk's split architecture.
  • Tim Morrow delivers the background for Betfair's promise to deliver a fast experience to their customers, and their progress on that promise.
  • Theo Schlossnagle is a recognized leader at Velocity. He's giving two talks on web operations careers and monitoring.
  • Estelle Weyl joins Velocity for the first time talking about the nuances of mobile rendering performance. I learn something new every time I hear Estelle speak, so I'm excited to welcome her to Velocity.
  • Ivo Teel discusses the balance we all face between features and performance and how they're handling that at Spil Games.
  • Jeff Veen knows the importance of third-party performance and availability as the CEO of Typekit. Jeff is an amazing, engaging speaker. Reading his session description gave me goosebumps with anticipation: "Jeff sat on a couch in the Typekit offices, staring out the window, and wondering if everything their company had been working towards was about to slip through their fingers …"

There's much much more – lightning demos, browser vendor talks, John Allspaw on anticipating failure, David Mandelin on JavaScript performance – I've got to stop here but please check out the entire schedule.

I want to give a shout out to the Velocity Europe Program Committee: Patrick Debois, Aaron Peters, Schlomo Schapiro, Jeroen Tjepkema, and Sean Treadway. They've participated in numerous video concalls (yay Google Hangouts!) to review proposals, build the program, and shape Velocity to be a European conference. And they might have one more card up their sleeve – more on that later.

If you're heading to Berlin you should also check out CounchConf Berlin on Nov 7. NoSQL has great performance benefits and Couchbase is a good choice for many mobile apps. Use couchconf_discount for 10% off registration.

The last time I was in Berlin was for 2009. The city had a high-tech vibe and the crowd was extremely knowledgeable and enthusiastic. I'm excited to get back to Berlin for Velocity Europe and do the web performance and operations deep dives that are the core of Velocity. If you want to have a website that's always fast and always up, Velocity Europe is the place to be. I hope to see you there.

Velocity Europe, being held Nov. 8-9 in Berlin, will bring together the web operations and performance communities for two days of critical training, best practices, and case studies.

Save 20% on registration with the code veu11sts


June 17 2011

Velocity 2011 debrief

Women's MeetupVelocity wrapped up yesterday. This was Velocity's fourth year and every year has seen significant growth, but this year felt like a tremendous step up in all areas. Total attendance grew from 1,200 last year to more than 2,000 people. The workshops were huge, the keynotes were packed, and the sessions in each track were bigger than anyone expected. The exhibit hall was more than twice as big as last year and it was still crowded every time I was there.

Sample some of the tweets to see the reaction of attendees, sponsors, and exhibitors.

Several folks on the #velocityconf Twitter stream have been asking about slides and videos. You can find those on the Velocity Slides and Videos page. There are about 25 slide decks up there right now. The rest of the slides will be posted as we receive them from the speakers. Videos of all the keynotes will be made available for free. Several are there already posted, including "Career Development" by Theo Schlossnagle, "JavaScript & Metaperformance" by Doug Crockford, and "Look at Your Data" by the omni-awesome John Rauser. Videos of every afternoon session are also available via the Velocity Online Access Pass ($495).

Velocity 2011 had a great crowd with a lot of energy. Check out the Velocity photos to get a feel for what was happening. We had more women speakers than ever before and I was psyched when I saw this photo of the Women's Networking Meet Up that took place during the conference (also posted above).

Velocity 2011: Take-aways, Trends, and Highlights — In this webcast following Velocity 2011, program chairs Steve Souders and John Allspaw will identify and discuss key trends and announcements that came out of the event and how they will impact the web industry in the year to come.

Join us on Friday, June 24, 2011, at 10 am PT

Register for this free webcast

Make sure to check out all the announcements that were made at Velocity. There were a couple big announcements about Velocity itself, including:

  • After four years Jesse Robbins is passing the co-chair mantle to John Allspaw. I worked with John at Yahoo! when he was with Flickr. John is VP of Tech Ops at Etsy now. He stepped into many of the co-chair duties at this Velocity in preparation for taking on the role at the next Velocity.
  • Speaking of the next Velocity, we announced there will be a Velocity Europe in November in Berlin.
    The exact venue and dates will be announced soon, followed quickly by a call for proposals.
    We're extremely excited about expanding Velocity to Europe and look forward to connecting with the performance and operations communities there,
    and helping grow the WPO and devops industries in that part of the world.
    In addition, the second Velocity China will be held in Beijing in December 2011.
  • And of course we'll be back next June for our fifth year of Velocity here in the Bay Area.

I covered a lot in this post and didn't even talk about any of the themes, trends, and takeaways. John and I will be doing that at the Velocity Wrap-up Webcast on Friday, June 24 at 10am PT. It's free so invite your friends and colleagues to join in.


May 24 2011

To the end of bloated code and broken websites

In a recent discussion, Nicole Sullivan (@stubbornella), architect at Stubbornella Consulting Group and a speaker at Velocity 2011, talked about the state of CSS — how it's adapting to mobile, how it's improving performance, and how some CSS best practices have led to "bloated code and broken websites."

Our interview follows.

How are CSS best practices evolving?

NicoleSullivan.jpgNicole Sullivan: New tools are being added to browsers, and the Chrome team is really pushing the limits of what we can do with CSS, but there is still an uphill battle. Some of the best practices are actually bad for the domain.

I recently wrote an article about the best practices and what's wrong with them. I figured out this year that it wasn't just that the best practices weren't ideal — it's that they were absolutely, every single time, leading to bloated code and broken websites. It was a revelation for me to realize the best practices were often causing issues.

How are architect-level CSS tools improving?

Nicole Sullivan: The preprocessors have gotten much better. They were partially created because people didn't like the syntax of CSS and wanted a new one, but the preprocessors changed a bunch of things that weren't necessarily useful to change. In the last year or so, the preprocessors have embraced CSS and have become a testing ground for what can go into browsers. At the same time, the Chrome team is pushing forward on WebKit — it's a pretty exciting time to be working on this stuff.

Are you encountering browser support issues when building with CSS and HTML5?

Nicole Sullivan: Particularly with CSS3, there's a ton of variation and levels of support. But what CSS3 gives us are ways of doing visual decorations without actually needing images. Stoyan Stefanov and I wrote a few years ago to crush and optimize images because we realized that image weight was one of the big problems on the web. Overall, CSS was sort of the source of the problem because it was bringing in all of this extra media via images.

The cool thing with CSS3 is that now we can eliminate a lot of those images by using the more advanced properties — "border-radius" can give us rounded corners without images; you can get gradients now without images; you can get drop shadows and things like that. The thing is to be flexible enough with design that it's still going to work if, say, it doesn't have that gradient. And to realize that for users on an older browser, it's not worth the weight you'd add to the page to get them that gradient or the rounded corners — they're much more interested in having a snappy, usable experience than they are in having every visual flourish possible.

Velocity 2011, being held June 14-16 in Santa Clara, Calif., offers the skills and tools you need to master web performance and operations.

Save 20% on registration with the code VEL11RAD

How about at the mobile level — what are the major issues you're facing in that space?

Nicole Sullivan: Media queries are the biggest issue for mobile right now. Designers and developers are excited to be able to query, for example, the size of the screen and to offer different layouts for the iPhone or the iPad. But that means you're sending your entire layout for a desktop view and a mobile view down to a mobile phone or down to an iPad, which is way more than you want to be sending over the wire. Designers need to put mobile first and then maybe layer on a desktop experience — but then only sending that code to a desktop user. All of this requires more of a server-side solution.

Do developers need to build two different sites to accomplish that?

Nicole Sullivan: It depends. On my little iPhone, there's not a lot of screen real estate. If I go to a travel website, I don't want every feature they've got cluttering up my iPhone. I want to know what flight I'm on, what my confirmation number is — that kind of thing. It makes sense on the design side to think about why your users are coming to the mobile site and then designing for those needs.

What happens to desktop design is there's sort of a land grab. Each team tries to grab a little bit of space and add stuff so they'll get traffic to their part of the site. It creates a disjointed user experience. The great thing about mobile is that people aren't doing that — there isn't enough screen real estate to have a land grab yet.

This interview was edited and condensed.


April 26 2011

Why speed matters

For the past several years I have been thinking about the role of speed in customer experience and business strategy. We live in an ever-accelerating world and the competitive terms of business are built upon achieving speed for many reasons. Here are just a few, from the obvious to the more speculative.

Speed is our default setting

Human beings live and operate in a constant state of now; we process extraordinary volumes of information in real time. The acceleration of technology is simply an effort to catch up to our zero-latency experience of being. Whenever given a choice, we will opt for a service that delivers response times as fast as our own nervous system.

The technology and processes around us are nowhere close to catching up — yet wherever they do, we see incredible value creation. Any information processing technology that moves from batch to "real-time" experiences a quantum leap in value, especially for those who adopt it first. Consider the arbitrage opportunity in financial systems capable of receiving market prices (or other data) in real time, or the efficiency of inventory management occurring in real time across the supply chain. All of the systems that surround and support modern life are accelerating into real-time systems.

Velocity 2011, being held June 14-16 in Santa Clara, Calif., offers the skills and tools you need to master web performance and operations.

Save 20% on registration with the code VEL11RAD

Speed is money saved

Walmart's competitive advantage came from accelerating inventory information to near real-time throughout its supply chain. The result was incredible efficiency and huge cost savings that were the basis for its domination of the American landscape.

Speed is gratification delivered

When I worked in e-commerce in the mid '90s, we quantified the obvious: faster page load times equaled more revenue. Our analytics showed that milliseconds spelled the difference between a sale and a lost customer.

Today we see the rise of flash drives in consumer electronics not because they are more reliable or durable (they are not) but largely because they wake your computer from sleep faster.

The magic of the new iPad 2 comes from its internal speed — it uses a flash drive — and speed via an external accessory: the Smart Cover automatically wakes the device and bypasses the estimated 3 seconds it takes to click and swipe.

Speed is loyalty earned

Money is a metaphor for our use of time. We pay attention and we spend time. Taking too much of a customer's time is a form of theft that can cost your business. Conversely, if a product or service saves us time and costs less in attention we feel rewarded.

Speed equals certainty, delay equals doubt

I have heard it argued that Google won the search battle as much due to the speed of delivering results as the vaunted relevance of those results. They put their response times in milliseconds on every results page. In a social interaction, any pause before responding to a simple question ( "does this dress make me look big?") qualifies the inevitable response ("absolutely not") as less certain. My example is a stereotype and a bit whimsical, but it is emblematic of how we transfer these same emotions to our interactions with people and services. In other words, speed/responsiveness engenders feelings of trust, certainty and comfort.

Speed is a key facet of business strategy

All of this amounts to a simple edict: Consider speed as a dimension to your business strategy, not as a by-product of seeking efficiency but as a means of winning customers. I have used examples from the digital domain but the same premise applies to any offline experience — from hotel check-ins to the "out-of-box" experience of your new product. In more ways than one speed can deliver advantages beyond quality or efficiency. Speed can deliver intangibles like trust and loyalty.

Speed is a pain

Delivering on speed puts stress on an organization and, more importantly, on people. Our schedules get compressed, our deadlines tighten and the bar for competitive productivity keeps going up. While many lament the increasing pace of modern life, it is a futile complaint because it focuses on the effect rather than the cause of increasing speed. Over and over, we reward speed with our attention and with our business. As customers we demand speed from the products and services we purchase. The consequence is that as employees or business owners we find ourselves subordinated to an accelerating pace of work to deliver on that demand.

Speed is a choice we make

I believe that the terms of success for people in the world will increasingly reside with managing their own pace and flow of attention against the demands of speed. Those capable of strategically disconnecting and applying selective focus will be at an advantage in business or in life (hasn't this always been the case?) because exercising foresight and judgment, two critical life skills, are not necessarily improved by speed. Quite the opposite.

But this isn't the same as saying that we must slow down in business wholesale. As long as society rewards speed with equity, it will be the fundamental basis for competitive advantage and worth our attention.

Associated photo on index pages: Speedy Gonzales by blmurch, on Flickr


January 06 2011

Big data faster: A conversation with Bradford Stephens

Strata Conference 2011 To prepare for O'Reilly's upcoming Strata Conference, we're continuing our series of conversations with some of the leading innovators working with big data and analytics. Today, we hear from Bradford Stephens, founder of Drawn to Scale.

Drawn to Scale is a database platform that works with large data sets. Stephens describes its focus as slightly different from that of other big data tools: "Other tools out there concentrate on doing complex things with your data in seconds to minutes. We really concentrate on doing simple things with your data in milliseconds."

Stephens calls such speed "user time" and he credits Drawn to Scale's performance to its indexing system working in parallel with backend batch tools. Like other big data tools, Drawn to Scale uses MapReduce and Hadoop for batch processing on the back end. But on the front end, a series of secondary indices on top of the storage layer speed up retrieval. "We find that when you index data in the manner in which you wish to use it, it's basically one single call to the disk to access it," Stephens says. "So it can be extremely fast."

Big data tools and applications will be examined at the Strata Conference (Feb. 1-3, 2011). Save 30% on registration with the code STR11RAD.

Drawn to Scale's customers include organizations working with analytics, in social media, in mobile ad targeting and delivery, and also organizations with large arrays of sensor networks. While he expects to see some consolidation on the commercial side ("I see a lot of vendors out there doing similar things"), on the open source side he expects to see a proliferation of tools available in areas such as geo data and managing time series. "People have some very specific requirements that they're going to cook up in open source."

You'll find the full interview in the following video:

August 02 2010

Operations: The secret sauce revisited

Guest blogger Andrew Clay Shafer is helping telcos and hosting providers implement cloud services at Cloudscaling. He co-founded Reductive Labs, creators of Puppet, the configuration management framework. Andrew preaches the "infrastructure is code" gospel, and he supports approaches for applying agile methods to infrastructure and operations. Some of those perspectives were captured in his chapter in the O'Reilly book "Web Operations."

"Technical debt" is used two ways in the analysis of software systems. The phrase was first introduced in 1992 by Ward Cunningham to describe the premise that increased speed of delivery provides other advantages, and that the debt leveraged to gain those advantages should be strategically paid back.

Somewhere along the way, technical debt also became synonymous with poor implementation; reflecting the difference between the current state of a code base and an idealized one. I have used the term both ways, and I think they both have merit.

Technical debt can be extended and discussed along several additional axes: process debt, personnel debt, experience debt, user experience debt, security debt, documentation debt, etc. For this discussion, I won't quibble about the nuances of categorization. Instead, I want to take a high-level look at operations and infrastructure choices people make and the impact of those choices.

The technical debt metaphor

Debts are created by some combination of choice and circumstance. Modern economies are predicated on the flow of debt as much as anything else, but not all debt is created equal. There is a qualitative difference between a mortgage and carrying significant debt on maxed-out credit cards. The point being that there are a variety of ways to incur debt, and the quality of debts have different consequences.

Jesse Robbins' Radar post about operations as the secret sauce talked about boot strapping web startups in 80 hours. It included the following infographic showing the time cost of traditional versus special sauce operations:

I contend that the ongoing difference in time cost between the two solutions is the interest being paid on technical debt.

Understanding is really the crux of the matter. No one who really understands compound interest would intentionally make frivolous purchases on a credit card and not make every effort to pay down high interest debt. Just as no one who really understands web operations would create infrastructure with an exponentially increasing cost of maintenance. Yet, people do both of these things.

As the graph is projected out, the ongoing cost of maintenance in both projects reflects the maxim of "the rich get richer." One project can focus on adding value and differentiating itself in the market while the other will eventually be crushed under the weight of its own maintenance.

Technical debt and the Big Ball of Mud

Without a counterbalancing investment, system and software architectures succumb to entropy and become more difficult to understand. The classic "Big Ball of Mud" by Brian Foote and Joseph Yoder catalogs forces that contribute to the creation of haphazard and undifferentiated software architectures. They are:

  • Time
  • Cost
  • Experience
  • Skill
  • Visibility
  • Complexity
  • Change
  • Scale

These same forces apply just as much to infrastructure and operations, especially if you understand the "infrastructure is code" mantra. If you look at the original "Tale of Two Ops Teams" graphic, both teams spent almost the same amount of time before the launch. If we assume that these are representative, then the difference between the two approaches is essentially experience and skill, which is likely to be highly correlated with cost. As the project moves forward, the difference in experience and skill reflects itself in how the teams spend time, provide visibility and handle complexity, change and scale.

Using this list, and the assumption that balls of mud are synonymous with high technical debt, combating technical debt becomes an exercise in minimizing the impact of these forces.

  • Time and cost are what they are, and often have an inverse relationship. From a management perspective, I would like everything now and for free, so everything else is a compromise. Undue time pressure will always result in something else being compromised. That compromise will often start charging interest immediately.
  • Experience is invaluable, but sometimes hard to measure and overvalued in technology. Doing the same thing over and over with a technology is not 10 years of experience, it is the first year of experience 10 times. Intangible experience should not be measured in time, and experience in this sense is related to skill.
  • Visibility has two facets in ops work: Visibility into the design and responsibilities of the systems, and real-time metrics and alerting on the state of the system. The first allows us to take action, the second informs us that we should.
  • Complex problems can require complex solutions. Scale and change add complexity. Complexity obscures visibility and understanding.

Each of these forces and specific examples of how they impact infrastructure would fill a book, but hopefully that is enough to get people thinking and frame a discussion.

There is a force that may be missing from the "Big Ball of Mud": tools (which might be an oversight, might be an attempt to remain tool-agnostic, or might be considered a cross-cutting aspect of cost, experience and skill). That's not to say that tools don't add some complexity and the potential for technical debt as well. But done well, tools provide ongoing insight into how and why systems are configured the way they are, illumination of the complexity and connections of the systems, and a mechanism to rapidly implement changes. That is just an example. Every tool choice, from the operating system, to the web server, to the database, to the monitoring and more, has an impact on the complexity, visibility and flexibility of the systems, and therefore impacts operations effectiveness.

Many parallels can be drawn between operations and fire departments. One big difference is most fire departments don't spend much time actually putting out fires. If operations is reacting all the time, that indicates considerable technical debt. Furthermore, in reactive environments, the probability is high that the solutions of today are contributing to the technical debt and the fires of tomorrow.

Focus must be directed toward getting the fires under control in a way that doesn't contribute to future fires. The coarse metric of time spent reactively responding to incidents versus the time spent proactively completing ops-related projects is a great starting point for understanding the situation. One way to insure operations is always a cost center is to keep treating it like one. When the flow of technical debt is understood and well managed, operations is certainly a competitive advantage.


June 03 2010

How Facebook satisfied a need for speed

FacebookRemember how Facebook used to lumber and strain? And have you noticed how it doesn't feel slow anymore? That's because the engineering team pulled off an impressive feat: an in-depth optimization and rewrite project made the site twice as fast.

Robert Johnson, Facebook's director of engineering and a speaker at the upcoming Velocity and OSCON conferences, discusses that project and its accompanying lessons learned below. Johnson's insights have broad application -- you don't need hundreds of millions of users to reap the rewards.

Facebook recently overhauled its platform to improve performance. How long did that process take to complete?

Robert Johnson: Making the site faster isn't something we're ever really done with, but we did make a big push the second half of last year. It took about a month of planning and six months of work to make the site twice as fast.

What big technical changes were made during the rewrite?

Robert Johnson: Velocity conference 2010The two biggest changes were to pipeline the page content to overlap generation, network, and render time, and to move to a very small core JavaScript library for features that are required on the initial page load.

The pipelining project was called BigPipe, and it streams content back to the browser as soon as it's ready. The browser can start downloading static resources and render the most important parts of the page while the server is still generating the rest of the page. The new JavaScript library is called Primer.

In addition to these big site-wide projects, we also performed a lot of general cleanup to make everything smaller and lighter, and we incorporated best practices such as image spriting.

Were developers encouraged to work in different ways?

This was one of the trickiest parts of the project. Moving fast is one of our most important values, and we didn't want to do anything to slow down development. So most of our focus was on building tools to make things perform well when developers do the things that are easiest for them. For example, with Primer, making it easy to integrate and hard to misuse was as important to its design as making it fast.

We also built detailed monitoring of everything that could affect performance, and set up systems to check code before release.

It's really important that developers be automatically alerted when there's a problem, instead of developers having to go out of their way for every change. That way, people can continue innovating quickly, and only stop to deal with performance in the relatively unusual case that they've caused a problem.

How do you address exponential growth? How do you get ahead of it?

You never get ahead of everything, but you have to keep ahead of most things most of the time. So whenever you go in to make a particular system scale better, you can't settle for twice as good, you really need to shoot for 10 or 100 times as good. Making something twice as good only buys a few months, and you're back at it again as soon as you're done.

In general, this means scaling things by allowing greater federation and parallelism and not just making things more efficient. Efficiency is of course important, too, but it's really a separate issue.

Two other important things: have good data about how things are trending so you catch problems before you're in trouble, and test everything you can before you have to rely on it.

In most cases the easiest way for us to test something new is to put it in production for a small number of users or on a small number of machines. For things that are completely new, we set up "dark launches" that are invisible to the user but mimic the load from the real product as much as possible. For example, before we launched chat we had millions of JavaScript clients connecting to our backend to make sure it could handle the load.

Facebook's size and traffic aren't representative of most sites, but are there speed and scaling lessons you've learned that have universal application?

OSCON Conference 2010The most important one isn't novel, but it's worth repeating: scale everything horizontally.

For example, if you had a database for users that couldn't handle the load, you might decide to break it into two functions -- say, accounts and profiles -- and put them on different databases. This would get you through the day but it's a lot of work and it only buys you twice the capacity. Instead, you should write the code to handle the case where two users aren't on the same database. This is probably even more work than splitting the application code in half, but it will continue to pay off for a very long time.

The most important thing here isn't to have fancy systems for failover or load balancing. In fact, those things tend to take a lot of time and get you in trouble if you don't get them right. You really just need to be able to split any function to run on multiple machines that operate as independently as possible.

The second lesson is to measure everything you can. Performance bottlenecks and scaling problems are often in unexpected places. The things you think will be hard are often not the biggest problems, because they're the things you've thought about a lot. It's actually a lot more like debugging than people realize. You can't be sure your product doesn't have bugs just by looking at the code, and similarly you can't be sure your product will scale because you designed it well. You have to actually set it up and pound it with traffic -- real or test -- and measure what happens.

What is Scribe? How is it used within Facebook?

Scribe is a system we wrote to aggregate log data from thousands of servers. It turned out to be generally useful in a lot of places where you need to move large amounts of data asynchronously and you don't need database-level reliability.

Scribe scales extremely large -- I think we do more than 100 billion messages a day now. It has a simple and easy-to-use interface, and it handles temporary network or machine failures nicely.

We use Scribe for everything from logging performance data, to updating search indexes, to gathering metrics for platform apps and pages. There are more than 100 different logs in use at Facebook today.

I was struck by a phrase in one of your recent blog posts: You said Scribe has a "reasonable level of reliability for a lot of use cases." How did you sell that internally?

For some use cases I didn't. We can't use the system for user data because it's not sufficiently reliable, and keeping user data safe is something we take extremely seriously.

But there are a lot of things that aren't user data, and in practice, data loss in Scribe is extremely rare. For many use cases it's well worth it to be able to collect a massive amount of data.

For example, the statistics we provide to page owners depend on a large amount of data logged from the site. Some of this is from large pages where we could just take a sample of the data, but most of it is from small pages that need detailed reporting and can't be sampled. A rare gap in this data is much better than having to limit the number of things we're able to report to page owners, or only giving approximate numbers that aren't useful for smaller pages.

This interview was condensed and edited.

Robert Johnson will discuss Facebook's optimization techniques at the Velocity Conference (6/22-6/24) and OSCON (7/19-7/23).

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!