@andrewchen

New here? Check out my list of featured essays · Blogging roadmap · Ask me anything

Built to Fail: How companies like Google, IDEO, and 37signals build failure-tolerant systems for anything!


Failure is fun, but sometimes only for the people watching – courtesy of GapingVoid

Planning for success, not failure
High achieving people who have a long history of being successful often plan accordingly – doing so, of course, means that they plan for success in whatever they do. And when you take a successful person and put them in a successful big company that’s already making money from their products, there’s even more reason to plan for high-achievement outcomes.

But let’s say that you put these successful people and put them in environments of great uncertainty, like at a Silicon Valley startup – what happens? That’s when realities collide! When you apply the big successful company playbook to startups, you can end up with monolithic planning processes, products that can’t find their markets, and lots of money being spent on launches for the wrong products. It’s not that these tactics are stupid, it’s just that they don’t work as well when you’re dealing with ill-defined customer problems with unknown solutions.

At the heart of this conversation is – what happens when you take something that’s usually assumed to be successful, and you instead say that it’s very likely to fail?

In a way, you can think of this as planning to fail, but then building the support structure around the failure in order to create a failure-tolerant system. Let’s dive into this.

Planning for failure, not success
The title of this blog refers to the fact that companies like Google, IDEO, and 37signals all have the culture of “Failure is OK” built into them.

At Google:

  • Google makes money by being always available, ubiquitous, and having a great product
  • To deliver their service, they have 100,000s of servers (maybe more?)
  • Any one of these servers have a high likelihood of failing at any time
  • To create a fault-tolerant system, they have lots of redundancy and lots of sophistication around what happens when an individual box fails
  • Contrast this to a big-iron approach that builds all the redundancy into specialized hardware that’s designed to never fail

At IDEO:

  • Companies hire IDEO to give them fresh designs based on a customer-focused approach
  • Part of every project involves lots of brainstorming and coming up with ideas
  • However, any specific idea is likely bad (for example, 12 out of 4,000 toy ideas were actually successful = 0.3%)
  • Thus, IDEO combines structured brainstorming, rapid prototyping, and field research to rapidly try out new concepts and get to good products
  • Contrast this to a process where the “Great Man” designer thinks about a design problem and then comes up with the right solution spontaneously

At 37signals, in particular Ruby on Rails:

  • Rails is framework built for programmers to build websites
  • Of course, every web project requires lots of lines of code which can easily break at any moment
  • If you assume that programmers will more often write code that is buggy and breaks, then you’ll want to make testing and iteration easy – this is at the heart of Agile, TDD, continuous integration, and other related disciplines
  • Contrast this to a waterfall engineering approach which assumes the correct design and architecture can be thought out by experienced software engineers

Each one of these examples is similar, yet unique in their own way – but there are similar themes that pervade each one of these approaches.

Characteristics of failure-tolerant systems
Each one of these systems takes the central part of a process and assumes failure, and then builds up a support system around it.

This happens by building on a few core principles:

  • Acceptance of failure: You have to accept that shit happens and failure is commonplace – this needs to be internalized so that failure isn’t punished, but rather embraced!
  • Massive redundancy: Then, it needs to be easy to have lots of redundancy built into the system – for designers, that means lots of designs get generated. For startups, that means lots of ideas are tested, and for Google, that means lots of servers are used
  • Cheap, easy, fast: As a side-effect of the redundancy, it needs to be easy, cheap, and fast to have lots of ideas, lots of servers, or write lots of code. The harder it is, harder it will be to create redundancy
  • Iterative, reality-based testing: Testing these individual components constantly becomes key – you need to force failure on the system to figure out how it reacts from a system-wide level

Building up processes based on the ideas above makes it easier and easier to deal with failure and come out on the other side!

Conclusion and next ideas
There are lots of interesting directions that this line of thinking can go.

This area of thinking started out with the hiring process, and the idea that maybe interviews don’t work at all – there’s a bunch of academic research that implies that, actually. So if how would you build a failure-tolerant system around the hiring process, if you assume that good interview candidates actually have no correlation to successful employees?

For dating, what happens if you assume that people you like to date may not be the kind of person you’d have a successful marriage with? What if people suck at figuring out what kind of guy or gal is the “type you’d bring home to Mom?” I think anyone could attest to the idea that many people suck at figuring out the right person to date, much less the right kind of person to marry. I personally find it crazy that people make a 50+year decision to be married based on a 18-month sample size :-)

For careers, what if it turns out that people have a really bad idea figuring out what they’ll actually want to do 40 hours a week, 50 weeks a year, for the rest of their life? How would you figure out the right career faster rather than shorter?

All of these are great thought experiments, I think.

What else am I missing? :-) I’d love to take any suggestions and write up some thought experiments around it.

Want more?
If you liked this post, please subscribe or follow me on Twitter. You can also find more essays here.

Like this post?
If you liked this post, please subscribe or follow me on Twitter. You can also find more essays here.

Written by Andrew Chen
July 13th, 2009 at 8:30 am
  • rogervalade

    Nice article, Andrew. If you don't mind, I'll be making a reference to it from my blog, Fail Fast (http://failfast.me) — it is very relevant to the theme I'm interested in!

    Roger.

  • http://www.vladimiroane.com Vladimir Oane

    The fail fast idea is quite stupid I think. You should spend more time on it and iterate fast. Don't quit just fast….

  • http://www.veerwest.com/ cedric

    Interesting point. Perhaps a better example with Google would be their 20% policy, where engineers are encouraged to pursue their own ideas. Most of it is discarded, but a few projects turn into highly successful products (gmail & co..)

  • http://blog.aisleten.com MicahWedemeyer

    A little on the harsh, but I agree. Give yourself a little time to see if you're on the right track. Real success takes years, not a couple days.

  • http://twitter.com/ksyed0 Kamal Syed

    This sounds very random.

    There are a lot of analysts that would wholeheartedly disagree with this concept.

    I think its more likely that this just demonstrates that relationships and planned outcomes are a lot more complicated than we can expect or perhaps even understand fully.

    “Planning to fail” is a key part of contingency planning and risk mitigation, but its a very different thing than expecting random outcomes (particularly in personal relationships).

    MKS

  • http://Scale.cc Vincent Chan

    As Ram Charan said: Failure is a fact of life for companies that pursue innovation seriously and leaders should know that failures represent opportunities to learn.

    The key should be finding the right metrics to measure failures and learn from them quickly. If you have read “The Game-Changer” by A.G. Lafley and Ram, you will find out even P&G has adopted similar approaches.

    Additional information on how failure breeds success, including the examples of GE, Intuit, Coke…etc:

    http://www.businessweek.com/magazine/content/06…

  • http://www.zoscomm.com jonziskind

    Andrew –

    We are scaling up our company right now and pressing forward and this piece is fantastic. Thanks.

    Jon at zhiing

  • http://jagtesh.tumblr.com jagtesh

    Good article, Andrew.

  • samnet

    I agree and it's sort of a learning process. How many times did you fall down before you learned to walk or run?
    You didn't just one day say — ok lets start walking today!

  • peter_zaballos

    Great post Andrew;
    A nuance I think you touch on is that it's not so much that organizations who embrace the reality of failure are motivated by a fear of not succeeding, they know that the knowledge gained from failure only speeds them to success. So, yes, they love the idea of lots of low cost experiments because its the failure as much as the success with these that will provide a durable advantage. IDEO's “structured brainstorming” is a wonderful “operationalization” of this philosophy.

    Organizations who fear failure build in a structural inability to be nimble, to learn, and to make the critical adaptations rapidly that will get them to even be in the same game/ They may succeed at not making a “mistake” but winning that battle will lose them the war.

    I wrote a post on this theme of “Lots of Low Cost Experiments” earlier in the year, which touches on some of the key points in your post today. http://openambition.com/2009/04/22/lots-of-low-…

    Pete

  • http://twitter.com/wmbenedetto Warren Benedetto

    You might want to watch the first few minutes of Jason Fried (37 Signals) giving the keynote at BigOmaha recently: http://vimeo.com/4717683?pg=embed&sec=

    Based on his comments about not embracing failure, I have a feeling he might disagree with your thesis about his company. Maybe you should consider updating the post to reference Ruby On Rails specifically, rather than 37 Signals. That seems to be a more accurate comparison.

  • http://twitter.com/rodet Stephane Rodet

    I just limit myself to the Google argumentation:
    Well, it's a different architecture – different needs too. The mainframe is much more expensive, so it all comes to requirements. If one Google box fails that references 10000 of websites, and these are absent of a search result, that's no big deal. Also if 1000 Gmail users can't log in during 15 minutes, that's no big deal – no one is going to be fired or to sue Google for that.
    Now think about 1000 of transactions on the NYSE that fail. That's another type of consequences. That signifies $$$ of losses. Or would you like your monthly paycheck to be lost randomly? Probably not. These are the case where you need special hardware. It's fault tolerant, complex & expensive, because it would be even more expensive not to have that degree of tolerance-fault. That's why both architectures coexist. There is not a “one fit all” solution.

    BTW, some study has shown that people that have a plan B tend to be less depressive because they can adapt better to the new reality… So failure has to be in the plan, too.

  • http://highlandersys.com ssaikia

    WHO IS YOUR TARGET AUDIENCE FOR THIS POST?

    The concepts outlined in this post probably apply to bigger organizations that have significant resources. A small resource-constrained startup is not going to plan for failure and build the redundancy that you prescrible.

    NOT a good post Andrew – it does not sound like you know your material! Sorry for this harsh criticism.

  • http://twitter.com/_HappyCloud_ Happy Cloud Moments

    Hi Andrew,

    Thanks for stopping by my blog! You have some interesting articles here.

  • segdeha

    There is a corollary here with different styles of dog training (believe it or not). Some people practice corrective training where the dog is “punished” (usually verbally) for incorrect behavior and praised/rewarded for correct behavior. Another style, using “markers”, uses rewards (sometimes praise, but more often food) for correct behavior and verbal markers for feedback when the dog is doing either the wrong thing or is on the right track.

    The difference is subtle between “no” as feedback and “no” as correction, but the result is dogs that are problem solvers rather than ones afraid to try new things for fear of being punished. The goal of any company should be to “breed” problem solvers. You do this, as Andrew points out, by planning for, accepting, and even celebrating failed attempts to solve problems.

  • mxaddison

    Andrew — I can say with experience, at least to the part about marriage, that if you have pure passion for each other you'll figure out the rest. Perhaps the same holds true for entrepreneurship.

  • http://twitter.com/purlem Martin Thomas

    Interested in your thoughts.. Does 37signals or E-myth have the right philosophy for business today? http://www.purlem.com/blog/?p=38

  • woodka

    A waterfall model with iterative design-build-test rapid prototype built within it can work well, especially if everything is well documented. For instance, design with user interface models to determine what interface a user likes and features needed, then move on to the build cycle to build the actual system, loop back into design if you need to. Keeps you from building things before the user knows what they want, and lets you go back to change things as needed. But still move through the waterfall steps as you go. It also helps if you ALWAYS build in testability and use a very modular design. Have standard code pieces in the library once verified so you're not reinventing wheels. This does take a lot of discipline, and what I've found doing software process consulting is that a lot of those who dis software engineering models like waterfall simply have never worked in a well-structured software environment. Once they're introduced to a good process with design and code reviews, good documentation and configuration management, they love it and find it really helps their creativity and ability to get things done.

    Don't play to the “cowboys” here — there is always a need for process, and anything has a process, even if it is chaotic. It can be documented and managed well to produce a better result.

  • http://www.philsimonsystems.com/ Phil Simon

    Interesting post. I have a few thoughts. Failure is probably more tolerant at Google than other organizations because they have so many things going on. A problem with a Okrut or Google Maps might be inconvenient, but it will hardly tarnish Google's brand. Plus, Google's apps are so widely used in many cases that problems will be discovered and fixed relatively quickly.

    Also, as Cedric mentioned, Google's 20% policy is huge. In a way, they bake failure into the culture. Relative to other companies, at Google there doesn't appear to be the blamed placed on those who tried and failed.

    However, for a large organization in the middle of a massive IT project affecting supplies, employees, or vendors, failure may be less acceptable. The consequences can be so severe that it affects employees receiving paychecks, corporate security, the accuracy of financial reports, and the like. This is particularly pronounced in development efforts that follow the Waterfall method.

    If adopted, methodologies like Agile software development should decrease failure rates. Of course, people will ultimately determine whether these projects fail much more than any methodology or technology.

  • http://twitter.com/kevinshaum Kevin Shaum

    I think you are mistaken about what the term “fail fast” means:

    http://en.wikipedia.org/wiki/Fail-fast

  • sumomo

    excellent article, you made my day, i'm a teen, and i learn more from this one page of interesting ideas and concepts than the 1000 page of boring history book that is required.

    you know what, the educational system in America needs a REALITY CHECK.

  • http://twitter.com/bryanzmijewski Bryan Zmijewski

    …little late to this post, but I was on the Skyline team (acquired by IDEO) that came up with those 12 ideas each year :)

    In my three years, I was involved in about 40 ideas that got licensed and probably 15 that made it on the shelves. If you counted all the ideas that we came up with that never made it to a sketch, we probably had over 20,000 ideas. I'd say the hit ratio is even smaller!

  • http://twitter.com/bryanzmijewski Bryan Zmijewski

    …little late to this post, but I was on the Skyline team (acquired by IDEO) that came up with those 12 ideas each year :)

    In my three years, I was involved in about 40 ideas that got licensed and probably 15 that made it on the shelves. If you counted all the ideas that we came up with that never made it to a sketch, we probably had over 20,000 ideas. I'd say the hit ratio is even smaller!

  • http://www.modularhomesnetwork.com/ modular homes

    hello i think there are more helping info which is help people!

  • http://www.modularhomesnetwork.com/ modular homes

    Building up processes based on the ideas above makes it easier and easier to deal with failure and come out on the other side!

Recent posts

Want more? Featured essays and book recommendations