Designed to Fail Well
Originally published on September 23, 2010, and updated on May 22, 2023. Think about the last time you used an ATM. Chances are, you have one in your vicinity, and you’ve interacted with it more times than you can count. Have you ever received an incorrect amount of money from an ATM? I’m guessing your answer is no, despite the millions of bills they distribute every year. The reliability of ATMs, despite handling a task as complex as dispensing the correct amount of cash, is a testament to the ingenuity of their design. But how do they achieve such high reliability?
It may surprise you that the answer is rather straightforward. ATMs employ a relatively simple mechanism that pulls bills one by one from a stack, verifying their thickness as they pass through. If a bill is thicker than expected—likely because multiple bills are stuck together—it’s rejected and set aside for human inspection. The system isn’t perfect, but it’s designed to “fail well.”
So, What Does This Have to Do With Software Design?
The ATM’s approach offers valuable lessons:
- Perform a task.
- Verify the task.
- If verification fails, stop and try again.
The beauty of this design is in its simplicity. Instead of creating a complex mechanism to ensure 100% reliability, ATMs are designed to handle failure gracefully. It’s a lesson that software developers can take to heart. In order to build reliable systems, here are the steps to follow:
Somehow Many Developers Didn’t Get This Memo
In order to build reliable systems, you need to:
- Perform a single, hopefully verifiable, task.
- Verify said task.
- If verification failed, then undo what can be undone, and notify someone.
However, there are also practices to avoid when designing software:
- Don’t perform multiple operations before verification: Keep things simple and verify each step.
- Avoid deleting things for cleanup: Instead, quarantine problematic files to preserve state information that can help resolve issues later.
- Don’t automatically fix the issue: Unless you’re sure about the problem, writing code to handle errors can cause additional ones.
- Never fail without logging or isolating state: When something fails, gather relevant data and put it in a known error location.
- Don’t perform operations in a shared location: If failure is possible, carry out the operation in a staging area first.
Simplicity, Verification, and Effective Failure Handling
In a nutshell, the philosophy of designing software should be akin to the design of ATMs—embrace simplicity, ensure verification, and handle failure effectively. Software doesn’t need to be complex to be reliable; it needs to fail well. Let’s continue to draw inspiration from the physical engineering world in designing effective software mechanisms and remember this important lesson. Have a project or problem that could use some advice? Let’s schedule a time to talk about how we can help with the solution.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.
ATMs can still give out the wrong amounts if the technician swaps the $10 and $20 cartridges, yielding the depressing truth that there is always room for error. I know because it happened to me once, and I immediately marched on over to a nearby branch and joined quite the line of folks who were not amused … imagine getting $80 deducted but only getting $40 in cash doled out! (Some machines seem to have mitigated this by saying "20s only.")
Still, this is good advice. Like traffic lights–if a timing error causes lights in opposing directions to go green at once, then a physical fuse is blown, causing them all to go to the flashing yellows/reds circuit. A lot of software could use the "fuse technique," where it just throws up its hands and says "I give up, nobody does anything till somebody takes a good hard look at me." Unfortunately, though, this kind of fail-safe in software usually means "crash" instead of "reduced functionality mode."
Especially true if failure is rare. Why have brittle exception code that is rarely tested or exercised when you can just hit "Retry"?
I’ve had incorrect money from an ATM…only once mind, and it was a hardware problem – the note was old a crinkled and got lost in the mechanism somehow.
@Nicholas @mat This is the internet, I knew as soon as I posted this I would get 1000 comments about people who got wrong amounts from an ATM 🙂
So, to summarize the entire article…
Oh, and KISS .
Plus, verify (which is simpler to do because of KISS ).
Which is why we have developer guidelines about NOT using `catch(Exception)` (unless followed by a quick software exit), because it’s (1) not simple, and (2) it’s impossible to know what exactly you’re catching (OutOfMemoryException, anyone?) and whether it’s actually safe for your program to continue executing…
Any thoughts on how to remove complexity from systems? I find that many developers take perverse joy in over-complicating things… 😉
 Keep it Simple, Stupid 
 Because if we think we’re all that, we’ve already lost. We need to keep reminding ourselves that we’re really Not So Smart, and by doing so we’ll ensure that the software we write can actually be understood by Mere Mortals, which behooves us all because we [i]are[/i] mere mortals…
Well, of course! We’re software developers, obsessed with the corner cases instead of the big picture =)
It isn’t identical, but the goal to "try not to let failure cases complicate your design" feels very similar to the "let it fail" approach of Erlang.
I haven’t worked in it, but from what I understand, Erlang has an interesting SRP take on handling failure. You don’t muddy up your domain implementation with ton of error compensating code. Instead you have supervisor processes that watch your implementation process and decide what to do if a failure occurs. Is that about right?
On a completely different note, this also reminds me of one of the ways I get a lot of value from TDD. When practicing TDD, I find I’m much more inclined to think about how the code should fail. And often, I’m able to redesign the API so that instead of compensating for a error, the API doesn’t allow the error state to exist.
@Al It is funny that you say that, because this whole post almost turned into a post about failures in software. I agree, in most cases polluting your code with error handling is just a waste. It is usually better to instead spend time logging, so when an unexpected error occurs, you know what happened and are able to compensate by fixing the software, not having it go through elaborate gyrations in an attempt to fix itself.
I can’t be the only person who’s seen BSOD and a "DHCP address not avaiable" on an ATM tho…
@Damien Yep, once us software engineers get our grubby little mitts in there, all bets are off. 🙂
ATM is a very good example of a bad design that was redesign correctly – withdraw/return money/return card to withdraw/return card/return money. This could also be an example of engineers designing something and not using the product themselves right away. 2cnd year philosophy of science… I love this article though.
Like traffic lights–if a timing error causes lights in opposing directions to go green at once, then a physical fuse is blown, causing them all to go to the flashing yellows/reds circuit. A lot of software could use the “fuse technique,” where it just throws up its hands and says “I give up, nobody does anything till somebody takes a good hard look at me.” Unfortunately, though, this kind of fail-safe in software usually means “crash” instead of “reduced functionality mode.”
Leave a comment