Designed to Fail Well
Originally published on September 23, 2010, and updated on May 22, 2023. Think about the last time you used an ATM. Chances are, you have one in your vicinity, and you’ve interacted with it more times than you can count. Have you ever received an incorrect amount of money from an ATM? I’m guessing your answer is no, despite the millions of bills they distribute every year. The reliability of ATMs, despite handling a task as complex as dispensing the correct amount of cash, is a testament to the ingenuity of their design. But how do they achieve such high reliability?
It may surprise you that the answer is rather straightforward. ATMs employ a relatively simple mechanism that pulls bills one by one from a stack, verifying their thickness as they pass through. If a bill is thicker than expected—likely because multiple bills are stuck together—it’s rejected and set aside for human inspection. The system isn’t perfect, but it’s designed to “fail well.”
So, What Does This Have to Do With Software Design?
The ATM’s approach offers valuable lessons:
- Perform a task.
- Verify the task.
- If verification fails, stop and try again.
The beauty of this design is in its simplicity. Instead of creating a complex mechanism to ensure 100% reliability, ATMs are designed to handle failure gracefully. It’s a lesson that software developers can take to heart. In order to build reliable systems, here are the steps to follow:
Somehow Many Developers Didn’t Get This Memo
In order to build reliable systems, you need to:
- Perform a single, hopefully verifiable, task.
- Verify said task.
- If verification failed, then undo what can be undone, and notify someone.
However, there are also practices to avoid when designing software:
- Don’t perform multiple operations before verification: Keep things simple and verify each step.
- Avoid deleting things for cleanup: Instead, quarantine problematic files to preserve state information that can help resolve issues later.
- Don’t automatically fix the issue: Unless you’re sure about the problem, writing code to handle errors can cause additional ones.
- Never fail without logging or isolating state: When something fails, gather relevant data and put it in a known error location.
- Don’t perform operations in a shared location: If failure is possible, carry out the operation in a staging area first.
Simplicity, Verification, and Effective Failure Handling
In a nutshell, the philosophy of designing software should be akin to the design of ATMs—embrace simplicity, ensure verification, and handle failure effectively. Software doesn’t need to be complex to be reliable; it needs to fail well. Let’s continue to draw inspiration from the physical engineering world in designing effective software mechanisms and remember this important lesson. Have a project or problem that could use some advice? Let’s schedule a time to talk about how we can help with the solution.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.