Your Software Can Learn A Lot From ATMs

Your Software Can Learn A Lot From ATMs

How many times have you used an ATM? A hundred times? A thousand times? Depending on where you are from, you probably have an ATM on just about every corner, and you probably use them on a fairly regular basis. How many times has an ATM given you an incorrect sum of money? How many times have you heard of friends who have received an incorrect sum of money from an ATM? I’m willing to bet that you haven’t. Pretty amazing, right?

With the millions of bills that are fed out of ATMs every year, it is amazing how reliable the mechanisms inside it are, right? I mean, if ATMs so rarely give out incorrect amounts of money, then they must have developed an amazingly reliable mechanism to divvy up the bills and then feed them out to the user, right? All those crisp new bills are so insanely hard to get apart, I just had to wonder what sort of engineering had gone into making absolutely sure that bills didn’t get stuck together!

Well, I was watching one of those “how it works” sort of shows the other day, and it turns out that the solution is devilishly simple. You see, most ATMs have a relatively simple mechanism of bands which drag bills one by one off of a stack and counts them out. While this system is fairly reliable, it still has some room for error. The reliability comes from a check that happens after the bills are grabbed off the stack. The machine passes the bill through a mechanism which measures the thickness of the bill. If the thickness of the bill doesn’t fall within a certain threshold it is rejected and put into a reject bin for inspection by a human.

So, what can our software learn from an ATM? Well, let’s break down the process:

  1. Perform task.
  2. Verify task.
  3. If verification failed, quit and retry.

Interesting. The mechanism wasn’t designed to be 100% reliable, it was designed to fail well. Instead of trying to implement a system which would detect that a bill was too thick and try further process it in order to make the task succeed, it simply ditches the bill and moves on. From a physical engineering perspective this is probably pretty obvious. More mechanisms means more cost, and more things which can go wrong. It wouldn’t make sense to try and build a more complex mechanism.

Somehow many developers didn’t get this memo. In order to build reliable systems, you need to:

  1. Perform a single, hopefully verifiable, task.
  2. Verify said task.
  3. If verification failed, then undo what can be undone, and notify someone.

What you don’t do:

  1. Perform multiple operations, then verify: If possible, keep things small and simple, and verify at each step. If you’re worried about failure, then keep things simple and isolated.
  2. Start deleting things to cleanup: If you start deleting things, and not everything can be removed, then you might lose state that will help you resolve issues later. What if you delete the file that tells you where files were originally put, and then the rest of the file copy fails? It would be better to try and quarantine the files.
  3. Try to automatically fix the issue: Unless you can say, with certainty, that you know exactly what the problem is, then writing code to handle an error situation will likely just cause another error. In fact, it may cause an error that is harder to detect or find. Errors that occur undetected are by far the worse kind of errors to have. If you do decide that fixing an issue automatically is the best course, then log out exactly what actions you performed to fix the issue.
  4. Fail quickly without logging or isolating state: Log out what happened and the state. Copy a file into a known error location. Put the message into a dead letter or poison queue. You get the idea, when something fails, try to gather things up and put them somewhere so that you can find them later!
  5. Perform operations in a shared location. If you know something can fail, perform the operation in a staging area. For example, if you need to bring in a file and perform some operations on it, then bring the file into a staging area, perform the operations and verify that they succeeded. Once you can verify they succeeded, then move the file to where it needs to be.

If you looked at this list and realized that you do all these things all the time, well, you need to stop lying to yourself. If all of these ideas seem obvious to you, then pat yourself on the back. Either way, you should make every effort to consider failure in your designs, but more importantly, try not to let failure cases complicate your design. The more complexity you introduce, the more likely it is to fail. Always weigh the consequences of failure against the effort required to deal with it.

Comments (11)

  1. ATMs can still give out the wrong amounts if the technician swaps the $10 and $20 cartridges, yielding the depressing truth that there is always room for error. I know because it happened to me once, and I immediately marched on over to a nearby branch and joined quite the line of folks who were not amused … imagine getting $80 deducted but only getting $40 in cash doled out! (Some machines seem to have mitigated this by saying "20s only.")

    Still, this is good advice. Like traffic lights–if a timing error causes lights in opposing directions to go green at once, then a physical fuse is blown, causing them all to go to the flashing yellows/reds circuit. A lot of software could use the "fuse technique," where it just throws up its hands and says "I give up, nobody does anything till somebody takes a good hard look at me." Unfortunately, though, this kind of fail-safe in software usually means "crash" instead of "reduced functionality mode."

    Especially true if failure is rare. Why have brittle exception code that is rarely tested or exercised when you can just hit "Retry"?

  2. So, to summarize the entire article…

    KISS [0].
    KISS [0].
    KISS [0].
    KISS [0].

    Oh, and KISS [0].

    Plus, verify (which is simpler to do because of KISS [0]).

    Which is why we have developer guidelines about NOT using `catch(Exception)` (unless followed by a quick software exit), because it’s (1) not simple, and (2) it’s impossible to know what exactly you’re catching (OutOfMemoryException, anyone?) and whether it’s actually safe for your program to continue executing…

    Any thoughts on how to remove complexity from systems? I find that many developers take perverse joy in over-complicating things… 😉

    [0] Keep it Simple, Stupid [1]
    [1] Because if we think we’re all that, we’ve already lost. We need to keep reminding ourselves that we’re really Not So Smart, and by doing so we’ll ensure that the software we write can actually be understood by Mere Mortals, which behooves us all because we [i]are[/i] mere mortals…

  3. It isn’t identical, but the goal to "try not to let failure cases complicate your design" feels very similar to the "let it fail" approach of Erlang.

    I haven’t worked in it, but from what I understand, Erlang has an interesting SRP take on handling failure. You don’t muddy up your domain implementation with ton of error compensating code. Instead you have supervisor processes that watch your implementation process and decide what to do if a failure occurs. Is that about right?

    On a completely different note, this also reminds me of one of the ways I get a lot of value from TDD. When practicing TDD, I find I’m much more inclined to think about how the code should fail. And often, I’m able to redesign the API so that instead of compensating for a error, the API doesn’t allow the error state to exist.

  4. @Al It is funny that you say that, because this whole post almost turned into a post about failures in software. I agree, in most cases polluting your code with error handling is just a waste. It is usually better to instead spend time logging, so when an unexpected error occurs, you know what happened and are able to compensate by fixing the software, not having it go through elaborate gyrations in an attempt to fix itself.

  5. ATM is a very good example of a bad design that was redesign correctly – withdraw/return money/return card to withdraw/return card/return money. This could also be an example of engineers designing something and not using the product themselves right away. 2cnd year philosophy of science… I love this article though.

  6. Like traffic lights–if a timing error causes lights in opposing directions to go green at once, then a physical fuse is blown, causing them all to go to the flashing yellows/reds circuit. A lot of software could use the “fuse technique,” where it just throws up its hands and says “I give up, nobody does anything till somebody takes a good hard look at me.” Unfortunately, though, this kind of fail-safe in software usually means “crash” instead of “reduced functionality mode.”

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

More Insights

View All