Worried you’ll miss us?
Subscribe to get our articles and updates in your inbox.
Subject: HELP! The Site’s Down!
On a long enough timeline, every web developer will receive this email.
Every web developer learns to dread this email. Your day just got a whole lot worse.
Imagine a major site your team is responsible for just went down – or started timing out, throwing errors, booting users off, etc… You don’t actually know the problem yet. All you know is that people are freaking out. The boss is yelling at you, clients are calling, an executive from another business unit is demanding an update.
Your adrenaline starts pumping, you feel that pit forming in your stomach, and if you are not extremely careful, you are about to make some bad decisions.
However, there are some steps you can take to minimize risk. Take a deep breath. We’ve all been there.
First, Do No Harm
The biggest engineering mistakes I’ve made in my career have occurred while trying to fix an urgent problem. This is true for many world-class engineers I know.
There is an old saying in politics: It’s not the crime; it’s the cover up that does the most damage. It’s similar in incident management; the response often causes a bigger calamity than the original problem.
Before you take any action or make any changes, you need to understand this: the wrong action will make the situation far worse.
Don’t let the fear of a mistake paralyze you, but remember that the system being down or slow is not actually the worst case scenario. What could be worse? Data loss. Data leakage. Security breaches. Improperly charging customers. These situations are almost always an order of magnitude worse than an outage.
Between us, we could name about a hundred other ways things could get much worse. So even though this incident feels urgent – and it probably is – our first obligation is to not make things worse.
Doubt Your Instincts
There are many reasons people make bad decisions in a crisis. A big part of it is simple human biology; it’s nearly impossible to think clearly while panicking, which most people do in a crisis. Practice and repeated exposure can lessen that instinctive reaction, but it’s not something you can control entirely.
It helps to be aware of it, but you’ll still make bad decisions, even accounting for that fact. Don’t completely disregard your instincts, but verify hunches before acting on them.
Slow Down to Speed Up
Another reason for mistakes is that in a crisis, people want to take actions, do something, anything!, to try to fix the problem. Other stakeholders want to see something being done. There’s pressure to act now.
We know that’s probably a bad idea. There’s a mantra in performance management that sometimes you need to slow down in order to speed up, i.e., being more deliberate and methodical in the short-term can help you achieve your objectives faster over the long-term. This is often true in incident response as well: extra caution in the first half hour can greatly shorten the total time to resolution.
So how do we push back against our bad instincts and all of this pressure to act quickly?
Get a Second Opinion, Always
Ideally you’d have another engineer to talk through these problems with, but if not, grab anyone who can listen and ask questions.
Even Dan from accounting, who is always so annoying with his constant questions and change requests?
Yes, he’s perfect.
Before you take any action, explain it in detail, what you are hoping it will accomplish and why.
Even if you’re both panicking, even if your troubleshooting partner is completely nontechnical, the mere act of talking through the problem and your proposed solution will help you slow down and order your thoughts.
This Is Not Normal
The other big reason people make mistakes in a crisis is that it’s a completely different context from normal operation.
For example, imagine your site traffic is growing predictably, and you can see your database performance will be a bottleneck in a few months. Thankfully, you already moved to a database cluster that supports horizontal scaling. So you add another node, which starts taking some of the load, and everything is smooth sailing. Great work, your Mensa membership is safe for another year.
Now let’s switch to an emergency scenario. The site’s up, but certain admin sections are extremely slow, even unresponsive. You can reproduce the issue, but there’s no clear pattern, no obvious source of the problem. Maybe it’s something about the reporting updates from earlier in the week. The one thing you know is that the database servers are nearly pegged, almost constant >97% CPU utilization.
Okay, while we’re trying to figure out what’s happening, let’s add another node to the database cluster to give us some breathing room.
Good plan, right?
I just felt a great disturbance… as if millions of DevOps engineers cried out in terror, and were suddenly silenced. I fear something terrible has happened.
Adding a node to a cluster would be a reasonable idea under normal circumstances, but it often adds nontrivial load to the cluster initially, placing extra strain on other nodes as they figure out how to rebalance, push data over to the new node, etc.
Adding a node may have no impact, or it may make the situation far worse. It may turn that scenario with a partial outage in the admin section into a full outage of the entire system. Your temporary fix just took everything down and can’t be easily rolled back without risking data inconsistencies. Whoops.
Know Your Tools (and Experts 👋)
The exact impact of adding a node to your cluster will depend on the technology and the type of stress your system is under.
Large, complex systems fail in weird, complex ways. This is why sophisticated engineering organizations use tools like Chaos Monkey to simulate random failures and test their systems’ robustness.
Observing the fallout of real failures is the only way to truly know how your tools and systems will behave in failure mode. That’s one of the intangible benefits that senior engineers bring to a team; they’ve simply survived more failures.
That said, it’s unrealistic to know every part of your toolchain at an expert level, but it is realistic and important to know what tools you are running and what experts you can reach out to for help.
Have a Communication Plan
In a fantasy world, you’d get out the runbook for your system and turn to the section that explicitly addresses this exact problem. Then you simply follow the clear, concise instructions for diagnosing and resolving the problem. It’s almost too easy.
But of course, that’s not reality for most of us, unless you’re working on, say, a spaceship or a power plant. It’s unrealistic to have a plan for all of the unexpected ways your systems will fail.
However, it’s completely realistic, necessary even, to plan for them to fail eventually.
Planning for failure is a big topic I hope to explore more in later posts, but for now, let’s focus on a few basic questions around communication:
- What downstream teams or systems could this affect?
- Who needs to know about the problem?
- Who else could already be working on the problem?
- Could this problem cause any legal or regulatory obligations?
- How quickly does this need to be escalated? To whom?
You don’t need to be psychic to know the types of concerns and questions that will arise in a system outage – any type of outage.
A simple example is to have emergency contacts available in a redundant location that does not require access to the corporate network.
If a large cloud provider is having a problem affecting your public site, it’s entirely plausible that it could affect other systems. Do you know how to reach key personnel without access to your corporate address book? Or email? Or Slack?
The best time to plant a tree was 20 years ago. The second best time is now.
Getting an emergency communication plan in place is the type of basic preparation you can do that will help increase your resilience in any incident. If you haven’t done so before the incident, task someone to consider communication channels as step one in your response.
Know Your Last Resort
Depending on the problem, there is often an option of last resort that could fix it or at least contain the damage – typically at a steep price though.
For example, if you have a runaway job emailing every user in the system, killing all queue worker processes will stop the immediate problem. Emails will stop being sent. But it also might leave some long-running batch jobs in an unknown state and create mountains of work for other departments to reconcile reports, etc.
Or maybe, the site is under so much load that customer orders are getting processed incorrectly, e.g., not getting payment verification saved correctly. Taking the site down will stop the invalid order records getting created, but the site being down will obviously have serious repercussions – on revenue and reputation, among others.
It’s almost always the wrong move to unilaterally decide to take this irreversible action.
Large cybersecurity and operations teams have protocols for when and how a frontline operator is authorized to take systems offline, but in general, acting alone is extremely risky.
For one, remember that your instincts are probably not great now, but also, there may be other issues at stake here, beyond the one problem you’re handling. Your system is typically only one part of a much larger software ecosystem, which individual engineers rarely have full visibility into.
That said, this can be a tricky issue to navigate. If your algorithmic stock trading system is making unexplainable trades, you can’t wait for your request to make it up 5 levels of managers to get approval to take it down. You need to act as fast as possible to literally save the organization.
Too much process and hand wringing can delay critical remediation tasks, at great cost. Frontline engineers need to be empowered without promoting recklessness.
As you start to understand the shape of an incident, it’s important to start considering your Last Resort and the necessary logistics.
- Who is the proper person to make the call?
- Who knows the safest shut down sequence?
- Who has the requisite system permissions?
- If your system is a collection of FaaS endpoints, e.g., Lambda functions, how does a shut-down work?
In a follow up post, I want to talk about monitoring and how to avoid getting this email.
Fortunately, we don’t see this type of urgent support email very often these days at Simple Thread. It’s not because we’re so amazing and build perfect systems with zero incidents; it’s because we are typically the ones emailing our clients before a problem becomes noticeable to them.
That’s due to monitoring and instrumentation, which also happen to be essential to understanding an incident well enough to confidently and quickly resolve it.
The truth is that incident response is a vast, complex topic that can challenge even the best organizations; it is impossible to perfectly thread the needle between the desire to exert control in a crisis and the need for rapid response, in every unpredictable scenario.
Ideal processes are focused on facilitating human communication and supporting expert decision-making – not replacing human judgment with an inflexible policy.
Ideal processes support human decision-making, not replace it.
There is a natural tension between preparedness planning and efficiency. To borrow a phrase from cybersecurity, incident response planning always feels excessive until it’s suddenly not enough.
Time To Face Reality
The site will go down. Your system will stop responding or start behaving in unexpected ways. It’s inevitable.
Trying to predict every mode of system failure has diminishing returns and is probably a waste of time for your organization.
However, planning for how your team will react is hugely valuable. It can eliminate so much of the chaos and confusion at the start of a major incident – simply by understanding who will need to communicate and what information they’ll need.
If your system being up is literally a matter of life and death, well, I hope you have better resources at your disposal than my rambling blog post. 🙂
For everybody else, take a deep breath. Some downtime is not the end of the world. If you have a plan, try to stay calm and follow the plan.
If you don’t have a plan, remember these key points:
- First, do no harm.
- Doubt your instincts.
- Slow down to speed up.
- Get a second opinion, always.
- This is not normal.
- Know your tools – and experts.
- Have a communication plan, or make one now.
- Know your option of last resort.
- You can plan for this.
May the force be with you.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.