Changes in complex software systems seem like they take forever, don’t they? Even to engineers it often feels like changes take longer than they should, and we understand the reasons for the underlying complexity in the system!
For stakeholders it can be even more obtuse and frustrating. This can be exacerbated by the incidental complexity introduced over time in systems that haven’t been properly maintained. It can feel like we are bailing water out of a ship with a thousand leaks.
So it can be incredibly frustrating and deflating when one day you get a message from a stakeholder saying, “Why in the world will this take so long?” But we have to remember that as software engineers we have a window into a world that our stakeholders often don’t have visibility into. They put a lot of trust in us to deliver for them, but sometimes a seemingly small change can end up taking a large amount of time. This leads to a frustration that results in a curt “explain to me why this takes so long.”
Don’t be offended by this question; see it as an opportunity to empathize with your stakeholders and to give them a clearer picture of the complexity of the system. And at the same time, you have the opportunity to suggest ways in which to improve the situation. There is no better time than when someone is frustrated by something to tell them you have an idea for how to improve it!
Below we have a letter that we have written variations of numerous times over the years. We hope this will help you to think about how you communicate with your stakeholders the next time you receive a similar question.
I saw your comment on the “Notify Before Task Due” story card, and I’d be happy to discuss it in our next meeting. I’ll summarize my thoughts here for reference, no need to reply.
To paraphrase your note:
Changing the tasks due email to be delivered a day earlier should be a one-line change. How could that take 4-8 hours? What am I missing?
On one hand, I agree with you. This request is simply changing part of a query from “tasks due <= today” to “tasks due <= tomorrow”.
On the other hand, by reducing it to that simplistic idea, we’re inadvertently ignoring inherent complexity and making a number of engineering choices – some we should discuss.
Part 1: Why is this small change bigger than it seems?
It’s an easy, small change, one line of code. Spending a full day, even a half day, on it sounds excessive.
Of course, you can’t just deploy a change to production without, at least, running it locally or on a test server to make sure the code executes correctly, and in the case of a query change, you need to compare the output to make sure it seems roughly correct.
In this case, that query output comparison can be fairly minimal, spot-check a handful, ensure result counts make sense, etc. It’s a notification for internal employees. If our date math is wrong (an easy mistake), we’ll hear about it from the teams quickly. If this were, say, an email to your customers, deeper scrutiny would be needed. But for this light testing and review, 20-40 minutes is reasonable, depending on if anything weird or unexpected comes up. Digging into data can eat time. Pushing out a change without doing a review is simply professionally negligent.
So, adding in time for normal logistics like committing the code, merging changes, deployment, etc., the time spent from start to live-in-production is at minimum an hour – for a competent, professional engineer.
Of course, that assumes you know exactly what line of code to change. Task workflow mostly lives in the old system, but some pieces of logic live in the new system. Moving logic out of the old system is a good thing, but it does mean the Task functionality is currently split across two systems.
Because we’ve worked together for so long, our team happens to know which process sends the due/overdue task email and can point to the line of code in the new system that initiates the process. So we don’t need to spend time figuring that out at least.
But if we look at the task code in the old system, there are at least 4 different ways to determine if a task is due. Plus looking around at templates and email behavior, there are at least 2 more places that seem to perform custom due logic.
And then the task notification logic is more complicated than you might guess. It needs to reason about team vs individual tasks, public vs private tasks, recurring tasks, if it’s so overdue a manager should be notified, etc.. But we can figure out fairly quickly that only 2 of the 6+ overdue definitions are actually used in this notification process. And only 1 needs to be changed to accomplish the goal.
That review could easily take another half hour or so, maybe less if you’ve been in that part of the codebase recently. Also, the hidden complexity means we might exceed our manual-testing estimate. But let’s just add 30 minutes for higher effort than expected.
So we’re up to 1.5 hours to do this change and feel confident that it will achieve the stated request.
Of course, we didn’t check if any other processes are using the query we’re changing. We don’t want to accidentally break other functionality by changing the concept of “due” to mean the day before the task is due. We need to review the codebase from this perspective. In this case, there seem to be no major dependencies – probably because the bulk of task UI is still in the old system. So we don’t need to worry about changing or testing other processes. In this best case scenario, that’s another 15-30 minutes.
Oh and since the bulk of the task user interface does still live in the old system, we really should do a quick review of the task functionality in that system to see if we’re giving inconsistent feedback. For example, if the UI highlights a task when it’s due, we may want to change that logic also to match the notification – or at least go back and ask our stakeholder what she wants to do. I have not looked at task functionality in the old system lately and don’t remember if it has any notion of due/overdue. This review is another 15-30 minutes, possibly more if there are also multiple definitions of “task due” in the old system, etc.
So we’re up to 2 – 2.5 hour range to do this and feel confident it will achieve the stated request, without unintended side effects or user experience confusion.
Part 2: How can we reduce that time?
The suboptimal, frustrating part is that the only output of this effort is the requested change. The knowledge, say, Sam gained in the process is personal and ephemeral. If a different developer (or ourselves 6 months from now) needed to make a change in this part of the code again, the process would need to be repeated.
There are two main tactics to improve that:
- 1) Actively clean a codebase – focused around reducing duplication and complexity.
- 2) Write automated tests
Side Note: We have discussed documentation in the past, but documentation is not a good solution in this case. Documentation can be great for high-level ideas, e.g., explaining the business reasons behind system behavior, or for highly repeatable processes, e.g., my “new integration partner” checklist. But when it comes to documenting application code, it quickly becomes far too voluminous and out of date as the code changes.
You’ll notice neither of those two improvement tactics is included in our 2 – 2.5 hours.
As an example, keeping a codebase clean would mean that instead of just focusing on the requested change, we’d ask questions:
- Why are there so many different ways to find due/overdue tasks?
- Are all of those needed or even still used?
- Can we collapse those to one or two concepts/methods?
- If the concept of due is split across the old and new systems, can we consolidate this?
And so on.
Answering those questions may be fairly quick – e.g., if they’re obviously dead code. Or it may take several hours – e.g., if they’re used in a lot of complicated processes. Once we had those answers, it would take more time still to refactor the code to reduce duplication/confusion and have a single notion of “due” – or rename concepts in code to make it obviously clear how they’re different and why.
But at the end of it, this part of the codebase would be far simpler, easier to reason about, and easier to change.
Another tactic we use commonly is automated testing. In some ways, automated tests are like documentation – documentation that can’t get out of date and is easier to discover. Instead of manually running code and reviewing the output, we’d write test code that runs the query and programmatically verifies the output. So any developer can run that test code to understand how the system is supposed to work and make sure it still works that way.
If you have a system with decent test coverage, it can greatly speed up these types of changes. You can make the logic change and then run the full test suite to have confidence that
a) your change works correctly and even more valuable
b) your change hasn’t broken anything else.
When we build systems from scratch at Simple Thread, we always include time for writing automated tests in our estimates. It can slow down initial development, but it greatly improves the efficiency of operating and maintaining a software system. It isn’t until a system grows that you truly start to feel the pain of not having tests, and by that point it can be a monumental task to work tests back into the system. It also makes it much easier to on-board new developers, and it makes it much faster and safer to change behavior.
Part 3: Where have we come from? Where are we going?
To date, we rarely include explicit cleaning or test-writing time in estimates for you. That’s partially because while writing tests from scratch is negligible overhead, adding tests to a codebase retroactively is a lot of work, kind of like rebuilding the foundation under a house that people live in.
It’s also partially because when we started working with you, we immediately went into emergency triage mode. We had near-daily problems with the 3rd-party data syncs, weekly problems with report generation, constant support requests for small data changes, inadequate system monitoring and logging, etc. The codebase was sinking under the weight of technical debt, and we were feverishly trying to keep the systems afloat while also patching holes with duct tape.
Over time, we’ve gotten the systems more stable and reliable and we’ve automated/provided UIs to self-service many frequent support requests. We still have a lot of technical debt, but we’re out of the emergency room. However, I don’t think we’ve ever fully shifted away from that reactive triage mentality to a more proactive, mature plan and execute mentality.
We try to clean up code we’re already changing, and we always desk-test thoroughly. But being careful and diligent is different from proactive refactoring and building the infrastructure needed for good automated tests.
If we don’t start paying down some technical debt, we’ll never meaningfully improve the situation, and it’ll continue to take months for highly experienced, highly competent developers to get oriented enough to make nontrivial changes.
In other words, spending 1/2 day to a full day (4-8hrs) on this task is roughly 2x-4x the effort, but it could provide a benefit that greatly reduces the effort to make a similar change in the future. If this part of the codebase were cleaner and had good automated test coverage, I’d expect a competent developer to perform it in an hour or less, 1/2 or less of the original time. And a key point is that it wouldn’t be much longer for a new developer to do it versus a developer experienced in the system.
This shift is one that we are going to need buy-in from you to make. It is a conscious effort to improve how your systems work at a fundamental level, not just how users perceive the system. I know that investments like this can be hard to make, precisely because there aren’t any new visible rewards, but we are happy to sit down with you and put together some hard numbers that can show how these investments would pay off long-term from an engineering perspective.