Changes in complex software systems seem like they take forever, don’t they? Even to engineers it often feels like changes take longer than they should, and we understand the reasons for the underlying complexity in the system!
For stakeholders it can be even more obtuse and frustrating. This can be exacerbated by the incidental complexity introduced over time in systems that haven’t been properly maintained. It can feel like we are bailing water out of a ship with a thousand leaks.
So it can be incredibly frustrating and deflating when one day you get a message from a stakeholder saying, “Why in the world will this take so long?” But we have to remember that as software engineers we have a window into a world that our stakeholders often don’t have visibility into. They put a lot of trust in us to deliver for them, but sometimes a seemingly small change can end up taking a large amount of time. This leads to a frustration that results in a curt “explain to me why this takes so long.”
Don’t be offended by this question; see it as an opportunity to empathize with your stakeholders and to give them a clearer picture of the complexity of the system. And at the same time, you have the opportunity to suggest ways in which to improve the situation. There is no better time than when someone is frustrated by something to tell them you have an idea for how to improve it!
Below we have a letter that we have written variations of numerous times over the years. We hope this will help you to think about how you communicate with your stakeholders the next time you receive a similar question.
I saw your comment on the “Notify Before Task Due” story card, and I’d be happy to discuss it in our next meeting. I’ll summarize my thoughts here for reference, no need to reply.
To paraphrase your note:
Changing the tasks due email to be delivered a day earlier should be a one-line change. How could that take 4-8 hours? What am I missing?
On one hand, I agree with you. This request is simply changing part of a query from “tasks due <= today” to “tasks due <= tomorrow”.
On the other hand, by reducing it to that simplistic idea, we’re inadvertently ignoring inherent complexity and making a number of engineering choices – some we should discuss.
Part 1: Why is this small change bigger than it seems?
It’s an easy, small change, one line of code. Spending a full day, even a half day, on it sounds excessive.
Of course, you can’t just deploy a change to production without, at least, running it locally or on a test server to make sure the code executes correctly, and in the case of a query change, you need to compare the output to make sure it seems roughly correct.
In this case, that query output comparison can be fairly minimal, spot-check a handful, ensure result counts make sense, etc. It’s a notification for internal employees. If our date math is wrong (an easy mistake), we’ll hear about it from the teams quickly. If this were, say, an email to your customers, deeper scrutiny would be needed. But for this light testing and review, 20-40 minutes is reasonable, depending on if anything weird or unexpected comes up. Digging into data can eat time. Pushing out a change without doing a review is simply professionally negligent.
So, adding in time for normal logistics like committing the code, merging changes, deployment, etc., the time spent from start to live-in-production is at minimum an hour – for a competent, professional engineer.
Of course, that assumes you know exactly what line of code to change. Task workflow mostly lives in the old system, but some pieces of logic live in the new system. Moving logic out of the old system is a good thing, but it does mean the Task functionality is currently split across two systems.
Because we’ve worked together for so long, our team happens to know which process sends the due/overdue task email and can point to the line of code in the new system that initiates the process. So we don’t need to spend time figuring that out at least.
But if we look at the task code in the old system, there are at least 4 different ways to determine if a task is due. Plus looking around at templates and email behavior, there are at least 2 more places that seem to perform custom due logic.
And then the task notification logic is more complicated than you might guess. It needs to reason about team vs individual tasks, public vs private tasks, recurring tasks, if it’s so overdue a manager should be notified, etc.. But we can figure out fairly quickly that only 2 of the 6+ overdue definitions are actually used in this notification process. And only 1 needs to be changed to accomplish the goal.
That review could easily take another half hour or so, maybe less if you’ve been in that part of the codebase recently. Also, the hidden complexity means we might exceed our manual-testing estimate. But let’s just add 30 minutes for higher effort than expected.
So we’re up to 1.5 hours to do this change and feel confident that it will achieve the stated request.
Of course, we didn’t check if any other processes are using the query we’re changing. We don’t want to accidentally break other functionality by changing the concept of “due” to mean the day before the task is due. We need to review the codebase from this perspective. In this case, there seem to be no major dependencies – probably because the bulk of task UI is still in the old system. So we don’t need to worry about changing or testing other processes. In this best case scenario, that’s another 15-30 minutes.
Oh and since the bulk of the task user interface does still live in the old system, we really should do a quick review of the task functionality in that system to see if we’re giving inconsistent feedback. For example, if the UI highlights a task when it’s due, we may want to change that logic also to match the notification – or at least go back and ask our stakeholder what she wants to do. I have not looked at task functionality in the old system lately and don’t remember if it has any notion of due/overdue. This review is another 15-30 minutes, possibly more if there are also multiple definitions of “task due” in the old system, etc.
So we’re up to 2 – 2.5 hour range to do this and feel confident it will achieve the stated request, without unintended side effects or user experience confusion.
Part 2: How can we reduce that time?
The suboptimal, frustrating part is that the only output of this effort is the requested change. The knowledge, say, Sam gained in the process is personal and ephemeral. If a different developer (or ourselves 6 months from now) needed to make a change in this part of the code again, the process would need to be repeated.
There are two main tactics to improve that:
- Actively clean a codebase – focused around reducing duplication and complexity.
- Write automated tests
Side Note: We have discussed documentation in the past, but documentation is not a good solution in this case. Documentation can be great for high-level ideas, e.g., explaining the business reasons behind system behavior, or for highly repeatable processes, e.g., my “new integration partner” checklist. But when it comes to documenting application code, it quickly becomes far too voluminous and out of date as the code changes.
You’ll notice neither of those two improvement tactics is included in our 2 – 2.5 hours.
As an example, keeping a codebase clean would mean that instead of just focusing on the requested change, we’d ask questions:
- Why are there so many different ways to find due/overdue tasks?
- Are all of those needed or even still used?
- Can we collapse those to one or two concepts/methods?
- If the concept of due is split across the old and new systems, can we consolidate this?
And so on.
Answering those questions may be fairly quick – e.g., if they’re obviously dead code. Or it may take several hours – e.g., if they’re used in a lot of complicated processes. Once we had those answers, it would take more time still to refactor the code to reduce duplication/confusion and have a single notion of “due” – or rename concepts in code to make it obviously clear how they’re different and why.
But at the end of it, this part of the codebase would be far simpler, easier to reason about, and easier to change.
Another tactic we use commonly is automated testing. In some ways, automated tests are like documentation – documentation that can’t get out of date and is easier to discover. Instead of manually running code and reviewing the output, we’d write test code that runs the query and programmatically verifies the output. So any developer can run that test code to understand how the system is supposed to work and make sure it still works that way.
If you have a system with decent test coverage, it can greatly speed up these types of changes. You can make the logic change and then run the full test suite to have confidence that
a) your change works correctly and even more valuable
b) your change hasn’t broken anything else.
When we build systems from scratch at Simple Thread, we always include time for writing automated tests in our estimates. It can slow down initial development, but it greatly improves the efficiency of operating and maintaining a software system. It isn’t until a system grows that you truly start to feel the pain of not having tests, and by that point it can be a monumental task to work tests back into the system. It also makes it much easier to on-board new developers, and it makes it much faster and safer to change behavior.
Part 3: Where have we come from? Where are we going?
To date, we rarely include explicit cleaning or test-writing time in estimates for you. That’s partially because while writing tests from scratch is negligible overhead, adding tests to a codebase retroactively is a lot of work, kind of like rebuilding the foundation under a house that people live in.
It’s also partially because when we started working with you, we immediately went into emergency triage mode. We had near-daily problems with the 3rd-party data syncs, weekly problems with report generation, constant support requests for small data changes, inadequate system monitoring and logging, etc. The codebase was sinking under the weight of technical debt, and we were feverishly trying to keep the systems afloat while also patching holes with duct tape.
Over time, we’ve gotten the systems more stable and reliable and we’ve automated/provided UIs to self-service many frequent support requests. We still have a lot of technical debt, but we’re out of the emergency room. However, I don’t think we’ve ever fully shifted away from that reactive triage mentality to a more proactive, mature plan and execute mentality.
We try to clean up code we’re already changing, and we always desk-test thoroughly. But being careful and diligent is different from proactive refactoring and building the infrastructure needed for good automated tests.
If we don’t start paying down some technical debt, we’ll never meaningfully improve the situation, and it’ll continue to take months for highly experienced, highly competent developers to get oriented enough to make nontrivial changes.
In other words, spending 1/2 day to a full day (4-8hrs) on this task is roughly 2x-4x the effort, but it could provide a benefit that greatly reduces the effort to make a similar change in the future. If this part of the codebase were cleaner and had good automated test coverage, I’d expect a competent developer to perform it in an hour or less, 1/2 or less of the original time. And a key point is that it wouldn’t be much longer for a new developer to do it versus a developer experienced in the system.
This shift is one that we are going to need buy-in from you to make. It is a conscious effort to improve how your systems work at a fundamental level, not just how users perceive the system. I know that investments like this can be hard to make, precisely because there aren’t any new visible rewards, but we are happy to sit down with you and put together some hard numbers that can show how these investments would pay off long-term from an engineering perspective.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.
This is a great article. As someone more from the “corporate” side of the house, one thing your analysis should also capture is that most people who make these requests are either 1) not high enough in the org or 2) will not stick around in the company long enough to enjoy the benefit of re-engineering to eliminate technical debt. (Or, they may be senior enough and be long-tenured, but have more pressing considerations like meeting a quarterly number or satisfying a customer’s immediate request).
In this case, the bottleneck is not that the client doesn’t understand why the change took so long, but that they honestly are not paid to care about any effort the engineer that does not directly benefit them in their current role. Maybe your next piece should address how to incentivize organizations to support engineering best practices – I’m certain many companies would be eager to learn any ideas you could share in that regard.
I couldn’t agree more about the clean code and automated testing. Those two things will return so much value compared to the time invested. They are so worth it in the long run.
Speaking of documentation, you forgot to include the changes to the docs in your time estimate. Engineering and testing may take 4-8 hours, but this is a UI change and someone is also going to have to update the user guides, possibly change a menu or a tooltip or an online Help file, and then push those changes to production as well.
Dave Thomas (of PragProg) proposes a simple test for good software design: is it easy to modify? The details that you so ably explain are all related to this issue.
As someone who is maintaining a code base that is over 15 years old I can totally related to this.
I am a technician who programs for fun and to solve problems at work. And, I get called on by the boss to solve a few of his issues to.
Thank You, Thank You, Thank You!
I will never flagellate myself again, nor will I accept it!
I think you mean abstruse, not obtuse
Way too long of an explanation. Sometimes you need to explain, but not like this and certainly not over smaller items. It looks like it took a long time to write.
It’s important to take time to reflect with your team and suggest changes to improve efficiency, but producing novels in the name of transparency is just wasteful and doesn’t challenge the other side to exercise some faith and patience. They will have to do that anyway regardless of the thoroughness and frequency of your explanations.
What is far more important is having mutual respect and proper manners. Those go a long way towards trust and carry more value than lengthy explanations that people may not even read.
That reminds me. Remember that concept of a manager that Agile tries to do away with? Yeah, we need those. Guess why? Because their job is to foster respect and communications with other departments and provide an insulation barrier so their people can get their work done.
One minute to make the change, one hour to know where to make the change, three hours to take care of the logistics and other aspects of the life cycle of a product, then eight hours to spend collectively with stakeholders to explain why it took us so long.
Or, management could take 5 min to realize that everything in software is really complicated (and that’s why they choose not to be engineers) If they’re ever not sure, ask them to estimate how many minutes it will take for them to make breakfast for a small group of people with special requirements that you don’t know yet (like food allergies) in a kitchen with 5 chefs all working on different meals at the same time while that same kitchen is also simultaneously being remodeled and have them explain why their estimate is +/- and hour or two for “unknowns”.
I swear, we ALL spend more time and money maintaining non-engineering folks expectations than it sometimes takes to just do the damn work.
All of this was well known in the 1980s. In fact, a very good book was written on the subject for managers. Of course, no one actually read the book except for us technical professionals.
As one who has had every long career in the IT profession, I can categorically say that as well as this article may be written and as sincere as it is, the problem is unsolvable. This is because most mid-level managers as well as technical managers operate in the same way sociopaths do (stakeholders get caught up in this mental mix by being associated with such managers as a result of the work hierarchies involved); they all follow the same aberrant moral code, which precludes intelligent technical professionals from working with the majority of managers in a professional and efficient way. If this wasn’t the case then such articles as this one would not be written any longer as working environments would have adopted to such techniques and realities a long time ago.
Unfortunately, Human stupidity is a bottomless pit, which forces technical professionals to adopt ways that circumvent their managers and subsequent stakeholders. One such techniques comes in the form of what I believe is called the Johnson technique of discussion. Steven McConnell described this manager\user meeting format in his 1996 book, “Rapid Application Development”.
This technique is supposed to allow technical personnel to work with management (and stakeholders) in a way that gets such people to buy into what is being said by the technical personnel. In other words, it is a very sophisticated con. However, according to McConnell, if employed well, the technique is known to work.
In years past I have brought solid proof of the advantages of adopting software engineering standards in one place of employment. This was done by the successful completion of a project (I was the technical lead.) I brought in within 4 days of the expected deadline. When I met with my supervisor so he could understand how I was able to accomplish this his response was… “I don’t see any reason as to why we need this stuff…” Yet, the answer was staring him in the face…
Part of what you are saying is true, but you ignore the elephant in the room. It takes longer because the code was written poorly, is not maintained properly, and has far too many changes which the developer doing the work was “too busy” to spend 30 extra minutes and instead hacked in a solution.
Part of the problem is too many developers are just too sloppy and lazy. They became developers because the pay was good.
While I appreciate and sympathise with the frustration being expressed, I hope you don’t actually send this to your poor clients. It’s hugely dense and technical. I would be amazed if anyone tried to understand this. I glazed over and I’m a developer. This isn’t exactly describing an interesting process either. I highly doubt that they had that level of interest in the question.
If the aim was to empathise with the client then I think you missed the mark.
I would strip this back to something a lot simpler that explains the high-level details, with zero technical jargon. I wouldn’t use the tone that you have used either like “of course this” or “obviously” that – the client has already shown they don’t understand this so it reads like they are stupid for not having any idea about these points.
It should be short, upbeat, and lightly touch on the broadest subjects like:
“We wish it could be done quicker as well, we want to get this in your users hands, but our highest priority is delivering software that works well. We do not want to compromise on quality by cutting corners.
The system has a lot of moving parts. A change in one place can affect many other places, often which don’t seem connected at first.
Even for the simplest change, the process still has to be followed.
We need to identify where to make the change, do the actual change, test that the change didn’t cause any problems, document the change so the next developer knows what we did, and then publish the change.
We know this project is important to you, and that it is a cornerstone of your business. This is why we are doing our best work on this; to make sure the quality levels are maintained for many years to come.
Now stop bothering me or I’ll double your quote next time to cover your time wasting.”
And then I would spend another 3 hours tweaking this before I sent it 🙂
Part 2 concept “Actively clean codebase”.
Great concept. But how far do you apply it. Every big shop I’ve worked in had a “Prime Directive”: “If it ain’t broke, don’t fix it!” In simplest terms, no matter how inefficient the code, as long as it spits out the right answer, KEEP YOUR HANDS OFF. Paraphrased: only make the minimum line changes required to implement the requested change, don’t “fix” anything else!
Yes it’s great to clean up “technical deficit”, but in order to do that you have to have support from upper management. They have to buy into the general concept and they have to be willing to pay, both in extra time and money, for this ongoing cleanup. I haven’t encountered a workplace that has embraced this.
You also missed one of the complaints. In your simple scenario, the user comes back and says, “we want the change, but we refuse to approve the 4hrs work, cut it in half”. So then the dilemma is, do you give in or stick to your guns. Sure, you can say “we can cut testing time”, but… invariably in my experience when we do cut the estimate, the real world jumps in and the task runs “over due” back to the original estimate or longer! Then we have management breathing down our necks about cost over-runs.
When I’m feeling butch, my reply to the initial request to cut the estimates goes something like this. The company is paying me $40K / $50K (more?) a year for my expertise. My estimate is how long similar tasks have taken on this system. It is irresponsible for me to cut my estimate. Either approve the estimated time and cost for the task, or don’t do it now. (but how often do we feel crusty enough to make a reply like that).
Once when I was starting at a new company I was given a “graduation” test at the end of the initial training period (it was actually the best onboarding training I got anywhere). The task was simple: add display of 1 character on 1 screen. Initial estimate, make a few lines of program code change in 2 programs and 1 message passed between them. Total time, less than a week. I looked at it and agreed it seemed reasonable. By the time the task was eventually finished, I identified over 20 pieces of program code that had to be checked out of our version control system. I identified that the data flow impacted another subsystem I had not been trained in, requiring additional training and help from another business user analyst. 6 months later, duration not effort, the task was completed by a Sr Analyst who was ordered to take it over from me. Bottom line, at a high level, it sure didn’t make me look competent.
What an awesome write up. Will save it into my Pocket account.
How about a follow up article describing the reaction to this mail and the meeting you’ve had with that customer?
A reaction must have been given even though you wrote that a reaction isn’t necessary and should be saved for the meeting. I’m curious if it resulted in what you hoped it would.
If you considered redevelopment using the “Enigma Method” most of this would go away. That is not as silly as it may seem. The time savings in ongoing maintenance of a black-box structure using generated code are unbelievable; but real.
Also note that for this particular example, “tomorrow” is not well-defined.
Does it mean “next business day” or simply “next day”?
If the former, what about public holidays? What about locale differences in the very concept of public holidays? Some countries have very vague such concepts and quite few of them (like the US, if I understand correctly) , other countries have fairly well-defined and well-known “public holidays” or “bank holidays”. Does the company already have a set policy which days are business days, in the locations relevant? And then there are the timezone issues (which as such affect already the old code for “today”).
Leave a comment