I was discussing with one of my managers a few weeks ago the problem that some of our teams had a relatively large percent of time they were spending on toil - operational work that had to be done, but wasn’t as value generating as adding new features or making software more scalable. Google SRE Book has a good definition:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
We did a quick back of the envelope calculation. Let’s assume you have a 5-engineer team that spends about 40% of their time on toil. 40% is a lot — not only does it mean that roughly 2 of the engineers are not utilized as productively as they could be, but can also make some team members less excited to do this work over time, decreasing the morale of the team.
Let’s assume we can get this number down to 20% through improved automation, and the cost is about 4.5 person-months. Let’s do an ROI calculation to get the payback period. It basically means we can get back an additional developer in about 3 months by allocating 1.5 FTEs for a quarter.
This means that in about 7.5 months (3 + 4.5), we will breakeven on the effort.
Great — let’s do it!
Or maybe not…. Can you spot the flaw in logic above?
The implicit assumption above was that we had 1.5 FTE developers just hanging around, and if they weren’t increasing automation (decreasing toil), they would be sitting idly on their hands.
The reality is of course that we can take the same 4.5 months and put that into value-generating projects as opposed to the cost-reduction above. Let’s assume that a new project results in additional value to the company of $50K (in annual subscriptions) through accelerating some feature or getting a new sale in that same quarter.
Hmm, we just introduced a new unit, money. To compare the cost / benefit of our automation work to revenue, we need to convert our person-months to money. Let’s assume each developer-month costs us $12.5K. This makes 4.5 person-months worth $56K.
Now things get more complicated. We certainly don’t want to forego this new revenue, or delay it by a quarter. New revenue can enable us to hire another person to continue the toil, perhaps somebody who enjoys operations. Alternatively, if the new hire is an engineer can be used to help with more automation (after examining opportunity cost again). I suspect at this point, our investment in automation makes less economic sense.
Of course, there is still the issue of team morale.
long-term scalability of the decisions, team morale, risk attrition, which could be harder to quantify.
What’s the take away?
Of course, the above numbers are made up, so the actual decision would have to be based on actual values.
In the scenario presented, the value of automation is relatively small. If there is more value to be had — for example, if for the same cost the savings would accrue to more engineers, or if the opportunity cost was smaller, we might have made a different decision. If the 20% reduction in operational burden cost 1 person-months as opposed to 4.5, we would also probably move forward.
Predicting revenue opportunities is even harder than predicting the savings, but the larger point here is that we must try to take opportunity cost into account even if imprecise.
So when do you do technical improvement projects?
In a word, when it makes economic sense, and very much depending on what alternatives you have. Projects start having a much better ROI when they are not purely about freeing up developers, but also about adding new technical capabilities, affecting many developers (10% of improvement for 20 engineers is huge). Even if you don’t have time for a step-change reduction, there are often smaller improvements you can make that do have a very high ROI.
In the scenario of reducing toil, we may not have considered all benefits. For example, if in addition to reducing toil, we managed to accelerate new feature delivery in the future (not just by freeing up a developer, which we accounted for, but by improving the overall delivery process), then we must take that into account. Here, analysis concepts such as cost of delay can be helpful. Learn More.
The above was a fairly simple analysis. To get more rigorous, I recommend this article, where Don Reinertsen breaks down technical debt. That one is different in that it discussed whether to strategically incur some debt, where this post looks at the situation where you already incurred it.
Thanks to César Arévalo for feedback on drafts of this post and Ablimit Aji for introducing me to toil as a term for this.