Reducing Toil in Software Operations

I was discussing with one of my managers a few weeks ago the problem that some of our teams had a relatively large percent of time they were spending on toil - operational work that had to be done, but wasn’t as value generating as adding new features or making software more scalable. Google SRE Book has a good definition:

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

ROI Calculation

Let’s assume we can get this number down to 20% through improved automation, and the cost is about 4.5 person-months. Let’s do an ROI calculation to get the payback period. It basically means we can get back an additional developer in about 3 months by allocating 1.5 FTEs for a quarter.

This means that in about 7.5 months (3 + 4.5), we will breakeven on the effort.

Great — let’s do it!

Or maybe not…. Can you spot the flaw in logic above?

The implicit assumption above was that we had 1.5 FTE developers just hanging around, and if they weren’t increasing automation (decreasing toil), they would be sitting idly on their hands.

Opportunity Cost

Hmm, we just introduced a new unit, money. To compare the cost / benefit of our automation work to revenue, we need to convert our person-months to money. Let’s assume each developer-month costs us $12.5K. This makes 4.5 person-months worth $56K.

Now things get more complicated. We certainly don’t want to forego this new revenue, or delay it by a quarter. New revenue can enable us to hire another person to continue the toil, perhaps somebody who enjoys operations. Alternatively, if the new hire is an engineer can be used to help with more automation (after examining opportunity cost again). I suspect at this point, our investment in automation makes less economic sense.

Of course, there is still the issue of team morale.

long-term scalability of the decisions, team morale, risk attrition, which could be harder to quantify.

What’s the take away?

In the scenario presented, the value of automation is relatively small. If there is more value to be had — for example, if for the same cost the savings would accrue to more engineers, or if the opportunity cost was smaller, we might have made a different decision. If the 20% reduction in operational burden cost 1 person-months as opposed to 4.5, we would also probably move forward.

Predicting revenue opportunities is even harder than predicting the savings, but the larger point here is that we must try to take opportunity cost into account even if imprecise.

So when do you do technical improvement projects?

In the scenario of reducing toil, we may not have considered all benefits. For example, if in addition to reducing toil, we managed to accelerate new feature delivery in the future (not just by freeing up a developer, which we accounted for, but by improving the overall delivery process), then we must take that into account. Here, analysis concepts such as cost of delay can be helpful. Learn More.

The above was a fairly simple analysis. To get more rigorous, I recommend this article, where Don Reinertsen breaks down technical debt. That one is different in that it discussed whether to strategically incur some debt, where this post looks at the situation where you already incurred it.

Thanks to César Arévalo for feedback on drafts of this post and Ablimit Aji for introducing me to toil as a term for this.

Head of Engineering @ Apprentice.io, Intelligent Manufacturing Execution for Life Sciences, from COVID to Cancer.