Forem

CodingBlocks

Site Reliability Engineering – Eliminating Toil

We say “toil” a lot this episode while Joe saw a movie, Michael says something controversial, and Allen’s tip is to figure it out yourself, all while learning how to eliminate toil.

The full show notes for this episode are available at https://www.codingblocks.net/episode184.

Sponsors

  • Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
  • Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.

Reviews

Thank you for the reviews! AA, Franklin MacDunnaduex, BillyVL, DOM3ag3

Want to help out the show? Leave us a review!

Survey Says

Anonymous Vote
Sign in with Wordpress
Does your job include any toil?
  • Of course it includes some, but it's a reasonable amount.
  • This topic is opening my eyes to how much toil my job has.
  • I think my job includes too much toil but my team won't do anything to change it.
  • OMG, if I removed the toil from my job I'd have no job left.
Cover of the "Site Reliability Engineering" book from O'ReillyThe famous “SRE Book” from Google

Chapter 5: Eliminating Toil

  • Toil is not just work you don’t wanna do, nor is it just administrative work or tedious tasks.
  • Toil is different for every individual.
  • Some administrative work has to be done and is not considered toil but rather it’s overhead.
    • HR needs, trainings, meetings, etc.
    • Even some tedious tasks that pay long term dividends cannot be considered toil.
      • Cleaning up service configurations was an example of this.
  • Toil further defined is work that is often times manual, repetitive, can be automated, has no real value, and/or grows as the service does.
    • Manual – Something a human has to do.
    • Repetitive – Running something once or twice isn’t toil. Having to do it frequently is.
    • Automatable – If a machine can do it, then it should be done by the machine. If the task needs human judgement, it’s likely not toil.
    • Tactical – Interrupt driven rather than strategy driven. May never be able to eliminate completely but the goal is to minimize this type of work.
    • No enduring value – If your service didn’t change state after the task was completed, it was likely toil. If there was a permanent improvement in the state of the service then it likely wasn’t toil.
    • O(n) with service growth – If the amount of work grows with the growth of your service usage, then it’s likely toil.

Why is Less Toil Better?

  • At Google, the goal is to keep each SRE’s toil at less than 50%.
    • The other 50% should be developing solutions to reduce toil further, or make new features for a service.
      • Where features mean improving reliability, performance, or utilization.
  • The goal is set at 50% because it can easily grow to 100% of an SRE’s time if not addressed.
  • The time spent reducing toil is the “engineering” in the SRE title.
    • This engineering time is what allows the service to scale with less time required by an SRE to keep it running properly and efficiently.
  • When Google hires an SRE, they promise that they don’t run a typical ops organization and mention the 50% rule. This is done to help ensure the group doesn’t turn into a full time ops team.

Calculating Toil

  • The book gave the example of a 6 person team and a 6 week cycle:
    • Assuming 1 week of primary on-call time and 1 week of secondary on-call time, that means an SRE has 2 of 6 weeks with “interrupt” type of work, or toil, meaning 33% is the lower bound of toil.
  • With an 8 person team, you move to an 8 week cycle, so 2 weeks on call out of 8 weeks mean a 25% toil lower bound.
  • At Google, SRE’s report their toil is spent most on interrupts (non-urgent, service related messages), then on-call urgent responses, then releases and pushes.
  • Surveys at Google with SRE’s indicate that the average time spent in toil is closer to 33%.
    • Like all averages, it leaves out outliers, such as people who spend 0 time toiling, and others who spend as much as 80% of their time on toil.
      • If there is someone taking on too much toil, it’s up to the manage to spread that out better.

What Qualifies as Engineering?

  • Work that requires human judgement,
  • Produces permanent improvements in a service and requires strategy,
  • Design driven approach, and
  • The more generic or general, the better as it may be applied to multiple services to get even greater gains in efficiency and reliability.

Typical SRE Activities

  • Software engineering – Involves writing or modifying code.
  • Systems engineering – Configuring systems, modifying configurations, or documenting systems that provide long term improvements.
  • Toil – Work that is necessary to run a service but is manual, repetitive, etc.
  • Overhead – Administrative work not directly tied to a service such as hiring, HR paperwork, meetings, peer-reviews, training, etc.

The 50% goal is over a few quarters or year. There may be some quarters where toil goes above 50%, but that should not be sustained. If it is, management needs to step in and figure out how to bring that back into the goal range.

“Let’s invent more, and toil less”

Site Reliability Engineering: How Google Runs Production Systems

Is Toil Always Bad?

  • The fact that some amount of toil is predictable and repeatable makes some individuals feel like they’re accomplishing something, i.e. quick wins that may be low risk and low stress.
  • Some amount of toil is expected and unavoidable.
  • When the amount of time spent on toil becomes too large, you should be concerned and “complain loudly”.
  • Potential issues with large amounts of toil:
    • Career stagnation – If you’re not spending enough time on projects, your career progression will suffer.
    • Low morale – Too much toil leads to burnout, boredom, and being discontent.
  • Too much time on toil also hurts the SRE team.
    • Creates confusion – The SRE team is supposed to do engineering, and if that’s not happening, then the goal of the team doesn’t match the work being done by the team.
    • Slows progress – The team will be less productive if they’re focused on toil.
    • Sets precedent – If you take on too much toil regularly, others will give you more.
    • Promotes attrition – If your group takes on too much toil, talented engineers in the group may leave for a position with more development opportunities.
    • Causes breach of faith – If someone joins the team but doesn’t get to do engineering, they’ll feel like they were sold a bill of goods.
  • Commit to cleaning up a bit more toil each week with engineering activities.

Resources We Like

  • Links to Google’s free books on Site Reliability Engineering (sre.google)
  • The Greatest Inheritance, uh stars Jaleel White (IMDb)
  • We’re Testing Your Patience… (episode 20)
  • Clean Code – How to Write Amazing Unit Tests (episode 54)
  • DevOps Vs SRE: Enabling Efficiency And Resiliency (harness.io)

Tip of the Week

  • Pandas is a great tool for data analysis. It’s fast, flexible and easy to use. Easy to work with information from GCS buckets. (pandas.pydata.org)
  • 7 GUIs you can build to study graphical user interface design. Start with a counter and build up to recreating Excel, programming language agnostic! (eugenkiss.github.io)
  • Did you know there’s a bash util for sorting, i.e. sort? (manpages.ubuntu.com)
  • Using Minikube? Did you know you can transfer images with minikube image save from your Minikube environment to Docker easily? Useful for running things in a variety of ways. (minikube.sigs.k8s.io)
  • Ever have a multi-stage docker, where you only wanted to build one of the intermediary stages? Great for debugging as well as part of your caching strategy, use docker build --target <stage name> to build those intermediary stages. (docs.docker.com)

Episode source