Forem: Hannah Culver

SREview Issue #12 April 2021

Hannah Culver — Tue, 20 Apr 2021 14:47:22 +0000

Spring is here! We have rain! We have flowers! We have allergies! We also have some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this month.

Tweets that have us twittering

Alex Elman

@_pkill

We need to stop with the "they need to feel their own pain" framing for service owners being on-call for their products. That's such a counter-productive message. On-call is an opportunity to gain expertise on how the service works in the context of Prod.

14:50 PM - 07 Apr 2021

ca$s:e cage 💫

@akolsuoicauqol

Half of my job is Googling the other half is documentation. There I said it.

11:32 AM - 07 Apr 2021

dan slimmon

@danslimmon

INCIDENT RESOLVED: This outage has been resolved. In investigating the incident, our engineers learned that the "fifth nine" was their friendship all along

23:50 PM - 07 Apr 2021

SREading

SRE Leaders Panel: SRE Adoption as Organizational Transformation: If you missed our recent panel with Tony Hansmann, Vanessa Yiu, and Kurt Andersen, this is the transcript.

Incident analysis as guerrilla case study research: Lorin Hochstein writes about how to use the desire for closure to justify spending time examining how work is really done.

Having On-call Nightmares? Runbooks can Help you Wake Up.: Senior Software Engineer Harry Hull writes about how to use runbooks to improve your incident response, even at 2 AM.

Advice for someone moving from SRE to backend engineering: Charles Cary writes about how dynamic Ops and SRE are, misconceptions about creativity, and on-call duties.

Resilience in Action, Episode 6: Our podcast, Resilience in Action, is back! Host Kurt Andersen speaks with Todd Underwood, ML SRE Lead and Pittsburgh Site Lead for Google.

The Mightiest Monolith: Robert Barron writes what modern developers, DevOps practitioners and Site Reliability Engineers can learn from the Space Shuttle program.

Give it a whirl

Here are the improvements we’d love to highlight about the new Blameless bot:

Updated Incident Summary: The incident summary has been updated to display a more concise rundown. The new summary has improved UX/UI design and shares the incident summary, severity, status, and type as well as timestamps and the people involved.

Automated shortcut message: As you can see above, we also added a message that gives users access to documents and shortcuts to help automate commands. This lowers the toil for responders.

Incident help suggestions: When launching a new incident, the Blameless bot will now provide suggestions for commands. The bot will direct the user to the list of commonly used "slash" commands and provide a link to the web site where the commands are all listed.

Task list enhancements: In addition to creating an expanded checklist, we’ve also allowed tasks to be checkmarked within the Slack UI for a smoother user experience. To limit the noise within the Slack channel, now only task owners can see their assigned tasks. When creating a new task, it will appear to the owner in readable format and updates the person’s main task list.

Lastly, rather than change task status from a drop down, tasks are crossed off as they are complete. This makes the previous Blameless commands “_mark task as pending” and “/blameless complete task” irrelevant. Instead, users should use the command “/blameless show tasks” and cross off completed items from the checklist that appears after the command.

If you’d like to see these upgrades in action, try Blameless today.

Events

DevOps Online Summit April 26-30: DevOps professionals throughout the world come together and share their learnings.

Failover Conf April 27: Learn how teams have adapted over the past year, share your own stories, and engage with others in the reliability community.

99 Percent Visible: DevOps Reliability April 27 9 AM PDT: Kat Cosgrove will give her talk, “Learning to Learn by Teaching” and discuss her experience teaching developers.

Blameless Bi-Weekly Demo April 27 at 8 AM PDT: Check out a live demo of Blameless as we walk you through operations best practices, and get your questions answered.

SRE Leaders Panel: Business Agility & SRE April 29 at 11 AM PDT: Join Chris Hendrix, Garima Bajpai, and Jason Fraser for a discussion on how reliability impacts the flow of value.

Deserted Island DevOps April 30: A single-day virtual event streamed on Twitch. All presentations will take place in the world of Animal Crossing: New Horizons.

Resilience in Action E6: Oversize Coffee Mugs, SLOs, and ML with Todd Underwood

Hannah Culver — Mon, 19 Apr 2021 16:19:36 +0000

What are MTTx Metrics Good For? Let's Find Out.

Hannah Culver — Tue, 13 Apr 2021 15:46:27 +0000

By: Emily Arnott, Failure is Inevitable.

Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X, or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process.

Yet, MTTx metrics rarely tell the whole story of a system’s reliability. To understand what MTTx metrics are really telling you, you’ll need to combine them with other data. In this blog post, we’ll cover:

What are common MTTx metrics and why are they used?
What are some problems with relying on MTTx metrics?
How can I make MTTx metrics more helpful?
How do I move away from shallow metrics?
How better metrics help build a blameless culture

Common types of MTTx metrics

What are some problems with relying on MTTx metrics?

For each metric, trends can help suggest where to work on improvement. For example, if the MTTD is increasing, you might work to improve your monitoring. But, MTTx metrics alone are insufficient to identify trends in reliability.

In an experiment detailed in the ebook Incident Metrics in SRE, author Štěpán Davidovič ran simulations of multiple systems with varying incident frequencies and durations. He generated sets of hypothetical data and compared the MTTx metrics from each. The goal was to determine if changes made to improve MTTx metrics (such as buying a tool) would reflect in the system.

The findings were conclusive: “MTTx metrics will probably mislead you.” As the experiment stated, “Even though in the simulation the improvement always worked, 38% of the simulations had the MTTR difference fall below zero for Company A, 40% for Company B, and 20% for Company C. Looking at the absolute change in MTTR, the probability of seeing at least a 15-minute improvement is only 49%, 50%, and 64%, respectively. Even though the product in the scenario worked and shortened incidents, the odds of detecting any improvement at all are well outside the tolerance of 10% random flukes.”

This means that even if your tool or process improvement is working, you may not even be able to detect it. This makes it hard to understand what actually improves incident response. And, it doesn’t really tell us anything about the overall system reliability.

How can I make MTTx metrics more helpful?

MTTx metrics are more helpful when contextualized with other information about the incident. As Blameless SRE Architect Kurt Andersen suggests, “What can be enlightening is to combine these metrics with some form of incident categorization.” Using your incident classification process, you can analyze MTTx metrics for a smaller subset of incidents.

Here are some ways you can further categorize incidents to work with more meaningful data:

The severity of the incident
How the incident was discovered (internally or via customer report)
The service area disrupted
The resources used in responding to the incident (such as runbooks, backups)
Other monitoring data about the system when the incident occurred (such as server load)

Here are some examples of how these combinations can lead to actionable change:

If the MTTA for customer-reported incidents is much higher than for internally detected incidents, can you create a faster pipeline for processing customer reports? Or, is there a way your monitoring could detect the issue so customer reports are less frequent?
If using a certain runbook leads to lower MTTR metrics, what about that runbook could be adopted into other runbooks?
If one area of service has very high MTTD, what monitoring tools could you implement to catch incidents faster?

Moving away from shallow metrics

As you conduct deeper analysis on your metrics, you’ll find there’s no single MTTx metric that can tell the whole story. However, there are better ways you can analyze your data to gain insight into your overall reliability and incident response processes.

Focusing on customer impact with SLOs

One of the most important things you want to assess after an incident is customer impact. This can be difficult to determine. Reliability is subjective, based on how customers perceive your service.

To determine the impact on customer happiness, you can use SLIs and SLOs. SLIs, or service level indicators, measure how key areas of your services are performing against customer expectations. SLOs, or service level objectives, mark where customers begin to be pained by unreliability.

How you perform against your SLO is often a better indicator of reliability than MTTx metrics. This is because reliability is determined by your users. SLOs help you understand the effect that incidents have on customer happiness. As SLOs are moveable goals that will change as your customers’ needs change, you should never find yourself or your team goaled for an arbitrary number. Revision is part of setting good SLOs.

Examining outliers instead of focusing on averages

Kurt also suggests looking at outliers instead of averages: “In general, I don't find the ‘central tendency’ to be as interesting as investigating outliers for a distribution.” Although they may not represent the typical incidents, outliers in your MTTx trends can be valuable.

Discover what was different about the incident that made it an outlier. Is it something that could occur again? You might need to focus on a qualitative rather than quantitative approach. Lorin Hochstein breaks this concept down in a blog post. Rather than relying on metrics to prevent major incidents, Lorin suggests looking for “signals.” Use your team’s expertise to catch and act on noteworthy data.

Look at the story of real work behind your incidents

In a post for Adaptive Capacity Labs, John Allspaw looks at how to move beyond shallow data. His conclusion is that “meaningful insight comes from studying how real people do real work under real conditions.” Metrics alone cannot contain the many complicating factors in real work.

John shows how to build a “thicker” understanding of data. You can map out how an incident developed and was resolved. This is much “messier” than a single metric, but often more insightful. These complicated representations should be examined when they’re a deviation from the mean.

How better metrics help build a blameless culture

When you rely on shallow metrics, it can become desirable to game the system or even give up trying to meet KPIs. Team members may feel that their performance is measured by a particular (and sometimes irrelevant) metric. They could be tempted to work to just improve that metric instead of actually improving the system. This phenomenon exists in many industries, from manufacturing to healthcare.

This causes a multitude of problems:

Employees are hesitant to raise issues that might improve the system if it will negatively affect the metric
When the metric reaches an undesirable level, employees may blame others to avoid being blamed
Employees will hesitate to take risks or innovate if they fear it could negatively impact the metric
Employees may even misreport data to artificially inflate the metric, especially if jobs, promotions, or bonuses depend on it

To empower and encourage employees, you need to cultivate a blameless culture. Moving away from shallow metrics is part of this transformation. Emphasize that everyone has a shared goal of customer satisfaction. Using SLOs as your guiding metric can help teams quantify this.

Emphasize that there is no single “score” for an employee or team’s performance. This encourages teams to see incidents as a chance to learn rather than a major setback.

If you’re looking to get more from your metrics, we can help. Blameless SLOs put incidents in the context of customer satisfaction, and Reliability Insights allows teams to sort MTTx metrics into more informative subsets of data. To see how, feel free to sign up for a demo.

If you enjoyed this blog post, check out these resources:

Having On-call Nightmares? Runbooks can Help you Wake Up.

Hannah Culver — Mon, 12 Apr 2021 15:24:34 +0000

By: Harry Hull, Failure is Inevitable

The nightmare

You aren't sure how long you've been here, but the view outside the window sure is soothing. Before you can fully take in your surroundings, a siren rips you back into the conscious world. Slowly, you begin to piece together that you exist, and you are on call.

The ringing, much louder now, pierces through your skull as you begin to open your bleary eyes. You turn over your pillow, grab your phone, and click through the PagerDuty notification. After quickly ACKing, you start to read the alert:

```
alertname = CartService5xxError
```

As fate would have it, you know literally nothing about the cart service or why it might be erroring. Unfazed, you keep reading:

    endpoint ='CheckoutPromoWeb'

This combination of symbols is totally meaningless to you, but its sounds really scary. You have already worked here for a year, but you acutely remember your first week when the cart service was down for 3 hours. The company lost a lot of money and your boss was really stressed out during the incident retrospective.

You read the rest of the alert message, hoping for a sign for how serious this could be:

```
description = ask harry
```

"Great... I’ll page Harry," you mumble under your breath as you reach for your laptop. What your half-asleep brain fails to realize is that Harry hasn't worked at the company for 4 years.

You will soon realize this, however, as you sit hunched over your laptop staring at a greyed out "deactivated" Slack avatar. No one else is awake either, of course, and in a hazy panic you @channel your entire team and page a few unfortunate people.

In the meantime, you start to open random dashboards searching for any clues to help triage the severity of this mess.

There's a better way

The previous spooky tale is sadly all too real. After a short time on call, every team realizes that having service alerts is only the very first step. There's a huge gap between having well-instrumented services with actionable alerts and having your alerting system so finely tuned that anyone on the team can ack and efficiently act upon an alert, even with a sleep deprived mind.

To get to the latter point, try to get your team to consider the following questions for different scenarios:

Is this currently affecting our customers? Will customers be affected soon? If so, how many and how bad?
Has this happened before? What did we do?
What other context do I need to fully triage this?
How do I know when this has recovered?

In order to bridge this gap and answer these questions, we created Runbook Documentation. Now we link a runbook in the description of all of our alerts, and we don't let a new alert get passed the pull request without an attached runbook. This is how we ensure our on-call team feels supported, even during the trickiest of incidents.

Applying a runbook to incident response

In the beginning of our story, the first thing the on-call needed to do is triage the customer impact of the alert. Since it's hard to remember anything at 2:30 AM, this is our first step in the runbook.

This step links to a dashboard showing the general health of the cart service (request / error throughput + latency histograms). Here, the on-call can quickly see some very important context: how many purchases are happening, and which endpoints are erroring at what rate.

Now the on-call can see that, thankfully, checkouts per minute are about the same as this time last week and there are no huge dips. This isn't affecting revenue yet, but we still don't know how these errors are affecting the customer experience.

Step 2 gives context to show the customer impact of this endpoint being down, and links to additional runbooks if necessary. This step could also give an updated severity recommendation for the incident based on impact.

Now we know that customers will not be able to see offers on the checkout page. This, of course, is frustrating to the customer and impacts revenue, but customers are still ordering and the core purchase flow is otherwise healthy. Following the runbook, the on-call creates a SEV3 incident in Blameless and continues on to the next steps.

From here, the on-call sees a ton of experiment-related errors. They notice that all of the error logs seem to be referencing the same experiment ID. This experiment is probably the culprit for the sudden spike in errors. The linked “Promotion Outage Runbook” mentions that misconfigured experiments have caused outages in the past, and has a section on viewing historical data for experiments and steps for disabling specific experiments.

Iterating through failure

In this example, the on-call team is able to successfully resolve an incident with a handy runbook. But, how would this outcome have changed if the runbook was out of date, or even just not thorough enough in detail? A good runbook takes time and iterations for it to be maximally informative.

With each outage, runbooks can be tuned and hardened to be more helpful for the people acking the alert. Keeping in line with the blameless ethos, common problems like misestimating severities can be seen as gaps in our runbooks and processes rather than the mistake of the on-calls.

Runbooks are living documents. They’re meant to be helpful. If a particular runbook is not answering the questions you need, it’s time to review it. As your runbooks improve, you’ll be able to eliminate toil from the incident response process. Additionally, some of that on-call dread will dissipate as you know that you have the tools to support you during an incident.

If you’d like to learn more about runbooks, here are some additional resources:

So you Want an SRE Tool. Do you Build, Buy, or Open Source?

Hannah Culver — Mon, 05 Apr 2021 16:37:17 +0000

By: Emily Arnott, Failure is Inevitable

As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project?

This is a big decision. Switching methods half-way through adoption is costly and can cause thrash. You’ll want to determine which method is the best fit before taking action. Each choice requires a different type of investment and offers different benefits. We’ll help you decide which solution is your best fit by breaking down the pros and cons. In this blog post, we’ll cover:

Types of SRE tools available
Why SRE tools can be helpful
Pros and cons of buying, building, and open sourcing

Types of SRE Tools

There are a wealth of SRE tools available to you for different areas of SRE best practices. Let’s look at some of the most common categories:

Monitoring and observability tools: collect and present information about your services
SLOs and error budgeting tools: monitor your service level objectives for a holistic view of your reliability
Alerting tools: inform designated people when monitoring detects an issue with your service
Runbook tools: document, execute, and automate step-by-step guides
Incident management tools: automate toil and improve communication through chatbots, checklists, and data aggregation
Incident retrospective tools: create documents that record what you learned in an incident as well as follow-up tasks
Chaos engineering tools: create and execute chaos experiments that simulate failure in your system

While all of these tools help operationalize some facet of SRE best practices, an end-to-end solution incorporates many of these tools and helps them talk to one another. A very important consideration when choosing between building, buying, or open sourcing an SRE tool, is how these moving parts connect. You want to ensure that all the pieces in your ecosystem are working together.

Why SRE tools can be helpful

Automation

Some SRE tools allow you to automate parts of your reliability process. The level of automation will be affected by the tool itself. Alerting tools automate notifying incident participants. Incident management tools automate coordination during incident response, as well as data aggregation. Runbook tools automate repetitive routine processes. End-to-end solutions automate many of these processes, and more. Automation reduces toil, freeing up time and energy for more nuanced tasks.

Repeatability

Learning new SRE processes can be daunting. SRE tools guide you through tasks like setting SLOs, building runbooks, and creating incident retrospectives. Following these guidelines reduces the cognitive toil and improves process consistency.

Data analysis

Tooling can help make data about your services more actionable. The reports you generate are consistent and codified, helping you identify patterns. Monitoring tools highlight and consolidate the most important information.

Integrations

An ideal SRE tool would also help you make the most of your other tools. By connecting monitoring and alerting to your incident management, or syncing your retrospectives to your ticketing system, you save time and have a more holistic view of all the individual parts of your system. This makes it easier for teams to make decisions, resolve incidents, and prioritize development. In short, an ideal SRE tool would help you throughout your entire software development life cycle rather than just distinct portions of it.

Reliability

Of course, SRE tools can make your service more reliable, too. By using these tools to codify SRE best practices, establish processes and guardrails, and automate repetitive tasks, you set your teams up for success. These best practices bolster reliability and encourage a blameless culture. Embracing, responding to, and learning from failure makes teams and systems stronger.

Buying, building, or open-sourcing

Now that you’ve seen what tools are available to you, you’ll need to decide how to adopt them. Your major choices are buying an out-of-the-box solution, building your own solution, or adapting an open source solution. Each has pros and cons.

Pros of buying your SRE tool

Reliability: Your vendor will maintain your solution. You’ll have an agreement guaranteeing standards of availability and security. Your vendor will have support processes to ensure your tool works as expected. This will save you from having to devote resources to keeping the tool functional.

Expanding functionality: The vendor will continue to develop the functionality of the tool. Without needing to invest your own resources, the tool will grow more useful and efficient. You can also provide your own input with feature requests to drive the product in the direction you’d like.

Built-in integrations: The vendor will likely have a variety of integrations available to source information from or communicate with. By not having to create these integrations on your own, you save time and money. Additionally, this ensures that the other tools you use can all speak with one another rather than operating in silos.

Cons of buying your SRE tool

Costs: Tooling can be costly up front, and difficult to get budget for. Buying a solution is still often less costly than building your own or maintaining an open source solution. Yet the costs of building or maintaining may not be bubbled up.

Pros of building your SRE tool

Customizability: Because you’ll be building it to meet your specific needs, this product will fit your needs like a glove. If your needs change, you’ll have full access to adapt your tool.

Cons of building your SRE tool

Opportunity cost: Time spent working on the tool is time not spent working on other projects. You may have to devote people full-time to maintaining and upgrading the tool.

Responsibility: If the tool breaks, it’s up to you to fix it. If other internal services rely on your tool, this could mean establishing internal SLAs. Also, tribal knowledge can become a problem. If documentation becomes out of date or team members change organizations, crucial knowledge can be lost.

Building complex integrations: You will need to spend time making sure that your SRE tool is able to ingest information from your alerting and monitoring tools and send information to ticketing systems and more. These capabilities can be difficult to build out, especially for teams with other development priorities.

Pros of open sourcing your SRE tool

Cost efficiency: Open source tools are often free to use. There will be some opportunity cost to implementing the tool in your specific environment, but it can be minimal.

Adaptability: As you have access to the source code of your tool, you’ll have the flexibility to add the features you need. But, this could have a large opportunity cost.

Cons of open sourcing your SRE tool

Security concerns: As everyone will have access to the source code of the tool, security issues could emerge. The community behind the tool will make efforts to secure the tool and fix issues, but the responsibility will be yours.

Maintenance and improvements: Updates and fixes for the tool will come from the development community. As these projects are often projects community members take on outside of work, there could be a longer wait for improvements or additional integrations.

Here’s a chart summarizing the pros and cons of each option:

For more analysis on how to choose and acquire SRE tools, check out our buyers’ guide.

If you’ve decided to purchase an SRE tool, check out what Blameless has to offer. Sign up for a demo!

How to Analyze Incidents Better with the Right Metrics

Hannah Culver — Tue, 30 Mar 2021 15:03:27 +0000

Written by: Emily Arnott, Failure is Inevitable

An important SRE best practice is analyzing and learning from incidents. When an incident occurs, you shouldn’t think of it as a setback, but as an opportunity to grow. Good incident analysis involves building an incident retrospective. This document will contain everything from incident metrics to the narrative of those involved. These metrics aren’t the whole story, but they can help teams make data-driven decisions.

But choosing which metrics are best to analyze can be difficult. You need to find the valuable signals among the noise. You’ll want your metrics to reflect how the incident impacted your customers. In this blog post, we’ll cover:

Common metrics in incident response
How to connect your incident metrics to customer happiness
How to measure an incident’s impact on development
How to integrate your metrics into your cycle of learning

Common metrics in incident response

Here are some common categories of metrics and how they can be helpful.

Metrics measuring incident frequency

Number of incidents: This is the most basic thing you’ll want to keep track of. You can make this measure more meaningful by classifying your incidents.

Number of alerts: Different types of incidents will require different levels of alerts. Keeping track of these can help balance on-call loads.

Metrics measuring incident response timelines

Mean time to detect: This tracks the average amount of time it takes for your system to register an incident. To lower this time, consider investing in monitoring tools.

Mean time to acknowledge: This is the average time between the system registering an incident and the team responding. Alerting and on-call policies can impact this indicator.

Mean time to resolve: This covers the average time between the incident response starting and the service returning to full functionality. This number will be highly variable depending on the service, type of incident, and more.

These metrics, commonly referred to as MTTx metrics, may not always reflect the improvements you’re making in your overall reliability efforts. It’s important to understand which metrics are most indicative of certain areas of improvement. As Štěpán Davidovič noted in Incident Metrics in SRE, “If you are improving one step of the journey, including all other steps in the aggregate makes your ability to understand the impact of the change worse.”

There are alternatives to MTTx metrics that can better depict changes in reliability. As Štěpán also noted, SLOs can help answer the most important question, which is “Is our reliability getting better or worse, as a company?”

Connecting your incident metrics with customer happiness

Metrics can offer insights into your practices, but you need more context. Without this context and deeper analysis, these numbers can be shallow indicators of how well your team is responding to incidents. They’re also very team-centric metrics. Customer-centric metrics can shed more light on the impact of an incident. To reflect customer happiness in your metrics, you can use SRE tools like SLIs and SLOs. Let’s break down the process of how to develop these.

1. Create user journeys to determine what matters most to customers

Get into the mindset of a typical customer. Think about how they engage with your service. What aspects do they rely on? What slowdowns would annoy them most? Think of each action they take in your service as part of a user journey. Partnering with Customer Support or product can be useful during this process.

2. Craft SLIs that are indicative of user happiness

Determine the monitoring data that is most representative of what your customers find valuable. This could include the latency of a search result, the freshness of the search data, or the availability of the search service as a whole. Once you’ve determined which data type best fits your customers’ needs, you can begin measuring your performance.

3. Set SLOs to the customer pain point

Now consider what metrics for the SLI would be unacceptable for the customer. This is where you’ll set your SLO, or service level objective. It should be comfortably above any legal agreements (SLAs) to provide wiggle room for the team. When incidents occur, determine their impact on customer happiness by looking at how service performance measures against the SLO.

SLOs allow teams to have a better idea of how impactful an incident is. This can be more indicative of service health than many other solitary metrics.

For instance, imagine a team has a very minor incident. If the incident went unresolved for a week or two, it might have little to no impact on the customer. Yet, if you’re setting goals on MTTR, this outlier would “point” to an issue in your incident response process.

But, when you look at the incident through the lens of customer happiness, the response was appropriate. This context is important to note.

Another important consideration is the error budget. This also informs how critical an incident is. Let’s take a look at error budgets and how they help teams prioritize reliability.

Measuring an incident’s impact on development

The reciprocal of the SLO is the error budget. This reflects how much unreliability the system can experience within a window. As the error budget decreases, certain policies can kick in to preserve the SLO or respond to reliability challenges.

For example, if a service is only half way through its window but has burned through 85% of the error budget, the policy might implement a localized code freeze to keep from exceeding the error budget.

Or, a team that has exceeded their error budget the last two windows might reallocate development resources to work on reliability needs. These methods can help teams maintain customer happiness and better prioritize development work.

Integrating metrics into a cycle of learning

After an incident, you should record your insights in an incident retrospective. Some of your incident metrics can provide valuable context for each incident. Were there any outliers? Did an incident take an unusually long time to detect? Is an incident of this severity rare in this service area? Include a discussion section where you analyze potential contributing factors.

Teams should share these incident retrospectives throughout the organization. This helps everyone learn from failure. This can also inform further incident response policies. Look at what worked and what didn’t, and adjust. If a runbook was out of date, it’s time to update it. If a gap in monitoring caused a delay in response, look for ways to fill in this knowledge.

These learnings will benefit the customer, as well. As you get better at analyzing and learning from incidents, your response process will also mature. By looking at metrics through a customer-centric lens, you can hone in on the metrics that matter. SLOs and error budgets are important indicators for your system’s reliability performance. They can act as guides when other metrics appear inconclusive.

If you enjoyed this blog post, check out these resources:

SREview Issue #11 March 2021

Hannah Culver — Tue, 23 Mar 2021 15:01:42 +0000

Is it spring yet? Or spring still? Time sure is strange nowadays. At least we have a ton to look forward to in the next few weeks! Here are some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this month.

Tweets that have us twittering

Lorin Hochstein E_GOAL_CONFLICT

@norootcause

Incident write-ups are both time-sensitive (gotta talk to the people involved before their memories fade!) and timeless (lessons aren't localized to the particular moment in time when the incident happened).

05:51 AM - 15 Mar 2021

Camila Lenis

@camilaleniss

Don't spend 6 minutes doing something by hand when you can spend 6 hours failing to automate it.

23:45 PM - 13 Mar 2021

J. Paul Reed

@jpaulreed

In my observations, a pretty good (and easily measurable!) first order approximation for how "successful" an incident review was: how LITTLE the facilitator speaks.

(This holds even when the incident commander facilitates; yet another reason why your IC shouldn't run your IR.)

23:29 PM - 04 Mar 2021

SREading

SRE2AUX: How Flight Controllers were the first SREs: Geoff White writes about what vintage space lore has to do with site reliability engineering in the 21st century.
The Netflix Cosmos Platform: This article explains why the Netflix team built Cosmos, how it works, and shares some of the things the team learned along the way.
SRE as Organizational Transformation: Lessons from Activist Organizers: Chris Hendrix writes about how we can learn from activist organizers while driving company-wide change.
What is a Canary Deployment?: This post contains a thorough description of canary releases including benefits, visual examples, and how it fits into an effective deployment strategy.
How We Built and Use Runbook Documentation: Alicia Li and Lucas Bartroli write about runbooks. “Even if you don’t notice, you are executing runbooks everyday, all the time.”
Increment’s Reliability Issue: This issue contains articles on reliability from thought leaders such as Tanya Reilly, Mads Hartmann, Ana Margarita Medina, and more.

Events

SRE Thought Leader Panel: SRE Adoption as Organizational Transformation March 25, 11 AM PDT: Hear from experts Kurt Andersen, Vanessa Yiu, and Tony Hansmann. Hosted by Chris Hendrix.
Blameless Bi-Weekly Demo March 30 at 8 AM PDT: Check out a live demo of Blameless as we walk you through operations best practices, and get your questions answered.
DevOps Online Summit April 26-30: DevOps professionals throughout the world come together and share their learnings.
Deserted Island DevOps April 30: A single-day virtual event streamed on Twitch. All presentations will take place in the world of Animal Crossing: New Horizons.

How to Analyze Contributing Factors Blamelessly

Hannah Culver — Tue, 16 Mar 2021 15:39:08 +0000

By: Emily Arnott

Originally published on Failure is Inevitable.

SRE advocates addressing problems blamelessly. When something goes wrong, don’t try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don’t fear blame, leading to more initiative and innovation.

Learning everything you can from incidents is a challenge. Understanding the benefits and best practices of analyzing contributing factors can help. In this blog post, we’ll look at:

A definition for root cause analysis
A definition for contributing factor analysis
How to choose between RCAs and contributing factor analysis
Best practices for contributing factor analyses
How to incorporate learning from analyses back into development

What is a root cause analysis?

Root cause analysis, or RCA, is a method for finding the reason an incident occurred. Here it is, summarized in four steps:

Identify the incident. You should understand the exact boundary of what is and isn’t considered part of the incident.
Create a timeline. Log all events impacting the system. Start when the aberrant behavior begins and end when the system returns to normal.
Judge the events for causality. Consider the impact of each event leading up to the incident. Did it indirectly or directly cause the incident? Was it necessary for the incident to happen? Was it irrelevant?
Build a causal diagram. A causal diagram or graph is an illustrative tool. It shows how events contribute to the incident. Here is an example:

What is a contributing factor analysis?

A contributing factor analysis is another methodology for examining an incident. Rather than pinpoint a single root cause of an incident, the contributing factor analysis looks for a broader range of factors This is a more holistic approach. It considers technical, procedural, and cultural factors. For the above example of a server outage, here are some factors you may also consider:

The feature launch schedule doesn’t account for server update timings
No policy to scale up server availability for feature launches
Server architecture could be updated to support more traffic
Incident response team could be overworked with new feature launch, delaying backup server availability

Contributing factor analysis should be part of a larger incident retrospective approach. Teams should try to identify contributing factors that can lead to actionable change.

How do you choose between an RCA and a contributing factor analysis?

RCAs and contributing factor analysis each have use cases. RCAs are often formally required while contributing factor analysis is a useful internal tool. Let’s break down why.

When are RCAs used?

RCAs can be part of an organization’s official response to an incident. Because they are often public-facing, they have strict guidelines for formatting. This standardization can be challenging. In a discussion with Blameless, Nic Benders from New Relic shared his thoughts on RCAs:

“The RCA process is a little bit of a bad word inside of New Relic. We see those letters most often accompanied by ‘Customer X wants an RCA.’ Engineers hate it because they are already embarrassed about the failure and now they need to write about it in a way that can pass Legal review.”

Even if they’re unpleasant, RCAs can be necessary. Customers have come to expect openness around failure. Dheeraj Khanna from Tenable explains:

“Today, the industry has become more tolerant to accepting the fact that if you have a vendor, either a SaaS shop or otherwise, it is okay for them to have technical failures. The one caveat is that you are being very transparent to the customer. That means that you are publishing your community pages, and you have enough meat in your status page or updates.”

When are contributing factor analyses used?

Contributing factor analyses help translate the causes of an incident into actionable changes. As this document is for internal use, teams can be more open about the failure and teams can improve.

Nic Benders discusses the shortcomings of RCAs in capturing these areas. “It remains challenging for me to try and find a way to address those people skills and process issues. Technology is the one lever that we pull a lot, so we put a ton of technical fixes in place. But, there are three elements to those incidents. And I worry that we're not doing a good job approaching the other two: people skills and processes.”

When trying to learn the most you can from incidents, looking at all contributing factors is a must. Although you may need both types of analysis, contributing factor analyses are often more useful.

Best practices for blameless contributing factor analysis

Remove the value of blame. While analyzing an incident, blame offers an easy answer. Making an individual at fault removes the responsibility from the system. This means that no changes are necessary to the system; the work is already done. You should not value the solution of blame. By focusing on systemic causes, you can learn more and improve your system further.

Look beyond individuals. Humans aren't perfect. Imagine while conducting a retrospective the team realized that an alert was triggered. But, a team member ignored it. Why? It's time to dig deeper than the individual. Are alerts often noisy or irrelevant? Has this person had enough on-call training and experience? Or have they been on call for too long without a break? By asking these questions, you can arrive at meaningful lessons. It is the best way to ensure the mistake doesn’t happen again.

Celebrate failure. When uncovering factors, celebrate each one as an opportunity for learning. It may seem that the more factors you uncover, the more work you’ve made for yourselves. You don’t want this to discourage team members from suggesting other factors. Create a psychologically safe environment for people to brainstorm. Make sure each contribution is valued.

How to feed learning from analyses back into development

One of the key benefits of a contributing factor analysis is generating actionable insights into the system. But how do you ensure that these lessons lead to changes in development and policy? Here are some tips:

Create a central repository of required actions per incident
Invite development teams to incident review meetings
Bake action items into future sprints, working with product when necessary
Link learning and tasks to larger initiatives for the organization
Have review meetings after task completion to ensure the desired changes occurred

Keep a cycle flowing between the causes of incidents and the changes you make. This will help your system continually improve in relevant ways.

Blameless can help your contributing factor analysis process. Blameless incident retrospectives serve as a hub for learning and future changes. Blameless aggregates the data you need to discover systematic causes behind each incident. To see how, check out a demo.

If you enjoyed this blog post, check out these resources:

It's all Chaos! And it Makes for Resilience at Scale

Hannah Culver — Mon, 15 Mar 2021 16:16:45 +0000

By: Emily Arnott

Originally published on Failure is Inevitable.

Chaos engineering is a practice where engineers simulate failure to see how systems respond. This helps teams proactively identify and fix preventable issues. It also helps teams prepare responses to the types of issues they cannot prevent, such as sudden hardware failure. The goal of chaos engineering is to improve the reliability and resilience of a system. As such, it is an essential part of a mature SRE solution.

But integrating chaos engineering with other SRE tools and practices can be challenging. To get the most from your experiments, you’ll need to tie in learnings across all your reliability practices. You’ll also need to adjust your chaos engineering as your organization scales. In this blog post, we’ll look at:

How SRE and chaos engineering intersect
Best practices for chaos engineering
What experiments look like at different maturity levels

Chaos engineering and SRE

It’s clear how the goals of SRE and chaos engineering align. Both practices encourage teams to build resilience into their systems. But the connections don’t stop there. Many SRE practices integrate with chaos engineering to increase the effectiveness of both. Below are a few examples.

SLOs as chaos engineering scoreboards
When running chaos engineering experiments, it's important to determine how impactful the hypothetical failure would be. This can be difficult.

Consider a test that shows that an entire service would go offline if a certain server fails. You estimate that it would take an hour or so to return to normal operation. It sounds frightening, but what if that service was only used by a tiny fraction of your customers?

Another test might show that, when traffic surpasses a certain threshold, a page accessed by every customer loads 3 seconds slower. This scenario could have more customer impact than the other. Teams would want to focus on resolving this issue first.

SLOs allows you to compare these scenarios using the most important metric: customer impact. SLIs, or service level indicators, are built from the service metrics that matter most to customers. SLOs, or service level objectives, show the level of failure customers will tolerate.

When you run chaos experiments, you can determine how the experiment would affect the SLO. This gives you a triaging model for the lessons of different experiments. You can then focus on preventing the incidents that affected the SLO most.

Chaos engineering as runbook bootcamp
It's important to simulate the impact of a hypothetical failure and work to prevent it. But there will still be incidents, no matter how many experiments we run. Chaos engineering also gives teams the space to practice response measures. This can help responders work faster and with more confidence during a real incident.

In SRE, incident responses are codified as runbooks. These are guides broken down into modular checks and steps. Where possible, runbooks are automated to save toil. Of course, runbooks can never be perfect. Regular review is necessary to ensure that all information is up-to-date and comprehensive.

Chaos engineering can help improve runbooks by providing more opportunities to evaluate them. Teams will not use runbooks addressing a type of rare, catastrophic failure often. When the failure does occur, your team will need to know it can trust the runbook. By running chaos experiments of this scenario, you’ll find potential stumbling blocks.

Runbooks can also serve as inspiration for chaos experiments. If a runbook has been “gathering dust,” you can design an experiment to put it to use. This will ensure it’s up-to-date with your system and still useful.

Building a library of chaos engineering retrospectives
A valuable tool in your SRE tool belt is the incident retrospective. This is a document built by teams responding to an incident. It contains the incident timeline, key communications, follow-up actions, and more. Incident retrospectives form a valuable hub of knowledge. They are invaluable for onboarding and developing a culture of continuous improvement.

Chaos engineering can help build your library of retrospectives. Teams should write retrospectives about chaos experiments as they would for a real incident. Include details about why and how the experiment was conducted to be thorough. Reviewing these retrospectives can provide the same beneficial insights as a real incident.

Conversely, incident retrospectives can motivate good chaos experiments. Imagine your team had difficulty responding to a particular incident. Reviewing the incident retrospective revealed why the team stumbled. Your team creates a plan for incidents like these moving forward. “Replaying” the incident will give you a direct comparison between the new and old methods. It can help you avoid making the same mistakes.

A maturity model of chaos engineering

As organizations grow in maturity, adopting chaos engineering as a practice provides more opportunities. But there are also challenges. This chart breaks down what to expect at each stage.

No matter what maturity your organization is, the best time to try chaos engineering is now. The sooner you can build experimenting into your routines, the more time you’ll have to develop your expertise.

Blameless can help you make the most of your chaos engineering experiments. Our SLO, runbook documentation, and incident retrospective tools can help you get the most from every experiment. To see how, check out a demo.

If you enjoyed this blog post, check out these resources:

How to Build an SRE Team with a Growth Mindset

Hannah Culver — Tue, 09 Mar 2021 16:18:06 +0000

Originally published on Failure is Inevitable.

By: Emily Arnott

The biggest benefit of SRE isn’t always the processes or tools, but the cultural shift. Building a blameless culture can profoundly change how your organization functions. Your SRE team should be your champions for cultural development. To drive change, SREs need to embody a growth mindset. They need to believe that their own abilities and perspectives can always grow, and encourage this mindset across the organization.

In this blog post, we’ll cover:

What a growth mindset is and why it helps your SRE team
How to hire for a growth mindset
How to develop people into SREs with a growth mindset
How a blameless culture empowers a growth mindset

What is a growth mindset?

In an article for Harvard Business Review, Carol Dweck breaks down the definition of a growth mindset. She summarizes her findings:

“Individuals who believe their talents can be developed (through hard work, good strategies, and input from others) have a growth mindset. They tend to achieve more than those with a more fixed mindset (those who believe their talents are innate gifts).”

To illustrate this, let’s compare some statements.

Why does having a growth mindset help?

Having a growth mindset in your organization helps with camaraderie, commitment, risk taking, and innovation. In a study conducted for Harvard Business Review, the authors found that:

“Employees in a ‘growth mindset’ company are:

47% likelier to say that their colleagues are trustworthy,
34% likelier to feel a strong sense of ownership and commitment to the company,
65% likelier to say that the company supports risk taking, and
49% likelier to say that the company fosters innovation.”

If you want to improve morale and productivity in your teams, a growth mindset is essential.

Your SRE team: champions of the growth mindset

The best way to instill a growth mindset throughout your engineering organization is to have a dedicated team championing it. Your SREs are the perfect candidates to help drive this change. Here are some reasons why:

SRE is a holistic practice, so SREs interact with teams throughout the organization
SREs create processes that emphasize a growth mindset, such as learning from incidents via blameless retrospectives
SREs create safeguards with SLOs and error budgets to encourage innovation
SREs drive review and revision cycles, reinforcing the importance of feedback
SREs treat all failures as opportunities to learn and improve, making both the team and the system more resilient over time

Wait, what if you don’t have an SRE team right now? If you don’t have an SRE team, these principles can be adopted by each engineer. A growth mindset isn’t produced by an SRE team, only championed. To be successful, all team members will need to be on board. But, if you are looking to build an SRE team from the ground up, here are some tips.

Hiring SREs with a growth mindset

When hiring for your SRE team, finding people with a growth mindset is key. In a discussion with Blameless, New Relic VP and GM Nic Benders discussed how he prioritizes mindset over experience. He says that the predictive power of where someone has worked before is “very poor.” Instead, he says that “I always look for people who are interested in constantly challenging themselves and learning new things.”

But how do you determine someone’s growth mindset while interviewing? Here are some example questions that can be revealing:

Could you tell us about a success you’ve had recently? Why were you successful?
Candidates with a growth mindset will likely attribute some of their success to outside sources. These could be books they’ve read, research they’ve done, or teammates and colleges that they spoke with. The might also point to previous failures as a reason for recent success. Afterall, failures should always teach you something.

Could you tell us about a failure you’ve had recently? Why did it go wrong?
Failures can be painful. Yet they give us space to grow. Candidates with a growth mindset will still likely note the sting of failure. But, they’ll also hopefully talk about what they took from the experience. They’ll turn a blameless eye towards the incident and look for ways to improve in the future.

What are you excited to learn about while working here?
Look for candidates who are applying because they see an opportunity to grow at the company. Learning and trying new things is what makes day-to-day work exciting. Candidates may want to learn a new programming language, or a new tool that’s part of your stack. Or they may be interested in seeing how different teams work together cross-functionally. Curiosity is always welcome.

What does a growth mindset mean to you?
Sometimes the direct route is best. When a candidate is enthusiastic about growth and learning, it’s a sign that they could be a good fit for your SRE role.

These questions are not comprehensive. They are a starting point for a conversation about growth. Consider giving these questions to candidates prior to the interview so they have time to think about them before answering. This can also take some of the pressure off the interview process and create a more open and friendly environment.

Internally developing an SRE team with a growth mindset

Alongside hiring, you can grow your SRE team by promoting people to the role internally. When building your SRE team, consider a wide variety of previous positions. Just because someone doesn’t have the title “SRE” doesn’t mean they couldn’t be great in the role.

In his discussion with Blameless, Nic Benders shared his thoughts. “I have to always remind myself that I didn't get to where I am today by doing things that I was qualified to do. In the same way, as a leader I need to be giving work to other people who might not be, or might not seem to be, qualified to do that work, because that is their path to growth.”

If you give them the opportunity, people with a growth mindset will rise to the challenge. Someone showing that they’re ready to learn new things is as encouraging as technical expertise. Keep in mind that this may require some additional support from teammates and leadership while a fledgling SRE finds their footing.

Spreading SRE practices throughout the organization

Ideally, everyone within your engineering organization should be familiar with the SRE best practices. Even if they aren’t working directly with SLOs or writing retrospectives, they should know how these tools work. Most importantly, they should understand why these practices are important. Your SRE team should be the ambassadors of these lessons.

SRE isn’t just for SREs. Blameless incident retrospectives are important for all teams responsible for running code in production. Incident response procedures are key for all teams who own services or carry pagers. This mindset makes for a more resilient organization as a whole. It also improves your pool of potential SREs. If everyone understands best practices, you’re able to focus on promoting based on mindset.

Empowering a growth mindset with a blameless culture

At the heart of growth is a feeling of agency. People need to be free to push their limits and challenge themselves. They need to be empowered to innovate. Most importantly, they need to embrace failure. Even the most growth-oriented person will be stifled if they’re afraid they’ll be punished if they fail.

A blameless culture is the key to encouraging a growth mindset. When something goes wrong, rather than placing blame on an individual, look into the systemic causes. Assume everyone was working in good faith to the best of their abilities. Rather than faulting someone for making an error, look into why they made the mistake. Try to celebrate the fact that you’re improving your system with every incident.

The blameless attitude complements a growth mindset:

Both deal with the causes behind actions, rather than the actors
Both believe that failure leads to growth and future success
Both see feedback as a form of support

Are you ready to empower your team with a growth mindset? Blameless can help! To see how, check out a demo.

If you enjoyed this blog post, check out these resources:

How We Built and Use Runbook Documentation at Blameless

Hannah Culver — Mon, 08 Mar 2021 22:09:19 +0000

By: Alicia Li and Lucas Bartroli

Originally published on Failure is Inevitable.

Why runbooks are important to a fully developed SRE strategy

Even if you don’t notice, you are executing runbooks everyday, all the time. When you have an incident in your day-to-day operations, you follow a series of ordered and connected steps to solve it. For instance, if you lose your internet connection, you will follow a series of steps to resolve that issue:

Check if you’re still connected to the WiFi network.
Check for the router status.
Try to restart the router.
Check if connection is back.
In case connection is not back, call the internet provider.

This could be different depending on your method, but you have the idea. Even if you don’t write it down because it is not a complex process, you’re still executing a runbook to achieve a goal or resolve an incident. However, within a more complex socio-technical environment, it becomes crucial to document your runbooks and codify your knowledge.

SRE and engineering teams need a tool to write and store their runbooks because incidents can be way more complex than the one in the above example. Incidents can involve collaboration between different teams, code execution, reuse of metadata across different steps (tokens, names, password, etc), conditional actions based on the result of a step execution, and more. Or teams may just need to write down a personal experience from an edge case they encountered while resolving an incident, which can help others if it happens again in the future.

Most runbooks focus on incident mitigation. However, sometimes the response depends on knowing the cause of the incident first. It is easy to overlook the role a runbook can potentially play in determining a contributing factor of an incident. Instead of a single, large runbook that tries to deal with multiple situations, we recommend breaking it down into multiple runbooks focused on doing one thing well.

For example, imagine your internet isn’t working. There could be multiple reasons why you cannot connect. Your computer might have suffered a hardware failure, the modem might fail, you might be connected to the wrong network, or simply at a place where signal strength isn’t strong enough. Some of these issues might require their own runbooks. You can have an overarching runbook to determine the cause which links to one or more runbooks that can help fix an individual issue.

Well-written runbooks should be clearly broken down into different steps. For each step, in addition to clearly indicating what needs to be done, it’s also helpful to include some context to explain why this step is taken. This helps new engineers onboard quickly and limits tribal knowledge.

Migrating runbooks to a central repository

Runbooks are only helpful if everyone can find them. If your runbooks are scattered across Confluence, Google Docs, or even stored locally on a laptop, they can be difficult to locate when you need them the most. We dealt with a similar problem here at Blameless. So, our team began dogfooding Runbook Documentation for our own runbooks. Here’s what we found the most useful.

Migrating our runbooks to Blameless was a very easy task. We used to have all our runbooks in Confluence, broken down by steps. Runbook Documents currently support 4 types of steps (and we plan to add even more). These are the steps we most commonly use within our own runbooks and they include:

Text Blocks: Log and print any message to the screen.
Rich Text Blocks: Similar to Text Block with rich text capabilities.
Code Snippets: Display a code editor that allows you to select between more than 50 languages with syntax highlighting.
Custom Forms: Create your own form with JSON Schema. Here is an example of a runbook migrated from Confluence to Blameless:

When we’re trying to find a particular runbook within Blameless later, we also have a sorting function that makes finding the exact runbook we need faster. We provide a search-and-sort functionality in the runbooks list page that allows us to filter them very quickly by name, description, amount of steps, and last execution dates.

What makes us excited about Runbook Documentation

Runbook Documentation allows users to document the optimal way to respond to events. This helps teams be consistent in their incident response processes. Users are guided through a series of predefined steps to accomplish a specific outcome via manual tasks. In Blameless, you can also create independent steps that allow you to craft custom flows, and get metadata from each step to use on another step.
Additionally, we built Runbook Documentation using GraphQL Subscriptions. This means that you can interact with runbooks in real time. For example, if someone else executed a runbook, you can see the new instance of the runbook running and take actions if needed.

Another cool feature of Runbook Documentation is that you can write code snippets using Monaco Editor (the code editor that powers VSCode). This means you have no limits when writing a code snippet, as it supports more than 50 languages with syntax highlighting.

Another feature that we love about Runbook Documentation is the ability to attach individual runbooks to an incident. This integration allows all stakeholders to see exactly which steps are being taken to mitigate this incident. Plus, you can track runbook usage. This helps teams understand which runbooks are most commonly consulted, which are most useful, and which might need a little tidying up.

Additionally, what was run at the time of the incident is preserved as-is, even if the runbook changes in the future. This is much better than an ad-hoc comment linking to a document or Confluence that may have already been edited as it gives a clearer view of what responders were working with. Furthermore, we’re able to see the audit log history of individual runbooks that have been invoked on the runbook history page.

Runbooks are more than a guide to resolving incidents. They’re a way to collaborate with your team and find the best way to respond. These documents are well-loved and well worn. With Runbooks Documentation, we’re able to keep them up-to-date, monitor usage, and create a team-based approach to crafting and revising.

If you’d like to learn more about runbooks, here are some additional resources:

SRE as Organizational Transformation: Lessons from Activist Organizers

Hannah Culver — Wed, 03 Mar 2021 16:18:16 +0000

By: Chris Hendrix, Failure is Inevitable

In the software industry’s recent past, the biggest disruptive wave was Agile methodologies. While Site Reliability Engineering is still early in its adoption, those of us who experienced the disruptive transformation of Agile see the writing on the wall: SRE will impact everyone.

Any kind of major transformation like this requires a change in culture, which is a catch-all term for changing people’s principles and behaviors. As your organization grows, this will extend beyond product and engineering. At some point you also need to convince the key power-holders in your organization to invest in this transformation.

Folks who’ve been successful at managing these multi-year complex transformations point to a piece of invaluable advice: you must treat the transformation as its own project–with business outcomes, executive buy-in, and a project team. And there is an unexpected place to look for learning, strategy, and tactics to achieve this goal: activist organizing.

Activist organizers are in the business of changing minds and behaviors, leading decision-makers and traditional power holders in new directions. Here’s a curated list of their tips and practices that you can use to bolster your company’s transformation efforts.

Spectrum of Allies

Photo courtesy of 350.org

The main principle to the spectrum of allies is that some people are more aligned with your cause than others. People will range from active allies, passive allies, neutral, passive opposition, and active opposition. There’s a few lessons to take away from this concept:

It’s most efficient to try to move someone only “one slice” closer to active allyship at a time.
It’s not worth the energy to try to influence people who are actively opposed to your efforts! Target passive-opposition at the most.
Tailor your message and your “ask” to where someone is on the spectrum. An active ally can be asked to amplify your efforts within your company. However, you’ll need to pay special attention to how you frame the outcomes of SRE adoption to a person of passive opposition.

People Power / Stay on Message

Executives and other decision-makers are examples of concentrated power. The major alternative to concentrated power is people power, or the power of numbers and organization. People power exists when many people are all organized to make the same request.

Your campaign’s passive and active allies should all be trained on the elevator pitch that answers “What business value will Site Reliability Engineering give us?” Those allies should then repeat that pitch, and any other messaging in every venue available, amplifying each other until you create a level of support that builds heat on the decision-makers. At some point it will come to a boil and leaders will be forced to address the growing calls for SRE transformation.

Supporting Limbs of Power

Photo courtesy of 350.org

While framing your message appropriately and having intimate one-on-one conversations with those in charge can go a long way to build relationships and influence leaders, you will inevitably encounter someone who holds reservations about SRE adoption.

The traditional view of power thinks of a CEO at the top, giving orders to a VP who passes on orders to a director, then a manager, and finally an IC. This is a convenient mental model, but in the world of activism and organizers this view is disheartening. Instead, organizers have reframed the idea of concentrated power as being held up by various pillars of support. This support system can empower leaders to make choices that are beneficial to the organization as a whole, if they can be convinced of the campaign’s merit.

For example, while your SRE transformation effort might be targeting your CTO–a passive opponent– they might have the following pillars of support:

An executive mentor from outside of the company
An SVP who they rely on for guidance
A COO who “executes orders”
An executive assistant who controls their calendar
A thought leader they regularly quote or follow on twitter
An HR, finance, or legal business partner that holds them accountable

Get creative and you will quickly realize that there are many avenues for influence! You can first try meeting with those pillars of support. Any one of them you can bring into your campaign can have an outsized impact on amplifying your message with your target.

Start Small

Changing an organization doesn’t happen overnight, and while you work on influencing those in power, the best way to drive change is to start with what you can impact.

While SLIs and SLOs often require a more substantial level of buy-in, it is very easy to start running blameless retrospectives after a production incident. You can begin to build the culture of SRE by reflecting on an incident and looking at the failures as systemic instead of individual.

Another practice that works on a team level is writing and using runbooks for common incident responses. Once you’ve shown the value of process repeatability and consistency that using runbooks has achieved for your team, you can leverage that experience when trying to convince others to adopt the same practice!

Book Clubs are Practical

Even after reading this post you may still feel daunted by the prospect of helping or leading your company to adopt SRE. It can be a long and arduous process but here’s one practical way you can kick it off: start a book club!

Book clubs are a great way to:

Organize a group of people and build community
Learn about a new skill, a new technology, or a new way of working
“Get on message”

Book clubs–especially long-running ones that read multiple books in sequence–provide the seed that can germinate into a much larger effort. Make sure to stay in contact with participants, and utilize a chat or message group to strategize and execute your broader campaigns!

At Blameless, we’ve run book clubs for “Implementing Service Level Objectives” by Alex Hidalgo and “Shape Up” by Basecamp.

One final piece of advice is to lean on other people’s experiences! You aren’t alone in your journey and the people power behind SRE exists beyond the boundaries of your organization.

If you’re interested in learning more, we’ll be hosting a new SRE Thought Leader panel with industry experts who have experienced and helped drive this transformation. They’ve championed SRE adoption in companies like Goldman Sachs, LinkedIn, and Pivotal. Panelists include:

Kurt Andersen, SRE Architect at Blameless
Vanessa Yiu, Executive Director, Enterprise Architecture at Goldman Sachs
Tony Hansmann, Former Global CTOat Pivotal Software, Inc.
Chris Hendrix (Host), Staff Software Engineer at Blameless