Forem: Anu-angie

Towards More Effective Incident Postmortems

Anu-angie — Wed, 03 Jun 2020 12:01:48 +0000

An incident postmortem is not only an essential document for reference, but also necessary as a process by which teams can collaboratively learn from failure, and communicate independent learnings across the organization.

As our systems grow in scale and complexity, outages are inevitable, no matter how hard we try to provide uninterrupted services. When an outage occurs, the most important and immediate step is, of course, fixing the underlying issue and keeping the relevant stakeholders and customers informed. A lot of the incidents can be quickly rectified with tools like infrastructure automation, runbooks, feature flags, version control, continuous delivery and people can be kept in the loop with chatops and status pages. These actions, though beneficial to fix the situation at hand, do not really help understand what failed and why. And understanding what failed and why is a crucial step towards preventing similar occurrences going forward.

This is where incident postmortems come in - the next logical step after any incident is to dissect and analyze the why, how and the what of the incident. And ideally, this should really be done for every single incident, and not just the high severity or high impact ones.

An incident postmortem is a report that records the details of an incident, the impact it has on the service, the team that was assembled to address the event, the immediate steps taken to mitigate the damage,the actions taken to resolve the incident and the lessons learnt that can help the team minimize the impact of future incidents. These lessons can in turn affect how you think about a particular component of your system, or sometimes just how mitigation steps could be done faster in specific cases. Which is a big deal, to say the least.

Importance of Incident Postmortems

There are several reasons why doing an incident postmortem are incredibly important:

Serves as a documentation tool: It provides team members with the ability to record the nitty-gritties of an incident ensuring that it won’t be forgotten. A well-documented incident becomes invaluable to a team since it not only includes a description of what happened but also details on the actions that were taken that serves as a reference point for remediation and mitigation of future incidents
Helps build trust and transparency with customers and relevant stakeholders of a particular service or application when posted publicly. This also helps build confidence amongst users that necessary steps are taken to prevent any future disruptions to the services provided
Instills a culture of learning. As rightly said, “The cost of failure is education”. It also helps shift the focus from the immediate now to the future. This is why conducting blameless postmortems becomes crucial. More on being blameless is covered later in this blog.
Serves as an opportunity to get more insights to drive improvement in infrastructure when services and applications fail in new and interesting ways for us to realise what areas need improvement

Incident Postmortem - What does it consist of?

Incident Postmortems are also called RCAs (Root Cause Analysis) or incident reviews. At Squadcast, we prefer the term Incident Reviews but to keep this easier to digest, we are going to refer to them by the more popular “Incident Postmortem”, for the rest of this article. When it comes to an incident postmortem, there is no one-size-fits-all approach or even a universally accepted standard for doing different kinds of post-mortems. The Postmortem process varies across organizations and sometimes even within companies depending on the size and culture of the teams, from casual to highly formal, depending on the nature of the product or the severity of the incident.

Regardless of the names and the approach, the end goal remains the same - to keep relevant stakeholders informed and as a learning opportunity not only to fix a weakness but to make systems more resilient as a whole. The whole incident postmortem process can take considerable time and effort to gather information and the postmortem meeting (if needed) might occur days, or even weeks, after the actual incident depending on the severity of the same.

A typical postmortem process covers the below-outlined aspects, in no particular order:

A high-level outline or summary: This covers the ‘what’ and ‘why’ of the incident, the severity and business impact on customers or users, people involved in the response process and the resolution of the incident. This is particularly beneficial to managers and application owners who need to communicate details of the incident to the top management and relevant outside stakeholders.
Causes: This part of the postmortem addresses more technical and operational aspects of the incident starting with the causes and triggers, explaining the origins of the failure and highlights the underlying cause - what made the system to break. A popular method to get to the root cause is called the 5 Whys Process - which was first made popular by Toyota.
Effects: Post analyzing the deeper granularity on the causes of the incident, the team is now tasked with measuring and analyzing the effects on business, services, and users. This step of the postmortem process also analyses the extent and severity of the incident. For instance, the impact on the business when a payment service was down on an e-commerce website affecting its customer’s experience in purchase.
Resolution: This step starts with a diagnostic dissection into details of the Incident Timeline covering the time of failure, the time when the incident was recognized and handled, the team involved in the process, procedures taken to remediate the problem. This part can also include a review of failed attempts which can serve as a reference to the team when a similar incident occurs, saving valuable time.
Conclusion: Outlines the key takeaways, recommendations and next steps to ensure prevention of the same or similar incident in the future.

Successful postmortems are blameless

“Blameless postmortems are a tenet of SRE culture. You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems”

A critical factor in incident postmortem to be successful is that they are blameless. A culture that seeks to point fingers at the person who may have caused an outage through error or omission is unlikely to get truthful answers during a review, thus negating the intent behind the whole exercise of having an incident postmortem in the first place.

Through blameless postmortems, the aim is to have a nurturing environment where every “mistake” is seen as an opportunity to strengthen the system. Blameless postmortems shift from allocating blame to investigating the underlying cause and reasons, why an individual or team faced an outage, and also emphasizing the effective prevention plans that can be put in place.

Many teams, including us here at Squadcast similar to Google, have adopted the culture of the blameless postmortem which paves way to build resilience in its teams and systems.

Blameless postmortems can tend to be challenging to write since the postmortem format clearly identifies the actions that led to the incident. However, removing blame from a postmortem provides the team the confidence to escalate issues without fear. The next section outlines the steps that can be taken to conduct effective blameless postmortems.

In order to ensure that teams develop a culture around blameless incident postmortem reviews, it should also be noted that empowering teams with an easy and automated way to capture incident information and publish the final report with reusable checklists and templates, could potentially make incident postmortem meetings less dreadful. In fact, having an automated timeline and templates that are auto-populated with incident metrics and other details as part of your incident management tool can help the process be more consistent and productive for every incident that occurs.

Conducting effective Incident Postmortems - The process

In order for postmortems to be blameless and effective at reducing recurring incidents, the review process can incentivize teams to identify root causes to fix them. A well-run postmortem allows your team to come together in a less stressful environment to achieve several goals.The exact method can depend on team culture.

Here are a few best practices that can ensure the effectiveness of postmortems‍:

1. Start with an incident timeline

Prior to conducting an effective postmortem meeting , the premise of the meeting should be around the timeline of significant activity - from chat conversations, incident details and more. You can streamline the entire postmortem process with automated incident timeline building, collaborative editing, actionable insights, and formalize your own postmortem process to make it as easy as possible for your team to respond to issues.

The goal is to understand all contributing root causes, document the incident for pattern discovery, which allows you to set a better context during the post mortem meeting. This step also plays a key role in enacting effective preventative actions to reduce the likelihood or impact of recurrence.

2. Conduct a postmortem meeting with anyone internal to the team who was affected by the incident

A structured and collaborative approach by bringing people together affected by an incident allows for a better cohesive contribution to the postmortem meeting in terms of what they learnt from the incident. This also helps in building trust and resiliency within teams. The formal incident postmortem document that records the details of the incident along with how the team remedied it can help teams in handling future incidents.

At this step, a formal template can help you record all key details and helps build consistency across all your incident postmortems.

At Squadcast we use our own incident postmortem feature that helps build an insightful timeline in a matter of minutes. This is especially useful as automation ensures that you can quickly have a system-generated post mortem for pretty much any incident, big or small. There are also a few predefined postmortem templates available from the likes of Google, Azure, and others. You can also choose to create new templates/modify existing ones. What’s more, these are available to download in MD and PDF formats!

3. Define roles and owners along with having a moderator

Another key aspect to keep in mind during a postmortem meeting is to have well defined roles and owners along with having a moderator who can ensure the meeting stays on track and avoid any hint of a “blamestorming” session. It will be helpful to have guidelines for the owners of the postmortem process in how the meetings should be run.

The owner of the review is tasked with managing the meeting and chronicling the subsequent report. It is advisable that the owner should be someone who has sufficient understanding of the technical details, familiarity with the incident, and an understanding of the business impact. Mostly, the moderator is the owner of the incident review and is responsible for maintaining order and giving every participant the chance to speak.

4. Determine the urgency of an incident by setting the right thresholds

Not all incidents are equal. Each incident in an organisation should be associated with a measurable severity level based on the impact it has on its business and customers. Associating incidents with the right severity level can help you prioritize your postmortem process. For instance, Sev 1 or higher incidents definitely require a postmortem, while for less severe incidents, postmortems can be automated with a tool like Squadcast.

That said, if need be, teams should also be provided with an option to request a postmortem for any incident that doesn't meet the threshold.

5. Devil’s in the Details - incident metrics and other key information captured

Capturing as many details as possible about what happened and what was done during the incident can help teams be more unambiguous. Details such as links to tickets, status updates, incident state documents like monitoring charts along with screenshots and relevant graphics or dashboards becomes a powerful data set that captures the fine details of an incident.

It is also crucial that along with summarizing key details, important incident related metrics are also captured that help you associate numeric and hard data to the incidents and their impact. Metrics such as Mean Time to Resolution (MTTR), SLO, Extent of SLO breach, Error Budget consumed, severity of incident, number of minutes of downtime can be considered for postmortem tracking. With consistent measurement of these metrics, you can analyze the incident trends over time.

The key to conducting effective incident postmortems that can help you improve your team and systems is to have a process and stick to. And, making sure it is effective requires commitment at all levels in the organization.

6. Publish and track postmortems promptly

Once the postmortem review meeting is completed, the final but important step is to publish the postmortem promptly and distribute the same as an internal communication, typically via email, to all relevant stakeholders, describing the results and key learnings along with a link to the full report.

Google states that “A prompt postmortem tends to be more accurate because information is fresh in the contributors’ minds. The people who were affected by the outage are waiting for an explanation and some demonstration that you have things under control. The longer you wait, the more they will fill the gap with the products of their imagination. That seldom works in your favor!”

Regular application of these practices results in better system design, less downtime, and more effective and happier engineers.

How Squadcast Actions help you reduce MTTR

Anu-angie — Wed, 15 Jan 2020 03:42:47 +0000

Did you land here searching for a way to reduce MTTR as a DevOps/SRE or reliability engineer? If yes, then you are in the right place. If not, you should still read on if you care about the reliability of the system you are building.

MTTR - or Mean Time To Resolve is a widely used metric in the realm of systems reliability. However, people tend to interpret MTTR differently. A temporary patch to get systems up and running may be considered a resolution in some teams, even if the root cause requires a more long-term fix. Regardless of its different definitions, MTTR is a crucial metric because its a measure of operational resilience and is closely linked to your uptime. And most importantly, there is a universal need to keep this number down as it has a direct impact on revenue and customer happiness.

A recent study conducted by devops.com tries to measure the impact of downtime and the numbers are quite staggering

For the Fortune 1000, the average total cost of unplanned application downtime per year is $1.25 billion to $2.5 billion.
The average hourly cost of an infrastructure failure is $100,000 per hour.
The average cost of a critical application failure per hour is $500,000 to $1 million.

It stands to reason, then, that engineering teams should strive to decrease their overall MTTR. But one of the biggest challenges that DevOps and IT teams face today is the inability to quickly take obvious mitigation actions when an incident has been detected - this, in turn, leads to increased TTR.

The time taken to detect a problem or an incident depends on:

A variety of logs, monitoring tools and other solutions in place
The efficacy, and accessibility of these tools and
Dependencies on other teams and systems.

Once an incident is detected, taking the right actions automatically and immediately is the easiest step to make a sustainable and measurable improvement to your MTTR.

Now this means not just alerting the right responder on time, but also triggering certain scripts and failsafes based on incident severity and context to minimize end-user impact.

So, we thought - what if you could get push notification alerts for events, and you can just swipe those notifications to acknowledge and take basic mitigation actions. You don't need to get to your laptop, or run stuff on the terminal, or log in to a bunch of other tools like CI/CD, Infra automation or Testing platforms. Sounds intriguing? Check out how it works.

Introducing Squadcast Actions

Often, despite the DevOps/SRE/on-call team being alerted immediately about a major incident, and despite them knowing in a matter of seconds what actions need to be taken to minimize end-user impact, it still takes a few minutes and sometimes hours to recover due to human factors. This is especially true if the SRE/on-call engineer is outside work hours, or away from their computer.

Needless to say, actual incident resolution can take many hours or even days depending on the triage time, access to key data/information, cooperation from other colleagues. But in cases like this, a quick recovery to a state where the end-user impact is inconsequential should be the only acceptable behavior. Empowering on-call teams to quickly take obvious and necessary actions can save the day (and most importantly, avoid those dreadful 3 AM calls!)

This is why we built Squadcast Actions - a convenient and practical way to respond to incidents on time. At Squadcast, we obsess about improving the on-call experience and reducing the inherent stress of dealing with incidents.

Squadcast Actions allow you to take actions directly from within the platform. You can take quick actions such as

Acknowledging or resolving an incident
Rebuilding a project
Rebooting a server
Rolling back a feature
Running custom scripts and much more

All this with just a tap, thus making it easy to do tasks that are otherwise manual and repetitive. Or in other words - Reducing toil for your team.

For instance, one of the actions that you can take is *"Rebuilding CircleCI" *projects directly from the incident page by clicking on the More actions button. (Note that in order do this, CircleCI Integration with Squadcast must be first completed)

You can also see the actions performed listed in chronological order as part of the Incident timeline. The incident timeline is intended to serve as your single source of truth of who did what and when, while the incident was live.

Incident response on the go - Squadcast Actions on Mobile

The best part about taking actions is doing it on the go - be it while you are enjoying a scrumptious meal with your colleagues at lunch or during your tiring commute to and fro from work.Our fully functional native apps on both Android and iOS platforms make it easy to respond to critical incidents with pre-defined actions.

Here's a quick sneak peek

Effective incident management not only requires sending the right information to the right on-call responders but also enabling your team with the right tools to act swiftly. Combining Squadcast with an existing incident management workflow allows DevOps/SRE professionals to efficiently track, analyze and resolve incidents.

Enjoyed this? If you have come this far then you should definitely check out some cool new features that we are currently working on, available on our product road map.

We love your comments. What do you struggle with as a DevOps/SRE? Do you have ideas on how incident response could be done better in your organization?

We would love to hear from you! Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

Learn more about Squadcast

Transparency in Incident Response

Anu-angie — Wed, 15 Jan 2020 03:31:07 +0000

An often overlooked bedrock of Site Reliability Engineering (SRE)

When your production systems are hit with a critical issue, you can trust your DevOps team, your Sysadmins or your SREs to get the system back on track. This is a no brainer.

And in turn, these folks need to be able to trust the rest of the team to let them do their jobs, be it engineering, customer support or product management. But where does this trust come from? It comes from understanding - the more you understand, the more you can trust. But when there is obscurity, it severely impedes understanding.

This is why transparency matters.

In most organizations, the pursuit of reliability is often blended with obscurity. There is strict access control and sometimes this means relevant people not having access to even basic observability metrics. This leads to increased stress when systems go down, with people pointing fingers & assigning blame to things they may not fully understand.

As a general rule of thumb, higher transparency not only results in a better incident management and response process but more importantly also increases trust between team members and gives them a way to calmly figure out what went wrong before fixing it for good.

In this article, we are going to outline how you can cultivate transparency in your team and benefit from it.

Evolution of transparency in Tech teams

Transparency, although not a traditional goal of tech teams while handling incidents has evolved over the years to become an auxiliary objective. The most recent developments in terms of enabling transparency are widespread adoption of incident management and alert notification tools that help you plan, track, and work faster and more efficiently in an ever-changing environment. By offering increased visibility into tasks and who owns these tasks, these tools facilitate better collaboration. But on the other hand, if transparency is regarded as one of the primary objectives of teams doing incident response, the productivity gains increase manifold.

Transparency as a primary objective

In order to make transparency as a primary objective, it is important to think about what would be the milestones in your journey. You can choose these milestones based on the four progressive levels of transparency we've seen many tech teams use as a reference. These levels are:

Level 1 - Engineering Transparency
The first level of transparency is purely internal to the engineering team. Once you have selected the metrics that are the most crucial and their target range of values i.e. your Service Level Indicators (SLIs) and Service Level Objectives (SLOs) - you can then share these across the entire engineering team (instead of restricting only to specific folks) through (i) status pages private to your team, (ii) centralized incident timelines that are accessible to every developer, and (iii) opening up the incident response documentation like post-mortems, runbooks and other best practices. This level of transparency is gated and serves to help the engineering team collaborate better.
Level 2 - Organizational Transparency
Taking it a notch higher, you can expose this same information to the entire organization including product, support and business teams. You can start to do this by first setting your SLOs in collaboration with customer facing teams. The outcome at this level of transparency is increased trust in the engineering team and better communication with external stakeholders like customers, partners, resellers, etc.
Level 3 - Stakeholder Transparency
The third level of transparency is where you expose your incident management practices and your SLOs to all external stakeholders such as your customers, partners, resellers, vendors or anybody else that you're working with. This can be achieved with a public SLO dashboard, public status pages, and open post-mortems. The benefits at this level of transparency is higher customer loyalty, and improved brand perception.
Level 4 - Universal Transparency
The final level of transparency is the holy grail where you really bare all. It's where you are public about your metrics not only to existing stakeholders but also future potential stakeholders. This is the level at which many teams tend to live stream their response to outages. Businesses at this level can be very confident about their metrics constantly improving because they are being fully transparent about them.

For various metrics and events, you can choose the level of transparency you want to go with for those specific metrics/events.

Often we need to iterate on our SLOs before we settle on what works best for particular situations. So, it's crucial that this information is made transparent at least within the engineering team, making it easier to reflect and understand if these SLOs are indeed the right ones. When you are transparent about your SLOs, you also have a better understanding of the dependencies between these SLOs. This further allows you to have better policies around your error budgets, and have a good understanding of how these SLOs interact with each other.

That being said, just because you're transparent about your SLOs doesn't mean that everyone gets to have a say in what your SLOs should be. It just means that you're communicating what is important to people across the organization. Also, there is certainly an assumption that things get more complicated if you have to be transparent about SLOs. But that's far from the truth because if you want to be transparent, the idea is to make your SLOs really simple so that even non-engineering teams can understand them. Another myth about being transparent is that it slows down processes because everybody needs to understand the SLOs. On the contrary, processes are much more streamlined and actually more effective because being transparent removes any blind spots in terms of the metrics that you're tracking.

Effective Incident response is a team effort with the right tool

Once important metrics have been identified and their target levels are defined, it is imperative that the collection of these metrics is carefully handled. Different metrics may be monitored with the help of tools like Prometheus and Datadog to collect and visualize the data. When a metric goes outside its target range of values, these tools generate an alert. Any organization with well defined SLOs will feel the need for multiple monitoring tools to track the underlying metrics or SLIs. A proper incident management tool centralizes all these alerts from the different monitoring tools and does a lot more than just alerting. It allows your team to have a robust incident response plan in place, and helps teams perform retrospectives so that repeated incidents can be resolved faster. A well-designed, dynamic incident management tool can potentially save the day, with the ability to automate a number of different incident response activities.

Each incident will have unique requirements like the data to be verified, recorded and tracked, the runbooks and processes to be followed, the stakeholders to be notified, and the reports to be filed etc.

A holistic incident response tool can address all of these. While this level of flexibility allows for individualized workflows based on the type of incident, a well-designed technology solution for incident management can also aid in providing greater transparency for all incident types.

SLOs can be much more effective if the cycle that starts with creating the objectives, ends with evaluating them based on the SLO breaches that have happened. Reevaluation of service level objectives is a must to take corrective action either by refining the indicators and their target ranges or by making the services more robust. It is crucial to design service level objectives while keeping in mind that services will fail because they inevitably will.

Implementation of Transparency

Squadcast is an incident management tool that's purpose-built for SRE. Its innovative design enables true transparency and minimizes friction in the incident response process. With transparency comes the ability to resolve incidents faster, create and iterate on SLOs and calculate error budgets to implement policies around them. This prevents re-occurrences of similar incidents and allows for faster innovation and enhanced customer satisfaction.

Squadcast helps with achieving transparency through the below inbuilt aspects of the platform.

Status Pages

One of the cornerstones of SRE is Transparency and status pages help you communicate the status of your services internally to other teams or externally to your customers at all times.

❖ Public Status Page: You can configure your public-facing services and their dependent components and show their status in real-time directly within Squadcast itself. Customers can subscribe to real-time email updates by entering their contact information in the status page.

❖ Private Status Page: You can expose the status of your internal services privately to other internal teams. You can check who is working on an incident if the service is facing issues. You can also page teams responsible for specific services.

Centralized SLO dashboard and SLI management for services.

See all your configured SLOs on a single dashboard. Analyze breaches instantly with a quick snapshot of SLIs rolled up for all your services. Squadcast allows you to track Service Level Indicators (SLIs) like uptime, latency, throughput volume, availability, etc.. Set custom thresholds to get notified when breached. The SLO dashboard is accessible to all users of Squadcast.

Collaboration with transparency

Communicate and collaborate to resolve incidents quickly irrespective of your location with the help of

❖ Virtual War Rooms: Incident-specific war rooms within the Squadcast platform enable real-time collaboration between all responders.

❖ Incident Timeline: All incident response activities are recorded in real time in chronological order. The incident timeline can be downloaded in CSV and PDF formats and can be shared with the rest of your team.

❖ ChatOps integration: Bidirectional integration with collaboration tools like Slack allows teams to track incident specific conversations on the war room. All responses from a Slack channel will be reflected on the Incident war room and vice versa.

Conclusion:

Doing incident management the SRE way helps you develop operational transparency across your organization. And you can choose different levels of transparency for different metrics and events that you may have. With a single source of truth for metrics, logs, events, traces, incident information and response, your team can be empowered to quickly access the information they need with sufficient context and collaborate to quickly resolve incidents.

With better collaboration and transparency, the overall reliability of your service improves significantly.

This article was originally a talk at SREcon'19 titled "Transparency---How Much Is Too Much". Slides available here.