Forem: Madeline Webster

Little Hands, Big Thoughts

Madeline Webster — Tue, 10 Feb 2026 13:32:00 +0000

I have had the pleasure many times over the years of running STEM workshops for different Girl Guide units in my area. Recently I found myself sitting amongst 26 Embers (aged 7-8) debriefing after a challenging activity and it struck me that we were having conversations many teams neglect.

Earlier, I had given each group of girls a set of plastic cups, and a rubber band with strings attached to it. I challenged them to work together to stack the cups in various formations using the rubber band and strings. They were told that they couldn't touch the cups with their hands, but I intentionally left the instructions vague and open to interpretation. Immediately the other adults in the room attempted to step in, thinking the girls needed more help to navigate the challenge, but I assured them that the girls would find their way on their own... and they did.

It was loud while the girls talked over each other, and quiet while the girls wrestled internally with the challenge at hand. There were exasperated sighs, grumbles of frustration, and eventually cheers of joy as they accomplished their task. Throughout it all, I kept a watchful eye and helped the girls learn to navigate their frustration.

Afterwards we all sat together talking about what worked well, and what went wrong. At first the girls wanted to focus on the physical mechanics of the task, like how to right a cup that was tipped over. One girl had put her shoes on her hands to right a fallen cup, and mentioned that her leaders had told her that was against the rules... was it? I had simply said they couldn't touch the cups with their hands. I applauded her creativity and ability to think outside the box!

I pushed the girls to think about the thing they couldn't see, the biggest tool they needed to use during the exercise - communication. What was happening when they were successful? What was happening when they weren't? It turns out that taking turns being the leader, listening to each other's ideas, and accepting that we aren't always right were important themes. Sound familiar? They talked about how to be kind even when frustrated, how they like to be talked to, and what makes them feel heard.

We all know these things are important to effective communication as a team. Nothing I've said is revolutionary. Yet I think it's easy to lose sight of these fundamental principles over time, and start to lead with our egos. Pause, take a breath, and approach each other human to human. If 7 and 8 year olds can do it, so can we! The better we understand each other and how we operate, the easier it becomes to work together as a team.

We also need to set our inner child free more often when ideating. Somewhere along the way, many of us lose the creativity that we had as children and we stop thinking critically about the boundaries enforced on us. We start to think and hear "no" before we've even asked a question - especially girls and young women. Innovation comes from pushing the boundaries of what is, and asking "what if?" instead. How would you think about the problem if you questioned its constraints or worked backwards from "yes"? What would happen if you put your shoes on your hands?

Lean into the discomfort. You'll be surprised what you might be able to learn from it.

Enabling Communication

Madeline Webster — Wed, 29 Jan 2025 15:57:35 +0000

Like many people, I've been reflecting on my experiences over the past year as we kick off 2025. What stuck with me the most was the value of communication.

In tech, we often see individuals rewarded and recognized for their technical capabilities, or a team for achieving an outcome. It is more rare to see people rewarded for their "soft skills", like communication, or leadership. Now, I also don't think they should be called "soft" skills because they are some of the hardest skills to develop... but that's a different conversation.

How to play the game

Have you ever played a game where you didn't know what the rules were, what the goal was, or what your position was? Most games have clear and concise instructions for the rules of play. They tell you what your objective is, and what is within or out of bounds for your position. Of course, you'll have to come up with your own strategy for success, but specific expectations are in place and clearly defined. How many of us could say the same about our jobs?

Too often, teams are being pushed to deliver quickly without really understanding what they're working towards or how the piece they're building fits into the overall puzzle. Priorities are changed quickly and projects scrapped, leaving team members scratching their heads and wondering what they should be focusing on. In order to create a productive environment, everyone contributing to a project should understand the problem they're trying to solve, the constraints they have to work within, and most importantly how solving the problem provides value to their customer and the company. Why would a team be motivated to work hard and deliver quickly, if they don't understand the value or importance of the work?

Many companies that have shifted to using a DevOps mindset have done so without resetting expectations for employees and their roles. In doing so, they've created teams that are confused about their responsibilities or not meeting expectations, which can lead to demotivation, poor product quality, and higher attrition rates.

Communicating with the other players

Many software developers and engineers are in the profession because we like to solve problems, so it should come as no surprise that when we encounter an issue many of us jump right into problem-solving mode... often without pausing to fully understanding the problem. In an engineering organization, this can lead to solving the same problem multiple times, in multiple ways, across multiple teams. From the outside looking in, it's easy to point this out and say, "what a waste of resources and time!" but in practice it's much harder to get these groups to communicate. Why?

"Why won't people just talk to each other?!" - Me, every day.

I've given this a lot of thought over the course of the past year, and I noticed some trends in barriers to communication based on my experiences. There are many more reasons why we fail to communicate with each other, but these are the ones that stood out to me this year.

Ownership: figuring out who to talk to is one of the biggest issues I've seen. If teams don't know who owns what, or what other teams are working on, how are they supposed to know who to reach out to or if another team might have already solved their problem?
Functional: often different functions within an organization, like product management, developers, and site reliability engineering (SRE) are managed independently of one another. If their leaders aren't communicating across functions, why should the rest of the employees?
Geographic:
- Workforce: many organizations have employees spread across the globe. Between time zones, and a mix of in-office, hybrid, and fully-remote workers, it's difficult to connect to the right person.
- Culture: people in different countries have different customs and ways of doing business. Across the globe we think, speak, and act differently.
- Language: for the majority of people around the world, English is not their first language. Language should also take into account your word choice, tone, and expression.

Breaking barriers

A wise person once told me:

"It is the responsibility of the person trying to relay the information to communicate it in a way that the recipient understands." - My Mom, probably.

We need to communicate better within our teams, and across our organizations. In order to do that we have to first consider:

Who is the audience? Or, in other words, who are we communicating with?
What are we trying to get them to understand?
Why should they care about what we're telling them?

The answers to these questions can (and should!) impact the way you approach the conversation. If you're trying to explain what you're working on to someone on your team who's knowledgable in the same areas, you might explain it differently than if you were talking to your CEO, or your non-techie friend. Each person would care about it for a different reason. Your teammate could be working on the project with you, deep in the details. Your CEO could be looking to understand the value this project provides to the company, and your friend might be curious about the new role you took on recently. Using the same approach for all 3 most likely wouldn't yield the results you're looking for.

Many of us forget to take this a step further to preempt some of the barriers we may have in communicating with our audience. Thinking a little deeper about who we're talking to can help significantly.

For example:

Ownership:
- Do you have the right audience for what you're trying to achieve?
Functional:
- What does the audience know already?
- What might they need to know in order to understand your point?
Geographic:
- Where are the people in the audience from?
- What might affect how my tone of voice, facial expressions, or word choice are interpreted?

Bringing intention into 2025

Communication goes beyond words. It is in the way we interact with one another, our posture, our facial expressions. Even the act of not talking to someone says something. We need to think about who we're trying to communicate with, and how we can do so most effectively. We need to consider the barriers in our organizations and how to help our teams overcome them... and we need to get people talking to each other.

After all of this reflection (writing is a great way to do that!), my biggest take-away from the past year is simply this: we should all be more intentional in the way that we communicate with others, and how we set up our organizations to enable communication.

Separating Release from Deployment

Madeline Webster — Thu, 23 Nov 2023 22:20:37 +0000

Release and deployment are important parts of the software development life cycle (SDLC). In fact they're critical because they are responsible for how a piece of software is delivered to its end users. These two steps in the SDLC are often treated as one, coupling together release and deployment. However, just because a code change has been deployed, does not mean it has been released to customers, or is ready to be released to customers.

When working with a DevOps mindset, there is no separate team managing release and deployment. The development team is responsible for their application throughout the entire SDLC - including deployment and release. Many modern teams have automated pipelines that trigger after merging changes into their main code branch. These pipelines deploy their code changes to any number of pre-production and production environments automatically, running tests and checks as they go. When the deployment is successfully completed to production, the changes are immediately available to release to customers. In most cases, they are immediately released to customers without further thought.

What about big features?

One of the main tenets of DevOps is to deliver value to customers quickly and iteratively with shortened feedback loops, as opposed to having long release cycles. However, big features still exist and long-lived development branches are awfully big pains to maintain, code review, and test. The "big bang" style of deploying these large changes also leads to a lot of developer pain, bugs, and incidents.

If deployment and release were separated from each other, the need for big, long-lived development branches goes away. Code changes could be done in smaller, iterative pieces that build on each other. These smaller pieces are easier to test, code review, and deploy. So how can we deploy changes iteratively without releasing?

Decoupling deployment

A simple option to separate deployment from release is using feature flags to turn features or code blocks on and off. When code changes that are not ready for immediate release to customers are merged into main, they can be deployed behind feature flags which would determine whether or not that feature was visible in a given environment. The feature flag could be turned on in test environments to use as the rest of the feature is developed, while it could remain off in production to completely hide the feature from customers and continue running the current software.

More complex solutions, such as traffic shifting, can also help to decouple release from deployment. With traffic shifting, a small percentage of traffic is directed to the new code path while the majority remains on the current code path. As the new code path proves to be effective, the proportion of traffic sent to it is increased until all traffic is directed through the new code path and the old one is unused. This can be automated using a variety of tools, or even set up as part of the infrastructure as code (IaC) configuration for many resources, like AWS Lambda.

There are many other approaches a team could take to decouple deploying software from releasing it, and each team will need to find what fits their processes, tooling, and team the best. However, decoupling deployment from release should not introduce manual toil. Wherever possible, codify and automate these processes!

Code Freezes

In the past, companies have implemented code freezes to prevent introducing changes to the codebase for a period of time. This could be before a big release so that the quality assurance (QA) team could perform tests, during holidays when there was little support, or times of the year when their business is sensitive to change (for example, Black Friday for an American e-commerce company).

It's simple to say that a code freeze would go against the very concept of continuous integration and continuous delivery (CI/CD) and the DevOps methodology of continuously delivering customer value. However, it is also simple to understand that deploying code or releasing features without adequate support available is irresponsible. When operating with a DevOps mindset and using automated pipelines that deploy changes as soon as they're merged to main, it is easy to forget that the code is also being released to customers. In this case, a code freeze means preventing teams from merging any code to main and essentially piling up changes to be delivered later on in quick succession, or even using a dreaded long-lived development branch. This brings with it the same issues that "big bang" deployments have.

It also brings in a new set of problems. Top-down organization-wide decisions that impact how a team manages their application, like implementing an organization-wide change freeze, undermine the very essence of autonomy and trust. If teams are operating with a DevOps mindset and a company is truly embracing a culture of team autonomy, then they need to enable teams to make the good decisions for the things that they own, and support them when mistakes are made.

Responsible Application Ownership

Instead of implementing an organization-wide code freeze, why not give teams the autonomy, tools, and knowledge to be responsible application owners?

A responsible application owner considers the implications of their actions, and is able to determine the impact of deploying or releasing a code change. They should be able to identify and understand the potential risks associated with a change to their application, and how to mitigate them. Teams should always take into consideration who is available to monitor a deployment or release when either action is taken and that includes knowing who is available to support customers. For example, instead of merging a code change at 5pm on a Friday when they're about to take off for the weekend a responsible team might decide to wait to do it Monday morning when they will be there to deal with any fall out from the deployment or release of that change, and customer support engineers and the SRE team will be around to assist should something go wrong.

With ownership comes responsibility. It shouldn't matter what time of the year it is, teams should be empowered to own their application's software development life cycle end to end - and that includes making conscious decisions about deployment and release.

Conclusion

When it comes to deployment and release of an application remember that they are in fact two distinct phases of the software development life cycle. Empowering development teams to own their applications throughout the entire SDLC means enabling them to make conscious choices when it comes to deploying and releasing code changes.

Next time you're about to merge a code change, pause and ask yourself:

"What could happen when this is merged?"
"What is the impact if this is released to customers immediately?"
"Who will support the application if something goes wrong during deployment?"
"How will the team know if this was deployed successfully?"
"How will the team know if this was released successfully?"

Take responsibility, and make conscious decisions when it comes to deploying a code change and when releasing it to your customers.

The Great Testing Transition

Madeline Webster — Wed, 13 Sep 2023 19:39:19 +0000

Blurry lines

As companies transition to operating with a DevOps culture and mindset, individuals may find that their roles expand, change and shift. With DevOps, the traditional approach of handing work over to a quality assurance (QA) or operations (Ops) person is thrown out the window. Silos between these different functions are broken down, and once-clear role definitions become blurry.

One of the key elements of DevOps is autonomy. Teams should be able to deliver on their objectives without depending on other teams to test, deliver, or operate their code. This means that people in traditional Development, QA, or Ops roles may find they need to dive deeper into other areas of the software development lifecycle.

Making a mindset shift

Recently I was approached by an individual who found themself needing to know more about quality assurance and testing strategies. They wanted to learn how to test their microservice and were soliciting advice from various groups and individuals throughout the company.

I was asked:

How do we build confidence (because ultimately this is what tests do, build a developers confidence) for developers of Microservice A that its dependency, Microservice B, will do what it’s supposed to?

Microservice architectures contain many smaller, independently deployable applications that work together to create a larger application. In order to work together, microservices often have dependencies on other microservices in the ecosystem. With this in mind, their question made a lot of sense. However, their approach to the issue gave me the impression that they were thinking about testing from a monolithic perspective - where the entire system is part of a unit that you can test end to end.

...but microservice architectures aren't like that! Microservice B may be built by another team, or even a third party solution like an API from another company. You can't control Microservice B. Instead, the focus should be on what we can control - our microservice, Microservice A.

Thinking about microservice architectures, the question should instead be:

How do we build confidence in our microservice (Microservice A) such that our solution will react gracefully to the state of Microservice B?

If Microservice B can't do what it needs to do, for whatever reason, how does Microservice A respond? Different from the traditional testing strategy, quality assurance for microservices can be broken into 3 parts:

Testing - ensuring your service does what it is supposed to do and only what it's supposed to do
Monitoring - checking in on metrics for things you know might happen, and alarming when action needs to be taken
Observability - having the tools in place to debug when something you didn't expect to happen, happens.

Testing

Testing an application builds confidence that it will do what it's supposed to do, and only what it's supposed to do when it is deployed to production. There is a lot of terminology when it comes to types of tests, and it's very easy to confuse them. Call them what you'd like but essentially you need a set of tests that validate your application's functionality including the backend code, and the connective infrastructure.

Small tests, often called unit tests, can be helpful for testing your runtime code. They isolate and test only one independent "unit" - generally a method or function. These small tests are typically cheap to build and quick to run, which means they can provide fast feedback to developers when run locally or as part of an automated pipeline.

However, many microservices are built in the Cloud and use different pieces of cloud infrastructure (hopefully taking advantage of infrastructure as code). These infrastructure resources need to be configured properly such that they are secure and they can work together to enable your application. This means that testing the runtime code isn't enough. We need to make sure that the microservice can be deployed properly to the cloud, that it runs, and can do what it is supposed to do (and only what it's supposed to do!). While there are ways to mock these resources for local testing, I have found this to be tedious and unhelpful as you still haven't tested that the real resources can work together.

Instead, development workflows should be set up such that every developer can deploy their version of the microservice on-demand, in an environment that's safe for testing, fast iteration, and learning. Every time a developer pushes code to their branch, it should update their deployed version of the application. Rather than messing around with mocked resources, a developer can experiment with real resources in the cloud and see how they really interact with each other to build confidence in the application. Automated tests can also be run against this version of the application as part of the deployment pipeline. This will help reduce manual work, and create consistency in the testing done in pull requests for each change being merged to production.

Tests should also be run post-deployment, to ensure the active application is still working properly in each environment that an update has been deployed to. Your application may be deployed to Alpha, Staging, or other pre-production environments in addition to your customer-facing Production environment. The same automated tests should be run in all environments to ensure consistency, and catch issues early. If automated post-deployment tests fail in your Staging environment the deployment pipeline can stop before rolling out the change to production, preventing your customers from being impacted by the issue.

Monitoring

Even with a plethora of testing, issues can still arise in production. Quality Assurance doesn't stop when the change is deployed. We need to maintain a high standard of quality throughout the software development lifecycle, which includes while it is actively being used. A team that truly owns the quality of their microservice needs to know when issues happen and be prepared to respond to them. With alarms to monitor their application in place, teams will be notified when something they predicted might happen actually happens.

Monitoring doesn't just magically get set up for an application though, teams have to put in the time to think about possible issues that might arise with the application, create indicators for those issues, and responses to those indicators. This generally includes things like creating metrics or service level indicators to monitor specific situations such as the duration of an API call, or the number of errors and invocations for a Lambda. With a metric in place, there needs to be an alarm to notify someone when that metric enters an undesirable state and action needs to be taken - perhaps a sustained spike in the API call duration, or a drop in availability of the application. The person responding to the alarm needs to know what's going on and how to respond to it. This means that alarms should contain information such as:

Details about the metric (what's going on?)
Details about the environment such as AWS account, region, etc. (where's the issue happening?)
Details or links to observability tools like logs, metrics, dashboard, etc. (where can I find more information to help debug?)
Details or link to a runbook that describes the procedure to follow for managing the incident (what action should I take?)

Remember that alarms can wake people up in the middle of the night, so they need to be actionable. If you're having trouble writing the runbook describing what actions to take when an alarm goes off it might indicate that an alarm is unnecessary and the metric is something you want to passively monitor or check in on from time to time instead. Alarms thresholds may also need tweaking - I've never got them 100% right the first time personally, so don't worry if they aren't perfect but do listen to feedback and adjust as you go.

Observability

Observability comes in when you need to investigate something you didn't anticipate happening. Observability uses tools like logs, metrics, dashboards, and traces - anything that can provide more insight into how the system is behaving. When an incident occurs, you need visibility in order to debug the issue and find out what's contributing to it. The more visibility you build into the system prior to an incident, the easier it will be to debug.

There is a balance to be struck here, so be sure to think about permissions and who should have access to specific information. Personally Identifiable Information (PII) and customer data should be handled with care and not be exposed in logs, traces, or other observability tools.

Automation

Remember how we talked about shifting roles and team autonomy? When transitioning to working with a DevOps mindset and full team ownership for microservices, automation will become your best friend. Taking advantage of Infrastructure as Code (IaC) will help your team provision and maintain the infrastructure with less guess work. (Even your dashboards and alarms can be created as code!) Utilizing automated deployment pipelines for deploying your development branches and production code will ensure consistency in the deployment process, the level of testing and quality. Automating your tests frees developers to add value to the product in other ways... I can keep going on, but I think you get the point.

Conclusion

It can feel like a lot to take on as your team transitions to a DevOps mindset and owning their application completely - and it is! Quality assurance is no longer handled by another team, it is up to your team to ensure that your application is tested and operating properly at all times. Try to take it one step at a time. Each iterative improvement you make will get your application closer to your quality and operational goals!

Making Work Visible

Madeline Webster — Mon, 15 May 2023 15:10:08 +0000

Ever wondered why a work item that seems small takes so long to complete? Often the work itself doesn't, but there are competing priorities, or too much work in progress for anything to actually get done. To really understand where your time sinks are, you first need to #makeWorkVisible!

A few years ago I read Making Work Visible: Exposing Time Theft to Optimize Work & Flow by Dominica DeGrandis. When I picked up the book, I didn't realize the impact it would have on my teams and the way we work. If you're looking to improve flow and expose your time thieves, I highly recommend reading a copy! Meanwhile, here's what I've learned from trying to apply DeGrandis' thinking with the teams I've worked with over the past 3 or 4 years.

Celebrate the small stuff

Technical debt is a huge problem which we all face. In my estimation, it's the sneakiest of DeGrandis' time thieves. How many times have you caught yourself thinking, "Oh I might as well fix that too while I'm in here," then gone down a rabbit hole like Alice in Wonderland? Suddenly your day is gone, and you've fixed a few other things but haven't accomplished your original task for the day. While those other fixes might have been good and necessary, according to your Kanban board you haven't accomplished anything!

We've all said to ourselves: "But it's such a simple task. It would take more time to make a ticket than it would to do the work!" So often a small or simple task snowballs into something larger. Or, we're asked to context switch and do something else. In time this can make us feel overloaded and under-appreciated, and cause us to burn out quite quickly.

Making work visible allows others to see what you're working on, which helps them make priority calls when another request comes in. It also enables your teammates to pick up work you might have had to abandon during a context switch. For many people, completing small incremental tasks can be much more rewarding than being stuck on one large task for many weeks. Celebrate the "small" stuff! Some of those small tasks are the biggest wins!

Save the context

I used to be in the habit of writing down "to do's" on sticky notes so I didn't lose track of them. If only there was a system for that! When I went to make tickets for all of my sticky notes I found that some had already been done (yay!) While others I couldn't remember writing down or didn't think were as important anymore. Now, when I come across a piece of technical debt or I have a bright idea, I pause to take a step back and look at the larger picture.

Do we need to do this right now?
Do I need to be the one to do this?

Most of the time the answer to both of those questions is "no" and creating a ticket takes just as long as finding my stack of sticky notes and writing down my idea. Usually, there's a lot more context in a ticket which can help your team remember what it was and why you thought it was important. You can only fit so much on a sticky note! Later, all that context can help your team decide if a task is worth doing.

Create a blameless culture

"Oops! Maybe I can fix this before anyone else notices!"

To put it simply, creating a blameless culture enables teams to work together to solve problems more effectively without fearing failure. Without blame, teammates are more inclined to seek help, ask questions, and discover diverse solutions to their problems.

Making work visible requires just that – for the work to be visible! If teammates are fearful or worried that they might be blamed for an issue, they may be more inclined to fix it quietly. Instead, celebrate mistakes and learn from them as a team! My teams started talking about our "oops" moments at the end of each week. We share our mistakes, large and small. We celebrate them, and foster a culture of trust and collaboration.

Label your work

DeGrandis identifies different types of work in her book to help identify time thieves. We took a very literal approach to this concept and started tagging each of our tickets with one of "Debt," "Risk," "Feature," "Defect," or "Incident." We used these tags to create metrics and see where the team was spending most of its time.

We were able to see when the whole team dropped everything to work on an incident, or when we started a new project. Best of all, we could see how much work in progress we had over time. This allowed us to identify thresholds above which we were overloaded, and below which we were under utilized. Eventually, we even added metrics to see how thinly the team was spread across different projects.

Having these metrics helped us to correlate how we were feeling and the work we were doing. It also gave us data to use when other teams and managers came with requests for features or other work. We were not only able to identify if the team had capacity to take on more work, but also how long a request might have to wait in the queue. With this data we could talk about priorities and determine the best course of action.

Communicate with your team

Having metrics allowed us to set expectations about the priority of tasks, but it didn't help when it came to navigating why some tasks were taking longer than expected.

Our team tried posting in Slack to communicate our daily goals. After a while, we realized that we had fallen into a rut of saying "did X, plan to do Y" and not really sharing what we were blocked on or what we needed help with, which was the whole point of the post! Switching up the questions we asked helped us re-focus and made these posts way more valuable! Some of the questions we tried were:

What would make today a success?
What are you currently blocked by?
How could the team help you achieve your goals?

We also discovered that setting our status in Slack was a great way to communicate when we were heads-down working, poring over a code review, or in meetings. Most of all, setting our status enabled us to communicate when we'd be available again to reply to messages, so people weren't waiting for an immediate response from someone who was out grabbing a coffee.

Sharing what you're working on, where you might need help, and setting expectations for when you'll be available to help others has become super important to the way we work, especially when collaborating with remote colleagues!

Be open to continuous learning

Over the course of the past few years my teams have tried many different things to find what works best for us. As our teams, projects and goals changed, so did our ways of working. Be open to continuous improvement and learning as you go.

"Learning is not compulsory... neither is survival.” – W. Edwards Deming

Owning Operations

Madeline Webster — Mon, 06 Mar 2023 22:54:23 +0000

The phrase "you build it, you run it" has become a mantra for many as we build more microservice architectures, and switch to building with a DevOps mindset.

Moving from monolithic applications to a distributed set of microservices comes with many challenges, among which is operations. Maybe you have (or had) an operations team that was responsible for operating and maintaining your monolithic application in production. What happens when that monolith is broken down into 100's of microservices, all operating independently? How is one team supposed to operate all of those services?

What does "You run it" really mean?

Now that we're adjusting to building microservices, we also need to adjust to thinking with a DevOps mindset in our development teams. Running an application requires more than just deploying it successfully to Production. We need to put the "Ops" in "DevOps" and take ownership when it comes to operating our services.

Running or operating a service includes many things. Among them are:

Responding to incidents (we'll save this one for another day)
Monitoring the service
Ensuring the service is meeting its service level objectives (SLOs)
Conducting operational reviews
Managing disaster recovery

How can we prepare to take on these responsibilities within our teams as we learn to think with a DevOps mindset?

We need to learn about operations, then automate and document how to operate our microservices. Your team may not own your service forever, or there could be turnover or new hires... how to operate your service needs to be documented in a way that's easy to understand by those outside of your immediate team.

A great way to start is by creating an Operations Guide.

What is an Operations Guide?

An Operations Guide should contain everything a person would need to know to operate your microservice. Let's dive into that a bit further.

Note: The questions and methods below are reflective of what I have experimented with, and what has worked well for the teams that I have worked with. They are not exhaustive. I encourage you to take them as inspiration, and add your own twists to create an Operations Guide that is applicable to your team, organization, and architecture.

Deployment

This section should contain everything someone needs to know about where the service is deployed to, and additionally how. Having it contained in a single, easy to reference place is helpful when responding to an incident so that you don't have to parse different files or lots of text to find the information you need. Consider using tables to make data easy to read.

Consider the following questions:

How is service deployed?
Where is it deployed to?
If the service is deployed in the Cloud, which regions is it deployed to?
What environments is it deployed to? (Ex. Staging, Production)

Contact info

Anyone who is operating the service will need to know how to escalate issues to the team or community supporting the service.

Consider the following questions:

Where can the team/community be found?
What is the best way to contact the team/community for a general inquiry?
What is the best way to contact the team/community during an incident?
What are the team/community's time zone and hours of operation?
Who should they escalate to if the the team/community doesn't respond in a timely manner?

Service levels

Service Level Objective (SLO): a target value or range of values for a service level that is measured by an SLI.

Service Level Indicator (SLI): a carefully defined quantitative measure of some aspect of the level of service that is provided.

Service Level Agreement (SLA): an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.

📚 The above definitions are quoted from Google's SRE Book, chapter 4 - Service Level Objectives. Check out the resources section below for recommended reading.

Your microservices may not have a customer facing service level agreement (SLA), though the product it is a part of might. Either way, we should always be striving to set and maintain a high level of service in all of our microservices.

Think about your service...

What would indicate that it is highly available?
What should the response time of your service be?
Does availability look different across different environments?

These are examples of Service Level Objectives (SLOs). Next, how can we measure and understand if our service is meeting those objectives? Those metrics and measurements are the Service Level Indicators (SLIs). Every SLO must have an associated SLI to indicate how the objective will be measured. Different environments might have different SLOs - for example, Development may be slightly less stable than Production so it might have a slightly lower SLO for availability.

What SLOs and SLIs would be meaningful to track for your service? How will you track them?

Additional resources

Observability and monitoring

Now that we have service level objectives and can measure them with service level indicators, how do we know if we're meeting them? In fact, how do we know our service is operational at all?

This is where Observability and Monitoring come in.

Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.

Monitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs.

📚 The above definitions are quoted from DevOps measurement: Monitoring and observability.

Essentially, observability includes things such as logs and traces. These tools give you the ability to debug and work through unknowns (things we aren't aware of) when they appear. Monitoring on the other hand helps you manage knowns (things we are aware of), and would include things like dashboards, pre-defined metrics, and alarms.

Consider the following questions:

What is your observability strategy?
What tools are incorporated for observability?
What is your monitoring strategy?
What types of dashboards are available for monitoring the service?
- Where are the dashboards?
- What sort of information do they contain?
- How should the information on the dashboards be used?
What alarms exist?
- Who do they notify? (during and outside of working hours)
- Where are the playbooks for responding to these alarms? (tip - put them in the repo with the service as markdown docs!)
- How does someone know which playbook to use when responding to an alarm? (Ex. Is a link included in the alarm description?)

Operational reviews

Once we have observability, monitoring and alarms in place, how do we know that our service is operating as we intend it to, and only as we intend it to?

Monitoring lets us know when things that we expect might happen actually occur. When our service starts behaving outside of what we know to be normal we should get alerts. What about when our service is behaving, or being used, in a way that we didn't predict? Observability allows us to see things that we might not have anticipated or predicted... but we must be looking in order to see them!

This is why it's important to have regularly scheduled operational reviews. We need to check in on the service even when nothing has alerted us that something has gone wrong. We need to analyze data to understand how the service is behaving, and how our consumers are using it.

Consider the following questions:

How often will an operational review be run?
Who will lead the operational reviews?
Who will be involved in your operational reviews?
What will you look at during your operational reviews?
What questions might you ask during your operational reviews?

Disaster recovery (DR)

Disasters can strike at any time, and can take on many forms. We need to be prepared and know how to respond when a disaster strikes - such as an AWS service becoming unresponsive in a region, a server going down, or an account becoming compromised from a security breach.

Consider the following questions:

What else is running where your service operates?
What sort of redundancies does your service have?
What other services depend on this service being available?
What services does this service depend on in order to operate?
What else might be impacted if your account is compromised?
What happens if the service stops working, or is compromised in a single region in production?
What happens if production is compromised?
How have you mitigated the impact on customers or other teams, should something happen to part or all of this service?
How would you redeploy the service to a new AWS account if needed?
How would you redirect traffic from the affected service to a version that's working?
What does your deployment strategy look like? (Ex. blue/green deployments, rolling updates, etc.)
Where are the runbooks located to handle these disaster scenarios?
- How do you know the runbooks work?
- How often will you practice recovering from a disaster scenario?

Tips and tricks

There is a lot to think about when writing an Operations Guide, and defining operating procedures for your service. Take it one step at a time - start small and build up!
Don't worry about getting things perfect the first time. As unknowns become knowns, iterate and improve upon your service.
Be clear and concise. Provide information in a way that's easy to read, and straight to the point. Make use of tables, bullet points, and other formatting to help.
Ask for feedback from your team; everyone should be aligned when operating a service.

What have we missed?

After writing your Operations Guide take a step back and ask your team, "what have we missed?" Reflect and align as a team. Try to picture yourselves while debugging a production outage. What else might you need to know?

There is always room for improvement, and as I mentioned earlier, this guide is not exhaustive. So now I ask you...

What have I missed?

Let me know in the comments!