Forem: Derek Berger

Swapping out microservices gracefully with the help of AWS

Derek Berger — Mon, 07 Oct 2024 15:10:58 +0000

Introduction

The AWS load balancer controller is a key enabler for running services in Amazon EKS, using AWS APIs to provision load balancer resources. But this controller can help with more than just everyday management of load balancers. For instance, it greatly simplified how my team released APIs during a major project to rewrite our services in a new language.

Background

Our application follows a common pattern for running microservices in EKS. Outside requests come into our clusters through application load balancers (ALBs). The ALBs’ target groups forward requests according to path-based rules that correspond to services’ endpoints.

ALBs fronting EKS services

The load balancer controller manages our ALBs based on Ingress resources defined in services’ Helm manifests. We keep these manifests in our version control system, and deploy them through pull requests.

Here’s an abbreviated example from one of our services:

 ingress:
  ingressClassName: alb
  enabled: true
  annotations:
    alb.ingress.kubernetes.io/group.name: login
    alb.ingress.kubernetes.io/subnets: 'subnet-a,subnet-b,subnet-c'
alb.ingress.kubernetes.io/healthcheck-path: '/help-i-am-alive'
    alb.ingress.kubernetes.io/success-codes: '200,404'
alb.ingress.kubernetes.io/target-type: 'ip'
alb.ingress.kubernetes.io/backend-protocol: 'HTTPS'

When an Ingress is deployed, the controller provisions the ALB, applies the path-based rules, and creates the target group that points to the service’s pods. It handles additional behaviors for certificates, ELB access logs, health checks and more through annotations. If we deploy updates to an ingress, the controller keeps the ALB in sync with its definition.

In with the new (but not quite out with the old)

During the project to rewrite our microservices, we continued to define service and ingress resources in Helm manifests. A new challenge would be to run old and new services side by side while we incrementally rewrote and released individual APIs. We wanted requests for rewritten APIs to be forwarded to the new service, while requests for all other APIs remain forwarded to its older counterpart.

The Ingress Group feature made this possible in part by consolidating old and new Ingress resources under the same ALB with the original group.name annotation. When the team released an API, we just added a pathType: Exact rule for that endpoint and deployed its ingress.

Here is an excerpt from a new service’s ingress, with some pathType: Exact path-based rules:

ingress:
 ingressClassName: alb
 enabled: true
annotations:
   alb.ingress.kubernetes.io/group.name: login
paths:
 - path: '/api/login/path1'
   pathType: Exact
 - path: '/api/login/path2'
   pathType: Exact

Here again is the original service’s ingress, which has a single pathType: Prefix rule, catching anything that does not match the new service’s path-based rules.

ingress:
 ingressClassName: alb
 enabled: true
annotations:
   alb.ingress.kubernetes.io/group.name: login
paths:
 - path: '/api/login/'
   pathType: Prefix

Because we defined both ingresses with alb.ingress.kubernetes.io/group.name: login, the controller would apply both sets of rules to the original ALB, letting the new service steal requests, or so we hoped, from the original service.

Not so fast

The problem with this was that the pathType: Prefix would match every request to /api/login/, including /api/login/path1 and /api/login/path2. We had no guarantee that requests for those would be forwarded to the new service.

To solve this, we could have just replaced the Prefix path with Exact paths for all the APIs we still wanted forwarded to the old service. That would have spared us from creating a new ALB, but would add complexity and friction to our releases, requiring changes to two ingresses with every release.

Help from AWS

We found a more elegant solution with a subtle but powerful controller feature called group.order. By assigning a smaller order number to the new service, group ordering ensured the controller would find a match for its path rules first.

Here's the new service's Ingress again, now with alb.ingress.kubernetes.io/group.order:

ingress:
 ingressClassName: alb
 enabled: true
annotations:
   alb.ingress.kubernetes.io/group.name: login
   alb.ingress.kubernetes.io/group.order: 10
paths:
 - path: '/api/login/rewritten-path1'
   pathType: Exact
 - path: '/api/login/rewritten/path2'
   pathType: Exact

With that, we could set a higher group.order value for the original Ingress and leave it alone until all endpoints were transitioned. Then we just replaced all the pathType: Exact rules in the new service’s manifest with a pathType: Prefix rule and deleted the old service. The same approach worked for all of our services with Ingress resources.

Conclusion

The AWS load balancer controller's group.order feature has made it trivial for my team release new APIs. The experience reminds me that maintaining infrastructure as code provides benefits beyond everyday management of infrastructure. Features like group.order allow engineers to spend more time on features and less less time managing the infrastructure that they run on.

Taking Your Releases Into Overdrive with GitHub Actions

Derek Berger — Wed, 14 Aug 2024 20:04:36 +0000

Introduction

GitHub Actions’ seamless integration with version control simplifies creating and executing operations and infrastructure workflows. Two key features of Actions for building efficient workflows are:

Composite actions. Composite actions let you create combinations of steps that you can reuse across different kinds of workflows.
Job outputs. Outputs make values derived from one job's steps available to downstream jobs' steps.

In this article, I’ll share how Actions’ integration with version control, composite actions, and job outputs helped my team advance our use of Actions to automate production deployments.

Mostly manual automation

The workflow that my team built for disaster recovery uses a composite action that cuts off DNS traffic to the impaired region. It lets us execute failover in just one step, and the same workflow can be used for any operation that requires rerouting production traffic for an extended time. We just manually trigger it exactly the same way we would when failing over, specifying which DNS to change and which region to cut off.

While that can be helpful for major changes like infrastructure upgrades, major changes are not common. More often we simply deploy application or configuration updates via pull requests, following GitOps practices. To keep our uptime as high as possible and avoid disrupting customers, we tried using the failover workflow to reroute DNS during these everyday changes.

That at least rerouted traffic how we wanted, but it required multiple manual steps:

Trigger failover workflow to change DNS.
Merge pull request to deploy application change.
Verify pods roll out.
Trigger failover workflow to restore DNS.

Considering the workflow required thinking about which DNS needed to change, and manually selecting the right DNS and region options, the procedure took more like five steps. And despite automation for applying DNS changes and verifying pods, it became a drudgery to deploy anything.

Even worse, whenever we wanted to apply the change to multiple clusters, we’d have to repeat every step multiple times.

This was counterproductive, inefficient, and discouraged us from deploying as frequently as we could have.

Eradicating the toil

What we needed was a workflow to deploy everyday changes in one step, not four or five, so we set out to build a new workflow that uses the same composite actions as the failover workflow, but makes better overall use of GitHub Actions' automation capabilities.

First, since all production changes are deployed via pull request, instead of workflow_dispatch we use pull request triggers based on branch, path, and event type.

name: Production deployment
on:
 pull_request:
   branches:
     - main
   paths:
     - "path/to/region1/releases/namespace1/"
     - "path/to/region1/releases/namespace2/"
     - "path/to/region2/releases/namespace1/"
     - "path/to/region2/releases/namespace2/"
   types:
     - closed

We broke the new workflow down into 4 jobs, which correspond to the old manual steps.

Determine what changed.
Make DNS change to stop traffic.
Verify services’ pods roll out successfully
Make DNS change to restore traffic.

In the first job, we start by ensuring the workflow only proceeds upon merged pull requests, and not just any pull request closed event, by adding this if condition:

job1:
   if: (github.event_name == 'pull_request') && (github.event.pull_request.merged == true)

Unlike the failover workflow, the new workflow must automatically determine the right region and DNS to cut off. The dorny paths-filter action was our help for this.

We first use it to determine region based on change path:

     - name: Determine region
       uses: actions/checkout@v4
     - uses: dorny/paths-filter@v3
       id: region-filter
       with:
         filters: |
           region-1: 'path/to/cluster/account/region1/**'
           region-2: 'path/to/cluster/account/region2/**'

Next, we have a filter for DNS, which is important to get right because the DNS we want to reroute will vary depending on which services change:

     - name: Determine DNS to change
       uses: actions/checkout@v4
     - uses: dorny/paths-filter@v3
       id: dns-filter
       with:
         filters: |
           dns1: 
             - 'path/to/services/1'
             - 'path/to/services/2'
           dns2:
             - 'path/to/services/3'
             - 'path/to/services/4'
           dns3:
             - 'path/to/services/1'
             - 'path/to/services/2'
           dns4:
             - 'path/to/services/1'
             - 'path/to/services/2'
             - 'path/to/services/5'
             - 'path/to/services/6'

The final filter determines which pods to validate, also based on which services have changed:

     - name: Determine services
       uses: actions/checkout@v4
     - uses: dorny/paths-filter@v3
       id: service-filter
       with:
         filters: |
           service-1:
             - path/to/services/1
           service-2:
             - path/to/services/2
           service-3:
             - path/to/services/3
           service-4:
             - path/to/services/4

Finally, job outputs are what makes all these filtered values available to downstream jobs.

   outputs:
     dns: ${{ steps.dns-filter.outputs.changes }}
     services: ${{ steps.service-filter.outputs.changes }}
     region: ${{ steps.region-filter.outputs.region-1 == 'true' && 'region-1' || 'region-2' }}

The second job applies the DNS change, but includes a condition to only proceed if it finds values in the outputs for DNS and region.

 job2:
   needs: [ job1 ]
   if: ${{ needs.job1.outputs.dns != '[]' && needs.job1.outputs.dns != '' && needs.job1.outputs.region != '[]' && needs.job1.outputs.region != '' }}
   strategy:
     matrix:
       dns: ${{ fromJSON(needs.job1.outputs.dns) }}

It calls the same composite action as the failover procedure:

   steps:
     - uses: actions/checkout@v4
     - name: Stop traffic to ${{ needs.job1.outputs.region }} ${{ matrix.dns }}
       uses: './.github/actions/dns-change'
       with:
         dns: ${{ matrix.dns }}
         action: stop
         region: ${{ needs.job1.outputs.region }}

The third job verifies pods roll out, applying the values from the services output to a matrix:

job3:
   needs: [ job1, job2 ]
   if: ${{ needs.job1.outputs.services != '[]' && needs.job1.outputs.services != '' }}
   name: Validate pods
   strategy:
     matrix:
       service: ${{ fromJSON(needs.job1.outputs.services) }}

And calling the composite action to execute the pod validation steps:

- uses: actions/checkout@v4
     - name: validate deployments and pods
       uses: './.github/actions/pod-validation'
       with:
         deployment: ${{ matrix.service }}
         cluster: ${{ needs.job1.outputs.region == region-1' && '1' || '2' }}

If every pod rolls out successfully, the workflow proceeds to restore the traffic cut off in job 2.

 job4:
   name: Restore traffic to ${{ needs.determine-changes.outputs.region }} in ${{ matrix.dns }}
   strategy:
     matrix:
       dns: ${{ fromJSON(needs.job1.outputs.dns) }}
   steps:
     - uses: actions/checkout@v4
     - name: Restore traffic to ${{ needs.job1.outputs.region }} ${{ matrix.dns }}
       uses: './.github/actions/dns-change'
       with:
         dns: ${{ matrix.dns }}
         action: start
         region: ${{ needs.job1.outputs.region }}

If any step fails, the workflow fails and traffic remains routed away from the failed cluster while the team investigates. If necessary, we can open a pull request to revert the change. Merging it will trigger the workflow again, effectively validating the rollback.

Success!

By combining the composite actions we created for the failover workflow with path filters and job outputs into a new workflow, deployments now take one manual step: merging a pull request.

The workflow takes over from there, automatically making the proper DNS changes, verifying impacted pods roll out, restoring DNS traffic, and notifying us of results.

Unburdened by multiple manual steps, our team deployed 32 changes to production in the first month of using the workflow. In the previous month, we deployed 17 changes. The results so far have been promising, and we'll continue looking for ways to make our release practices even better with Actions.

Optimizing DevOps automation in the AWS cloud with GitHub Actions

Derek Berger — Fri, 09 Feb 2024 13:26:35 +0000

DevOps is not just about automation, but automation is core to an effective DevOps practice, driving:

Greater consistency and repeatability.
Faster and more efficient workflows.
Increased traceability and visibility.

The AWS CLI is an essential automation tool for DevOps tasks in AWS cloud environments. In this article you'll see its power for automating DevOps tasks, and how it becomes even more powerful when combined with GitHub Actions.

The Power of the AWS CLI

As I’ve written previously, my team at Asurion handles disaster recovery/failover with the Secondary Takes Over Primary (STOP) pattern, following a typical STOP implementation:

DNS records are associated with Route 53 health checks, which are associated with specific S3 objects.
Health check status is controlled by uploading (or deleting) its associated S3 object.
A failing health check cuts off requests to its DNS target.

The AWS CLI is the obvious tool for executing the failover procedure. Its commands to trigger DNS changes in accordance with the STOP failover pattern are put-object and delete-object.

To upload the object and cut off traffic:

aws s3api put-object --bucket <bucket-name> --key <object-name> --body <file-to-upload>

To reverse the operation and restore DNS traffic:

aws s3api delete-object --bucket <your-bucket-name> --key <your-object-name>

If the response from the S3 API is OK, it’s probably safe to assume that the health check change has been triggered. But Route 53 also has a rich API, which lets us extend the script to verify the health check status has flipped.

The commands for Route 53 aren’t as simple as S3, but the steps are straightforward:

Query for the relevant health check ID with the aws route53 list-health-checks command.
Get all Route 53 health checkers for that health check ID with aws route53 get-health-check-status.
Repeat step 2 until every health checker has the expected status, HEALTHY or UNHEALTHY.

This approach works anytime we want to stop traffic to an endpoint, whether for disaster recovery, testing major infrastructure changes without disrupting users, or regular patching cycles.

Automating these steps in a shell script checked into version control alleviates drudgery and provides consistency and repeatability for our failover procedure. The procedures become even more efficient and transparent when executed with GitHub Actions.

GitHub Actions in action

What makes GitHub Actions unique, compared to other platforms for building, testing, and deploying infrastructure, is its seamless integration with version control. This simplifies triggering jobs to execute tasks, like our scripted failover procedure.

One very powerful feature of GitHub Actions is its matrix strategy, which trivializes executing the same procedure with different configurations. In our scenario, multiple health check changes could be triggered and verified, in parallel, with a single job.

First, following the example from GitHub's own documentation, the workflow is defined with inputs and workflow_dispatch, which let the job be triggered with the GitHub API, CLI, or browser interface.

name: AWS CLI Execution
on:
  workflow_dispatch:
    inputs:
      config1:
        description: 'Config 1'
        required: false
        type: 'boolean'
        default: false
      config2:
        description: 'Config 2'
        required: false
        type: 'boolean'
        default: false
      config3:
        description: 'Config 3'
        required: false
        type: 'boolean'
        default: false

The workflow has two jobs. The first job creates a matrix of configurations based on inputs. Any combination of the 3 inputs can be added to the matrix.

jobs:
  create-matrix:
    runs-on: ubuntu-18.04
    outputs:
      config: ${{ steps.config.outputs.matrix }}
    steps:
      - id: config
        run: >
          echo "matrix=[$(echo '${{ inputs.config1 && '"config1"' || '' }}
          ${{ inputs.config2 && '"config2"' || '' }} 
          ${{ inputs.config3 && '"config3"' || '' }} 
          | sed 's/ *$//; s/^ *//; s/  */,/g')]" >> $GITHUB_OUTPUT

The second job executes the script with the 1, 2, or 3 configurations added to the matrix in the first job.

execute-script:
    runs-on: ubuntu-18.04
    environment: prod
    needs: create-matrix
    strategy:
      matrix:
        config: ${{ fromJson(needs.create-matrix.outputs.config) }}
    name: Executing script
    steps:
      - uses: actions/checkout@v3
      - name: Executing script with ${{ matrix.config }}
        uses: './.github/actions/script-name'
        with:
          config: ${{ matrix.config }}

Summary

The AWS CLI is a powerful tool for automating DevOps tasks in the AWS cloud. GitHub Actions provides a platform for executing workflows seamlessly and transparently from version control, all without the overhead of a separate build system.

Combined, they enable transparent, frequent, incremental, reversible changes in production at scale.

Building a multi-region highly available identity provider with the AWS cloud and Ory Hydra

Derek Berger — Tue, 07 Nov 2023 15:35:22 +0000

AsurionID is an OpenID Connect (OIDC) compatible identity provider. It allows Asurion developers to easily integrate identity and access management into their applications using a standard protocol (OIDC) and open-source libraries. Our team worked from specific requirements, including custom user experience and low cost, so we decided to build a homegrown solution instead of using an off-the-shelf solution. We built AsurionID on AWS using open-source Ory Hydra and custom microservices.

High availability using multi-AZ in a single region

As shown in the diagram below, in AsurionID's initial architecture its microservices ran on Amazon Elastic Kubernetes Service (EKS) across 3 Availability Zones (AZs) in a single region. Amazon ElastiCache for Redis, used for storing temporary session data, was also deployed in 2 AZs (primary in one AZ and replica in another AZ). We used Amazon Aurora multi-AZ features to protect the database against AZ-level failures.

Multi-AZ high availability

This provided AsurionID with availability of up to three nines (99.9%) in a single region. As more and more applications adopted AsurionID for identity and access management, it became more critical to our business. We wanted to protect AsurionID against region-level service disruptions which are less frequent but can be more impactful. That’s what led us to multi-region architecture.

Designed for protection against regional service disruptions

In our latest architecture, all microservices now run in active-active mode, in two EKS clusters, across two AWS regions. With active-active, both regions' services are always live and taking traffic, and we use Route 53 weighted routing to distribute customer traffic between the two regions.

Multi-region, active-active microservices

We leverage Route 53 inverted health checks, following the Secondary Takes Over Primary (STOP) pattern, to handle failover if microservices encounter region-level disruption.

In our implementation of STOP, we associate the weighted DNS records with the inverted health checks, and those health checks with S3 objects. We invoke health check failure for a particular DNS by uploading its associated object. The failing health check stops Route 53 from forwarding requests to its associated regional ALB.

STOP pattern for failing over microservices

With this approach, we have achieved static stability and independence from the Route 53 control plane for failing over our microservices, which has resulted in higher availability for AsurionID microservices, up to four nines (99.99%).

We have taken a slightly different approach for the caching layer. Since we cache only ephemeral data like one-time passcodes (OTP), we aren’t replicating this data to the secondary region. But we have another ElastiCache for Redis cluster always running in the secondary region, and in case our caching layer is impaired by an AWS regional service interruption, we would invoke failover using STOP, just like our microservices.

Multi-region caching architecture

This new architecture has helped us achieve static stability and control plane independence for the caching layer as well as the application layer.

For the database, we are using Aurora Global database with a read replica in the secondary region.

Aurora Global database

In case of a region-level Aurora impairment, we would promote the second region's instance to primary.

Future Enhancements

We now strive for the same static stability and control-plane independence in the database layer as we have for our microservices and caching layers. In our current database architecture, the promotion of the read replica triggers a Lambda that updates Route 53 CNAME values (a control plane function) to route all application traffic to the new primary database cluster. We are looking for new approaches to database failover that use data plane operations.

One potential option is AWS Route 53 Application Recovery Controller (ARC). Route 53 ARC works with Route 53 health checks to enable failover using the data plane, with the extra capability of checking the standby database to ensure it is ready for failover. ARC can also fail over an entire application stack in one operation, making it expandable to our cache and microservice layers.

Conclusion

In this article, we have walked you through how AsurionID started out with a multi-AZ approach to high availability and how we further improved availability with a multi-region architecture. Our architecture protects AsurionID against regional AWS service disruptions, achieves static stability, and uses data plane functions for failing over the microservices and caching layers.

While the primary goals of our multi-region architecture were improved availability and resiliency, the architecture has provided the team with even more benefits. We can now perform releases and infrastructure upgrades during business hours without impacting customers by routing traffic to one region while performing tasks in the other. The ability to perform critical operations during the day has improved the quality of life for the engineers. Of course, we could have realized these capabilities with a single-region architecture, but for us, they became additional benefits of a multi-region architecture.

Asurion is a leading tech care company that provides device protection, tech support, repair, and replacement services to 300 million customers worldwide. It partners with mobile carriers, retailers, and device manufacturers to deliver innovative solutions for smartphones, tablets, computers, and home appliances in over 20 countries worldwide. Asurion is headquartered in Nashville, TN.