Forem: Verifa crew

How to use the AWS Load Balancer Controller to connect multiple EKS clusters with existing Application Load Balancers

Verifa crew — Thu, 24 Oct 2024 14:05:33 +0000

This was originally posted on Verifa's blog, written by Jacob Lärfors.

Exposing services in AWS EKS clusters via Load Balancers can be done in many ways. In this post we explore using the AWS Load Balancer Controller to dynamically bind nodes to existing Application Load Balancers.

Background

In a project I am working on we manage the AWS infrastructure with Terraform (Application Load Balancers, Elastic Kubernetes Service clusters, Security Groups, etc.). We also have one requirement; the Application Load Balancers (ALBs) need to be treated like pets, primarily because another team manages DNS records. This is why we cannot use Kubernetes controllers to dynamically manage the ALBs and update DNS records. Thus, the problem statement can be summarised as: how to manage AWS EKS clusters and ALBs with Terraform, and attach EKS nodes to ALB TargetGroups.

Connect nodes to ALBs using Terraform

Our initial implementation used Terraform to attach the AWS AutoScalingGroups to TargetGroups.

When using self-managed node groups you can pass a list of target_group_arns to have any nodes part of the AutoScalingGroup to auto-register themselves as targets to the given TargetGroup ARNs. Nice and easy. Check the docs. We use the community AWS EKS Terraform module which supports this argument.

When using EKS managed node groups, the option to pass target_group_arns is not available, and EKS will dynamically generate AutoScalingGroups based on your EKS node group definitions. The side effect here is that the AutoScalingGroup IDs are not known until they have been created. When working with Terraform this becomes a problem. It requires you to run Terraform apply with the -target option to first provision the EKS node group before you can reference the AutoScalingGroup and attach it to a TargetGroup.

Here’s a lovely GitHub issue with more details: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1539

This really is the core source of the problem we wanted to address; how to use EKS managed node groups and not have a hacky solution. And the solution we chose was the AWS Load Balancer Controller.

AWS Load Balancer Controller

The AWS Load Balancer Controller is a Kubernetes controller that can manage the lifecycle of AWS Load Balancers, TargetGroups, Listeners (and Rules), and connect them with nodes (and pods) in your Kubernetes cluster. Check out the how it works page for details on the design.

Looking back at our use case, we want to use an existing Application Load Balancer that is managed by Terraform. If you search for this online, you will most certainly find another lovely GitHub issue: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/228

It’s a long issue, with lots of suggestions. Personally, I was not concerned with how much of the infrastructure we manage with Terraform vs Kubernetes; the primary goal was a simple solution that did not require too much customisation, and I found the idea of the TargetGroupBinding Custom Resource Definition (CRD) quite appealing.

TargetGroupBinding Custom Resource Definition

TargetGroupBinding is a CRD that the AWS Load Balancer Controller installs. If you follow the most common use case and manage your ALBs with the AWS Load Balancer controller, it will create TargetGroupBindings under the hood even if you do not interact with them directly. Good news; it is a core feature of the AWS LB Controller, not an extension for people with an edge case. That gave me some confidence.

It requires an existing TargetGroup and IAM policies to lookup and attach/detach targets to the TargetGroup. There’s a mention of the required IAM policies required here.

Here’s a sample TargetGroupBinding manifest taken from the docs:

apiVersion: elbv2.k8s.aws/v1beta1
kind: TargetGroupBinding
metadata:
  name: ingress-nginx-binding
spec:
  # The service we want to connect to.
  # We use the Ingress Nginx Controller so let's point to that service.
  # NOTE: it was necessary for this TargetGroupBinding to be in the same namespace as the service.
  serviceRef:
    name: ingress-nginx
    port: 80
  # NOTE: need the ARN of the TargetGroup... Bit of a PITA.
  # Would be nice to use tags to look up the TargetGroup, for example.
  targetGroupARN: arn:aws:elasticloadbalancing:eu-west-1:123456789012:targetgroup/eks-abcdef/73e2d6bc24d8a067

For our case, this means managing the ALBs, Listeners, Rules and TargetGroups with Terraform. The AWS Load Balancer Controller would only be responsible for attaching nodes to the specified TargetGroups. This sounds like a clean separation of concerns.

Multiple EKS clusters, same ALB

Expanding on our particular use case, we manage multiple EKS clusters that share Application Load Balancers. We already use ArgoCD ApplicationSets to manage applications across clusters. We have a “root” cluster that runs core services, like ArgoCD, which connects to multiple other clusters. The below diagram is a high-level simplified illustration of the setup we want to achieve. It will be ArgoCD’s job to deploy the AWS Load Balancer Controller and TargetGroupBinding manifests to the different clusters.

Let’s look at how we implemented this with the AWS Load Balancer Controller next.

Implementation

Terraform

We use Terraform to manage (amongst other things) the EKS clusters, ALBs and TargetGroups. For implementing the AWS Load Balancer Controller all we needed to do was create the necessary IAM role that can be assumed by a Kubernetes ServiceAccount. The following snippet show this:

#
# Create IAM role policy granting the kubernetes service account AssumeRoleWithWebIdentity
#
data "aws_iam_policy_document" "aws_lb_controller" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type = "Federated"
      identifiers = [
        "arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/${local.cluster_oidc_issuer}"
      ]
    }

    condition {
      test     = "StringEquals"
      variable = "${local.cluster_oidc_issuer}:sub"
      values   = ["system:serviceaccount:aws-lb-controller:aws-lb-controller"]
    }
  }
}

#
# Create an AWS IAM role that will be assumed by our kubernetes service account
#
resource "aws_iam_role" "aws_lb_controller" {
  name               = "${local.cluster_name}-aws-lb-controller"
  assume_role_policy = data.aws_iam_policy_document.aws_lb_controller.json
  inline_policy {
    name = "${local.cluster_name}-aws-lb-controller"
    policy = jsonencode(
      {
        "Version" : "2012-10-17",
        "Statement" : [
          {
            "Action" : [
              "ec2:DescribeVpcs",
              "ec2:DescribeSecurityGroups",
              "ec2:DescribeInstances",
              "elasticloadbalancing:DescribeTargetGroups",
              "elasticloadbalancing:DescribeTargetHealth",
              "elasticloadbalancing:ModifyTargetGroup",
              "elasticloadbalancing:ModifyTargetGroupAttributes",
              "elasticloadbalancing:RegisterTargets",
              "elasticloadbalancing:DeregisterTargets"
            ],
            "Effect" : "Allow",
            "Resource" : "*"
          }
        ]
      }
    )
  }
}

This has some dependencies so cannot be run “as is”. But if you need help configuring IAM Roles for Service Accounts (IRSA) then I already wrote a post on the topic which you can find here.

The TargetGroupBinding Kubernetes Custom Resource we need to create requires the TargetGroup ARN which is non deterministic. In our setup, we use Terraform to create the Kubernetes secret that informs ArgoCD about connected clusters. Within that secret we can attach additional labels that can be accessed by the ArgoCD ApplicationSets, which means we have a very primitive way of passing data from Terraform to ArgoCD without ay extra tools. Note that Kubernetes labels have some restrictions on the syntax and character set that can be used, so we can’t just pass in arbitrary data.

Here is how we create the Kubernetes secret that essentially “register” a cluster with ArgoCD, and how we pass the TargetGroup name and ID via labels.

locals {
 #
 # Extract the target group name and ID to use in ArgoCD secret
 #
 aws_lb_controller = merge(coalesce(local.apps.aws_lb_controller, { enabled = false }), {
    targetgroup_name = split("/", data.aws_lb_target_group.this.arn_suffix)[1]
    targetgroup_id   = split("/", data.aws_lb_target_group.this.arn_suffix)[2]
  })
}

#
# Get cluster TargetGroup ARNs
#
data "aws_lb_target_group" "this" {
  name = "madeupname-${local.cluster_name}"
}

#
# Create Kubernetes secret in root cluster where ArgoCD is running.
#
# The secret tells ArgoCD about a cluster and how to connect (e.g. credentials).
#
resource "kubernetes_secret" "argocd_cluster" {
  provider = kubernetes.root

  metadata {
    name      = "cluster-${local.cluster_name}"
    namespace = "argocd"
    labels = {
      # Tell ArgoCD that this secret defines a new cluster
      "argocd.argoproj.io/secret-type" = "cluster"
      "environment"                    = var.environment
      "aws-lb-controller/enabled"      = local.aws_lb_controller.enabled
      # Kubernetes labels do not allow ARN values, so pass the name and ID separately
      "aws-lb-controller/targetgroup-name" = local.aws_lb_controller.targetgroup_name
      "aws-lb-controller/targetgroup-id"   = local.aws_lb_controller.targetgroup_id
   ...
   ...

    }
  }

  data = {
    name   = local.cluster_name
    server = data.aws_eks_cluster.this.endpoint
    config = jsonencode({
      awsAuthConfig = {
        clusterName = data.aws_eks_cluster.this.id
        # Provide the rolearn that was created for this cluster, and which the
        # root ArgoCD role should be able to assume
        roleARN = aws_iam_role.argocd_access.arn
      }
      tlsClientConfig = {
        insecure = false
        caData   = data.aws_eks_cluster.this.certificate_authority.0.data
      }
    })
  }

  type = "Opaque"
}

That’s it for the Terraform config. We use these snippets in Terraform modules that get called for each cluster we create, keeping things DRY.

ArgoCD

Let’s first look at the directory structure and we can work through the relevant files (note that in our code this exists alongside many other applications inside a Git repository).

.
├── appset-aws-lb-controller.yaml
├── appset-targetgroupbindings.yaml
├── chart
│   ├── Chart.yaml
│   ├── README.md
│   ├── templates
│   │   └── targetgroupbindings.yaml
│   └── values.yaml
└── kustomization.yaml

2 directories, 7 files

We use Kustomize to connect our ArgoCD applications together (minimising the number of “app of apps” connections needed) and that’s what the kustomization.yaml file is for, and here it contains the two top-level appset-*.yaml files.

The appset-aws-lb-controller.yaml file contains the AWS Load Balancer Controller ApplicationSet which uses the Helm chart to install the controller on each of our clusters.

# File: appset-aws-lb-controller.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: aws-lb-controller
  namespace: argocd
spec:
  generators:
  # This is a little trick we use for nearly all our apps to control which apps
  # should be installed on which clusters. Terraform creates the secret that
  # contains these labels, so that's where the logic is controlled for setting
  # these labels to true/false.
    - clusters:
        selector:
          matchLabels:
            aws-lb-controller/enabled: "true"
  syncPolicy:
    preserveResourcesOnDeletion: false
  template:
    metadata:
      name: "aws-lb-controller-{{ name }}"
      namespace: argocd
    spec:
      project: madeupname
      destination:
        name: "{{ name }}"
        namespace: aws-lb-controller
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - PruneLast=true

      source:
        repoURL: "https://aws.github.io/eks-charts"
        chart: aws-load-balancer-controller
        targetRevision: 1.4.4
        helm:
          releaseName: aws-lb-controller
          values: |
            clusterName: {{ name }}
            serviceAccount:
              create: true
              annotations:
                "eks.amazonaws.com/role-arn": "arn:aws:iam::123456789012:role/{{ name }}-aws-lb-controller"
              name: aws-lb-controller
            # We won't be using ingresses with this controller.
            createIngressClassResource: false
            disableIngressClassAnnotation: true

            resources:
              limits:
                cpu: 100m
                memory: 128Mi
              requests:
                cpu: 100m
                memory: 128Mi

Next up we have the appset-targetgroupbindings.yaml ApplicationSet which creates the TargetGroupBinding on each of our clusters. For this, we needed to template some values based on the cluster and the most straightforward way I have found with ArgoCD is to create a minimalistic Helm chart for our purpose. This is what the chart/ directory is, which contains a single template file targetgroupbindings.yaml.

Let’s first look at the ApplicationSet which installs the Helm Chart:

# File: appset-targetgroupbindings.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: targetgroupbindings
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            aws-lb-controller/enabled: "true"
  syncPolicy:
    preserveResourcesOnDeletion: false
  template:
    metadata:
      name: "targetgroupbindings-{{ name }}"
      namespace: argocd
    spec:
      project: madeupname
      destination:
        name: "{{ name }}"
        # TargetGroupBindings need to be in the same namespace as the service
        # they bind to
        namespace: ingress-nginx
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - PruneLast=true

      source:
        repoURL: "<url-of-this-repo>"
        path: "apps/aws-lb-controller/chart"
        targetRevision: "master"
        helm:
          releaseName: targetgroupbinding
          values: |
            # AWS ARNs are not valid Kubernetes label values, so we had to split the ARN up and glue it back together here.
            targetGroupArn: "arn:aws:elasticloadbalancing:eu-west-1:123456789012:targetgroup/{{ metadata.labels.aws-lb-controller/targetgroup-name }}/{{ metadata.labels.aws-lb-controller/targetgroup-id }}"

Next we can look at the single template in our minimal Helm chart:

# File: targetgroupbindings.yaml
apiVersion: elbv2.k8s.aws/v1beta1
kind: TargetGroupBinding
metadata:
  name: ingress-nginx
spec:
  targetType: instance
  serviceRef:
    name: ingress-nginx-controller
    port: 80
  targetGroupARN: {{ required "targetGroupArn required" .Values.targetGroupArn }}
  # By default, add all nodes to the cluster unless they have the label
  # exclude-from-lb-targetgroups set
  nodeSelector:
    matchExpressions:
      - key: exclude-from-lb-targetgroups
        operator: DoesNotExist

It’s not an ideal situation to use Helm charts for this purpose, but the inspiration came from the ArgoCD app of apps example repository. Anyway, I am fairly happy with this implementation and it works dynamically for any new clusters that we add to ArgoCD.

Conclusion

In this post we looked at a fairly specific problem; binding EKS cluster nodes to existing Application Load Balancers using the TargetGroupBinding CRD from the AWS Load Balancer Controller. The motivation to make this write-up came from the number of people asking about this on GitHub, and I think this is quite a simple and elegant approach.

A point worth noting is that using the AWS Load Balancer Controller decouples your node management with your cluster management. Let’s say we wanted to use Karpenter for autoscaling instead of the defacto cluster-autoscaler. Karpenter will not use AWS AutoScalingGroups but will instead create standalone EC2 instances based on the Provisioners you define. This means our previous approach of attaching AutoScalingGroups with TargetGroups will not work as the EC2 instances Karpenter manages will not belong to the AutoScalingGroup and therefore not be automatically attached to the TargetGroup. The AWS Load Balancer Controller doesn’t care how the nodes are created; only that they belong to the cluster and match the label selectors defined. Probably we will look into Karpenter again in the near future for our project now that it supports pod anti-affinity, as this was previously a blocker for us.

How to secure Terraform code with Trivy

Verifa crew — Wed, 14 Aug 2024 07:10:39 +0000

This was originally posted on Verifa's blog, written by Mike Vainio.

In this blog post we will look at securing an AWS Terraform configuration using Trivy to check for known security issues. We will explore different ways of using Trivy, integrating it into your CI pipelines, practical issues you might face and solutions to those issues to get you started with improving the security of your IaC codebases.

Terraform is a powerful tool with a thriving community that makes it easy to find ready-made modules and providers for practically any cloud platform or service that exposes an API. Also internally many companies have a great deal of modules available. One of the strengths of Terraform is that modules provide an abstraction. You don’t have to worry about what is underneath the module’s variables (interface); you provide the necessary values and off you go, but this might lead into some trouble security-wise. Especially in public cloud platforms, it’s easy to expose a VM, load balancer or an object storage bucket publicly to the internet, and when using an abstraction this can happen without you truly acknowledging it. If you are familiar with Terraform, you might say, “Well I will just review the plan before applying”. But if the module is presenting you a plan of creating/modifying 200 resources, are you really confident you can eyeball that information and catch a misconfiguration that would expose your infrastructure to an attacker?

At the time of writing there are around 16,000 modules available from HashiCorp’s public registry. In this post we will pick a couple of AWS modules and check for insecure configurations. For this “check”, we will use an open source tool called Trivy.

Security Scanners for Terraform

One of the big upsides of maintaining your infrastructure using an IaC approach is the fact that your infrastructure can be analysed by static analysis tools since your infrastructure is in plain text files. We can analyse the infrastructure before creating any resources to get quick feedback on the security posture and fix any issues before deployment. The only problem is that there are so many tools! After trying few alternatives, however, I have settled on a favourite that is both easy to use and effective at finding issues with built-in checks. In the past this favourite tool was tfsec , but quite recently the development efforts of the Tfsec project have been migrated into the Trivy project. Thus, it’s time to move over to Trivy although it’s not specialised to Terraform like Tfsec was.

[!NOTE]
I noticed a few differences between Tfsec and Trivy when comparing their results and I will make note of these later in the hands-on section. Based on the GitHub issues and PRs, both open and closed, I am confident Trivy will eventually match and surpass Tfsec in features and accuracy as the development team looks very keen on closing gaps between the two tools.

Worth noting that there are some great open-source alternatives to Trivy, but overall we have found Trivy to be both easy to use locally and to integrate into build pipelines.

There’s of course nobody stopping you from using multiple tools, and when you automate the checks, that might not be a big deal to implement in the end. However, the goal of this blog post is not to focus on comparing different tools. What we want to focus on is that you should use a tool like this to perform security checks on your Terraform code, and it’s quite trivial to accomplish in the end.

Introduction to Trivy

Trivy is a Swiss army knife type of tool for security scanning of various types of artifacts and code. It can scan different targets such as your local filesystem or a container image from a container registry. It can also check for many kinds of security issues such as known vulnerabilities, exposed secrets and most relevant to this blog post; misconfigurations.

At the time of writing Trivy supports scanning of various IaC configurations such as Terraform, CloudFormation and Azure Resource Manager. So even if your organisation uses different tools across teams, Trivy might just be the right tool. Trivy comes with built-in checks for various cloud platforms and in this blog post we will only use the built-in checks, but you can also define your own custom checks/policies.

Trivy can also scan for secrets which you should also use in the IaC context, but this is not really specific to the Terraform use-case. I suggest looking into the Trivy documentation to discover all of it’s power beyond what I already covered here as this will naturally evolve over time.

Now it’s time to get our hands dirty and look at an example of how Trivy can save you from doing things that might put your organisation in jeopardy.

Installing Trivy

For installation I suggest checking out the installation guide in the documentation that covers all supported platforms. But for a quick start, here are a couple of commands that work for most folks:

brew install trivy

For Debian/Ubuntu:

apt install trivy

For Windows:

choco install trivy

There are also pre-built packages available for various Linux distros, or grab the binary from GitHub releases: https://github.com/aquasecurity/trivy/releases

I highly suggest verifying the signature when installing, especially when you are using Trivy in your production build pipelines.

Scanning an Example Terraform Module

Let’s create an example Terraform root module in order to get something to point Trivy at. Like I mentioned earlier, there are many open-source modules for Terraform that we can utilise in order to quickly build infrastructure. The AWS modules are especially popular, so I thought let’s write an example by utilising a couple of these modules with mostly their default configuration. Here’s what I came up with:

#main.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

locals {
  common_tags = {
    Terraform = "true"
    Environment = "dev"
  }
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"

  name = "my-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway = false

  tags = local.common_tags
}

data "aws_ami" "amazon_linux" {
  most_recent = true

  owners = ["amazon"]

  filter {
    name = "name"

    values = [
      "amzn2-ami-hvm-*-x86_64-gp2",
    ]
  }

  filter {
    name = "owner-alias"

    values = [
      "amazon",
    ]
  }
}

resource "aws_instance" "this" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.nano"
  subnet_id     = element(module.vpc.private_subnets, 0)
}

module "alb" {
  source  = "terraform-aws-modules/alb/aws"
  version = "8.7.0"

  name = "my-alb"

  load_balancer_type = "application"

  vpc_id             = module.vpc.vpc_id
  subnets            = module.vpc.public_subnets

  target_groups = [
    {
      name_prefix      = "pref-"
      backend_protocol = "HTTP"
      backend_port     = 80
      target_type      = "instance"
      targets = {
        my_ec2 = {
          target_id = aws_instance.this.id
          port      = 8080
        }
      }
    }
  ]

  http_tcp_listeners = [
    {
      port               = 80
      protocol           = "HTTP"
      target_group_index = 0
    }
  ]

  tags = local.common_tags
}

This configuration is ~100 LoC and it will create a VPC, an EC2 instance and an ALB. Naturally, the ALB also targets the EC2 instance. After creating the file, let’s initialise Terraform to download the external modules:

terraform init

[!NOTE]
If you are working with local modules then there is no need to run terraform init before the scan as all files are already present, but remote modules must be fetched in to the .terraform folder before a scan.

Simplest way to run a Trivy misconfiguration scan is to point it at your current folder:

trivy config .

Like mentioned earlier, we can also scan for secrets at the same time with Trivy:

trivy fs --scanners misconfig,secret .

Due to the focus on Terraform, I’ll use the config subcommand for the rest of the blog post, but in a CI pipeline I would run the secrets scanning definitely for the whole repository as well, not only in IaC folders.

Before showing the full results, I noticed there are some example configurations picked up by Trivy from the remote modules, such as this:

HIGH: IAM policy document uses sensitive action 'logs:CreateLogStream' on wildcarded resource '*'
═══════════════════════════════════════════════════════════════════════════════════════════════════════
You should use the principle of least privilege when defining your IAM policies.
This means you should specify each exact permission required without using wildcards,
as this could cause the granting of access to certain undesired actions, resources and principals.

See https://avd.aquasec.com/misconfig/avd-aws-0057
───────────────────────────────────────────────────────────────────────────────────────────────────────
 modules/vpc/vpc-flow-logs.tf:112
   via modules/vpc/vpc-flow-logs.tf:100-113 (data.aws_iam_policy_document.vpc_flow_log_cloudwatch[0])
    via modules/vpc/vpc-flow-logs.tf:97-114 (data.aws_iam_policy_document.vpc_flow_log_cloudwatch[0])
     via modules/vpc/examples/complete/main.tf:25-82 (module.vpc)
───────────────────────────────────────────────────────────────────────────────────────────────────────
  97   data "aws_iam_policy_document" "vpc_flow_log_cloudwatch" {
  ..
 112 [     resources = ["*"]
 ...
 114   }

If you look closely you notice that the source of the finding is a main.tf file in the examples folder of the VPC module: via modules/vpc/examples/complete/main.tf:25-82 (module.vpc)

This is not really our configuration and we should not include these files in the scan. This is also a difference between Tfsec and Trivy, when running Tfsec it does not pickup the examples folder.

However, we can easily resolve this by skipping all files under examples folders and then we should get proper report of findings:

trivy config . --skip-dirs '**/examples'

Here are the results:

.terraform/modules/alb/main.tf (terraform)
==========================================
Tests: 3 (SUCCESSES: 1, FAILURES: 2, EXCEPTIONS: 0)
Failures: 2 (UNKNOWN: 0, LOW: 0, MEDIUM: 0, HIGH: 2, CRITICAL: 0)

HIGH: Application load balancer is not set to drop invalid headers.
════════════════════════════════════════════════════════════════════════════════════════════════
Passing unknown or invalid headers through to the target poses a potential risk of compromise.

By setting drop_invalid_header_fields to true, anything that doe not conform to well known,
defined headers will be removed by the load balancer.

See https://avd.aquasec.com/misconfig/avd-aws-0052
────────────────────────────────────────────────────────────────────────────────────────────────
 .terraform/modules/alb/main.tf:23
   via .terraform/modules/alb/main.tf:5-63 (aws_lb.this[0])
────────────────────────────────────────────────────────────────────────────────────────────────
   5   resource "aws_lb" "this" {
   .
  23 [   drop_invalid_header_fields                  = var.drop_invalid_header_fields
  ..
  63   }
────────────────────────────────────────────────────────────────────────────────────────────────

HIGH: Load balancer is exposed publicly.
════════════════════════════════════════════════════════════════════════════════════════════════
There are many scenarios in which you would want to expose a load balancer to the wider internet,
but this check exists as a warning to prevent accidental exposure of internal assets.
You should ensure that this resource should be exposed publicly.

See https://avd.aquasec.com/misconfig/avd-aws-0053
────────────────────────────────────────────────────────────────────────────────────────────────
 .terraform/modules/alb/main.tf:12
   via .terraform/modules/alb/main.tf:5-63 (aws_lb.this[0])
────────────────────────────────────────────────────────────────────────────────────────────────
   5   resource "aws_lb" "this" {
   .
  12 [   internal           = var.internal
  ..
  63   }
────────────────────────────────────────────────────────────────────────────────────────────────

.terraform/modules/vpc/main.tf (terraform)
==========================================
Tests: 1 (SUCCESSES: 0, FAILURES: 1, EXCEPTIONS: 0)
Failures: 1 (UNKNOWN: 0, LOW: 0, MEDIUM: 1, HIGH: 0, CRITICAL: 0)

MEDIUM: VPC Flow Logs is not enabled for VPC
════════════════════════════════════════════════════════════════════════════════════════════════
VPC Flow Logs provide visibility into network traffic that traverses the VPC and can be used to
detect anomalous traffic or insight during security workflows.

See https://avd.aquasec.com/misconfig/avd-aws-0178
────────────────────────────────────────────────────────────────────────────────────────────────
 .terraform/modules/vpc/main.tf:29-52
────────────────────────────────────────────────────────────────────────────────────────────────
  29 ┌ resource "aws_vpc" "this" {
  30 │   count = local.create_vpc ? 1 : 0
  31 │
  32 │   cidr_block          = var.use_ipam_pool ? null : var.cidr
  33 │   ipv4_ipam_pool_id   = var.ipv4_ipam_pool_id
  34 │   ipv4_netmask_length = var.ipv4_netmask_length
  35 │
  36 │   assign_generated_ipv6_cidr_block     = var.enable_ipv6 && !var.use_ipam_pool ? true : null
  37 └   ipv6_cidr_block                      = var.ipv6_cidr
  ..
────────────────────────────────────────────────────────────────────────────────────────────────

main.tf (terraform)
===================
Tests: 3 (SUCCESSES: 1, FAILURES: 2, EXCEPTIONS: 0)
Failures: 2 (UNKNOWN: 0, LOW: 0, MEDIUM: 0, HIGH: 2, CRITICAL: 0)

HIGH: Instance does not require IMDS access to require a token
════════════════════════════════════════════════════════════════════════════════════════════════

IMDS v2 (Instance Metadata Service) introduced session authentication tokens which improve
security when talking to IMDS.
By default <code>aws_instance</code> resource sets IMDS session auth tokens to be optional.
To fully protect IMDS you need to enable session tokens by using <code>metadata_options</code>
block and its <code>http_tokens</code> variable set to <code>required</code>.

See https://avd.aquasec.com/misconfig/avd-aws-0028
────────────────────────────────────────────────────────────────────────────────────────────────
 main.tf:55-59
────────────────────────────────────────────────────────────────────────────────────────────────
  55 ┌ resource "aws_instance" "this" {
  56 │   ami           = data.aws_ami.amazon_linux.id
  57 │   instance_type = "t3.nano"
  58 │   subnet_id     = element(module.vpc.private_subnets, 0)
  59 └ }
────────────────────────────────────────────────────────────────────────────────────────────────

HIGH: Root block device is not encrypted.
════════════════════════════════════════════════════════════════════════════════════════════════
Block devices should be encrypted to ensure sensitive data is held securely at rest.

See https://avd.aquasec.com/misconfig/avd-aws-0131
────────────────────────────────────────────────────────────────────────────────────────────────
 main.tf:55-59
────────────────────────────────────────────────────────────────────────────────────────────────
  55 ┌ resource "aws_instance" "this" {
  56 │   ami           = data.aws_ami.amazon_linux.id
  57 │   instance_type = "t3.nano"
  58 │   subnet_id     = element(module.vpc.private_subnets, 0)
  59 └ }
────────────────────────────────────────────────────────────────────────────────────────────────

For brevity I shortened some of the long lines.

Inspecting the Results

Unfortunately Trivy does not print a summary in the end like tfsec does which makes it nice to read the output from bottom to top. Trivy does offer different ways to modify the resulting report, but for the needs of this blog I quickly used grep to find a short summary of each finding:

HIGH: Application load balancer is not set to drop invalid headers.
HIGH: Load balancer is exposed publicly.
MEDIUM: VPC Flow Logs is not enabled for VPC
HIGH: Instance does not require IMDS access to require a token
HIGH: Root block device is not encrypted.

Looking at the list, I think we want to change the configuration before deploying the resources. Next, let’s look at our options for resolving these findings.

Resolving the Issues

We have two choices when it comes to resolving these issues so that we can have a nice clean report (until we change the configuration again). We can either resolve the issues by modifying our configuration or we can choose to accept the finding as something that is not relevant for our requirements and ignore the findings for future scans.

Since we use the public AWS modules, we cannot easily make changes besides the inputs without forking the source module, however we can change the EC2 instance which is defined directly in the main.tf . So let’s look into the issues related to the EC2 instance first.

There is a finding related to the AWS Instance Metadata Service (IMDS). The finding is related to making sure the instance uses the IMDSv2 instead of the legacy IMDSv1. Looking at the full report above, you can see that Trivy explicitly tells us what the problem is and how to resolve it:

IMDS v2 (Instance Metadata Service) introduced session authentication tokens which improve
security when talking to IMDS.
By default <code>aws_instance</code> resource sets IMDS session auth tokens to be optional.
To fully protect IMDS you need to enable session tokens by using <code>metadata_options</code>
block and its <code>http_tokens</code> variable set to <code>required</code>.

You can read more of the security benefits and scenarios where this configuration matters in the IMDSv2 announcement blog post by AWS.

To resolve this, we are going to change the configuration in the following way, like the description suggested:

resource "aws_instance" "this" {
   ami           = data.aws_ami.amazon_linux.id
   instance_type = "t3.nano"
   subnet_id     = element(module.vpc.private_subnets, 0)
+
+  metadata_options {
+    http_tokens = "required"
+  }
 }

If you run the scan again you will notice the finding is gone:

trivy config . --skip-dirs '**/examples'

Let’s move onto the next issue that is related to the root disk being unencrypted for this EC2 instance. In reality I would configure encryption because it is a common compliancy requirement and AWS makes it very easy, but for the sake of the example let’s ignore this instead, saying that we are ok with running unencrypted root disk:

+#trivy:ignore:avd-aws-0131
 resource "aws_instance" "this" {
   ami           = data.aws_ami.amazon_linux.id
   instance_type = "t3.nano"
   subnet_id     = element(module.vpc.private_subnets, 0)

   metadata_options {
     http_tokens = "required"
   }
 }

Using the inline method for ignoring findings is the most intuitive way in my opinion, this might be familiar to you if you have worked with just about any code linter in the past.

Now the only remaining issues are related to the AWS modules which we did not author. Unfortunately, right now Trivy can’t figure out that the remote modules are downloaded under different path (.terraform/modules) than what is declared when specifying the source for the modules:

module "alb" {
  source  = "terraform-aws-modules/alb/aws"
  version = "8.7.0"
...

Again, not an issue when using local modules since the paths nicely match between the findings report and the declaration in the Terraform configuration. This issue is actively being discussed in the Trivy GitHub repository, so when you read this it might be fixed and I should update this blog.

Luckily Trivy has a cure for this even without us waiting for a fix. I quickly brewed a solution using an advanced filtering mechanism in Trivy that uses the Rego language:

package trivy

import data.lib.trivy

default ignore = false

ignore_avdid := {"AVD-AWS-0052", "AVD-AWS-0053"}

ignore_severities := {"LOW", "MEDIUM"}

ignore {
 input.AVDID == ignore_avdid[_]
}

ignore {
 input.Severity == ignore_severities[_]
}

This will resolve the issue with the VPC module because all MEDIUM findings are ignored, and the findings in the ALB module are ignored explicitly by the finding IDs.

Now we can run a scan and include this policy from a local file:

trivy config . --skip-dirs '**/examples' --ignore-policy custom-policy.rego

Now all the findings should be resolved (of course if you try this yourself, there might be new built-in checks and you get a different list of findings).

While authoring the custom ignore policy, I couldn’t figure out how to connect the source of the finding (the ALB module) to the AVDID for a more fine-grained rule, but that does not seem like a big deal to me. You should not place all your IaC configuration into a single root module after all, in production I would separate the VPC creation from this Terraform root module and use an existing one that is part of a different root module.

Scanning Terraform Plans

Another way to run the scan is to first create a plan and then convert the plan from the default binary format to JSON, and then point Trivy to scan this plan that contains a list of changes:

terraform plan --out tf.plan
terraform show -json tf.plan > tfplan.json
trivy config tfplan.json

I noticed that there’s one additional finding when running Trivy against the plan:

CRITICAL: Listener for application load balancer does not use HTTPS.
═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Plain HTTP is unencrypted and human-readable. This means that if a malicious actor was to eavesdrop on your connection,
they would be able to see all of your data flowing back and forth.

You should use HTTPS, which is HTTP over an encrypted (TLS) connection, meaning eavesdroppers cannot read your traffic.

See https://avd.aquasec.com/misconfig/avd-aws-0054
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 main.tf:36
   via main.tf:34-48 (aws_lb_listener.frontend_http_tcp_ffdb4db32d85be4b5cd7539e4d3c6d16)
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ..
  36 [  protocol = "HTTP"
  ..
  48   }
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

I found it odd that this was not found in the previous scan and after some testing this seems to be caused by using remote modules (again!). I also noticed that running tfsec instead of Trivy will catch this issue. When using local modules Trivy can right away pick up this finding. Right now it seems necessary to scan both your plan and your configuration for the best accuracy, but I might revisit this in the future and see if the issue is resolved given the active discussion around remote modules support in Trivy.

Luckily, we can work around this also by scanning everything at once, so generate the plan and then scan the entire folder (then Trivy processes the configuration AND the plan):

terraform plan --out tf.plan
terraform show -json tf.plan > tfplan.json
trivy config . --skip-dirs '**/examples'

Trivy is smart enough to not duplicate the findings - awesome! With Tfsec it was not possible to scan Terraform plans, so this is great if scanning the plan fits your workflow better.

As we saw above, scanning the Terraform plan is more accurate than scanning just the files, but the downside is that you need to generate a plan for each Terraform root module and it takes much more time to generate a plan prior to running Trivy. You also need to be able to connect and authenticate to the providers for Terraform to generate the plan, although that’s not typically an issue.

One more thing to note about plans is that your inline #trivy:ignore comments will be ignored since that information will not make it into the plan, so if you are using plans primarily for your scanning, then you might need to get comfortable defining the Rego ignore policies instead.

Running Trivy in CI

Including Trivy scans in your IaC repositories’ CI pipelines is a must. If you don’t have CI pipelines for your IaC… Well you should! Trivy offers integrations with many CI/CD tools, IDEs and other systems, see the documentation for an up-to-date list.

When using Trivy in CI it’s wise to use a configuration file instead of the command line flags, this makes it easy to reproduce the scan using same configuration locally if you need to investigate some new findings. If you are using GitHub Actions, there’s an official Action that you can use to integrate Trivy into your CI pipeline, here’s a simple example which uses a configuration file:

name: build
on:
  push:
    branches:
    - main
  pull_request:
jobs:
  build:
    name: Build
    runs-on: ubuntu-20.04
    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Run Trivy vulnerability scanner in fs mode
      uses: aquasecurity/trivy-action@master
      with:
        scan-type: 'fs'
        scan-ref: '.'
        trivy-config: trivy.yaml

As seen in the above example, in CI you likely want to run the fs scan which includes by default all the scanners, meaning Trivy will also scan for secrets and vulnerabilities, not only for misconfigurations.

However, keep in mind this excellent blog post by my colleague Thierry. The Trivy action only really wraps the GitHub workflow YAML inputs to CLI flags. If you are using another CI/CD system, you can simply install and invoke the CLI as well, making transition between CI/CD tools extremely simple.

Bonus for GitHub Users

If you have an open-source project in GitHub or you pay GitHub for the advanced security features, then you can also upload the Trivy scan results into the GitHub code scanning which you should be using if you are not already. Refer to the Trivy Action’s README to view a sample configuration of uploading the results. This will help you to track the findings in addition to gatekeeping with a pipeline that must pass always before merging code (or however your team works).

Summary

In this blog post we rolled up our sleeves and looked into how to secure Terraform configuration by using static analysis. As an example and recommended tool we explored Trivy, but honestly the tool choice isn’t as important as the principle of integrating such checks into your workflow. I hope you can see the value of running a simple scan over your configuration. Thanks to extensive builtin checks in Trivy you can get actionable findings without spending time reviewing the configuration manually and magically knowing all the security intricacies of AWS infrastructure.

Demystifying Service Level acronyms and Error Budgets

Verifa crew — Wed, 12 Jun 2024 13:31:46 +0000

This was originally posted on Verifa's blog, written by Lauri Suomalainen

Availability, fault tolerance, reliability, resilience. These are some of the terms that pop up when delivering digital services to users at scale. Acronyms related to Service Levels tend to pop up as well. Most developers have at least seen SLA, SLO and SLI and some even know what they mean. However, based on personal experience, not many people who work in the intersection of writing, delivering and maintaining software necessarily know how to make use of them in their software delivery process.

In this fundamental level blog post I will explain what different Service Level concepts mean and how to use them effectively in the software delivery process.

I also did a talk at DevOps Finland on this topic, Service Levels, Error budgets, and why your dev teams should care.

What is a Service Level and why does it matter?

Depending on the source, I have seen claims that anywhere from 40% to a whopping 90% of a software systems lifetime costs consist of operational and maintenance costs, making the development costs of the software pale in comparison. Additionally, costs of even short service breaks and unplanned downtime are significant and getting more expensive still. This goes to show that being able to maintain your service availability and preferably being able to preemptively react to service degradation is not just a matter of convenience, but carries a very real price tag with business consequences.

Service Level embodies the overall performance of your software system. It consists of goals that, when met, indicate that your system is performing at the desired level, and measurements which tell if those goals are being met, exceeded or if the system is underperforming. There are three distinct concepts associated with the Service Level: Service Level Agreement (SLA), Service Level Objective (SLO) and Service Level Indicator (SLI).

Service Level Agreements

Service Level Agreements are the base level of performance and functionality you promise to your users, be those paying customers or developers using your internal tooling platform and databases. Typically SLAs are seen to relate to service availability and is expressed as a percentage like 99,9 (colloquially ‘three-nines’, ‘four-nines’ for 99,99% and so on), but soon we will see that this is a simplification. Especially in public cloud computing, failing to meet a set SLA carries a contractual penalty for the provider and a compensation to their clients, such as refunds or discounts, so it is in a provider’s best interest to react preemptively when the software system shows signs of degradation. That brings us to Service Level Objectives (SLOs).

Service Level Objectives

Service Level Objectives are goal values you set for your software system. They are not contractually bound like SLA values, but they still define the minimum baseline for your software system to be considered functional. It is a good practice to set SLOs slightly stricter than the thresholds defined in your SLAs; if your SLA promises 99,5% availability, 99,7% SLO gives you some leeway to fix problems in your software before they manifest for your users and start incurring sanctions. Obviously, you want to detect the symptoms before you start violating your SLOs and you do that by monitoring and measuring your Service Level Indicators (SLIs).

Service Level Indicators

Service Level Indicators are metrics you collect about your software system’s health. Often coined under general term of ‘availability’, SLIs are specific technical measurements the system produces. However, what constitutes of availability and unavailability varies from system to system. Straightforward simplifications on availability such as ‘my server is on and reachable 99,9% of the time’ can easily hide symptoms of a badly behaving system. SLIs should be values that actually matter to the users and their experience with the software system.

What to measure and how?

So, general ‘availability’ is not a good enough metric. Why?

A very heavy-handed example would be an Industrial Control System (ICS) that is only used during the day when the work is done on the factory floor. However, there’s a bug in this ICS that causes random disconnections and freezes when a certain load is reached in the system. This happens frequently during the day, but never outside working hours. If you would only monitor for server health or network connectivity (instead of HTTP error codes for example), your metrics would never reveal the problem affecting your users. In this scenario it does not matter if your SLO is met as it does not give you insight to how your users interact with the systems and how they experience using it. In the worst case scenario your SLI is just plain wrong, but even a good SLI may hide bad behaviour if the measurement window is too wide.

A simplistic way to measure availability is to look at the ‘good time’ your system experiences divided by total time. In the example above, you could have a health check or a liveness probe periodically checking on the server health and everything would look fine based on it. A more refined approach to measuring availability would be to measure the ratio of good interactions against the total number of interactions. Metrics like latency, error ratio, throughput and correctness might matter more to your users than just raw liveness. Server availability is the basic requirement, serving requests correctly and in a timely manner is what brings value.

As always with complex systems, there is no silver bullet to choosing correct SLIs. In some cases we could for example tolerate some number of false positives or incomplete data as long as we get it fast whereas in other cases we could be willing to tolerate a system with notable latency or subpar throughput if we can be sure that the data we get is always correct.

When you have identified your SLIs, you have to set SLOs and SLAs. As a rule of thumb, every system breaks somehow sometime. Even if you managed to build an infallible system, external forces like network congestion and hardware failure could hinder your performance. That’s why it is unrealistic to aim for 100% availability. The goals you set for your system are also not static. When developing and launching new software, you probably want to set your goals modestly for starters. As you gain more data on user interactions and loads your system experiences, you can re-evaluate the goals while you keep improving the software. This brings us to our next topic.

Why should you care about service level in software development?

I have said it before and I will say it again: software development is a customer service job. No commercial software system exists for its own sake. There are end users, your clients, who get something out of the software you build and your software should meet their needs constantly if you want to succeed. While user research might tell you what features you should build next, your service level tells how the features you already built are performing. With the ‘you build it, you run it’ approach becoming more prevalent within the industry, maintaining existing products increasingly becomes an exercise in the realms of software development processes rather than just an operational task. Best of all, monitoring your service levels allows you to make data-driven decisions when working on your software system.

I had an interesting discussion about service levels with a colleague who is managing a software team in a product company. I asked if they had SLAs and SLOs in place and he assured me they do. I also queried about their working practices and he told me they work in two week sprints building new features, but every now and then, usually after major releases, they have so-called ‘cooldown’ sprints where they work on improving the existing code base, refactoring and erasing technical debt. I said that’s just great, fantastic even. Technical debt will stifle the productivity and development speed in the long run, so I applaud any formal efforts taken trying to fight it.

Then I asked a few harder questions that revealed some room for improvement. The first one was: “What do you do if your SLOs are not being met?” He told me, that their SLOs were regarded more like key performance indicators: something they should strive for but is not actively acted upon. The second question I asked was: “How do they determine when to have a cooldown sprint”. From the answer I deduced that the decision was made somewhat at whim and when the feature backlog was not actively bursting from the seams with high priority stuff.

My main gripe with these answers is that breaking an SLO should always warrant action. If there are no procedures tied to it, an SLO becomes hollow fluff. That does not mean you should treat all SLO violations as major incidents; it is as unrealistic to expect 100% availability as it is to meet SLOs 100% of the time. Failing to meet an SLO should at least cause the software development team to stop and consider if they should prioritise their work or, say, have a cooldown sprint.

Enter error budgets. It took a while to get here from the title. One could say that error budgets are a tool and indicator on how much you can… muck around before you have to start finding out. But what are they and how do they work?

How to use error budgets in software development?

Consider your service has some availability SLO of 95% tied to a monthly aggregated SLI (which, as a side note, allows for terribly long outages of 1,52 days per month!). Now you are doing some top notch software development and consistently manage to achieve 97% availability (your software is uncooperative only some 43 minutes each day…). That means, you have 97%-95%=2% budget to do risky stuff that can break your software before you are breaking your SLO. In minutes, that is an additional 28,8 on top of the current downtime.

Now talking about doing risky things in the software development context might evoke thoughts about deploying very experimental features, prototypes or even untested changes (and if you considered that, it’s OK. It is called an intrusive thought and everyone has them), but these are quite extreme examples. One should bear in mind, that in software development any change carries inherent risk in complex interconnected systems. You can use error budgets to release more frequently and with more confidence. If you do canary deployments or A/B testing, you can roll out new features faster to a wider audience because your error budget gives you this leeway. You could plan and perform maintenance breaks knowing you will not violate your SLOs. I think one of the most important things is that you get a data-driven indicator which allows you to make informed choices when balancing between system reliability and innovating new features.

Building on the previous example, consider you introduce a new change to your software system. Everything seems fine until in a couple of days you find out that your daily downtime has gone from 43 minutes to some 58 minutes. You realise that the feature you shipped has caused some extra instability in your system and that this single feature just made a dent to your availability: from 97% to 96%. You are still not violating the SLO, but just this new feature is now taking 50% of your error budget, leaving you with less freedom to develop new features. If your outage time would have gone over 72 minutes per day, the error budget would show that you will run out of it before the end of the month: time to immediately switch over to maintenance mode before our end users start complaining!

Now you are sitting there with your error (and in a sense, development) budget cut in half, gnashing your teeth, even realising that maybe 95% SLO is not that high and some improvements must be made. What can be done before we spend the rest of our budget? That is when you should realise, that the error budget is there for you to spend! You look at your gutted budget and realise that even if you would not optimise the feature you just shipped, you could still afford 14 minutes and 24 seconds a day, or a whopping 7,2 hours per month downtime without breaking your SLOs. Encouraged, you and your team of developers and operations people (hopefully a somewhat overlapping group) can schedule a safe and informed downtime where you perform some much needed reliability improvements.

In conclusion

When building and serving software you care both about evolving it, but also about its availability and reliability. Focusing too much on the former can result in software robust on features, but brittle in architecture and maintainability, eventually slowing down the development as the majority of time is spent firefighting yet another failure. Focusing too much on the latter grinds the development to a halt as the best way to ensure reliability is to avoid making changes.

Using Service Level Agreements, Objectives and Indicators and Error Budgets effectively in your software development process enables you to strike the right balance between change versus stability. They define common goals to your developers and operations, promoting co-operation and data-driven decision making. They give your teams more ownership and agenda over the products they build and make it easier to react to problems before they can take effect.

I also did a talk at DevOps Finland on this topic, Service Levels, Error budgets, and why your dev teams should care.

How to assume an AWS IAM role from a Service Account in EKS with Terraform

Verifa crew — Wed, 15 May 2024 12:45:15 +0000

This was originally posted on Verifa's blog, written by Jacob Lärfors

When working with AWS Elastic Kubernetes Service (EKS) clusters, your pods will likely want to interact with other AWS services and possibly other EKS clusters. In a recent project we were setting up ArgoCD with multiple EKS clusters and our goal was to use Kubernetes Service Accounts to assume an AWS IAM role to authenticate with other EKS clusters. This led to some learning and discovering that we'd like to share with you.

When running workloads in EKS, the running pods will operate under a service account which allows us to enforce RBAC within a Kubernetes cluster. Well, we are not going to talk more about that in this post, we want to talk about how we can do things outside of our cluster and interact with other AWS services. The AWS documentation for this is fairly good if you want a reference point. There is also a workshop on this topic which might be useful to run through.

In our case, we were trying to communicate across EKS clusters to allow ArgoCD to manage multiple clusters and there is a pretty mammoth GitHub issue with people struggling (and succeeding!) with this. That GitHub issue partly inspired this blog post - if it was an easy topic people would not struggle and a blog would not be necessary ;)

The Plan

The simple breakdown of what we need:

An EKS cluster with an IAM OIDC provider
A Kubernetes Service Account in the EKS cluster
An AWS IAM role which we are going to assume (meaning we can do whatever that role is able to do)
An AWS IAM role policy that allows our Service Account (2.) to assume our AWS IAM role (3.)

We won't bore you with creating an EKS cluster and an IAM OIDC provider... Pick your poison for how you want to do this... We personally use Terraform and the awesome EKS module that has a convenient input enable_irsa which creates the OIDC provider for us.

Basic Deployment with Terraform

Before we create the Service Account and the IAM role we need to define the names of these as there's a bit of a cyclic dependency - the Service Account needs to know the role ARN, and the role policy needs to know the Service Account name and namespace (if we want to limit scope, which we do!).

Locals

So let's define some locals to keep things simple and DRY.

# locals.tf

locals {
  k8s_service_account_name      = "iam-role-test"
  k8s_service_account_namespace = "default"

  # Get the EKS OIDC Issuer without https:// prefix
  eks_oidc_issuer = trimprefix(data.aws_eks_cluster.eks.identity[0].oidc[0].issuer, "https://")
}

IAM

And let's define the Terraform code that creates the IAM role with a policy allowing the service account to assume that role.

# iam.tf

#
# Get the caller identity so that we can get the AWS Account ID
#
data "aws_caller_identity" "current" {}

#
# Get the EKS cluster we want to target
#
data "aws_eks_cluster" "eks" {
  name = "<cluster-name>"
}

#
# Create the IAM role that will be assumed by the service account
#
resource "aws_iam_role" "iam_role_test" {
  name               = "iam-role-test"
  assume_role_policy = data.aws_iam_policy_document.iam_role_test.json
}

#
# Create IAM policy allowing the k8s service account to assume the IAM role
#
data "aws_iam_policy_document" "iam_role_test" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type = "Federated"
      identifiers = [
        "arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/${local.eks_oidc_issuer}"
      ]
    }

    # Limit the scope so that only our desired service account can assume this role
    condition {
      test     = "StringEquals"
      variable = "${local.eks_oidc_issuer}:sub"
      values = [
        "system:serviceaccount:${local.k8s_service_account_namespace}:${local.k8s_service_account_name}"
      ]
    }
  }
}

Service Account and Pod

Then we need to create some Kubernetes resources. When working with Terraform it can make a lot of sense to use the Terraform Kubernetes Provider to apply our Kubernetes resources, especially as Terraform knows the ARN of the role and we can reuse our locals. However, if you don't want yet another provider dependency in Terraform you can easily do this with vanilla Kubernetes.

Terraform Kubernetes

NOTE: you will need to configure the Kubernetes provider if you want to do this via Terraform

# kubernetes.tf

#
# Create the Kubernetes service account which will assume the AWS IAM role
#
resource "kubernetes_service_account" "iam_role_test" {
  metadata {
    name      = local.k8s_service_account_name
    namespace = local.k8s_service_account_namespace
    annotations = {
      # This annotation is needed to tell the service account which IAM role it
      # should assume
      "eks.amazonaws.com/role-arn" = aws_iam_role.iam_role_test.arn
    }
  }
}

#
# Deploy Kubernetes Pod with the Service Account that can assume an AWS IAM role
#
resource "kubernetes_pod" "iam_role_test" {
  metadata {
    name      = "iam-role-test"
    namespace = local.k8s_service_account_namespace
  }

  spec {
    service_account_name = local.k8s_service_account_name
    container {
      name  = "iam-role-test"
      image = "amazon/aws-cli:latest"
      # Sleep so that the container stays alive
      # #continuous-sleeping
      command = ["/bin/bash", "-c", "--"]
      args    = ["while true; do sleep 5; done;"]
    }
  }
}

Vanilla Kubernetes

And now the same as above with vanilla Kubernetes YAML.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: iam-role-test
  namespace: default
  annotations:
    # TODO: replace ACCOUNT_ID with your account id
    eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/iam-role-test
---
apiVersion: v1
kind: Pod
metadata:
  name: iam-role-test
  namespace: default
spec:
  serviceAccountName: iam-role-test
  containers:
    - name: iam-role-test
      image: amazon/aws-cli:latest
      # Sleep so that the container stays alive
      # #continuous-sleeping
      command: ["/bin/bash", "-c", "--"]
      args: ["while true; do sleep 5; done;"]

Verify the setup

We can describe the pod (i.e. kubectl describe pod iam-role-test) and check the volumes, mounts and environment variables attached to the pod, but seeing as we way launched a pod with the AWS CLI, let's just get in there and check! Exec into the running container and execute the aws CLI:

# Exec into the running pod
kubectl exec -ti iam-role-test -- /bin/bash

# Check the AWS Security Token Service identity
bash-4.2# aws sts get-caller-identity
{
    "UserId": "AROA46FON4H773JH4MPJD:botocore-session-1637837863",
    "Account": "123456789101",
    "Arn": "arn:aws:sts::123456789101:assumed-role/iam-role-test/botocore-session-1637837863"
}

# Check the AWS environment variables
bash-4.2# env | grep "AWS_"
AWS_ROLE_ARN=arn:aws:iam::<ACCOUNT_ID>:role/iam-role-test
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_DEFAULT_REGION=eu-west-1
AWS_REGION=eu-west-1

As you can see, the AWS Service Token Service (STS) confirms that we have successfully assumed the role we wanted to! And if we check our environment variables we can see that these have been in injected when we started the pod, and the AWS_WEB_IDENTITY_TOKEN_FILE file is the part that is sensitive and mounted when we run the container.

Alternative Approaches

If we remove the service account from the pod and use the default service account (which exists per namespace), we can see who AWS STS thinks we are:

# Exec into the running pod
kubectl exec -ti iam-role-test -- /bin/bash

# Check the AWS Security Token Service identity
bash-4.2# aws sts get-caller-identity
{
    "UserId": "AROA46FON4H72Q3SPL6SC:i-0d0aff479cf2e2405",
    "Account": "123456789101",
    "Arn": "arn:aws:sts::<ACCOUNT_ID>:assumed-role/<cluster-name>XXXXXXXXX/i-<node-instance-id>"
}

# Check the AWS environment variables
bash-4.2# env | grep "AWS_"
# ... it's empty!

Having created the cluster with the EKS Terraform module, it has created a role for our autoscaling node group... and what we could do is allow this role to assume another role which grants us the access that we might need...

However, I find managing Service Accounts in Kubernetes much easier than assigning roles to node groups and it seems this is the recommended approach based on searches online. However I thought it worth mentioning here anyway.

Next Steps

Now that you can run a Pod with a service account that can assume an IAM role, just give that IAM role the permissions it needs to do what you want.

In our case, we needed the role to access another EKS cluster and so the role does not need any more policies in AWS, but it needs to be added to the aws-auth ConfigMap that controls the RBAC of the target cluster. But let's not delve into that here, there's already a ton of posts around that :)

To add a cluster in ArgoCD you can either use the argocd CLI, or do it with Kubernetes Secrets. Of course we did it with Kubernetes Secrets, and of course we did that with Terraform after we create the cluster!

#
# Get the target cluster details to use in our secret
#
data "aws_eks_cluster" "target" {
  name = "<cluster_name>"
}

#
# Create a secret that represents a new cluster in ArgoCD.
#
# ArgoCD will use the provided config to connect and configure the target cluster
#
resource "kubernetes_secret" "cluster" {
  metadata {
    name      = "argocd-cluster-name"
    namespace = "argocd"
    labels = {
      # Tell ArgoCD that this secret defines a new cluster
      "argocd.argoproj.io/secret-type" = "cluster"
    }
  }

  data = {
  # Just a display name
    name   = data.aws_eks_cluster.target.id
    server = data.aws_eks_cluster.target.endpoint
    config = jsonencode({
      awsAuthConfig = {
        clusterName = data.aws_eks_cluster.target.id
        # NOTE: roleARN not needed as ArgoCD will already assume the role that
        # has access to the target cluster (added to aws-auth ConfigMap)
      }
      tlsClientConfig = {
        insecure = false
        caData   = data.aws_eks_cluster.target.certificate_authority.0.data
      }
    })
  }

  type = "Opaque"
}

And boom! We can now create EKS clusters with Terraform and register them with ArgoCD using our Service Account that can assume an AWS IAM Role that is added to the target cluster RBAC... Sometimes it's confusing just to write this stuff, but happy Terraform, Kubernetes and AWS'ing (and GitOps'ing with ArgoCD perhaps)!

Useful Links

AWS EKS IAM Roles for Service Accounts (IRSA): https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html
Workshop on IAM Roles for Service Accounts (IRSA): https://www.eksworkshop.com/beginner/110_irsa/preparation/
Create AWS IAM OIDC Provider for EKS: https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html
Managing users or IAM roles in EKS: https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html
AWS EKS Terraform Module: https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest
ArgoCD: https://argo-cd.readthedocs.io/en/stable/
Related GitHub issue on ArgoCD: https://github.com/argoproj/argo-cd/issues/2347

How to Debug Failing Build Agent Pods in Kubernetes-enabled Jenkins

Verifa crew — Fri, 03 May 2024 08:59:31 +0000

This was originally posted on Verifa's blog, written by Andreas Lärfors

Running Jenkins in a Kubernetes cluster is a great way to enable auto-scaling of the infrastructure hosting your Jenkins Build Agents. However, when the Build Agent Pods fail to start correctly, it can be difficult to troubleshoot. In this short article we look at some simple things you can do to figure out what's going wrong.

Our Jenkins system is deployed on Kubernetes, with both Jenkins Master and the Jenkins Build Agents running as Pods. We recently had an issue where Build Agent Pods were not starting correctly. Because of our auto-scaling setup, this led to hundreds or thousands of Build Agent Pods being created in the cluster, and the infrastructure (Nodes) scaling up accordingly. This was first discovered when we reached our Cost Cap limit on our EKS cluster. Cleaning up required the deletion of around 7,000 Pods.

It was clear that we had to identify the cause of this issue. The cause itself is not that interesting in itself, but the method of debugging is, and might help you if you are facing similar issues.

Step 0: Inspect your Jenkins Build Logs

You've probably already done this, but you should of course start by looking at the build logs in Jenkins. Here's what ours were saying:

15:58:40 Started by user andreasverifa

15:58:40 Running in Durability level: MAX_SURVIVABILITY

15:58:40 [Pipeline] Start of Pipeline

15:58:41 [Pipeline] podTemplate

15:58:41 [Pipeline] {

15:58:41 [Pipeline] node

15:58:47 Created Pod: kubernetes staging/example-projects-andreas-test-1-x34f1-qzgqg-3cggn

15:58:56 Still waiting to schedule task

15:58:56 All nodes of label 'Example-Projects_andreas-test_1-x34f1' are offline

15:58:57 Created Pod: kubernetes staging/example-projects-andreas-test-1-x34f1-qzgqg-1yvt8

15:59:07 Created Pod: kubernetes staging/example-projects-andreas-test-1-x34f1-qzgqg-l3t8l

15:59:17 Created Pod: kubernetes staging/example-projects-andreas-test-1-x34f1-qzgqg-w784y

Oh dear, a new Pod created every 10 seconds. It's easy to see why our cluster was filling up with failed Pods.

Step 1: Inspect your Pods

So why are the Pods failing to start? Let's (kubectl) describe them:

$ kubectl describe pod example-projects-andreas-test-...

Containers:
  jnlp:
    Container ID:   ...
    Image:          jenkins/inbound-agent:4.3-4-jdk11
    Image ID:       ...
    Port:           <none>
    Host Port:      <none>
    State:          Failed
      Started:      ...
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  256Mi
    Environment:
      ...
    Mounts:
      ...
  build:
    Container ID:   ...
    Image:          ...
    Image ID:       ...
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      ...
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        600m
      memory:     768Mi
    Environment:  <none>
    Mounts:
      ...

Our Pipeline defines a Pod that contains two Containers:

"jnlp" - the container which runs the Jenkins Agent
"build" - the container in which we build our software and run the pipeline steps

From the description above we can see that the jnlp container has Failed, so let's get the logs and investigate why:

$ kubectl logs example-projects-andreas-test-... jnlp

INFO: Protocol JNLP4-connect encountered an unexpected exception
java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name:
...
Jun 18, 2021 12:04:41 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: None of the protocols were accepted
java.lang.Exception: The server rejected the connection: None of the protocols were accepted

From these logs we can see that the connection from the jnlp Agent Container to the Jenkins Master failed. And the reason for the failure is that the server (Master) rejected the connection.

The next step in this process is in hindsight unnecessary, but it documents our troubleshooting process and might be useful for you.

Step 2: Manually Override Pods (Optional)

We've established that the jnlp Container fails to connect to the Jenkins Master. One of the issues with containerized systems is that we often cannot debug what is going wrong on startup without modifying the container. That's what we're going to do now.

Our Jenkins build containers are defined as inline-YAML in our pipeline scripts. Let's take an example pipeline and make the jnlp container sleep for 10000 seconds instead of completing the normal startup routine (where the connection would usually fail):

pipeline {
    agent {
        kubernetes {
            yaml '''
apiVersion: v1
kind: Pod
metadata:
  name: andreas-test
spec:
  containers:
  - name: jnlp
    image: jenkins/inbound-agent:4.3-4-jdk11
    command: ["sleep", "10000"]
'''

        }
    }

    stages {
        stage('Say Hello World') {
            steps {
                container("jnlp") {
                    echo "Hello World!"
                }
            }
        }
    }
}

On startup, the jnlp container does not register with the Jenkins Master (as we've overridden the default startup/entrypoint with the command value). So the Jenkins Job finds no available Nodes and continues spawning Pods... So let's cancel the Jenkins Job.

But we are now left with at least one Pod which is in state "Running" instead of "Failed". This allows us to debug the connection process.
Let's exec into the running container and try running the entrypoint. We can determine the entrypoint by looking at the Dockerfile for this image, for example:

ENTRYPOINT ["/usr/local/bin/jenkins-agent"]

$ kubectl exec -it example-projects-andreas-test-... -- /bin/sh

$ sh -x /usr/local/bin/jenkins-agent
...
exec /usr/local/openjdk-11/bin/java -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless -tunnel jenkins-agent:50000 -url http://jenkins:8080/ -workDir /home/jenkins/agent ... example-projects-andreas-test-...
...

Ok, there's our connect command! We can now run this manually to test the connection.

We have seen that the server is refusing the connection because of "Unknown client name", which suggests the server is not expecting the client (jnlp Agent) to connect.

We manually added a Node in the Jenkins GUI with the name of our Pod. Once this was done, running the entrypoint/Java connect command succeeded in connecting the jnlp Agent to the Jenkins Master!

So we have now deduced that the Agent is failing to connect to the Master because the Master is not expecting the Agent. This suggests that the Kubernetes plugin is failing to "register" the Pod with the Master.

Step 3: Inspect Kubernetes Plugin Logs

We probably should have jumped straight from Step 1 to Step 3. However, Step 2 helped confirm that there are no external causes for why the Agent is failing to connect to the Master. It is just conditions in Jenkins which are causing the issue.

Suspecting the Kubernetes plugin, we want to see the logs of the Kubernetes plugin. The Jenkins plugin page for the Kubernetes plugin contains the following information:

For more detail, configure a new Jenkins log recorder for org.csanchez.jenkins.plugins.kubernetes at ALL level.

Once done, we could see the following output in the new log recorder:

Created Pod: kubernetes staging/example-projects-andreas-test-...
... WARNING org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
Error in provisioning; agent=KubernetesSlave name: example-projects-andreas-test-...
java.lang.NoSuchMethodError: 'java.lang.Object io.fabric8.kubernetes.client.dsl.PodResource.watch(java.lang.Object)'
  at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:170)
  at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:294)
  at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
  at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:829)

Aha! "NoSuchMethodError" is typical of version incompatibilities between plugin dependencies!

Conclusion

Skipping over some details, we concluded that the error is caused by a version conflict; the "kubernetes" pluginhas a dependency on the "kubernetes-client-api" plugin. We are using fixed version numbers for our plugins, and our kubernetes plugin version was fixed at 1.29.2. Looking at the pom.xml for the source of this version, we could see that the dependency of the kubernetes-client-api plugin was for version 4.13.2-1.

Our Jenkins image had recently been updated and this seemed to have updated all the non-fixed versions of plugins, i.e. the transitive dependencies. Upon inspection, we were running version 5.4.1 of the kubernetes-client-api plugin!

The quick solution was to upgrade the kubernetes plugin to the latest version, 1.30.0. This immediately resolved the issue and jobs began running as normal again.

The long-term solution to this problem, to prevent it from occurring again, is to build our own tagged & versioned Jenkins image. Each change then means incrementing the tag/version, and so we can more easily roll back changes. This is, in other words, a "complete" Jenkins image, which is bundled with the specific plugin versions we want to use.