Forem: Lucien Boix

AWS IAM : how to list unused access keys in your account

Lucien Boix — Tue, 24 Sep 2024 20:06:34 +0000

You have two options here.

The best is to activate the IAM unused access analyser if you are willing to pay around 50USD monthly for this service.
Basically it will scan constantly all your IAM section and list you warning events like unused roles, unused permissions, unused passwords and what interests us the most here : unused access keys.

You can event use EventBridge to be notified about that through an email or a Lambda (that could write to your Slack channel for example)
Or simply add this check to your morning routine at work

Otherwise you can launch this simple bash script I made here : it will list you the active access keys not used from more than 90 days.
You can confidently start to deactivate them, then remove them after a few days.

Datadog : how to filter metrics on tag "team"

Lucien Boix — Tue, 17 Sep 2024 18:57:27 +0000

We created a Datadog dashboard to monitor, across our organization, basic metrics about the health of our apps : logs in errors by service, Kubernetes containers restarts, APM errors by service, etc.

A few weeks ago, I wanted to add a "Team" filter on it : the goal was to help our different teams using it during their "morning routine" (a daily check of their applications metrics).

It appeared more challenging than I thought but with the help of Datadog Support we managed to figure it out. I am sharing this knowledge here in case that can help you achieve the same goal.

Introduction

Basically the most important link is this one, listing all available metrics in your Datadog account :

https://app.datadoghq.com/metric/summary

If in the "Tags" section of your metric (the one you want to use in your dashboard) you do not see the one you want to filter on (in our case "team"), then it means it has not been propagated correctly.

We discovered with Datadog Support that the tagging is different given the nature of the metric you want to use in your dashboard.

Here are the 3 usecases we identified, but first let's do some preparation if you plan to filter your metrics on teams.

Preparation

make sure all your different teams are described here : https://app.datadoghq.com/teams
make sure all of your services do have the right team assigned : https://app.datadoghq.com/services
make sure all your pods have a defined label "team" in their Deployment or DaemonSet manifest : ```

spec:
template:
metadata:
labels:
team: your-team-name


## Logs metric

Make sure your Datadog agent does have this environment variable in its configuration : it will map the label "team" of your pods with the tag "team" of metrics collected from it.

name: DD_KUBERNETES_POD_LABELS_AS_TAGS value: '{"team":"team"}'


> Important : if you do use a **custom** logs metric in your dashboard (that means this one is defined [here](https://app.datadoghq.com/logs/pipelines/generate-metrics), then edit it and make sure to add the "team" tag in the "Group By" section like this :
> ![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4gkjjq4v868gc5qarqxw.png)
> Clic on "Update Metric"

Wait and see if it populates correctly in : 
- https://app.datadoghq.com/logs (clic on one recent log to see its details)
- https://app.datadoghq.com/metric/summary

## Kubernetes metric

If you want to filter your `kubernetes_state.container.*` metrics for example, make sure to have this option activated for your Datadog **Cluster** agent configuration.

If you set up it through [Helm](https://github.com/DataDog/helm-charts/blob/bc09ff3950999aeea1ee142e055b6be452902feb/charts/datadog/values.yaml#L194) :

datadog:
kubeStateMetricsCore:
labelsAsTags:
pod:
team: team


If you set up it manually through a YAML manifest, make sure to update this [ConfigMap](https://github.com/DataDog/datadog-agent/blob/main/Dockerfiles/manifests/kubernetes_state_core/cluster-agent-confd-configmap.yaml#L38) :

labels_as_tags:
pod:
team: team


Wait and see if it populates correctly in :
- https://app.datadoghq.com/metric/summary

## APM metric

If you want to filter on `trace.servlet.request.errors.by_http_status` for example, you will need to add this environment variable to your Datadog agent configuration :

name: DD_APM_FEATURES value: 'enable_cid_stats'


Then go [here](https://app.datadoghq.com/apm/settings) and "Aggregate APM metrics" by "team" like this :

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xwqftztjkkby2uf7r5b0.png)

Wait and see if it populates correctly in :
- https://app.datadoghq.com/metric/summary

## Plugging team filter to your dashboard
Finally it's time to use this new tag you populated!
On the upper right of your dashboard, clic on the **+** ("Add Variable") and specify it like this :

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pb2a8eefeglyset3yl5q.png)

Then edit all the sections to add it to the scope of the metric displayed in each one of them, and Save, like this :

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mrx8g736r0xcundfe3c7.png)

Hope it helped, have a great day and happy monitoring!

## Sources
- https://docs.datadoghq.com/containers/kubernetes/tag/?tab=manualdaemonset#pod-labels-as-tags
- https://docs.datadoghq.com/tracing/guide/setting_primary_tags_to_scope/?tab=kuberneteswithouthelm#container-based-second-primary-tags

Testing Flux V2 (or migrating from Flux V1) : TLDR

Lucien Boix — Wed, 11 Oct 2023 19:51:17 +0000

Whether you are migrating from deprecated FluxV1 or decided to go GitOps by testing FluxV2, the existing documentation can be intimidating. You may want to quickly test FluxV2 without using the CLI and its default behaviour of bootstrapping a new repo containing your Flux setup.

That was my case so I created this repo to help you quickly get hands on by providing the simplest manifests template.

I hope it will help you discover the GitOps philosophy and start great things following it!

How many pods can run by default on an EKS node ?

Lucien Boix — Wed, 12 Jul 2023 16:46:55 +0000

As you know, in EKS each of your pod has a private IP assigned. Which means that the max number of pods that a node can handle is directly linked to the the max number of ENI possible (Elastic Network Interfaces), and their IP adresses, for the EC2 instance type of the node you are using.

I discovered recently that there were two hard limits applicable here, so I am sharing them in this small post if that can help you to gain some time.

First take the number from this file regarding your node instance type.

If your node is inside a managed node group where the AMI is pinpointed :

the number above is your max

If your node is inside a managed node group where the AMI is NOT pinpointed :

if the number above is < 110 then this is your max
if the number above is > 110 and your instance type has less than 30 vCPU, then 110 is your max (explained here)
if the number above is > 110 and your instance type has more than 30 vCPU, then 250 is your max (explained here)

So technically you multiply this final number with the number of nodes in your cluster (assuming they have all the same instance type) and you have the maximum number of pods that can run inside your EKS cluster.

If you want to double check this value for a node, you can simply run this kubectl commands :
kubectl get nodes
kubectl describe node NODE_NAME | grep 'pods\|PrivateIP'

A CloudFront Function to remove a specific value at the beginning of an URL

Lucien Boix — Wed, 25 Jan 2023 22:48:38 +0000

My usecase was :

the CDN was delivering images from a S3 bucket
the images URL pattern was https://URL/images/something.jpg
there was not an "images" folder at the root of the S3 bucket, images like something.jpg were directly there

So I used this function and associated it to the right Behavior ("Viewer request" option) of my CloundFront Distribution :

function handler(event) {

    var request = event.request;

    if (request.uri.startsWith("/images/")) {
        request.uri = request.uri.substring(7);
    }

    return request;
}

Let me know if that helped you or if you have suggestion for improvements. Take care !

How to do a thread dump on a pod running a Java app ?

Lucien Boix — Tue, 24 Jan 2023 22:46:59 +0000

If your Java app is struggling with busy threads pilling up, there's nothing better to have a look at the state of those threads and see what was their last action before they hung.

Here is a simple TODO to achieve that if your app is running inside a Kubernetes pod (we will assume that this one only run 1 container).

Open your terminal and tail the logs of your pod :

kubectl get po |grep "YOUR_APP"
kubectl logs -f POD_NAME

Open a new tab of your terminal, and launch the thread dump :

# connect to your pod's container
kubectl exec -it POD_NAME -- sh
# find the PID of your Java process (it should be 1)
ps aux
# force a thread dump to stdout (do not worry : this will not kill the application)
kill -3 YOUR_PID

Go back in your first tab and analyse the results.

For example it allowed me one time to quickly find out that I had a key locked in my Redis instance. What else did you discover through them ? Please share your experiences in the comments.

Take care and have a great day !

ElasticSearch cluster sanity check and first-aid kit

Lucien Boix — Fri, 14 Oct 2022 18:46:44 +0000

Here are some useful commands I used in the past to help you fix your yellow or red cluster, especially when you have unassigned shards. If you have suggestions for improvements please let me know in the comments. I wish you a great day !

# see cluster health
GET _cluster/health?pretty

# see nodes status
GET _cat/nodes?pretty&v=true

# see a summary of the JVM statistics (memory usage, does GC is triggering a lot, etc.) 
GET /_nodes/stats/jvm?pretty

# see shards status
GET /_cat/shards?v

# see shards allocation (useful to detect if a node has a disk space full)
GET /_cat/allocation?v

# get detailed reason for the first unassigned shard
GET /_cluster/allocation/explain

# get the reason for any unhealthy shard
GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

# the detail of an unhealthy shard can be found here : https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html#_example_with_reasons_for_unassigned_shards:~:text=unassigned.-,reason,-%2C%20ur

# if the unassigned shard belongs to an index you can get rid of (logs of a past day for example), the easiest fix is to remove the related index
GET _cat/indices?v
DELETE /your_index

# if the unassigned shard belongs to an index you can NOT get rid of (production data), then try to reroute it to another node (if it fails the precise reason will be described) : example for primary shard #2 (use "allow_primary": false for a replica shard) of your-index (remove the ?dry_run parameter to actually reroute the shard)
POST _cluster/reroute?dry_run
{
    "commands" : [
        {
          "allocate" : {
              "index" : "your-index", "shard" : 2, "node" : "new-node-name", "allow_primary": true
          }
        }
    ]
}

# if you stuck shard is not in UNASSIGNED status but rather in INITIALIZING status
## if you are with ES7+ then you can force the reassignment of the shard with the command above, but replace allocate with allocate_stale (I never tested it myself actually, only read about this)
## if not and you are comfortable, you can try to reboot the node currently assigned to this shard : after the restart, the shard should be back to UNASSIGNED status and you will be able to use the command above (I never tested it myself actually, only read about this)

# check your cluster settings (allocation rules for example)
GET _cluster/settings

# exclude the IP of a bad node for the shard allocation
PUT _cluster/settings
{
  "transient" :{
      "cluster.routing.allocation.exclude._ip" : "your-node-ip"
  }
}

# check your index settings (shards and replicas number for example)
GET /your-index/_settings

# if you have a replica unassigned shard, a known workaround is to put to 0 the number of replicas (it will delete replica shards) then put it back to its original value (it will recreate them). But I recommend to AVOID doing this as it will put a big load on the cluster, and it's a risky procedure especially if the state of the cluster is red
PUT /your-index/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}

TODO for smoothly upgrading Kubernetes version

Lucien Boix — Fri, 07 Oct 2022 21:54:46 +0000

During the last months, I tried to come up with a simple TODO to optimize the process and make it as smooth as possible for your workload with no downtime (by avoiding too much pods starting at the same time, hitting your container registry rate limit, etc).

So I am sharing it here if that can help your with your first upgrade :

Preparation

First start reading this article for potential breaking changes (especially regarding deleted apiVersion : you need to update them before going further!) : https://kubernetes.io/docs/reference/using-api/deprecation-guide/
If you have one, always start by upgrading your testing / staging cluster before your production one (I strongly suggest it)
Monitor it for a few days, just to be sure that there is no bad side effect on your workload with this upgrade

Upgrading master nodes

Start by upgrading the Kubernetes version of your Control Plane aka your master nodes, depending of the tool you are using (kops, EKS, AKS, etc.)
If you are using cluster-autoscaler, scale it down : kubectl scale --replicas=0 deployment/cluster-autoscaler -n kube-system
If you are using a GitOps agent (like flux in this example), scale it down : kubectl scale --replicas=0 deployment/flux -n flux
Create a new node group with the same new Kubernetes version
Put a maintenance plage for 2 hours in your monitoring tool

Rolling out worker nodes

Drain each node one by one on the old node group, use kubectl get nodes -o wide to pick the right ones (running old Kubernetes version) : kubectl drain node_name --ignore-daemonsets --delete-emptydir-data
After one drain, wait for all Evicted pods to restart correctly by running this command to get unhealthy pods across the cluster : kubectl get po -A | grep "0/" | grep -v "Completed"
Wait a few minutes until you only have a few lines left, then move to the next node
After the last node and when you have no more result at all with this command above, you are good to pursue!

Once ALL nodes of old node group have been drained, you can delete it
Check the completed deletion using again kubectl get nodes -o wide

Wrapping up

If you are using cluster-autoscaler, upgrade the version used :
find the latest release number that matches the new k8s version of your cluster : https://github.com/kubernetes/autoscaler/releases
type the major version number in the search field up right to filter easily
update the used Docker image number of cluster-autoscaler : kubectl -n kube-system set image deployment.apps/cluster-autoscaler cluster-autoscaler=k8s.gcr.io/autoscaling/cluster-autoscaler:v1.MAJOR.minor
make sure the pod is starting correctly checking its logs : kubectl scale --replicas=1 deployment/cluster-autoscaler -n kube-system

If you are using a GitOps agent (like flux in this example), scale it up, check logs and make sure that it syncs well : kubectl scale --replicas=1 deployment/flux -n flux

Check your monitoring tools and resolve muted alerts that may have been triggered by the rollout
Announce to your team that the rollout is done and all went well :)
Commit-push all the version modifications you made in your cluster repo if you have one (I strongly suggest it)
That's it!

If you have any suggestion to upgrade this TODO, do not hesitate to let me know in the comments below. Thanks for reading and I wish you a great day!

Filebeat config on k8s after switching to containerd

Lucien Boix — Thu, 18 Aug 2022 20:55:00 +0000

You can not ignore it, dockershim (layer for using Docker Runtime in Kubernetes) will be removed starting 1.24. Do not worry, it's a change pretty seamless and your images built with Docker will still be fully functional.

But it's pretty sure that if your current cluster nodes are running through Docker Runtime, then you have some hardcoded configuration tight to Docker.

In this article we will focus on a filebeat configuration originally setup for Docker Runtime, and what needs to be done after the switch to containerd in order to keep getting your precious logs.

The main steps are updating your filebeat config file :

activating symlinks option
update the path of the logs files
use together dissect and drop_fields processor to only parse and keep the necessary

Then after that update the volumeMounts section of your filebeat DaemonSet definition :

each existing mountPath or path with value /var/lib/docker/containers will need to be changed to /var/log/containers

Here is a snippet of a filebeat config file that worked for me, do not hesitate to let us know if it helped you in some way or if you have a suggestion for improvement :

apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: kube-system
data:
  filebeat.yml: |-
    setup.ilm.enabled: false
    filebeat.inputs:
    - type: log
      symlinks: true
      paths:
        - /var/log/containers/*.log
      processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            in_cluster: true
            default_matchers.enabled: false
            matchers:
            - logs_path:
                logs_path: /var/log/containers/

    processors:
      - add_cloud_metadata:
      - drop_event:
          when:
            equals:
              kubernetes.namespace: "kube-system"
      - dissect:
          tokenizer: "%{timestamp} %{std} %{capital-letter} %{parsed-message}"
          field: "message"
          target_prefix: ""
      - decode_json_fields:
          fields: ["message","log","logs.log","parsed-message"]
          target: "logs"
          process_array: true
      - drop_fields:
          when:
            regexp:
              message: "^{\""
          fields: ["message"]
          ignore_missing: true
      - drop_fields:
          fields: ["log.file.path","timestamp","std","capital-letter","parsed-message"]
          ignore_missing: true

...

Have a great day!

Go snippet for creating an Ingress rule

Lucien Boix — Thu, 18 Aug 2022 16:16:00 +0000

You probably need to migrate to apiVersion networking.k8s.io/v1 for your Ingress rules (given that after Kubernetes 1.22, the old apiVersion extensions/v1beta1 and networking.k8s.io/v1beta1 will simply disappear). If you are managing your Ingress rules through Go, here is a snippet to generate a valid Ingress rule if that can help you (I struggled a little to find the correct template so I am sharing this post).

Please let me know it this snippet was useful or if you see some improvements that we can make to it.

Have a great day!

import (
    v1Networking "k8s.io/api/networking/v1"
    v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func MapIngress(ingressName string, hostName string) *v1Networking.Ingress {

    annotations := map[string]string{}
    annotations["kubernetes.io/ingress.provider"] = "nginx"
    annotations["kubernetes.io/ingress.class"] = "yourIngressClass"
    annotations["kubernetes.io/tls-acme"] = "true"
    // add other annotations you need

    meta := v1.ObjectMeta{
        Name:        ingressName,
        Annotations: annotations,
    }

    pathTypeImplementationSpecific := v1Networking.PathTypeImplementationSpecific

    return &v1Networking.Ingress{
        ObjectMeta: meta,
        Spec: v1Networking.IngressSpec{
            TLS: []v1Networking.IngressTLS{
        v1Networking.IngressTLS{
            Hosts:      []string{hostName},
            SecretName: "yourSecretName",
        },
            Rules: []v1Networking.IngressRule{
                v1Networking.IngressRule{
                    Host: hostName,
                    IngressRuleValue: v1Networking.IngressRuleValue{
                        HTTP: &v1Networking.HTTPIngressRuleValue{
                            Paths: []v1Networking.HTTPIngressPath{
                                v1Networking.HTTPIngressPath{
                                    Path: "/",
                                    PathType: &pathTypeImplementationSpecific,
                                    Backend: v1Networking.IngressBackend{
                                        Service: &v1Networking.IngressServiceBackend{
                                            Name: "yourServiceName",
                                            Port: v1Networking.ServiceBackendPort{
                                                Number: 80,
                                            },
                                        },
                                    },
                                },
                            },
                        },
                    },
                },
            },
        },
    }
}

EKS : migrate your Service to a Network Load Balancer

Lucien Boix — Wed, 17 Aug 2022 22:59:00 +0000

Whatever your usecase is (more performance, decrease slightly the AWS bill, etc.), it's often a good call to switch to a Network Load Balancer for your Kubernetes cluster. You will get good performance gain at this level as you will use a more basic layer (4 on OSI model) to receive traffic. Given that all the routing logic is often already done applicatively through the Ingress Controller, or a service mesh like Istio, it's a good call.

In my usecase for example, it was specifically for having the possibility to use static IPs for my Network Load Balancer (through Amazon Elastic IPs feature).

After hours of tests and digging, I propose you a snippet that can be a good start for your switch. You will indeed create a new Service first, exposing the same Deployment that the Service currently existing. You will then have two load balancers reachable and forwarding the traffic to the same app. It's really useful to gracefully switch the traffic through DNS, test things, and be able to rollback quickly if needed (a TTL of 300 seconds is acceptable for that).



kind: Service
apiVersion: v1
metadata:
  name: public-ingress-nginx-nlb
  namespace: prod
  labels:
    app: public-ingress-nginx-nlb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: '60'
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: 'true'
    service.beta.kubernetes.io/aws-load-balancer-type: 'nlb'
    service.beta.kubernetes.io/aws-load-balancer-eip-allocations: "eipalloc-AAA,eipalloc-BBB,eipalloc-CCC"
    service.beta.kubernetes.io/aws-load-balancer-subnets: "subnet-AAA,subnet-BBB,subnet-CCC"
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: proxy_protocol_v2.enabled=true
    # until you use the AWS Load Balancer Controller, this last option above needs to be activated manually in the Target groups / Attributes tab
spec:
  type: LoadBalancer
  selector:
    app: public-ingress-nginx
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: http
    - name: https
      port: 443
      protocol: TCP
      targetPort: https

Notes :

The annotations service.beta.kubernetes.io/aws-load-balancer-eip-allocations and service.beta.kubernetes.io/aws-load-balancer-subnets are optional if you do not need to attach and use static IPs for your Network Load Balancer. If it's the case, you will need first to allocate them in EC2 (they need to be in your account and not currently in use). You do not need to have 3, 1 will work but I recommend having 3 if this is for production traffic. For redundancy, AWS will force you to define 1 public subnet per Elastic IP and each subnet will need to be in a different Availability Zone of the Region you are using.
To be able to use the PROXY protocol correctly, note that this annotation service.beta.kubernetes.io/aws-load-balancer-target-group-attributes will not work if you did not setup the AWS Load Balancer Controller on your cluster. In the meantime, do not forget to go activate this option through EC2, you will need to edit each Target Group of your Network Load Balancer and check this option :

After a few days or weeks, if everything is working as expected, do not forget to delete your original Service : it will tear down automatically the old Classic or Applicative Load Balancer you were using, with no downtime or impact on your current Service linked to your Network Load Balancer. Also do not forget to update the arg --publish-service of your Nginx Ingress Controller containers managed by your DaemonSet or Deployment specs.

Let me know if this page helped you in some way or if you have some suggestions for improvements.

Have a great day!

How to fix a npm install issue ?

Lucien Boix — Mon, 08 Aug 2022 21:31:00 +0000

Sometimes it can be good to start from scratch, especially if you are opening an old / legacy project. Here are the steps to follow in order to have a working dependency graph again :

delete the existing package-lock.json file
ran the following commands:

node -v (make sure you are using the same Node version than the pipeline that will build and deploy your project, if not see below)
npm cache clean -f
npm install

All should be good now.

If you need to quickly switch to another version of Node than the one currently setup on your workstation, you can use this powerful package n :
npm install -g n
sudo n stable
sudo n (choose + Enter)
node -v

Note : same remark if you need to run a npm audit fix, then make sure you are using the same Node version than the pipeline that will build and deploy your project