Forem: Sagar Parmar

NVIDIA GPU Operator Explained: Simplifying GPU Workloads on Kubernetes

Sagar Parmar — Wed, 05 Nov 2025 13:28:19 +0000

Introduction

While GPUs have long been a staple in industries like gaming, video editing, CAD, and 3D rendering, their role has evolved dramatically over the years. Originally designed to handle graphics-intensive tasks, GPUs have proven to be powerful tools for a wide range of computationally demanding applications. Today, their ability to perform massive parallel processing has made them indispensable in modern fields such as data science, artificial intelligence and machine learning (AI/ML), robotics, cryptocurrency mining, and scientific computing. This shift was catalysed by the introduction of CUDA (Compute Unified Device Architecture) by NVIDIA in 2007, which unlocked the potential of GPUs for general-purpose computing. As a result, GPUs are no longer just graphics accelerators they’re now at the heart of cutting-edge innovation across industries.

In this blog post we will discuss about NVIDIA GPU operator on Kubernetes and how to deploy it on the Kubernetes Cluster.

Why run GPU workload on Kubernetes?

Running GPU workload on Kubernetes offer significant advantage because it enables developer to seamlessly schedule and run GPU powered application, it simplify deployment and scaling of these workloads. With Kubernetes, workloads can be easily scaled up or down based on demand, while features like Role-Based Access Control (RBAC) provide isolation and multi-tenancy for secure, shared environments. Additionally, Kubernetes supports the creation of multi-cloud GPU clusters, allowing organizations to leverage GPU resources across different cloud providers with consistent orchestration and control.

In this article, we’ll explore the GPU-Kubernetes integration stack in depth with the help of NVIDIA GPU Operator. From the host operating system to the Kubernetes control plane, we’ll peel back each layer to understand the components required to make GPUs work seamlessly within a Kubernetes environment. More importantly, we’ll uncover why each component matters and how they interact with one another.

How are GPUs integrated into Kubernetes without using the GPU Operator?

Kubernetes excels at managing standard compute workloads, but orchestrating high-performance hardware like GPUs introduces unique challenges. Before diving into the GPU Operator, it’s important to understand the three foundational layers required to run GPU workloads in Kubernetes. Think of it as a recipe, each step must be correctly configured for the GPU to function seamlessly within the cluster.

Step 1: The Host Operating System

Everything begins at the host level. The NVIDIA device driver is the critical software that communicates directly with the GPU hardware. A key requirement here is version compatibility between the driver and the CUDA toolkit embedded in your container image. This compatibility matrix must be accurate any mismatch can break GPU functionality.

Step-2: The container runtime e.g, Docker, Container-d, CRI-O, RunC etc.

Next, we need a bridge between the container runtime (e.g., Docker, containerd, CRI-O) and the host GPU. This is where the NVIDIA Container Toolkit comes in.

Core Functions of the Toolkit:

GPU Access Enablement: Provides essential libraries like libnvidia-container and nvidia-container-cli to configure runtimes for GPU access.
Runtime Configuration: Injects GPU device files, drivers, and environment variables into containers via runtime hooks (e.g., updates to /etc/containerd/config.toml).
Device Plugin Dependency: The NVIDIA Device Plugin relies on the toolkit to expose GPU resources to Kubernetes.
Abstraction Layer: Allows containers to use GPUs without bundling drivers or CUDA libraries inside the image keeping containers lightweight and portable.

Without this toolkit, containers remain unaware of the GPU hardware on the node.

Step-3: The Kubernetes Orchestration layer.

Finally, Kubernetes needs to recognize and schedule GPU resources. This is achieved through the NVIDIA Device Plugin, which runs as a DaemonSet on GPU-enabled nodes.

Core Functions of the Device Plugin:

GPU Discovery & Advertising: Detects available GPUs and registers them with the Kubelet as extended resources (e.g., nvidia.com/gpu).
Resource Allocation: When a pod requests a GPU, the plugin ensures the container receives the correct device files, drivers, and environment variables.
Health Monitoring: Continuously checks GPU health and updates Kubernetes to prevent scheduling on faulty devices.
GPU Sharing & Partitioning: It maximizes utilization via advanced features: -

• Time-Slicing: Allows multiple containers to share a single GPU’s compute power.

• Multi-Instance GPU (MIG): Partitions high-end GPUs (like the A100) into multiple, fully isolated hardware instances.

• Virtual GPU(vGPU ): Enables the sharing of a single GPU among multiple virtual machines.

Why Scaling GPU Workloads in Kubernetes Is Hard and How Operators Help?

The three-layer setup we discussed works well on a single machine. But things get complicated when you scale to a production-grade Kubernetes cluster with hundreds or thousands of nodes. That’s when the manual approach starts to fall apart and the real operational pain begins.

Manually managing an entire fleet introduces a massive operational challenge that can bring projects to a grinding halt. You’re navigating a minefield of issues:

Driver Compatibility: Different GPU models require different, specific driver versions.
Configuration Drift: Nodes inevitably fall out of sync over time.
Risky Upgrades: The upgrade process becomes a high-risk nightmare.
Doubled Workload: You often end up managing two completely separate software stacks one for CPU nodes and another for GPU nodes effectively doubling your workload.

To solve these scaling challenges, the Kubernetes community embraced a powerful cloud-native pattern: ‘The Operator’. Think of it as an automated expert a robotic administrator that continuously monitors your cluster and handles all the tedious, error-prone tasks for you. It brings consistency, reliability, and automation to GPU management at scale.

The GPU Operator works in a control loop, constantly observing the state of your nodes and ensuring they match the desired configuration you’ve defined. This means no more manual setup, no more configuration drift, and no more juggling separate software stacks for CPU and GPU nodes. Instead, you get consistency, reliability, and automation at scale.

This shift from manual management to automated orchestration is what makes the Operator pattern so transformative. It turns GPU infrastructure from a fragile, high-maintenance setup into a resilient, self-healing system.

How NVIDIA GPU Operator works?

The Operator establishes a consistent, automated workflow for every node in your cluster. It eliminates manual intervention through a streamlined process. It begins by: -

Discovery: It first identifies which nodes physically possess GPUs.
Installation and Configuration: In the required order, it automatically installs the necessary containerised drivers, configures the Container Toolkit, and deploys the device plugin along with monitoring tools.
Validation: This final step is critical: the Operator validates that every component is working perfectly before allowing Kubernetes to schedule any AI workloads on that node.

This process guarantees reliability and prevents misconfigured nodes from disrupting GPU-intensive applications.

Installation the NVIDIA GPU Operator.

Installing the NVIDIA GPU Operator in Kubernetes is straightforward with Helm. The Operator automates the deployment and configuration of all essential GPU components including drivers, the container toolkit, and device plugins across your cluster. To ensure a smooth setup, follow a step-by-step approach.

Prerequisites:

Before proceeding please make sure that you have met the following prerequisites:

Operating System Requirements for the GPU Operator:

• To use the NVIDIA GPU Driver container for your workloads, all GPU-enabled worker nodes must share the same operating system version.

• If you need to mix different operating systems across GPU nodes, you must pre-install the NVIDIA GPU Driver manually on each respective node instead of using the containerized driver.

• CPU-only nodes have no OS restrictions, as the GPU Operator does not manage or configure them.
Helm is installed.
You have permission to execute kubectl commands against the target cluster.

Installation steps:

Add NVIDIA Helm Repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
 && helm repo update

2. Install GPU Operator

helm install --wait --generate-name \
 -n gpu-operator --create-namespace \
 nvidia/gpu-operator \
 --version=v25.10.0

If the NVIDIA driver or toolkit is already installed on your nodes, you can disable either or both during GPU Operator deployment by using the following flags:

--set driver.enabled=false
--set toolkit.enabled=false

3. Verify the installation, by checking the status of the deployed resources.

kubectl get pods -n gpu-operator

You should see the GPU operator components running in the namespace.

4. We can also check the configuration of the node to check if the nodes with the GPU are configured correctly

kubectl describe nodes

Name:               sagar.rajput27@live.com
Roles:              worker
Labels:             node-role.kubernetes.io/worker=true
                    nvidia.com/gpu.count=1
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.mode=compute
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-H100-PCIe
                    nvidia.com/gpu.replicas=1
...
Annotations:        nvidia.com/gpu-driver-upgrade-enabled: true
                    projectcalico.org/IPv4Address: 10.*.*.*/*
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.*.*.*
...
Capacity:
  cpu:                64
  ephemeral-storage:  32758Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527533864Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                64
  ephemeral-storage:  32631789953
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527533864Ki
  nvidia.com/gpu:     1
  pods:               110

We can see that the node with a GPU hardware attached has GPU-related labels and annotations added to it. Additionally, the GPU resources are visible under the Capacity and Allocatable sections.

Verification by running sample GPU application

We can test the setup by deploying the CUDA vectoradd application provided by NVIDIA on our cluster. This image is an NVIDIA CUDA sample that demonstrates vector addition a basic GPU computation.

Under the resources → limits section of this manifest, you’ll notice nvidia.com/gpu: 1. This instructs Kubernetes to schedule the Pod on a node equipped with an NVIDIA GPU.

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

pod/cuda-vectoradd created

Now we can check the logs

kubectl logs pod/cuda-vectoradd

Logs Output

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Now our cluster is ready to deploy the GPU workload.

GPU Sharing to Maximize GPU Utilization

GPUs are expensive, high-performance hardware and leaving them idle is a waste of valuable resources. Once your GPUs are up and running in Kubernetes, the real challenge becomes efficient sharing. The goal is to extract maximum value from every single card.

The GPU Operator makes this easy by allowing you to configure advanced sharing strategies declaratively. For example:

MIG (Multi-Instance GPU): Physically partitions a single GPU into multiple, fully isolated instances each with dedicated memory and compute.
MPS (Multi-Process Service): Enables concurrent execution of multiple GPU processes.
Time-Slicing: Ideal for development workloads that only need occasional GPU access.

The optimal GPU sharing strategy depends entirely on your specific workload requirements and operational goals. The choice involves balancing factors like performance isolation, dynamic flexibility, and raw utilization. A workload that demands predictable performance in a multi-tenant cluster has very different needs than an interactive development workload.

Optional GPU Operator Components: Streamlining Data Movement

The GPU Operator includes additional components that are not enabled by default, such as GPUDirect RDMA and GPUDirect Storage. These tools are designed to streamline data movement between GPUs and other system components, effectively bypassing traditional bottlenecks like the CPU and system memory.

GPUDirect RDMA (Remote Direct Memory Access)

GPUDirect RDMA enables direct memory access between GPUs and PCIe devices (such as NICs or storage adapters), without involving the CPU or system RAM. This is ideal for High-Performance Computing (HPC) and AI training, where latency is critical.

Benefits:

Lower latency: Data moves directly between the GPU and the device.
Reduced CPU load: Frees up CPU cycles for compute tasks.
Higher bandwidth: Enables faster data transfer for distributed workloads.

Use Cases:

GPU-to-GPU communication across nodes
Real-time inference at the edge
High-speed networking in HPC clusters

GPUDirect Storage

GPUDirect Storage allows GPUs to read data directly from NVMe or other storage devices again bypassing the CPU and system memory. This is essential for AI/ML workloads that need access and process large datasets quickly.

Benefits:

Faster data ingestion: Minimizes I/O bottlenecks during training or inference.
Efficient data pipeline: Direct flow from storage to GPU memory.
Simplified architecture: Eliminates unnecessary memory copies and CPU involvement.

Use Cases:

Large-scale deep learning training
Data analytics pipelines
Scientific simulations with massive datasets

Both technologies are part of NVIDIA’s strategy to optimize data movement for GPU workloads. By enabling direct communication paths between GPUs and external devices, they unlock higher performance and lower latency, better resource utilization in Kubernetes environments where scalability and efficiency are critical.

Summary

Integrating NVIDIA GPUs into Kubernetes typically involves a complex, three-layer manual setup: host drivers, the container toolkit, and the Kubernetes device plugin. This approach works for single machines but creates massive operational challenges like configuration drift and incompatible drivers at scale.

The NVIDIA GPU Operator is the solution. It uses the Operator pattern to automate the entire lifecycle, acting as a “robotic administrator” that discovers GPUs, installs the necessary software stack in the correct order, validates the setup, and streamlines maintenance.

The core benefit? Simplifying your infrastructure so you can focus on AI workloads, not operational headaches.

I hope you found this post informative and engaging. I would love to hear your thoughts on this post, so do start a conversation on Twitter or LinkedIn.

References

About the NVIDIA GPU Operator — NVIDIA GPU Operator

*Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other…*docs.nvidia.com

GPUDirect

*Enhancing Data Movement and Access for GPUs Whether you are exploring mountains of data, researching scientific…*developer.nvidia.com

Creating Infra Using Backstage Templates, Terraform and GitHub actions.

Sagar Parmar — Sun, 14 Jan 2024 09:58:05 +0000

Introduction

Backstage is an open-source platform for constructing Internal Developer Portals (IDPs). Internal Developer Portals serve as a one-stop shop, that provides a unified view of all our resources. IDPs enable us to seamlessly create, manage, monitor, and document our software resources from a single location. The primary goal of the IDP is to eliminate the reliance on the connection between DevOps and developers.

Prerequisites

Backstage is up and running. If you need assistance with deploying Backstage, please follow this getting started guide.
Backstage integration with Github.
A working Terraform code is uploaded on the GitHub repo.
GitHub action file is created to execute the terraform code.

For the sake of this blog, we are using GitHub and AWS, but you can use any CI/CD tool and cloud provider.

Templates

Backstage provides a software catalog which is used to manage all of our software resources, which include CI-CD, Docs, API, K8s, Websites, microservices etc. To create a new component(entity) in the software catalog we need Templates.

The structure of the template file is almost similar to the kubernetes manifests. This similarity in structure makes it easier for users familiar with Kubernetes to work with Backstage templates.

Below is a backstage template that you can use to create an EC2 instance on AWS.



apiVersion: scaffolder.backstage.io/v1beta3
    kind: Template
    metadata:
      name: sagar
      title: Backstage automation
      description: creating ec2 using backstage and terraform.
    spec:
      owner: guest
      type: service

      parameters:
        - title: backstage demo template
          required:
            - name
          properties:
            name:
              type: string

      steps:
        - id: test
          name: backstage-blog
          action: debug:log
          input:
            message: 'Hello, ${{ parameters.name }}!'

      output:
        links:
          - title: Open in catalog
            icon: catalog
            entityRef: ${{ steps['register'].output.entityRef

apiVersion: - is the required field and the latest value of the API version is scaffolder.backstage.io/v1beta3.
**Kind: - **template is the entity in the backstage and to configure it, set the value of kind to Template.
**metadata: - **Information you will add in metadata will appear on the template card.
spec: - It contains the name of the owner/group of the template and the type of template it could be anything for example:- service/ website/app etc.
parameters: - These are the list of options you can add to the template to get the information from the user. Parameters must contain these components: -
5.1. title: - contains the title of the template
5.2. required: - You can add the parameters which are mandatory for the template so that a user cannot skip it.
5.3. properties: - It is used to take input from the user.
steps: - Here we define actions taken by the template while executing the template.
*** Output ***: - This is the final section in the template file. After completing all the actions specified in the ‘steps,’ this part is executed. It is used to display the result of the template execution, although it is not mandatory.

You can find out more about the template format here.

Deploying Custom Template

Now let's see the files which we have created for the demo. In my backstage code, I have created a directory ec2-demo/template inside the template directory I have created a template.yaml file. Also, I have created a content directory where I have created 2 files component-info.yaml, index.json and package.json.

I have added component-info.yaml, index.json and package.json files below. In the component-info.yaml file I have given the reference of the GitHub repo where my Terraform workflow is present.

After adding all these files, navigate to your app-config.yaml file and insert the template path under the catalogue section, as illustrated in the screenshot below. Adjust the target path according to your requirements.

After adding the path, execute the yarn dev command. Now, on the Backstage portal, under the Create option, you will notice the newly created template.

In the template.yaml file, Under the parameter I have added instanceName, region and instanceType. Once you open the ec2-template it will ask you to add these values.

After adding the values on the next page it will ask you to add a repository name and owner. You can give any name to the repo, whatever name you add here the template will create a repo on GitHub with that name. The owner can be a backstage group or users.

Once all the details are filled, the template will ask you to verify the details which you have added.

After reviewing, click on ‘Create.’ It will take a couple of seconds to generate your catalog. In the background, Backstage will trigger the GitHub workflow. This workflow will execute the ‘terraform apply’ command, which will then create an EC2 instance. Below is the screenshot of the GitHub action.

If you check the status of the GitHub action Job you will see the status of the current job and its progress.

Coming back to backstage UI, here you will see that your catalog has been created.

Once done you can click on the catalog link. It will take you to the newly created catalog window.

Here you will see a CI-CD option under that option you can check the status/progress of the GitHub workflow job.

Once the pipeline is completed your ec2 instance will be up and running.

Summary

In this blog post, we have seen how we can use the backstage template to create an AWS EC2 instance. You can also use the template to spin Infra anywhere, to create a CI-CD Pipeline to deploy your application etc.

I hope you found this post informative and engaging. I would love to hear your thoughts on this post, so do start a conversation on Twitter** or LinkedIn.

Integration of Opentelemetry (Auto Instrumentation) with Jaeger

Vcluster — Architecture Overview and Installation

Backup, Restore and Migrate Kubernetes Cluster resources using Velero.

Sealed Secret in Kubernetes

Monitoring of AWS EKS using AWS Distro for OpenTelemetry (ADOT) and Amazon Managed Service for Prometheus (AMP)

Sagar Parmar — Sat, 08 Jul 2023 08:41:54 +0000

In this tutorial, we will be using AWS Distro for OpenTelemetry to capture the metrics from AWS EKS and send them to Amazon managed service for Prometheus.

AWS Distro for OpenTelemetry(ADOT): - It is an AWS-supported version of the upstream OpenTelemetry Collector and is distributed by Amazon. It supports the selected components from the OpenTelemetry community. It is fully compatible with AWS computing platforms including EC2, ECS, and EKS. It enables users to send telemetry data to AWS CloudWatch Metrics, Traces, and Logs backends as well as the other supported backends.

Amazon Managed Service for Prometheus(AMP): - It is Prometheus-compatible monitoring and alerting service offered by AWS that makes it easy to monitor containerized applications and infrastructure at scale.

Prerequisite: -

Cert manager is installed and running. If it is not installed follow this URL to install it.
An AMP workspace is created. Guides for this can be found here.
If you are setting up the ADOT Collector of AWS EKS, you will need to set up IAM roles for service accounts for the ingestion of metrics from Amazon EKS clusters.

3.1. Open the IAM Console and edit the trust policy.
3.2. In the left navigation pane, choose Roles and find the amp-iamproxy-ingest-role that you created in Step 3.
3.3. Choose the Trust Relationships tab and choose Edit trust relationship.
3.4. In the trust relationship policy JSON, replace aws-amp with adot-col and choose Update Trust Policy. Your resulting trust policy should look like the following:



{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::account-id:oidc-provider/oidc.eks.aws_region.amazonaws.com/id/openid"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.aws_region.amazonaws.com/id/openid:sub": "system:serviceaccount:adot-col:amp-iamproxy-ingest-service-account"
        }
      }
    }
  ]
}

3.5. Choose the Permissions tab and make sure that the following permissions policy is attached to the role.



{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "aps:RemoteWrite",
                "aps:GetSeries",
                "aps:GetLabels",
                "aps:GetMetricMetadata"
            ],
            "Resource": "*"
        }
    ]
}

ADOT Installation

Assuming that you have all the prerequisites installed or created and you are ready to deploy the ADOT on your cluster.

Before installing ADOT, we need to make sure that ADOT is configured in a way that it can send its metrics data to Amazon Managed Prometheus(AMP) and for this purpose, you have to first download the Prometheus file by running the following command.



wget https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/examples/eks/aws-prometheus/prometheus-sample-app.yaml

Once you downloaded this file. Now you need to change a few parameters in the file. These are mentioned below: -

remote_write endpoint for your Amazon Managed Service for Prometheus workspace for YOUR_ENDPOINT and your Region for YOUR_REGION. You can get the remote_write URL from the AMP workspace which you have just created. Below is the screenshot of configmap, where you need to add your details.

In the above screenshot, we are using adot as the namespace. So when you check the metrics exported by your ADOT it will show your metrics name starting with “adot” Below is the screenshot which shows how your metrics will look.

So, for the metrics like this, you have to create your own custom dashboard on grafana. And if you want to use the already available dashboard then keep the namespace name empty like this “namespace: "”. Now it will not add any name in front of your metrics.

You'll also need to change YOUR_ACCOUNT_ID in the service account section of the Kubernetes configuration to your AWS account ID.

As the ADOT Prometheus Receiver supports the full set of Prometheus scraping and re-labelling configurations described in the Configuration section in the Prometheus documentation. You can paste these configurations directly into your ADOT Collector configurations. The configuration for the Prometheus Receiver includes your service discovery, scraping configurations, and re-labelling configurations. The receiver configurations look like the following.



receivers:
  prometheus:
    config:
      [Your Prometheus configuration]

You can download the file which we have used for this demo by using below mentioned command.
You can customise this file or you can use your own file.



wget https://raw.githubusercontent.com/sagar0419/Adot-Configuration/master/adot.yaml

Create a namespace in the kubernetes where you are goin to deploy the downloaded file.


 create ns adot-col

Once the namespace is created you can deploy your prometheus configuration.


 apply -f adot.yaml

You can verify the configuration once it is deployed on the cluster with the following command: -


  get all -n adot-col

If the configuration is deployed successfully, then you will get an output like this: -

Now your ADOT is deployed on the cluster. To check if it sending telemetric data to AMP or not, run the below command. But before running this command change the required parameters first. AMP_ENDPOINT, AMP_REGION.

(Note AWS AMP is a different AWS service which is not running on EKS so we cannot check it by running the kubectl commands).



awscurl --service="aps" --region="AMP_REGION" "https://AMP_ENDPOINT/api/v1/query?query=adot_test_gauge0"

You will get output similar to this.



{"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"adot_test_gauge0"},"value":[1606512592.493,"16.87214000011479"]}]}}

Grafana for ADOT visualisation

Now we need to install Grafana on our Kubernetes cluster. You can deploy it anywhere or you can choose Grafana cloud also. For this tutorial we are going to deploy it on kubernetes.

To deploy grafana on your cluster you can use this helm chart. Please follow this document to enable sigv4 and IRSA in your grafana helm chart.

Once grafana is deployed login to the grafana console. Navigate to setting and select data source to add AWS Managed Prometheus as datasource.

Select Add data source, then Prometheus from the list as shown below:

Next, we paste the AMP Endpoint query URL (find this under the Summary tab on the AMP workspace) leaving out the api/v1/query portion (for example, https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-3aa5f57b-yy11-xx00-12ab-ea86005d6dd7/) in the URL field under HTTP. We need to enable SigV4 auth in the Auth section.

We also need to ensure that AWS SDK Default is selected in Authentication Provider under the Sigv4 Auth Details section, then select the AWS Region in which the AMP workspace was created earlier in the Default Region drop-down. See the following screenshot for details.

Next, we can choose Save & Test. We should see a green banner that says “Data source is working” as shown in the following.

Query the metrics from AMP to verify the setup

Next, we’ll create a new Dashboard from the left navigation bar by choosing the + sign.

We then add a new panel and select the new AMP data source configured previously.

We can write a simple PromQL query in the Metrics textbox, and we should see the metrics in the panel as shown in the screenshot:

You can also create or upload a custom dashboard. In this demo we are using Node Exporter for Prometheus Dashboard. Below is the screenshot of the dashboard.

Now your cluster metrics are available on the grafana dashboard collected using ADOT and AMP.

Summary

In this blog post, we saw that how we can monitor our EKS cluster using AWS ADOT and AMP. We also saw that how we can export the metrics generated by EKS in Grafana.

I hope you found this post informative and engaging. I would love to hear your thoughts on this post, so do start a conversation on Twitter or LinkedIn .

Here are some of my other articles that you may find interesting.

Monitoring your application using OpenTelemetry and Jaeger.

Multitenancy in Kubernetes cluster using vCluster.

Backup and Restore Kubernetes Cluster.

Until Next time....

AWS Patch Management

Sagar Parmar — Thu, 02 Mar 2023 08:44:12 +0000

Introduction: -
AWS Patch Manager automates the patching process for AWS-managed Linux and Windows instances. It patches the instances with security and non-security updates. By using patch manager, we can scan instances for missing patches, or we can use patch manager to scan and install all missing patches on our AWS-managed VMs.

Prerequisites:-

AWS VM is up and running (In our scenario, we are using CentOs). All the OS which are supported for installation of the patch manager can be checked on this link.
You have SSH access to VM.
You have AWS access to create S3 Bucket, IAM role and Policies.

Assuming that you have all the prerequisites, we can now move forward with installing Patch Manager on our machine.

Installation: -
To install the AWS System Manager Agent on our machine, we need to run the SSM agent install command on our machine. If you are using CentOS 7, then you can download the below-mentioned command and run it on your machine. Otherwise, you can get your command from this link.
You can use the below-mentioned command if you are using a CentOS 7 VM.

sudo yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm

Once the command is installed run the following commands: -

sudo systemctl daemon-reload && sudo systemctl restart amazon-ssm-agent

And you will get an output like this: -

If you get an output like the service is inactive: -

● amazon-ssm-agent.service - amazon-ssm-agent
 Loaded: loaded (/etc/systemd/system/amazon-ssm-agent.service; enabled; vendor preset: disabled)
 Active: inactive (dead) since Tue 2022–04–19 15:58:44 UTC; 2s ago
 - truncated -

To activate the agent run the below-mentioned command: -

sudo systemctl enable amazon-ssm-agent
sudo systemctl daemon-reload && sudo systemctl restart amazon-ssm-agent

IAM Instance Role:-
Assuming that your SSM agent is up and running. Now we need to create an IAM Instance Profile so that our machine can communicate with the patch manager.

To communicate the IAM role, log in to your AWS console and navigate to the IAM section. Once you reach there, click on "Roles" under "Access Management".

Click on "Create role", A window will appear to select "Select the trusted entity". Select the trusted entity type "AWS Service" and under the use case, select "EC2."

Once you are done with the selection, click on "Next."

Now a new window will appear; from this window, you can add permission to your role. Here, search for "AmazonEC2RoleforSSM".

Select the policy and click "Next" then give your role a name. In our scenario, we are using the "demo-Patch-manager" name.

Scroll down and click on "Create Role" and your role will be created.

Now we need to attach the newly created role to the instance. For this, navigate to the EC2 Console and select the instance that you want to add to your patch manager.

After selecting the instance, click on "Action" and then navigate to "Security", Under the security option, select "Modify IAM role".

In the new window, select the IAM role that you have created in the previous step, and then click on "Update IAM Role" and your IAM role will get attached to the machine.

Patching:-
To patch, your instance goes to the "AWS Systems Manager" on the AWS console. Then click on "Patch Manager" under "Node Management."

Under Patch Manager, select "Configure patching".

A new window will open, Here you need to select the instance that you want to patch.

In our scenario, our instance is "demo-patch," so we have selected that instance.

In the next option, we need to select the patching window and whether we want to install the patch or just scan the machine.

In our scenario, we are using the option "Skip scheduling and patch instance now". You can schedule the patch according to your requirements.

Under "Patching Operation," click on "Scan and Install" so that it can scan the machine and update the patches. If you only want to scan the machine and generate a list of patches that are available for installation, then select "Scan only."

Once you have selected all the requirements, click on "Configure Patching".

Verification: -
To verify the status of your patch command. Navigate to "Run Command" under "Node Management", in the "AWS Systems Manager" window.

Select the command and click on "View details."

In this window, you can check the status of the patch command.

As you can see, our command was successful, which means our instance was patched successfully.

I hope you found this post informative and engaging. I would love to hear your thoughts on this post, so do start a conversation on Twitter or LinkedIn :)

Here are some of my other articles that you may find interesting.

OpenTelemetry Auto Instrumentation

vCluster

Until Next time...

OpenTelemetry auto-instrumentation with Jaeger

Sagar Parmar — Tue, 14 Feb 2023 06:40:42 +0000

In earlier days, it was easy to deduct and debug a problem in monolithic applications because there was only one service running in the backend and front end. Now, we are moving towards the microservices architecture, where applications are divided into multiple independently deployable services. These services have their own goal and logic to serve. In this kind of application architecture, it becomes difficult to observe how one service depends on or affects other services.

To make the system observable, some logs, metrics, or traces must be emitted from the code, and this data must be sent to an observability backend. This is where OpenTelemetry and Jaeger come into the picture.

In this blog post, we will see how to monitor application trace data (Traces and Spans) with the help of OpenTelemetry and Jaeger. Trace is used to observe the requests as they propagate through the services in a distributed system and Spans is a basic unit of the trace; it represents a single event within the trace, and a trace can have one or multiple spans. A span consists of log messages, time-related data, and other attributes to provide information about the operation it tracks.

We will use the distributed tracing method to observe requests moving across microservices, generating data about the request and making it available for analysis. The produced data will have a record of the flow of requests in our microservices, and it will help us understand our application's performance.

OpenTelemetry

Telemetry is the collection and transmission of data using agents and protocols from the source in observability. The telemetry data includes logs, metrics, and traces, which help us understand what is happening in our application.

OpenTelemetry (also known as OTel) is an open source framework comprising a collection of tools, APIs, and SDKs. OpenTelemetry makes generating, instrumenting, collecting, and exporting telemetry data easy. The data collected from OpenTelemetry is vendor-agnostic and can be exported in many formats. OpenTelemetry is formed after merging two projects OpenCensus and OpenTracing.

Instrumenting

The process of adding observability code to your application is known as instrumentation. Instrumentation helps make our application observable, meaning the code must produce some metrics, traces, and logs.
OpenTelemetry provides two ways to instrument our code:

Manual Instrumentation
Auto Instrumentation

1. Manual Instrumentation

The user needs to add an OpenTelemetry code to the application. The manual instrumentation provides more options for customization in spans and traces. Languages supported for manual instrumentations are - C++, .NET, Go, Java, Python, etc.

2. Automatic Instrumentation

It is the easiest way of instrumentation as it requires no code changes and no need to recompile the application. It uses an intelligent agent that gets attached to an application, reads its activity, and extracts the traces. Automatic instrumentation supports Java, NodeJS, Python, etc.

Difference between Manual and Automatic Instrumentation.

Both manual and automatic instrumentation have advantages and disadvantages that you might consider while writing your code. A few of them are listed below:

Manual Instrumentation	Automatic Instrumentation
Code Changes are required.	Code Changes are not required.
It supports maximum programming languages.	Currently, .Net, Java, NodeJS and Python are supported.
It consumes a lot of time as code changes are required.	Easy to implement as we do not need to touch the code.
Provide more options for the customization of spans and traces. As you have more control over the telemetry data generated by your application.	Fewer options for customization.
Possibilities of error are high as manual changes are required.	No error possibilities. As we don't have to touch our application code.

To make the instrumentation process hassle-free, use automatic instrumentation as it does not require any modification in the code and reduces the possibility of errors. Automatic instrumentation is done by an agent which reads your application's telemetry data, so no manual changes are required.

For the scope of this post, we will see how you can use automatic instrumentation in a Kubernetes-based microservices environment.

Jaeger

Jaeger is a distributed tracing tool initially built by Uber and released as open source in 2015. Jaeger is also a Cloud Native Computing Foundation graduate project and was influenced by Dapper and OpenZipkin. It is used for monitoring and troubleshooting microservices-based distributed systems.

The Jaeger components which we have used for this blog are:

Jaeger Collector
Jaeger Query
Jaeger UI / Console
Storage Backend

Jaeger Collector: The Jaeger distributed tracing system includes the Jaeger collector. It is in charge of gathering and keeping the information. After receiving spans, the collector adds them to a processing queue. Collectors need a persistent storage backend, hence Jaeger also provides a pluggable span storage mechanism.

Jaeger Query: Is a service used to get traces out of storage. The web-based user interface for the Jaeger distributed tracing system is called Jaeger Query. It provides various features and tools to help you understand the performance and behaviour of your distributed application and enables you to search, filter, and visualise the data gathered by Jaeger.

Jaeger UI / Console: Jaeger UI lets you view and analyse traces generated by your application.

Storage Backend: Is used to store the traces generated by an application for the long term. In this post, we are going to use Elasticsearch to store the traces.

What is the need for integrating OpenTelemetry with Jaeger?

OpenTelemetry and Jaeger are the tools that help us in setting the observability in microservices-based distributed systems, but they are intended to address different issues.

OpenTelemetry provides an instrumentation layer for the application, which helps us generate, collect and export the telemetry data for analysis. In contrast, Jaeger is used to store and visualize telemetry data.

OpenTelemetry can only generate and collect the data. It does not have a UI for the visualization. So we need to integrate Jaeger with OpenTelemetry as it has a storage backend and a web UI for the visualization of the telemetry data. With the help of Jaeger UI, we can quickly troubleshoot microservices-based distributed systems.

Note: OpenTelemetry can generate logs, metrics, and traces. Jaeger does not support logs and metrics.

Now you have an idea about OpenTelemetry and Jaeger. Let's see how we can Integrate them with each other to visualize the traces and spans generated by our application.

Implementing OpenTelemetry auto-instrumentation

We will integrate OpenTelemetry with Jaeger, where OpenTelemetry will act as an instrumentation layer for our application, and Jaeger will act as the backend analysis tool to visualize the trace data.

Jaeger will get the telemetry data from the OpenTelemetry agent. It will store the data in the storage backend, from where we will query the stored data and visualize it in the Jaeger UI.

Prerequisites for this blog are:

The target Kubernetes cluster is up and running.
You have access to run the kubectl command against the Kubernetes cluster to deploy resources.
Cert manager is installed and running. You can install it from the website cert-manager.io if it is not installed.

We assume that you have all the prerequisites and now you are ready for the installation. The files we have used for this post are available in this GitHub repo.

Installation

The Installation part contains 3 Steps:

Elasticsearch Installation
Jaeger Installation
OpenTelemetry Installation

Elasticsearch

By default, Jaeger uses in-memory storage to store spans, which is not a recommended approach for the production environment. There are various tools available to use as a storage backend in Jaeger; you can read about them in the official documentation of Jaeger span storage backend.

In this blog post, we will use Elasticsearch as a storage backend. You can deploy Elasticsearch in your Kubernetes cluster using the Elasticsearch Helm chart. While deploying Elasticsearch, ensure you have enabled the password-based authentication and deploy that Elasticsearch in observability namespaces.

Elasticsearch is deployed in our Kubernetes cluster, and you can see the output by running the following command.

$ kubectl get all -n observability

NAME                            READY   STATUS    RESTARTS   AGE
pod/elasticsearch-0             1/1     Running   0          17m

NAME                            TYPE           CLUSTER-IP       EXTERNAL-IP          PORT(S)           AGE
service/elasticsearch         ClusterIP          None            <none>         9200/TCP,9300/TCP      17m

NAME                             READY   AGE
statefulset.apps/elasticsearch   1/1     17m

Jaeger Installation

We are going to use Jaeger to visualize the trace data. Let's deploy the Jaeger Operator on our cluster.

Before proceeding with the installation, we will deploy a ConfigMap in the observability namespace. In this ConfigMap, we will pass the username and password of the Elasticsearch which we have deployed in the previous step. Replace the credentials based on your setup.

kubectl -n observability apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: jaeger-configuration
  labels:
    app: jaeger
    app.kubernetes.io/name: jaeger
data:
  span-storage-type: elasticsearch
  collector: |
    es:
      server-urls: http://elasticsearch:9200
      username: elastic
      password: changeme
    collector:
      zipkin:
        http-port: 9411
  query: |
    es:
      server-urls: http://elasticsearch:9200
      username: elastic
      password: changeme
  agent: |
    collector:
      host-port: "jaeger-collector:14267"
EOF

If you are going to deploy the Jaeger in another namespace and you have changed the Jaeger collector service name, then you need to change the values of the host-port value under the agent collector.

Jaeger Operator

The Jaeger Operator is a Kubernetes operator for deploying and managing Jaeger, an open source, distributed tracing system. It works by automating the deployment, scaling, and management of Jaeger components on a Kubernetes cluster. The Jaeger Operator uses custom resources and custom controllers to extend the Kubernetes API with Jaeger-specific functionality. It manages the creation, update, and deletion of Jaeger components, such as the Jaeger collector, query, and agent components. When a Jaeger instance is created, the Jaeger Operator deploys the necessary components and sets up the required services and configurations.

We are going to deploy the Jaeger Operator in the observability namespace. Use the below-mentioned command to deploy the operator.

$ kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.38.0/jaeger-operator.yaml -n observability

We are using the latest version of Jaeger, which is 1.38.0 at the time of writing this article.

By default, the Jaeger script is provided for cluster-wide mode. Suppose you want to watch only a particular namespace. In that case, you need to change the ClusterRole to Role and ClusterBindingRole to RoleBinding in the operator manifest and set the WATCH_NAMESPACE env variable on the Jaeger Operator deployment.

To verify whether Jaeger is deployed successfully or not, run the following command:

$ kubectl get all -n observability

NAME                                    READY   STATUS    RESTARTS   AGE
pod/elasticsearch-0                     1/1     Running   0          17m
pod/jaeger-operator-5597f99c79-hd9pw    2/2     Running   0          11m

NAME                                      TYPE           CLUSTER-IP       EXTERNAL-IP             PORT(S)                  AGE
service/elasticsearch                     ClusterIP      None             <none>              9200/TCP,9300/TCP            17m
service/jaeger-operator-metrics           ClusterIP      172.20.220.212   <none>                 8443/TCP                  11m
service/jaeger-operator-webhook-service   ClusterIP      172.20.224.23    <none>                 443/TCP                   11m


NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/jaeger-operator     1/1        1            1       11m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/jaeger-operator-5597f99c79       1         1        1     11m

NAME                             READY   AGE
statefulset.apps/elasticsearch   1/1     17m

As we can see in the above output, our Jaeger Operator is deployed successfully, and all of its pods are up and running; this means Jaeger Operator is ready to install the Jaeger instances (CRs). The Jaeger instance will contain Jaeger components (Query, Collector, Agent ); later, we will use these components to query OpenTelemetry metrics.

Jaeger Instance

A Jaeger Instance is a deployment of the Jaeger distributed tracing system. It is used to collect and store trace data from microservices or distributed applications, and provide a UI to visualize and analyze the trace data. To deploy the Jaeger instance, use the following command.

$ kubectl apply -f https://raw.githubusercontent.com/infracloudio/Opentelemertrywithjaeger/master/jaeger-production-template.yaml

To verify the status of the Jaeger instance, run the following command:

$ kubectl get all -n observability

NAME                                    READY   STATUS    RESTARTS   AGE
pod/elasticsearch-0                     1/1     Running   0          17m
pod/jaeger-agent-27fcp                  1/1     Running   0          14s
pod/jaeger-agent-6lvp2                  1/1     Running   0          15s
pod/jaeger-collector-69d7cd5df9-t6nz9   1/1     Running   0          19s
pod/jaeger-operator-5597f99c79-hd9pw    2/2     Running   0          11m
pod/jaeger-query-6c975459b6-8xlwc       1/1     Running   0          16s

NAME                                      TYPE           CLUSTER-IP       EXTERNAL-IP             PORT(S)                                AGE
service/elasticsearch                     ClusterIP      None             <none>              9200/TCP,9300/TCP                          17m
service/jaeger-collector                  ClusterIP      172.20.24.132    <none>             14267/TCP,14268/TCP,9411/TCP,14250/TCP      19s
service/jaeger-operator-metrics           ClusterIP      172.20.220.212   <none>                    8443/TCP                             11m
service/jaeger-operator-webhook-service   ClusterIP      172.20.224.23    <none>                    443/TCP                              11m
service/jaeger-query                      LoadBalancer   172.20.74.114    a567a8de8fd5149409c7edeb54bd39ef-365075103.us-west-2.elb.amazonaws.com   80:32406/TCP                   16s
service/zipkin                            ClusterIP      172.20.61.72     <none>                 9411/TCP                                 18s

NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/jaeger-agent      2         2         2       2            2           <none>       16s

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/jaeger-collector    1/1     1            1          21s
deployment.apps/jaeger-operator     1/1     1            1          11m
deployment.apps/jaeger-query        1/1     1            1          18s

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/jaeger-collector-69d7cd5df9     1         1         1     21s
replicaset.apps/jaeger-operator-5597f99c79      1         1         1     11m
replicaset.apps/jaeger-query-6c975459b6         1         1         1     18s

NAME                             READY   AGE
statefulset.apps/elasticsearch    1/1    17m

As we can see in the above screenshot our Jaeger instance is up and running.

OpenTelemetry

To install the OpenTelemetry, we need to install the OpenTelemetry Operator. The OpenTelemetry Operator uses custom resources and custom controllers to extend the Kubernetes API with OpenTelemetry-specific functionality, making it easier to deploy and manage the OpenTelemetry observability stack in a Kubernetes environment.

The operator manages two things:

Collectors: It offers a vendor-agnostic implementation of how to receive, process and export telemetry data.
Auto-instrumentation of the workload using OpenTelemetry instrumentation libraries. It does not require the end-user to modify the application source code.

OpenTelemetry operator

To implement the auto-instrumentation, we need to deploy the OpenTelemetry operator on our Kubernetes cluster. To deploy the k8s operator for OpenTelemetry, follow the K8s operator documentation

You can verify the deployment of the OpenTelemetry operator by running the below-mentioned command:

$ kubectl get all -n  opentelemetry-operator-system

NAME                                                             READY   STATUS    RESTARTS    AGE
pod/opentelemetry-operator-controller-manager-7f479c786d-zzfd8    2/2    Running      0        30s

NAME                                                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/opentelemetry-operator-controller-manager-metrics-service   ClusterIP   172.20.70.244    <none>        8443/TCP   32s
service/opentelemetry-operator-webhook-service                      ClusterIP   172.20.150.120   <none>        443/TCP    31s

NAME                                                        READY   UP-TO-DATE   AVAILABLE      AGE
deployment.apps/opentelemetry-operator-controller-manager    1/1        1            1          31s

NAME                                                                   DESIRED   CURRENT   READY    AGE
replicaset.apps/opentelemetry-operator-controller-manager-7f479c786d      1         1        1      31s

As we can see in the above output, the opentelemetry-operator-controller-manager deployment is running in the opentelemetry-operator-system namespace.

OpenTelemetry Collector

The OpenTelemetry facilitates the collection of telemetry data via the OpenTelemetry Collector. Collector offers a vendor-agnostic implementation on how to receive, process, and export the telemetry data.

The collector is made up of the following components:

Receivers: It manages how to get data into the collector.
Processors: It manages the processing of data.
Exporters: Responsible for sending the received data.

We also need to export the telemetry data to the Jaeger instance. Use the following manifest to deploy the collector.

kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:

    processors:

    exporters:
      logging:
      jaeger:
        endpoint: "jaeger-collector.observability.svc.cluster.local:14250"
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: []
          exporters: [logging, jaeger]
EOF

In above code Jaeger endpoint is the address of the Jaeger service which is running inside the observability namespace.

We need to deploy this manifest in the same namespace where our application is deployed, so that it can fetch the traces from the application and export them to Jaeger.

To verify the deployment of the collector run the following command.

$ kubectl get deploy otel-collector

NAME              READY    UP-TO-DATE   AVAILABLE      AGE
otel-collector     1/1         1            1          41s

OpenTelemetry auto-instrumentation injection

The above-deployed operator can inject and configure the auto-instrumentation libraries of OpenTelemetry into an application's codebase as it runs. To enable the auto-instrumentation on our cluster, we need to configure an instrumentation resource with the configuration for the SDK and instrumentation.

Use the below-given manifest to create the auto-instrumentation.

kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: my-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector:4317
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "0.25"
EOF

In the above manifest, we have used three things: exporter, propagator, and sampler.

Exporter: It is used to send data to OpenTelemetry collector at the specified endpoint. In our scenario, it is "http://otel-collector:4317".
Propagators: - It carry traces, context and baggage data between distributed tracing systems. It have three propagation mechanism:-

tracecontext: This refers to the W3C Trace Context specification, which defines a standard way to propagate trace context information between services.
baggage: This refers to the OpenTelemetry baggage mechanism, which allows for the propagation of arbitrary key-value pairs along with the trace context information.
b3: This refers to the B3 header format, which is a popular trace context propagation format used by the Zipkin tracing system.

Sampler: - It uses a "parent-based trace ID ratio" strategy with a sample rate of 0.25 (25%). This means that when tracing a request, if any of its parent requests has already been sampled (with a probability of 0.25), then this request will also be sampled, otherwise it will not be traced.

To verify that our custom resource is created or not, we can use the below-mentioned command.

$ kubectl get otelinst

NAME                   AGE          ENDPOINT                          SAMPLER                 SAMPLER ARG
my-instrumentation      6s     http://otel-collector:4317       parentbased_traceidratio          0.25

This means our custom resource is created successfully.

We are using the OpenTelemetry auto-instrumented method, so we don’t need to write instrumentation code in our application. All we need to do is, add an annotation in the pod of our application for auto-instrumentation. Below are the annotations which we need to add to the deployment manifest.

As we are going to demo a Java application, the annotation which we will have to use here is:

instrumentation.opentelemetry.io/inject-java: "true"

Note: The annotation can be added to a namespace as well so that all pods within that namespace will get instrumentation, or by adding the annotation to individual PodSpec objects, available as part of Deployment, Statefulset, and other resources.

Below is an example of how your manifest will look after adding the annotations. In the below example, we are using annotation for a Java application.

apiVersion: apps/v1
kind: Deployment
metadata:
 name: demo-sagar
spec:
 replicas: 1
 selector:
   matchLabels:
     app: demo-sagar
 template:
   metadata:
     labels:
       app: demo-sagar
     annotations:
       instrumentation.opentelemetry.io/inject-java: "true"
       instrumentation.opentelemetry.io/container-names: "spring"
   spec:
     containers:
     - name: spring
       image: sagar27/petclinic-demo
       ports:
       - containerPort: 8080

We have added instrumentation “inject-java” and “container-name” under annotations. If you have multiple container pods, you can add them in the same “container-names” annotation, separated by a comma. For example, “container-name1,container-name-2,container-name-3” etc.

After adding the annotations, deploy your application, and access it on the browser. Here in our scenario, we are using port-forward to access the application.

$ kubectl port-forward service/demo-sagar  8080:8080

To generate traces either you can navigate through all the pages of this website or you can use the following Bash script:

while true;
do
  curl http://localhost:8080/
  curl http://localhost:8080/owners/find
  curl http://localhost:8080/owners?lastName=
  curl http://localhost:8080/vets.html
  curl http://localhost:8080/oups
  curl http://localhost:8080/oups
  sleep 0.01
done

The above-given script will make a curl request to all the pages of the website, and we will see the traces of the request on the Jaeger UI. We are making curl requests to https://localhost:8080 because we use the port-forwarding technique to access the application.
You can make changes in the Bash script according to your scenario.

Now let’s access the Jaeger UI, as our service jaeger-query uses service type LoadBalancer, we can access the Jaeger UI on the browser by using the load balancer domain/IP.

Paste the load balancer domain / IP on the browser and you will see the Jaeger UI there. We have to select our app from the service list and it will show us the traces it generates.

In the above screenshot, we have selected our app name “demo-sagar” under the services option and its traces are visible on Jaeger UI. We can further click on the traces to get more details about it.

Summary

In this blog post, we have gone through how you can easily instrument your application using the OpenTelemetry auto-instrumentation method. We also learned how this telemetric data could be exported to the Elasticsearch backend and visualized it using Jaeger.

Integrating OpenTelemerty with Jaeger will help you in monitoring and troubleshooting. It also helps perform root cause analysis of any bug/issues in your microservice-based distributed systems, performance/latency optimization, service dependency analysis, etc.

We hope you found this post informative and engaging. We would love to hear your thoughts on this post, so do start a conversation on Twitter or LinkedIn :).

If you want to implement observability in your microservices, talk to our observability consulting and implementation experts, and for more posts like this one, do visit our blogs section.

References