Forem: Daniel Kim

How to identify and troubleshoot common Kubernetes errors

Daniel Kim — Tue, 25 Apr 2023 18:16:44 +0000

To read this full New Relic article, click here.

Debugging Kubernetes can be stressful and time-consuming. This blog post walks through some common errors at the container, pod, and node level so you can apply these tips to help your clusters run smoothly. Whether you’re new to Kubernetes or have been using it for a while, you’ll learn a comprehensive set of tools and methods for debugging Kubernetes issues.

This post is part three in a Monitoring Kubernetes series that explains everything you need to quickly set up your Kubernetes clusters and monitor them with New Relic.

In part one, you learned that Kubernetes automates the mundane operational tasks of managing containers.

In part two, you learned how to optimize Kubernetes for your application's needs

We’ll build on the previous parts, diving into a wide range of topics, from understanding the basics of pods and containers to advanced troubleshooting techniques.

This post is part three in a Monitoring Kubernetes series that explains everything you need to quickly set up your Kubernetes clusters and monitor them with New Relic.

In part one, you learned that Kubernetes automates the mundane operational tasks of managing containers.
In part two, you learned how to optimize Kubernetes for your application's needs
We’ll build on the previous parts, diving into a wide range of topics, from understanding the basics of pods and containers to advanced troubleshooting techniques.

Troubleshooting every level of your Kubernetes deployment

Troubleshooting your nodes

To begin this troubleshooting guide, let’s start at the top layer. If you want to know the health of your entire Kubernetes cluster, you’ll want to look at how the nodes in the cluster are working, at what capacity, the number of applications running on each node, and the resource utilization of the entire cluster.

You can get all of these metrics and more by running this command:



kubectl top node

The output of kubectl top node includes metrics on the number of CPU cores and memory utilized, as well as overall CPU and memory utilization. This is a critical first step in troubleshooting as it allows you to see the current state of your infrastructure.

Troubleshooting your pods

In Kubernetes, a pod is the smallest and simplest unit in the object model. It represents a single process running in a cluster. While a pod is the smallest unit in the Kubernetes object model, it can hold one or more containers, and these containers share the same network namespace, meaning they can communicate with each other using localhost. Pods also have shared storage volumes, so all containers in a pod can access the same data.

Pods have various states, including running, pending, failed, succeeded, and unknown.

The running state means that the pod's containers are running and healthy.
The pending state means that the pod has been created but one or more of its containers are not yet running.
The failed state means that one or more of the pod's containers has terminated with an error.
The succeeded state means that all containers in the pod have terminated successfully. And the Unknown state means that the pod's state cannot be determined.

To get the state of your pods, you can use the kubectl get pods command in your terminal. The output displays the current state of all pods in the current namespace. By default, it shows the pod name, the current state (for example, running, pending, and so on), the number of containers, and the age of the pod.



kubectl get pods

You can use the -o wide option to get even more information on each pod, such as the IP address and hostname.



kubectl get pods -o wide

Pods also generate events that can provide valuable information about their status. You can view events by using the kubectl describe command to get more detailed information about the pod, including its current state, its IP address, and the status of its containers.

This is useful for understanding why a pod is in a particular state. For example, an event might indicate that a pod was evicted due to a lack of resources, or that a container failed to start due to an error in the application code.



kubectl describe pod <pod-name>

You can also filter the pods based on their state, for example, to check all the running pods, you can use this command:



kubectl get pods --field-selector=status.phase=Running

Additionally, you can use kubectl top pod command to get the resource usage statistics for the pods.



kubectl top pod

Are you currently running a Kubernetes cluster? If so, try using some of these commands to see what output you get.

Common pod errors

Now that you know the the current state of your pods, this section introduces common issues with Kubernetes resources (Pods, Services, or StatefulSets). We’ll cover how to make sense of, troubleshoot, and resolve each of these common issues.

Although there are more issues you can encounter when working with Kubernetes, the list in this section includes the most common instances you’ll encounter.

Use this list to jump directly to the section about these common issues:

CrashLoopBackOff error

One of the most common errors you’ll encounter while working with Kubernetes is the CrashLoopBackOff error. This error occurs in Kubernetes environments typically when a container in a pod crashes and the pod's restart policy is set to Always. In this scenario, Kubernetes will keep trying to restart the container, but if it continues to crash, the pod will enter a CrashLoopBackOff state.

How to identify a CrashLoopBackOff error

Run kubectl get pods.
Check the output to see if… The pod’s status has a CrashLoopBackOff error. There is more than one restart. Pods aren't identified as ready. This example shows 0/1 as ready, 2 restarts, and has a status of CrashLoopBackOff.



$ kubectl get pods 
NAME                 READY   STATUS               RESTARTS   AGE
example-pod           0/1    CrashLoopBackOff     2          4m26s

What causes a CrashLoopBackOff error?

A CrashLoopBackOff status in the STATUS column isn't the root cause of the problem—it simply indicates that the pod is experiencing a crash loop. To effectively troubleshoot and fix the issue, you'll need to identify and address the underlying error that’s causing the containers to crash.

There are several possible causes for the CrashLoopBackOff error:

The container might be running out of memory or CPU resources. You can verify this by checking the resource usage of the container and pod using kubectl commands.
The container might be unable to start due to an issue with the image or configuration. For example, the image might be missing a required dependency, or the container might not have the necessary permissions to access certain resources.
The container might be crashing due to a bug in the application code. In this case, the logs of the container might provide more information about the cause of the crash.
The container might be crashing due to a network issue, for example, the container might be unable to connect to a required service.

Troubleshooting a CrashLoopBackOff error

Once you identify the particular pod that is showing the CrashLoopBackOfferror, follow these steps to identify the root cause.

Run this command:



kubectl describe pod [name]

If the pod is failing due to a liveness probe failure or a back-off restarting failed container error, the command will provide valuable insights.



From       Message
-----      -----
kubelet    Liveness probe failed: cat: can’t open ‘/tmp/healthy’: No such file or directory
kubelet    Back-off restarting failed container

If you get back-off restarting failed container, and Liveness probe failed error messages, it’s likely that the pod is experiencing a temporary resource overload caused by a spike in activity. To resolve this issue, you can adjust the periodSeconds or timeoutSeconds parameters to give the application more time to respond. This allows the pod to recover.

ImagePullBackOff/ErrImagePull error

In a Kubernetes cluster, there’s an agent on each node called the kubelet that’s responsible for running containers on that node. If a container image doesn’t already exist on a node, the kubelet will instruct the container runtime to pull it.

When a Kubernetes pod encounters an issue with pulling an image, it will initially generate an ErrImagePull error. The system will then retry a few times to download the image before ultimately "pulling back" and scheduling another attempt. With each unsuccessful attempt, the delay between retries increases exponentially, with a maximum delay of five minutes.

The ImagePullBackOff and ErrImagePull errors in Kubernetes environments typically occur when the Kubernetes node is unable to pull the specified image from the container registry. This can happen for several reasons:

The image might not exist in the specified container registry, or the image name might be misspelled in the pod definition.
The image might be private, and the pod doesn’t have the necessary credentials to pull it.
The pod's network might not have access to the container registry.
The pod might not have enough permissions to pull the image from the container registry.

How to Identify an ImagePullBackOff error

Run kubectl get pods
Check the output to see if the pod’s status has an ImagePullBackOff error:



$ kubectl get pods 
NAME                 READY   STATUS             RESTARTS   AGE
example-pod           0/1    ImagePullBackOff   0          4m26s

How to Troubleshoot an ImagePullBackOff error

To troubleshoot the ImagePullBackOff error, first run kubectl describe. Review the specific error under events. Take these recommended actions for each of the errors:

Repository does not exist or no pull access
This means that the repository specified in the pod doesn’t exist in the Docker registry the cluster is using.
By default, images are pulled from Docker Hub, but your cluster might be using one or more private registries.
The error might occur because the pod does not specify the correct repository name, or doesn’t specify the correct fully qualified image name (for example username/imagename).
Another reason for this error might be DockerHub or another container registry’s rate limits prevent the kubelet from fetching the image.
Manifest not found
This means that the specific version of the requested image was not found. If you specified a tag, this means the tag was incorrect.
To resolve it, double-check that the tag in the pod specification is correct and that it exists in the repository. Keep in mind that tags in the repo might have changed. If you didn’t specify a tag, check if the image has a latest tag. If you don’t specify a valid tag, the images that don't have a latest tag won’t be returned.
Authorization failed
In this case, the issue is that the credentials you provided can’t access the container registry or the specific image you requested.
To resolve this, create a Kubernetes Secret with the appropriate credentials, and reference it in the pod specification. If you already have a Secret with credentials, ensure those credentials have permission to access the required image or grant access in the container repository.

OOMKilled error

In Kubernetes, kubelets running on your virtual machines (VMs) have something called a Memory Manager that tracks memory usage for various processes, including out-of-memory (OOM) issues. When the VM comes close to running out of memory, the Memory Manager kills the least amount of pods to free up enough memory to prevent the entire system from crashing.

There are two different scenarios that cause a OOMKilled Error.

1. The pod was terminated because a container limit was reached.

The Container Limit Reached error is specific to a single pod. When Kubernetes detects that a pod is using more memory than its set limit, it will terminate the pod with the error message OOMKilled - Container Limit Reached.

To troubleshoot this error, it's important to check the application logs to understand why the pod was using more memory than its set limit. This could be due to a spike in traffic, a long-running Kubernetes job, or a memory leak in the application.

Investigate! If you find that the application is running as expected and simply requires more memory to operate, consider increasing the values for the request and limit for that pod. Monitoring the resource usage and performance of the pod, and of the cluster, can also help you identify the problem, and find a way to prevent it in the future.

2. The pod was terminated because the node was “overcommitted”

The pods were scheduled to the node that, put together, request more memory than is available on the node.

The OOMKilled: Limit Overcommit error can occur when the aggregate memory requirements of all pods on a node exceeds the available memory on that node. You might recall seeing this issue in Part 2 of this series.

For example, let’s imagine you have a node with 5 GB of memory, and you have five pods running on that node, each with a memory limit of 1 GB. The total memory usage would be 5 GB, which is within the limit. However, if one of those pods is configured with a higher limit of, say 1.5 GB, it can cause the total memory usage to exceed the available memory, leading to OOMKilled error. This can happen when the pod experiences a spike in traffic or an unexpected memory leak, causing Kubernetes to terminate pods to reclaim memory.

It's important to check the host itself and ensure that there are no other processes running outside of Kubernetes that could be consuming memory resources, leaving less for the pods. It's also important to monitor memory usage and adjust the limits of pods accordingly.

How to identify an OOMKilled error

Run kubectl get pods
Check the output to see if the pod’s status has an OOMKilled error.



$ kubectl get pods 
NAME                 READY   STATUS           RESTARTS   AGE
example-pod           0/1    OOMKilled        0          4m26s

How to troubleshoot an OOMKilled error

How you respond to an OOMKilled error depends on why the pod was terminated. It might have been terminated because of a container limit or an overcommitted node.

If the pod was terminated because of a container limit

To resolve the OOMKilled: Container Limit Reached error, it's important to first determine if the application truly requires more memory. If the application is facing increased load or use, it might require more memory than was originally allocated. In this scenario, you can increase the memory limit for the container in the pod specification to address the error. You can check if this is the case by getting the logs for the particular pod: run kubectl logs pod-name and determine if there is a noticeable spike in requests.

But if the memory usage unexpectedly increases and doesn’t appear to be related to application demand, it could indicate that the application is experiencing a memory leak. In this case, you need to debug the application and identify the source of the memory leak. Simply, increasing the memory limit in the application without addressing the underlying issue might result in a situation that just consumes more resources without solving the problem. It’s important to address the root cause of the leak to prevent it from happening again.

If the pod was terminated because of an overcommitted node

Pods are scheduled on a node based on their memory request value compared to the available memory on the node. But this can result in overcommitment of memory. To troubleshoot and resolve OOMKilled errors caused by overcommitment, it's important to understand why Kubernetes terminated the pod. Then you can adjust the memory limits and requests to ensure that the node is not overcommitted.

To prevent these issues from happening, it is important to monitor your environment constantly, understand the memory behavior of pods and containers, and regularly check your settings. This approach can help you identify potential issues early on and take appropriate action to prevent them from escalating. Having a good understanding of the memory behavior of your pods and containers—and knowing the settings you have configured—allows you to easily diagnose and resolve Kubernetes memory issues.

CreateContainerConfigError and CreateContainerError

The CreateContainerConfigError and CreateContainerError in Kubernetes typically occur when there’s a problem creating the container configuration for a pod. Some common causes include:

An invalid image name or tag: Make sure that the image name and tag specified in the pod definition are valid and can be pulled from the specified container registry.
Missing image pull secrets: If the image is in a private registry, make sure that the necessary image pull secrets are defined in the pod definition.
Insufficient permissions: Ensure that the service account used by the pod has the necessary permissions to pull the specified image from the registry.

How to identify a CreateContainerConfigError or CreateContainerError

Run kubectl get pods

Check the output to see if the pod’s status is CreateContainerConfigError:



$ kubectl get pods 
NAME                 READY   STATUS                       RESTARTS   AGE
example-pod           0/1    CreateContainerConfigError   0          4m26s

How to Troubleshoot a CreateContainerConfigError or CreateContainerError

Check the pod definition for any errors or typos in the image name or tag. If the image does not exist in the specified container registry, it’s going to throw this error.
Make sure that the specified image pull secrets are valid and exist in the namespace. Also make sure that the service has the necessary permissions to pull the specified image from the registry. You can run the kubectl auth can-i command to check if a service account has the necessary permissions to perform a specific action. For example, use the this command to check if a service account named my-service-account can pull the NGINX image from the NGINX repository:



kubectl auth can-i pull nginx --as=system:serviceaccount:<namespace>:my-service-account

Replace <namespace> with the namespace where the service account is located. If the command returns yes, the service account has the necessary permissions to pull the NGINX image. If it returns no, the service account doesn't have the necessary permissions.

Check the Kubernetes logs for more information about the error.
Use the command kubectl describe pod <pod-name> to get more details about the pod and check for any error messages.

Pods are stuck in pending or waiting

In part 2, we discussed how to “rightsize” your workloads with requests and limits. But what happens if you don’t rightsize correctly? Your pods’ status might be stuck in pending or waiting—because they aren’t able to be scheduled onto a node.

Pods are stuck in pending

Look at the describe pod output, in the Events section. Look for messages that indicate why the pod couldn’t be scheduled.

Examples include:

The cluster might have insufficient CPU or memory resources. This means you’ll need to delete some pods, add resources on your nodes, or add more nodes.
The pod might be difficult to schedule due to specific resources requirements. See if you can release some of the requirements to make the pod eligible for scheduling on additional nodes.

Pods are stuck in waiting

If a pod’s status is waiting, this means it is scheduled on a node, but unable to run. Run kubectl describe <podname> and in the Events section look for reasons the pod can’t run.

Most often, pods are stuck in waiting status because of an error when fetching the image. Check for these issues:

Ensure the image name in the pod manifest is correct.
Ensure the image is really available in the repository.
Test manually to see if you can retrieve the image.
Run a docker pull command on the local machine to ensure that you have the appropriate permissions.

Kubernetes troubleshooting with New Relic

The troubleshooting process in Kubernetes is complex. Without the right tools, debugging can be stressful, ineffective, and time-consuming. Some best practices can help minimize the chances of things breaking down, but eventually something will go wrong—simply because it can.

You can use New Relic as a single source of truth for all of your observability data. It collects all of your metrics, logs, and traces from every part of your Kubernetes stack, from the applications themselves, to the Kubernetes components, all the way down to the infrastructure metrics of your VMs.

To read this full New Relic article, click here.

Not an existing New Relic user? Sign up for a free account to get started! 👨‍💻

How to optimize Kubernetes resource configurations for cost and performance

Daniel Kim — Tue, 17 Jan 2023 19:47:44 +0000

Kubernetes, often abbreviated as K8s, automates the mundane operational tasks of managing the containers that make up the necessary software to run an application. With built-in commands for deploying applications, Kubernetes rolls out changes to your applications, scales your applications up and down to fit changing needs, monitors your applications, and more. Kubernetes orchestrates your containers wherever they run, which makes it easier to deploy across multiple cloud environments and migrate between infrastructure platforms. In short, Kubernetes makes it easier to manage applications.

A properly configured Kubernetes system saves time and money. But configuring your Kubernetes clusters can be difficult. Improper configuration can lead to problems with application availability, performance, resilience, or overspending. Here in part two of this Kubernetes guide, you'll get help balancing appropriate parameter configuration for any cluster you are working with now or in the future. You'll learn about requests and limits, measuring CPU utilization, and how to optimize Kubernetes resource allocation.

Rightsizing your workloads with requests and limits

In an ideal world, your Kubernetes pods would use exactly the amount of resources you requested. But, in the real world, resource usage isn’t predictable. If you have a large application on a node with limited resources, the node might run out of CPU or memory and things can break. And if you’ve been working as an engineer long enough, you know that things breaking in your architecture means frantic messages in the middle of the night and lost revenue for your organization.

On the flip side, If you allocate too many resources for CPU and memory, then there is waste since those resources remain reserved for that node. When utilization is lower than the requested value, it creates slack cost. When you design and configure a tech stack, the goal is to use the lowest cost resources that still meet the technical specifications of a specific workload.

To rightsize workloads by optimizing the use of resources, it is important to know the historical usage and workload patterns of your system. With this knowledge, you can make informed cost savings decisions. For instance, let’s say your average CPU utilization is only 40% and on your highest traffic day in the last two years the CPU utilization spiked up to only 60%. Your initially provisioned level of compute is too high! A simple change in configuration can result in large savings in cost by reducing underutilized compute resources.

Applying accurate resource requests and limits to deployments can help prevent overprovisioning of extra resources which leads to underutilization and higher cluster costs, or underprovisioning of fewer resources than required, which may lead to various errors such as out of memory (OOM) events.

Kubernetes uses requests and limits to control resources like CPU and memory.

Requests are resources a container is guaranteed to get. If a container requests a resource, the Kubernetes scheduler (kube-scheduler) will ensure the container is placed on a node that can accommodate it.

Limits make sure a container never uses a value that is higher than its quota.

You can set requests and limits per container. Each container in the pod can have its own limit and request, but you can also set the values for limits and requests at the pod or namespace level.

Memory allocation and utilization

Memory resources are defined in bytes. You can express memory as a plain integer or a fixed-point integer with one of these suffixes: E, P, T, G, M, K, Ei, Pi, Ti, Gi, Mi, Ki. For example, the following represent approximately the same value:

128974848, 129e6, 129M, 123Mi

Memory is not a compressible resource and there is no way to throttle memory. If a container goes past its memory limit, it will be killed.

Memory limits and memory utilization per pod

When specified, a memory limit represents the maximum amount of memory a node will allocate to a container. Here are NRQL examples of querying memory limits.

NRQL that targets a New Relic metric:

SELECT latest(cpuUsedCores/cpuLimitCores) FROM K8sContainerSample FACET podName TIMESERIES SINCE 1 day ago

NRQL that targets a Prometheus metric:

SELECT rate(sum(container_cpu_usage_seconds_total), 1 SECONDS) FROM Metric SINCE 1 MINUTES AGO UNTIL NOW FACET pod TIMESERIES LIMIT 20

If a limit is not provided in the manifest and there is not an overall configured default, a pod could use the entirety of a node’s available memory. A node might be oversubscribed—the sum of the limits for all pods running on a node might be greater than that node’s total allocatable memory. This requires that the pods’ specific requests are below the limit. The node’s kubelet will reduce resource allocation to individual pods if they use more than they request so long as that allocation at least meets their requests.

Tracking pods’ actual memory usage in relation to their specified limits is particularly important because memory is a non-compressible resource. In other words, if a pod uses more memory than its defined limit, the kubelet can’t throttle its memory allocation, so it terminates the processes running on that pod instead. If this happens, the pod will show a status of OOMKilled.

Comparing your pods’ memory usage to their configured limits will alert you to whether they are at risk of being killed because they are out of memory (OOM), as well as whether their limits make sense. If a pod’s limit is too close to its standard memory usage, the pod may get terminated due to an unexpected spike. On the other hand, you may not want to set a pod’s limit significantly higher than its typical usage because that can lead to poor scheduling decisions.

For example, a pod with a memory request of 1gibibyte (GiB) and a limit of 4GiB can be scheduled on a node with 2GiB of allocatable memory (more than sufficient to meet its request). But if the pod suddenly needs 3GiB of memory, it will be killed even though it’s well below its memory limit.

Memory requests and allocatable memory per node

Memory requests are the minimum amounts of memory a node’s kubelet will assign to a container.

If a request is not provided, it will default to whatever the value is for the container’s limit (which, if also not set, could be all memory on the node). Allocatable memory reflects the amount of memory on a node that is available for pods. Specifically, it takes the overall capacity and subtracts memory requirements for OS and Kubernetes system processes to ensure they won’t compete with user pods for resources.

Although node memory capacity is a static value, its allocatable memory (the amount of compute resources that are available for pods) is not. Maintaining an awareness of the sum of pod memory requests on each node, versus each node’s allocatable memory, is important for capacity planning. These metrics will inform you if your nodes have enough capacity to meet the memory requirements of all current pods and if the kube-scheduler is able to assign new pods to nodes. To learn more about the difference between node allocatable memory and node capacity, see Reserve Compute Resources for System Daemons in the Kubernetes documentation.

The kube-scheduler uses several levels of criteria to determine if it can place a pod on a specific node. One of the initial tests is whether a node has enough allocatable memory to satisfy the sum of the requests of all the pods running on that node, plus the new pod. To learn more about the scheduling process criteria, see the node selection section of the Kubernetes scheduler documentation.

Comparing memory requests to capacity metrics can also help you troubleshoot problems when launching and running the number of pods that you want to run across your cluster. If you notice that your cluster’s count of current pods is significantly less than the number of pods you want, these metrics might show you that your nodes don’t have the resource capacity to host new pods. One straightforward remedy for this issue is to provision more nodes for your cluster.

Measuring CPU utilization

One CPU core is equivalent to 1000m (one thousand millicpu or one thousand millicores). If your container needs one full core to run, specify a value of 1000m or just 1. If your container needs 1⁄4 of a core, specify a value of 250m.

CPU is a compressible resource. If your container starts hitting your CPU limits, it will be throttled. CPU will be restricted and performance will degrade. But it won’t be killed.

To get important insight into cluster performance, you’ll need to track two things:

Track the amount of CPU your pods are using compared to their configured requests and limits.
Track the CPU utilization at the node level.

Much like a pod exceeding its CPU limits, a lack of available CPU at the node level can lead to the node throttling the amount of CPU allocated to each pod.

Measuring actual utilization compared to requests and limits per pod will help determine if these are configured appropriately and your pods are requesting enough CPU to run properly. Alternatively, consistently higher than expected CPU usage might point to problems with the pod that need to be identified and addressed.

Here's a NRQL query that shows the CPU requests and allocatable CPU per node. Try it on your cluster:

SELECT
filter(sum(`node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate`), where true) /
filter(sum(kube_pod_container_resource_requests), WHERE (resource = 'cpu') and job = 'kube-state-metrics') * 100 as 'CPU Request Commitment'
FROM Metric FACET node since 1 minute ago

Here’s a NRQL query that shows the CPU requests and allocatable CPU per pod. Try it on your cluster:

SELECT sum(`node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate`) / filter(sum(kube_pod_container_resource_limits), WHERE (resource = 'cpu') and job = 'kube-state-metrics') * 100 as 'CPU Limit Commitment' FROM Metric FACET pod since 1 minute ago

How to optimize Kubernetes resource allocation

To optimize your resource allocation, you’ll need to define pod specs, resource quotas, and limit range.

Define pod specs

Here is a typical pod spec for resources:

Each container in the pod can set its own requests and limits which are all additive. So in this example, the pod has a total request of 64 mebibyte (MiB) of memory, and a total limit of 128 MiB. Keep in mind that if you put a request for CPU above the core count of your biggest node, your pod will never be scheduled. Unless your application is specifically architected to take advantage of multiple cores, it is generally good to keep your CPU request below 1 and leverage replicas to scale horizontally.

Define resource quotas

Without guardrails, developers can allocate any amount of resources to their applications running on Kubernetes. When several teams share a cluster with a fixed number of nodes, this becomes a problem. Kubernetes allows administrators to set hard limits for resource usage in namespaces with ResourceQuotas.

If you apply this file to a namespace, you’ll set the following requirements for all the containers of the namespace:

The sum of all the CPU requests can’t be higher than 0.5 cores.
The sum of all the CPU limits can’t be higher than 0.8 cores.
The sum of all the memory requests can’t be higher than 200 MiB.
The sum of all the memory limits can’t be higher than 500 MiB.

This means you could have 50 containers with 4 MiB requests, five containers with 40 MiB requests, or even one container with 200 MiB requests.

Define limit range

You can also create a LimitRange for a namespace. Instead of looking at the namespace as a whole, a LimitRange applies to individual containers.

Here’s an example of what a LimitRange might look like:

The default section sets the default limits for a container in a pod. If you use the values in the LimitRange, any containers that do set ranges themselves will get assigned the default values.

The defaultRequest section sets the default requests for a container in a pod. If you use the values in the LimitRange, any containers that do set ranges themselves will get assigned the default values.

The max section will set up the maximum limits that a container in a pod can set. The default section and limits set on a container cannot be higher than this value. One thing to note, if the max value is set and the default is not, any containers that do not set these values themselves will get assigned the max value as the limit.

The min section will set up the minimum requests that a container in a pod can set. The defaultRequest section and requests set on a container cannot be lower than this value. One thing to note: if this value is set and the defaultRequest is not, the min value becomes the defaultRequest value.

Conclusion

Now that you’ve learned the basics of Kubernetes and why it needs monitoring in part one and had a deep dive into Kubernetes architecture here in part two, you might want to try out a few things on your own.

A growing number of tools and frameworks are dedicated to helping visualize Kubernetes infrastructure efficiency. Here are two examples:

Kubecost provides real-time cost visibility and insights for teams using Kubernetes, helping you continuously reduce your cloud costs
Open Cost is a vendor-neutral open source project for measuring and allocating infrastructure and container costs in real time. (New Relic is a founding contributor.)

What is Kubernetes and how should you monitor it?

Daniel Kim — Tue, 20 Dec 2022 19:13:30 +0000

In this blog post, you'll learn what Kubernetes is and what components you’ll need for complete observability. It's the first part in a Monitoring Kubernetes series.

What is Kubernetes?

Kubernetes, often abbreviated as “K8s,” is an open source platform that has established itself as the de facto standard for container orchestration. Usage of Kubernetes has risen globally, particularly in large organizations, with the CNCF in 2021 reporting that there are 5.6 million developers using Kubernetes worldwide, representing 31% of all backend developers.

As a container orchestration system, it automatically schedules, scales, and maintains the containers that make up the infrastructure of any modern application. The project is the flagship project of the Cloud Native Computing Foundation (CNCF). It’s backed by key players like Google, AWS, Microsoft, IBM, Intel, Cisco, and Red Hat.

What can Kubernetes do?

Kubernetes automates the mundane operational tasks of managing the containers that make up the necessary software to run an application. With built-in commands for deploying applications, Kubernetes rolls out changes to your applications, scales your applications up, and down to fit changing needs, monitors your applications, and more. Kubernetes orchestrates your containers wherever they run, which facilitates multi-cloud deployments and migrations between infrastructure platforms. In short, Kubernetes makes it easier to manage applications.

Automated health checks

Kubernetes continuously runs health checks against your services. For cloud-native apps, this means consistent container management. Using automated health checks, Kubernetes restarts containers that fail or have stalled.

Automated operations

You can automate mundane sysadmin tasks using Kubernetes since it comes with built-in commands that take care of a lot of the labor-intensive aspects of application management. Kubernetes can ensure that your applications are always running as specified in your configuration.

Infrastructure abstraction

Kubernetes handles the compute, networking, and storage on behalf of your workloads. This allows developers to focus on applications and not worry about the underlying environment.

How Kubernetes changes your monitoring strategy

If you ever meet someone who tells you, "Kubernetes is easy to understand," most would agree they are lying to you!

Kubernetes requires a new approach to monitoring, especially when you are migrating away from traditional hosts like VMs or on-prem servers.

Containers can live for only a few minutes at a time since they get deployed and re-deployed adjusting to usage demand. How can you troubleshoot if they don't exist anymore?

These containers are also spread out across several hosts on physical servers worldwide. It can be hard to connect a failing process to the affected application without the proper context for the metrics you are collecting.

To monitor a large number of short-lived containers, Kubernetes has built-in tools and APIs that help you understand the performance of your applications. A monitoring strategy that takes advantage of Kubernetes will give you a bird's eye view of your entire application’s performance, even if containers running your applications are continuously moving between hosts or being scaled up and down.

Increased monitoring responsibilities

To get full visibility into your stack, you need to monitor your infrastructure. Modern tech stacks have made the relationship between applications and their infrastructure a more complicated than in the past.

Traditional infrastructure

In a traditional infrastructure environment, you only have two things to monitor–your applications and the hosts (servers or VMs) running them.

The introduction of containers

In 2013, Docker introduced containerization to the world. Containers are used to package and run an application, along with its dependencies, in an isolated, predictable, and repeatable way. This adds a layer of abstraction between your infrastructure and your applications. Containers are similar to traditional hosts, in that they run workloads on behalf of the application.

Kubernetes

With Kubernetes, full visibility into your stack means collecting telemetry data on the containers that are constantly being automatically spun up and dying while also collecting telemetry data on Kubernetes itself. Gone are the days of checking a few lights on the server sitting in your garage!

There are four distinct components that need to be monitored in a Kubernetes environment each with their specificities and challenges:

Infrastructure (*worker nodes)
Containers
Applications
Kubernetes clusters (control plane)

Correlating application metrics with infrastructure metrics with metadata

While making it easier to build scalable applications, Kubernetes has blurred the lines between application and infrastructure. If you are a developer, your primary focus is on the application and not the cluster's performance, but the cluster's underlying components can have a direct effect on how well your application performs. For example, a bug in a Kubernetes application might be caused by an issue with the physical infrastructure, but it could also result from a configuration mistake or coding problem.

When using Kubernetes, monitoring your application isn’t optional, it’s a necessity!

Most Application Performance Monitoring (APM) language agents don’t care where an application is running. It could be running on an ancient Linux server in a forgotten rack or on the latest Amazon Elastic Compute Cloud (Amazon EC2) instance. However, when monitoring applications managed by an orchestration layer, having context into infrastructure can be very useful for debugging or troubleshooting to be able to relate an application error trace, for example, to the container, pod, or host that it’s running on.

Configuring labels in Kubernetes

Kubernetes automates the creation and deletion of containers with varying lifespans. This entire process needs to be monitored. With so many moving pieces, a clear organization-wide labeling policy needs to be in place in order to match metrics to a corresponding application, pod, namespace, node, etc.

Consistent labelling of objects in your K8s cluster
By attaching consistent labels across different objects, you can easily query your Kubernetes cluster for these objects. For example, suppose you get a call from your developers asking if the production environment is down. If the production pods have a “prod” label, you can run the following kubectl command to get all their logs.

kubectl get pods -l name=prod
NAME  READY STATUS  RESTARTS  AGE
router-worker-6db6999875-b8t8m   0/1   ErrImagePull   0   1d4h
router-worker-6db6999875-7fn7z   1/1   Running        0   47s router-worker-6db6999875-8rl9b   1/1   Running        3   10h router-worker-6db6999875-b8t8m   1/1   Running        2   11h

In this example, you might spot that one of the prod pods has an issue with pulling an image and providing that information to your developers who use the prod pod. If you didn’t have labels, you would have to manually grep the output of kubectl get pods.

Common labeling conventions

In the example above, you saw an instance in which pods are labeled “prod” to identify their use by environment. Every team operated differently but the following naming conventions can commonly be found regardless of the team you work on:

Labels by environment

You can create entities for the environment they belong to. For example:

env: production
env: qa
env: development
env: staging

Labels by team

Creating tags for team names can be helpful to understand which team, group, department, or region was responsible for a change that led to a performance issue.

### Team tags

team: backend
team: frontend0
team: db

### Role tags

roles: architecture
roles: devops
roles: pm

### Region tags

region: emea
region: america
region: asia

Labels by Kubernetes recommended labels

Kubernetes provides a list of recommended labels that allow a baseline grouping of resource objects. The app.kubernetes.io prefix distinguishes between the labels recommended by Kubernetes and the custom labels that you may separately add using a company.com prefix. Some of the most popular recommended Kubernetes labels are listed below.

Labels
Key Description
app.kubernetes.io/name Name of application (such as redis)
app.kubernetes.io/instance Unique name for this specific instance of the application (such as redis-department-a)
app.kubernetes.io/component A descriptive identifier of what the component is for (such as login-cache)
app.kubernetes.io/part-of The higher-level application using this resource (such as company-auth)

With all of your Kubernetes objects labeled, you can query your observability data to get a bird’s eye view of your infrastructure and applications. You can examine every layer in your stack by filtering your metrics. And, you can drill into more granular details to find the root cause of an issue.

Therefore, having a clear, standardized strategy for creating easy-to-understand labels and selectors should be an important part of your monitoring and alerting strategy for Kubernetes. Ultimately, health and performance metrics can only be aggregated by labels that you set.

Conclusion

So far, we’ve covered what Kubernetes is, what it can do, why it requires monitoring, and best practices on how to set up proper Kubernetes monitoring.

In part two of this multi-part series, we’ll go through a deep dive into Kubernetes architecture.

To read this full blog post from New Relic, click here. 📚

Instrumenting Your Node.js Apps with OpenTelemetry

Daniel Kim — Thu, 24 Jun 2021 16:21:26 +0000

When I first joined New Relic, I really didn't understand the importance of observability, because I came from a frontend background. As I began learning about why observability was valuable for developers, I started digging deeper into the open source ecosystem and learning what makes it possible for modern apps to maintain uptime. I learned more about OpenTelemetry, a popular open source tool for monitoring your apps and websites, but it was intimidating because I couldn't find any introductory tutorials online guiding me through the process of instrumentation.

It wasn't until I began instrumenting my own apps using the OpenTelemetry documentation that I realized how easy it was to get started. I collaborated with freeCodeCamp.org to create a beginner-friendly resource for anyone to begin instrumenting apps with OpenTelemetry. I worked with an amazing technical content creator named Ania Kubów to bring this one-hour video course to life. This course teaches how to use OpenTelemetry, including microservices, observability, tracing, and more.

Instrumenting your Node.js apps with OpenTelemetry

As systems become increasingly complex, it’s increasingly important to get visibility into the inner workings of systems to increase performance and reliability. Distributed tracing shows how each request passes through the application, giving developers context to resolve incidents, showing what parts of their system are slow or broken.

A single trace shows the path a request makes, from the browser or mobile device down to the database. By looking at traces as a whole, developers can quickly discover which parts of their application is having the biggest impact on performance as it affects your users’ experiences.

That’s pretty abstract, right? So let’s zero in on a specific example to help clarify things. We’ll use OpenTelemetry to generate and view traces from a small sample application.

Spinning up our Movies App

We have written a simple application consisting of two microservices, movies and dashboard. The movies service provides the name of movies and their genre in JSON format, while the dashboard service returns the results from the movies service.

👉 Clone the repo

To spin up the app, run



$ npm i
$ node dashboard.js
$ node movies.js

Notice the variable delay, built into the movies microservice that causes random delays returning the JSON.



const express = require('express')
const app = express()
const port = 3000

app.get('/movies', async function (req, res) {
   res.type('json')
+  var delay = Math.floor( ( Math.random() * 2000 ) + 100);
+  setTimeout((() => {
      res.send(({movies: [
         { name: 'Jaws', genre: 'Thriller'},
         { name: 'Annie', genre: 'Family'},
         { name: 'Jurassic Park', genre: 'Action'},
      ]}))
+  }), delay)
})

Tracing HTTP Requests with Open Telemetry

OpenTelemetry traces incoming and outgoing HTTP requests by attaching IDs. To do this, we need to

Instantiate a trace provider to get data flowing.
Configure that trace provider with an exporter to send telemetry data to another system where you can view, store, and analyze it.
Install OpenTelemetry plugins to instrument specific node module(s) to automatically instrument various frameworks

You need to have Docker on your machine to run a Zipkin instance. If you don't have Docker yet, it's easy to install. As for Zipkin, it's an open-source distributed tracing system created by Twitter that helps gather timing data needed to troubleshoot latency problems in service architectures. The OpenZipkin volunteer organization currently runs it. Finally, if you want to export your OpenTelemetry data to New Relic in Step 4, sign up to analyze, store, and use your telemetry data for free, forever.

Step 1: Create our trace provider and configuring it with an exporter

To create a trace provider, you need to install the following tool:



$ npm install @opentelemetry/node

OpenTelemetry auto instrumentation package for NodeJS

The @opentelemetry/node module provides auto-instrumentation for Node.js applications, which automatically identifies frameworks (Express), common protocols (HTTP), databases, and other libraries within your application. This module uses other community-contributed plugins to automatically instrument your application to automatically produce spans and provide end-to-end tracing with just a few lines of code.

OpenTelemetry Plugins

Install the plugins:



$ npm install @opentelemetry/plugin-http
$ npm install @opentelemetry/plugin-express

When NodeJS’s HTTP module handles any API requests, the @opentelemetry/plugin-http plugin generates trace data. The @opentelemetry/plugin-express plugin generates trace data from requests sent through the Express framework.

Step 2: Adding the Trace Provider and the Span Processor

After tracers are implemented into applications, they record timing and metadata about operations that take place (for example, when a web server records exactly when it receives a request, and when it sends a response).

Add this code snippet to

create a trace provider
adds a span processor to the trace provider

This code gets data from your local application and prints it to the terminal:



const { NodeTracerProvider } = require('@opentelemetry/node');
const { ConsoleSpanExporter, SimpleSpanProcessor } = require('@opentelemetry/tracing');

const provider = new NodeTracerProvider();
const consoleExporter = new ConsoleSpanExporter();
const spanProcessor = new SimpleSpanProcessor(consoleExporter);
provider.addSpanProcessor(spanProcessor);
provider.register()

If you want to learn more about this code, check out the OpenTelemetry docs on tracers.

Once we add this code snippet, whenever we reload http://localhost:3001/dashboard, we should get something like this - beautiful things on the terminal.

Step 3: Use Docker to install Zipkin and start tracing your application

You instrumented OpenTelemetry in the previous step. Now you move the data that you collected to a running Zipkin instance.

Let's spin up a Zipkin instance with the Docker Hub Image



$ docker run -d -p 9411:9411 openzipkin/zipkin

and you’ll have a Zipkin instance up and running. You’ll be able to load it by pointing your web browser to http://localhost:9411. You’ll see something like this

Exporting to Zipkin

Although neat, spans in a terminal window are a poor way to gain visibility into a service. You’re not going to want to scroll through JSON data in your terminal. Instead, it’s a lot easier to see a visualization in a dashboard. Let's work on that now. In the previous step, we added a console exporter to the system. Now you ship this data to Zipkin.

In this code snippet, we are instantiating a Zipkin exporter, and then adding it to the trace provider.



const { NodeTracerProvider } = require('@opentelemetry/node')
const { ConsoleSpanExporter, SimpleSpanProcessor } = require('@opentelemetry/tracing')
+ const { ZipkinExporter } = require('@opentelemetry/exporter-zipkin')
const provider = new NodeTracerProvider()
const consoleExporter = new ConsoleSpanExporter()
const spanProcessor = new SimpleSpanProcessor(consoleExporter)
provider.addSpanProcessor(spanProcessor)
provider.register()

+ const zipkinExporter = new ZipkinExporter({
+  url: 'http://localhost:9411/api/v2/spans',
+  serviceName: 'movies-service'
})

+ const zipkinProcessor = new SimpleSpanProcessor(zipkinExporter)
+ provider.addSpanProcessor(zipkinProcessor)

After you make these changes, let's visit our Zipkin instance at localhost:9411, start our application back up and request some URLs.

Step 4: Using the OpenTelemetry Collector to export the data into New Relic

What happens if we want to send the OpenTelemetry data to another backend where you didn't have to manage all of your own telemetry data?

Well, the amazing contributors to OpenTelemetry have come up with a solution to fix this!

The OpenTelemetry Collector is a way for developers to receive, process, and export telemetry data to multiple backends. This collector acts as the intermediary, getting data from the instrumentation and sending it to multiple backends to store, process, and analyze the data.

It supports multiple open source observability data formats like Zipkin, Jaeger, Prometheus, and Fluent Bit, sending it to one or more open source or commercial backends.

New Relic

New Relic is a platform for you to analyze, store, and use your telemetry data for Free, forever. Sign up now!

Configuring the OpenTelemetry Collector with New Relic

Clone the OpenTelemetry Collector with New Relic Exporter and spin up the docker container, making sure to export the New Relic API key.

To get a key, go to the New Relic one dashboard and choose API keys from the dropdown menu in the upper right.

Then, from the API keys window, click the Create a key button.

When creating the key, make sure you choose the Ingest - License key type. Then click Create a key to generate the key.

After you have an API key, you need to replace <INSERT-API-KEY-HERE> in the code snippet below with your API key.



export NEW_RELIC_API_KEY=<INSERT-API-KEY-HERE>
docker-compose -f docker-compose.yaml up

💡 Make sure to change the reporting URL from http://localhost:9411/api/v2/spans to http://localhost:9411/ in both dashboard.js and movies.js



const zipkinExporter = new ZipkinExporter({

- url: 'http://localhost:9411/api/v2/spans',

+ url: 'http://localhost:9411',

  serviceName: 'movies-service'

})

Step 5: Look at your ✨ beautiful data ✨

Navigate to the "Explorer" tab on New Relic One.

When you click on the service, you should be able to see some ✨beautiful✨ traces!

The trace in the dashboard is transmitting data about the random delay that was added to the API calls:

Final Thoughts

Instrumenting your app with Open Telemetry makes it easy to figure out what is going wrong when parts of your application is slow, broken, or both. With the collector, you can forward your data anywhere, so you are never locked into a vendor. You can choose to spin up an open source backend, use a proprietary backend like New Relic, or just roll your own backend! Whatever you choose, I wish you well you in journey to instrument EVERYTHING!

Next Steps

You can try out New Relic One with OpenTelemetry by signing up for our always free tier today.

To learn more about OpenTelemetry, look for our upcoming Understand OpenTelemetry blog series.