Forem: Yaar Naumenko

OpenClaw on GCP Cloud Run: Secure, Serverless, Multi-Tenant

Yaar Naumenko — Wed, 11 Mar 2026 10:47:11 +0000

A few days ago, Matias Kreder published a great article on running OpenClaw on AWS Bedrock AgentCore.
The architecture was elegant: ephemeral containers, S3-backed workspace sync, per-user isolation, no always-on VMs.
I was already running OpenClaw on a GKE node, and the bill was… fine, but the node was sitting there 24/7 whether anyone was chatting with the agent or not.

After reading Matias’s post, I thought: GCP has all the same primitives. Can I replicate this pattern natively on GCP?

Turns out yes — and in some ways the GCP path is even cleaner.
Cloud Run v2 supports native GCSFuse volume mounts, which means you get a persistent workspace without a sync daemon, a sidecar, or a background timer.
The filesystem just works across container restarts.
This post walks through how I built a multi-tenant OpenClaw deployment on Cloud Run, with full per-tenant isolation, Telegram/Slack support, and a shared router service as the only public endpoint.
The full repo is on GitHub: openclaw-serverless.

Architecture

GCSFuse Workspace Persistence
Cloud Run containers are ephemeral — they spin up on demand and disappear when idle. OpenClaw stores everything it knows about a user under .openclaw/ (conversation memory, user profiles, tool outputs). Without a persistence strategy, that all disappears the moment a session ends.
The solution here is simpler than the AWS approach: Cloud Run v2 has built-in GCSFuse support. The agent container gets a /data volume mount backed by a per-tenant GCS bucket.
The entrypoint writes openclaw.json to that path on startup, and every file write the agent makes is transparently persisted to GCS. No sync loop, no SIGTERM handler — it just works. Container restarts pick up exactly where the previous session left off.

One intentional detail: config is always overwritten on startup from environment variables. GCSFuse persists agent state; environment variables drive configuration. A new deploy always wins over stale config on disk.

Multi-Tenant Router
Rather than exposing each tenant’s Cloud Run service to the internet, a single lightweight Node.js router sits at the public endpoint. It validates webhook signatures (Telegram secret token, Slack HMAC-SHA256), looks up the tenant by user/channel ID, then forwards the request to the right tenant service using a GCP-issued ID token. Tenant services are deployed with INGRESS_TRAFFIC_INTERNAL_ONLY — they are completely unreachable except through the router.
Webhook secrets are fetched from Secret Manager and cached for 5 minutes. Both channels are fail-closed: requests without valid signatures are rejected before any tenant code runs.

Security

Network: Tenant Cloud Run services have internal-only ingress. The only public endpoint is the router service. Even within GCP, a caller needs a valid ID token to invoke a tenant service — ambient network access is not enough.

Per-tenant isolation: Each tenant gets its own Cloud Run service, GCS bucket, and service account. The tenant SA has objectAdmin on its own bucket only — no IAM binding to any other tenant’s resources. Secrets are scoped per-tenant; the SA can access its own secrets plus the shared Anthropic API key, nothing else.

Least-privilege IAM: The router SA has secretAccessor on webhook secrets and run.invoker on each tenant service. Tenant SAs have secretAccessor on their own secrets and objectAdmin on their own bucket. That’s it.

Secret management: Bot tokens, webhook secrets, and the Anthropic API key all live in Secret Manager. Nothing sensitive in environment variables or container images.

Device pairing bypass: OpenClaw normally requires an interactive shell command to approve devices. Cloud Run has no shell. dmPolicy: allowlist with the tenant’s user ID in allowFrom bypasses pairing entirely — safe because the router already validated the webhook source before the message arrived.

Instructions

Prerequisites

GCP project with billing enabled
gcloud CLI authenticated
terraform / opentofu installed
Docker with linux/amd64 build support

1. Clone the repo

git clone https://github.com/cloudon-one/openclaw-serverless
cd openclaw-serverless

Configure your project

# Set your GCP project
export PROJECT_ID=your-gcp-project-id
export REGION=us-central1
export REGISTRY="${REGION}-docker.pkg.dev/${PROJECT_ID}/openclaw"
gcloud config set project $PROJECT_ID

3. Create the Artifact Registry repository and enable APIs

gcloud services enable run.googleapis.com \
  secretmanager.googleapis.com \
  artifactregistry.googleapis.com

gcloud artifacts repositories create openclaw \
  --repository-format=docker \
  --location=$REGION

4. Build and push both container images

./scripts/build.sh

Or manually

gcloud auth configure-docker ${REGION}-docker.pkg.dev

docker build --platform linux/amd64 -t ${REGISTRY}/agent:latest agent/
docker build --platform linux/amd64 -t ${REGISTRY}/router:latest router/
docker push ${REGISTRY}/agent:latest
docker push ${REGISTRY}/router:latest

5. Store your Anthropic API key

echo -n "YOUR_ANTHROPIC_API_KEY" | gcloud secrets create openclaw-anthropic-api-key \
  --data-file=- --replication-policy=automatic

6. Define your first tenant in tenants.yaml

tenants:
  alice:
    display_name: "Alice Smith"
    telegram_user_id: "YOUR_TELEGRAM_USER_ID"
    telegram_enabled: true
    slack_enabled: false
    min_instances: 0
    max_instances: 1
    cpu: "2"
    memory: "2Gi"

7. Deploy infrastructure

cd infrastructure
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your project ID, region, registry URL

terraform init
terraform apply

Terraform creates: service accounts, GCS buckets, Secret Manager containers, Cloud Run services (router + one per tenant).
8. Create a Telegram bot

Message @botfather on Telegram
Use /newbot and copy the token

9. Store tenant secrets

# Telegram bot token
echo -n "YOUR_BOT_TOKEN" | gcloud secrets versions add \
  openclaw-sl-alice-telegram-token --data-file=-

# Webhook validation secret (random)
openssl rand -hex 32 | gcloud secrets versions add \
  openclaw-sl-alice-telegram-webhook-secret --data-file=-

10. Register the Telegram webhook

ROUTER_URL=$(cd infrastructure && terraform output -raw router_url)
WEBHOOK_SECRET=$(gcloud secrets versions access latest \
  --secret=openclaw-sl-alice-telegram-webhook-secret)

curl "https://api.telegram.org/bot${YOUR_BOT_TOKEN}/setWebhook" \
  -d "url=${ROUTER_URL}/webhook/telegram" \
  -d "secret_token=${WEBHOOK_SECRET}"

That’s it. Send a message to your bot on Telegram. The first response takes ~15–20 seconds for a cold start; subsequent messages in the same session are fast.

Conclusion

The solution works well, and the GCSFuse approach is genuinely nicer than S3 sync — one less moving part, no 5-minute flush window, no shutdown race condition.

A few things worth knowing before you deploy:
cpu_idle: false adds cost but is required. Agent sessions involve async operations and WebSocket connections that break under CPU throttling. With min_instances: 0, you’re only paying when the container is actually running, so this is acceptable — but it’s not free.
Gen2 execution environment is non-negotiable. GCSFuse is not available in Gen1. Set execution_environment = “EXECUTION_ENVIRONMENT_GEN2” in Terraform, or the mount will silently fail.

Cold starts are real. First message to an idle tenant takes 15–20 seconds. For async chat, this is fine; for anything latency-sensitive, it’s a problem. Set min_instances: 1 per tenant if you need it — just budget accordingly.

Adding a second tenant is genuinely just one YAML entry and a terraform apply.

The isolation model scales cleanly. Each tenant is a fully independent island with no shared state.
The Terraform state bucket needs to exist before terraform init. Create it manually or bootstrap it separately — classic chicken-and-egg.

Compared to the AWS AgentCore approach, the GCP version skips the NAT gateway entirely (Cloud Run has direct internet egress), which removes the ~$32/month baseline AWS cost.
For a single personal agent, this architecture is essentially free at idle.

Want to try it?
The repo is at https://github.com/cloudon-one/openclaw-serverless.
If you run into issues or want to extend it to other channels (WhatsApp, Discord), the router is straightforward to configure.

Lambda Fleet Monitoring with OpenSearch: Real-Time Insights at Scale

Yaar Naumenko — Mon, 17 Feb 2025 10:37:22 +0000

Do you manage multiple AWS accounts with countless Lambda functions — and feel overwhelmed by the complexity of monitoring them all?
Look no further. The Lambda Fleet Monitoring Solution is a fully automated cross-account approach that tracks real-time metrics (invocations, errors, duration, and even cold starts) and funnels them into an OpenSearch cluster for robust analysis and visualization.
This article walks through this solution's architecture, features, and setup. To dive deeper into the code and additional details, check out the opensearch-monitoring GitHub repository.

Why This Matters

As serverless adoption grows, monitoring Lambda metrics becomes increasingly challenging, especially if you have multiple AWS accounts.

With the Lambda Fleet Monitoring Solution, you gain:
• Visibility into every function’s performance and execution patterns.
• Centralized dashboards for easier troubleshooting.
• Scalability that covers as many AWS accounts as you need.

High-Level Architecture

Key Components:

Amazon EventBridge: Schedules the monitoring Lambda to run on a configurable interval.
Monitoring Lambda: Assumes roles in other AWS accounts to gather CloudWatch metrics and push them to OpenSearch.
OpenSearch Domain: Serves as the data store for all metrics.
OpenSearch Dashboards: Provides out-of-the-box (and customizable) visualization tools. Core Features • Cross-Account Monitoring: Leverage IAM roles to gather data from multiple AWS accounts. • Real-Time Metrics: Track invocation rates, error counts, memory usage, duration statistics, cold starts, etc. • Custom Dashboards: Quickly visualize performance trends and identify anomalies. • Automated Setup: Minimal manual configuration required — Terraform automates resource creation. • Customizable Alerts: Integrate with AWS services or third-party tools for alerting on critical thresholds. • Memory & Timeout Insights: Optimize Lambda performance and costs based on usage patterns.

Metrics You’ll See

Invocation Count
Error Rates
Duration Statistics
Memory Utilization
Cold Start Frequency
Timeout Proximity
Runtime Distribution
Cost Metrics

Prerequisites
To get started, ensure you have:
• AWS CLI configured with the right permissions.
• Terraform v1.5.0+ installed.
• Python 3.9+ installed.
• Cross-account IAM roles set up in each AWS account you wish to monitor.
• Permission to create:
• Lambda functions
• OpenSearch domains
• IAM roles and policies
• CloudWatch events
• S3 buckets

QuickStart Installation

Clone the Repository

git clone https://github.com/cloudon-one/opensearch-monitoring.git
cd opensearch-monitoring/lambda/terraform

Configure Variables In a terraform.tfvars file, define your settings:

aws_region                   = "us-west-1"
monitored_accounts           = ["123456789012", "098765432109"]
opensearch_master_user_password = "your-secure-password"
opensearch_instance_type     = "t3.small.search"
opensearch_instance_count    = 1
opensearch_volume_size       = 10

Initialize Terraform terraform init
Plan & Apply

terraform plan
terraform apply

This will provision the OpenSearch domain, monitoring Lambda, IAM roles, and other necessary resources.

Securing Your Setup

Regular Rotation • Rotate access keys and review roles periodically.
Access Logging • Enable CloudTrail logging for all AWS API activities.
Least Privilege • Minimize permissions where possible and remove unused policies.
Organization Controls • Use AWS Organizations Service Control Policies (SCPs) for additional governance.

Wrapping Up
The Lambda Fleet Monitoring Solution offers a robust, scalable way to track and analyze performance for all your AWS Lambda functions — regardless of how many accounts you manage. By combining real-time CloudWatch metrics with the visualization power of OpenSearch, this solution ensures you stay on top of function behaviour, performance trends, and potential cost optimizations.
For a deeper dive, including best practices, troubleshooting tips, and advanced configuration options, head to the opensearch-monitoring GitHub repository and explore the documentation.

Feel free to fork, submit issues, or contribute enhancements!
Have thoughts or questions?

Comment below or open an issue on GitHub to share your ideas.
Happy monitoring!

The Kubernetes Troubleshooting Handbook

Yaar Naumenko — Wed, 22 Jan 2025 13:11:18 +0000

Debugging Kubernetes applications can feel like navigating a labyrinth. With its distributed nature and myriad components, identifying and resolving issues in Kubernetes requires robust tools and techniques.

This article will explore various techniques and tools for troubleshooting and debugging Kubernetes. Whether you’re an experienced Kubernetes user or just getting started, this guide will provide valuable insights into efficient debugging practices.

Analyzing Pod Lifecycle Events

Understanding a pod's lifecycle is crucial for debugging and maintaining applications running in Kubernetes. Each pod goes through several phases, from creation to termination, and analyzing these events can help you identify and resolve issues.

Pod Lifecycle Phases

A pod in Kubernetes goes through the following phases:

Using kubectl get and kubectl describe

To analyze the lifecycle events of a pod, you can use the kubectl get and kubectl describe commands.

The kubectl get command provides a high-level overview of the status of pods:

kubectl get pods

Output:

NAME              READY   STATUS    RESTARTS   AGE
web-server-pod    1/1     Running   0          5m
db-server-pod     1/1     Pending   0          2m
cache-server-pod  1/1     Completed 1          10m

This output shows each pod's current status, which can help you identify pods that require further investigation.

The kubectl describe command provides detailed information about a pod, including its lifecycle events:

kubectl describe pod <pod-name>

Output snippet:

Name:           web-server-pod
Namespace:      default
Node:           node-1/192.168.1.1
Start Time:     Mon, 01 Jan 2025 10:00:00 GMT
Labels:         app=web-server
Status:         Running
IP:             10.244.0.2
Containers:
  web-container:
    Container ID:   docker://abcdef123456
    Image:          nginx:latest
    State:          Running
      Started:      Mon, 01 Jan 2025 10:01:00 GMT
    Ready:          True
    Restart Count:  0
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/web-server-pod to node-1
  Normal  Pulled     9m    kubelet, node-1    Container image "nginx:latest" already present on machine
  Normal  Created    9m    kubelet, node-1    Created container web-container
  Normal  Started    9m    kubelet, node-1    Started container web-container

Analyzing Pod Events

The Events section in the kubectl describe output provides a chronological log of significant events for the pod. These events can help you understand the lifecycle transitions and identify issues such as:

Scheduling Delays: Delays in scheduling the pod can indicate resource constraints or issues with the scheduler.
Image Pull Errors: Failures in pulling container images can indicate network issues or problems with the container registry.
Container Crashes: Repeated container crashes can be diagnosed by examining the events leading up to the crash.

Kubernetes Events and Audit Logs

Kubernetes generates cluster-wide events resources kind: Event which we can use to overview what’s happening on the cluster quickly.

Audit logs kind: Policy On the other hand, they help ensure compliance and security on the cluster. They can show login attempts, pod privileges escalation and more.

Kubernetes Events

Kubernetes events provide a timeline of significant occurrences within your cluster, such as pod scheduling, container restarts, and errors. They help understand state transitions and identify the root causes of issues.

Viewing Events

To view events in your cluster, use the kubectl get events command:

kubectl get events

Output example:

LAST SEEN   TYPE      REASON             OBJECT                                   MESSAGE
12s         Normal    Scheduled          pod/web-server-pod                       Successfully assigned default/web-server-pod to node-1
10s         Normal    Pulling            pod/web-server-pod                       Pulling image "nginx:latest"
8s          Normal    Created            pod/web-server-pod                       Created container web-container
7s          Normal    Started            pod/web-server-pod                       Started container web-container
5s          Warning   BackOff            pod/db-server-pod                        Back-off restarting failed container

Filtering Events

You can filter events to focus on specific namespaces, resource types, or periods. For example, to view events related to a particular pod:

kubectl get events --field-selector involvedObject.name=web-server-pod

Describing Resources

The kubectl describe command includes events in its output, providing detailed information about a specific resource along with its event history:

kubectl describe pod web-server-pod

Output snippet:

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/web-server-pod to node-1
  Normal  Pulled     9m    kubelet, node-1    Container image "nginx:latest" already present on machine
  Normal  Created    9m    kubelet, node-1    Created container web-container
  Normal  Started    9m    kubelet, node-1    Started container web-container

Kubernetes Audit Logs

Audit logs provide a detailed record of all API requests made to the Kubernetes API server, including the user, the action performed, and the outcome. They are essential for security auditing and compliance.

Enabling Audit Logging

Configure the API server with the appropriate flags and audit policy to enable audit logging. Here’s an example of an audit policy configuration:

audit-policy.yaml

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  resources:
  - group: ""
    resources: ["pods"]
- level: RequestResponse
  users: ["admin"]
  verbs: ["update", "patch"]
  resources:
  - group: ""
    resources: ["configmaps"]

Configuring the API Server

Specify the audit policy file and log file location when starting the API server:

kube-apiserver --audit-policy-file=/etc/kubernetes/audit-policy.yaml --audit-log-path=/var/log/kubernetes/audit.log

Viewing Audit Logs

Audit logs are typically written to a file. You can use standard log analysis tools to view and filter the logs. Here’s an example of an audit log entry:

{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "Metadata",
    "auditID": "12345",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/default/pods",
    "verb": "create",
    "user": {
        "username": "admin",
        "groups": ["system:masters"]
    },
    "sourceIPs": ["192.168.1.1"],
    "objectRef": {
        "resource": "pods",
        "namespace": "default",
        "name": "web-server-pod"
    },
    "responseStatus": {
        "metadata": {},
        "code": 201
    },
    "requestReceivedTimestamp": "2025-01-01T12:00:00Z",
    "stageTimestamp": "2025-01-01T12:00:01Z"
}

Kubernetes Dashboard

The Kubernetes Dashboard is a web-based UI that provides an easy way to manage and troubleshoot your Kubernetes cluster. It allows you to visualize cluster resources, deploy applications, and perform various administrative tasks.

Installing the Kubernetes Dashboard

Please take a look at the Kubernetes documentation for details on installing and accessing the dashboard.

https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/

Using the Dashboard

The Dashboard provides various features to help manage and troubleshoot your Kubernetes cluster:

Cluster Overview: View the overall status of your cluster, including nodes, namespaces, and resource usage.
Workloads: Monitor and manage workloads, such as Deployments, ReplicaSets, StatefulSets, and DaemonSets.
Services and Ingress: Manage services and ingress resources to control network traffic.
Config and Storage: Manage ConfigMaps, Secrets, PersistentVolumeClaims, and other storage resources.
Logs and Events: View logs and events for troubleshooting and auditing purposes.

Monitoring Resource Usage

Monitoring resource usage helps you understand how your applications consume resources and identify opportunities for optimization.

Tools for Monitoring

kubectl top: Provides real-time resource usage metrics.
Prometheus: Collects and stores metrics for detailed analysis.
Grafana: Visualizes metrics and provides dashboards for monitoring.

Using kubectl top

The kubectl top command shows the current CPU and memory usage of pods and nodes.

kubectl top pods
kubectl top nodes

Example output:

NAME        CPU(cores)   MEMORY(bytes)
my-app-pod  100m         120Mi

Using kubectl logs

kubectl logs is one of the most essential tools for debugging Kubernetes applications. This command retrieves logs from a specific container in a pod, allowing you to diagnose and troubleshoot issues effectively.

Basic Usage

The simplest way to retrieve logs from a pod is by using the kubectl logs command followed by the pod name and namespace. Here’s a basic example for a pod running in a default namespace:

kubectl logs <pod-name>

This command fetches the logs from the first container in the specified pod. If your pod has multiple containers, you need to specify the container name as well:

kubectl logs <pod-name> -c <container-name>

Real-time Logs with f Flag

To stream logs in real-time, similar to tail -f in Linux, use the -f flag:

kubectl logs -f <pod-name>

This is particularly useful for monitoring logs as your application runs and observing the output of live processes.

Some projects enhance the log tailing with additional capabilities, such as stern.

Retrieving Previous Logs

If a pod has restarted, you can view the logs from the previous instance using the --previous flag:

kubectl logs <pod-name> --previous

Examining the logs before the failure helps us understand what caused the pod to restart.

Filtering Logs with Labels

You can also filter logs from pods that match specific labels using kubectl along with jq for advanced filtering:

kubectl get pods -l <label-selector> -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl logs {} 
Replace <label-selector> with your specific labels, such as app=myapp.

Combining with Other Tools

You can combine kubectl logs with other Linux commands to enhance your debugging process. For example, to search for a specific error message in the logs, you can use grep:

kubectl logs web-server-pod | grep "Error"

For a continuous search in real-time logs:

kubectl logs -f web-server-pod | grep --line-buffered "Error"

Practical Tips

Log Rotation and Retention: Please ensure your application handles log rotation to prevent the logs from consuming excessive disk space.

Structured Logging: Structured logging (e.g., JSON format) can make it easier to parse and analyze logs using tools like jq.

Centralized Logging: Consider setting up a centralized logging system (e.g., Elasticsearch, Fluentd, and Kibana — EFK stack) to aggregate and search logs from all your Kubernetes pods.

Using kubectl exec for Interactive Troubleshooting

kubectl exec allows us to execute commands directly inside a running container. This is particularly useful for interactive troubleshooting, enabling the inspection of the container’s environment, running diagnostic commands, and performing real-time fixes.

Basic Usage

The basic syntax kubectl exec is as follows:

kubectl exec <pod-name> -- <command>

Use the flag to execute a command in a specific container within a pod. This will execute a command and immediately exit the container.

kubectl exec <pod-name> -c <container-name> -- <command>

Running an Interactive Shell

One of the most common uses of kubectl exec is to open an interactive shell session within a container. This allows you to run multiple commands interactively. Here’s how to do it:

kubectl exec -it <pod-name> -- /bin/bash

For containers using sh instead of bash:

kubectl exec -it <pod-name> -- /bin/sh

Example: Inspecting Environment Variables

To check the environment variables inside a container, you can use the env command:

kubectl exec <pod-name> -- env

If you need to check environment variables in a specific container:

kubectl exec <pod-name> -c <container-name> -- env

Example: Checking Configuration Files

Suppose you need to inspect a configuration file inside the container. You can use cat or any text editor available inside the container:

kubectl exec <pod-name> -- cat /path/to/config/file

For a specific container:

kubectl exec <pod-name> -c <container-name> -- cat /path/to/config/file

Copying Files to and from Containers

If you don’t have a binary you need inside a container, it’s easy to files to and from containers using kubectl cp. For example, to copy a file from your local machine to a container:

kubectl cp /local/path/to/file <pod-name>:/container/path/to/file

To copy a file from a container to your local machine:

kubectl cp <pod-name>:/container/path/to/file /local/path/to/file

Practical Tips

Use the—i and—t Flags: The—i flag makes the session interactive, and the—t flag allocates a pseudo-TTY. Together, they enable a fully interactive session.

Run as a Specific User: Use the --user flag to execute commands as a specific user inside the container, if required.

kubectl exec --user=<username> -it <pod-name> -- /bin/bash

Security Considerations: Be cautious when running kubectl exec with elevated privileges. Ensure you have appropriate RBAC (Role-Based Access Control) policies in place to prevent unauthorized access.

Node-Level Debugging with kubectl debug

Most debugging techniques focus on the application level; however, the kubectl debug node command can also be used to debug a specific Kubernetes node.

Node-level debugging is crucial for diagnosing issues affecting the Kubernetes nodes, such as resource exhaustion, misconfigurations, or hardware failures.

This way, the debugging Pod can access the node's root filesystem, mounted at /host in the Pod.

Create a Debugging Session:

Use the kubectl debug command to start a debugging session on a node. This command creates a pod running a debug container on the specified node.

kubectl debug node/<node-name> -it --image=busybox

Replace <node-name> with the name of the node you want to debug. The -it flag opens an interactive terminal, and --image=busybox specifies the image for the debug container.

For more details, refer to the official Kubernetes documentation on node-level debugging.

Application-Level Debuging with Debug Containers

For more complex issues, consider using a debug container with pre-installed tools. There are a lot of good docker images with tooling and scripts for debugging, one that stands out to me is https://github.com/nicolaka/netshoot. It can quickly be created using:

kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot 
Example: Using the debug container as a sidecar

 apiVersion: apps/v1
   kind: Deployment
   metadata:
       name: nginx-netshoot
       labels:
           app: nginx-netshoot
   spec:
   replicas: 1
   selector:
       matchLabels:
           app: nginx-netshoot
   template:
       metadata:
       labels:
           app: nginx-netshoot
       spec:
           containers:
           - name: nginx
           image: nginx:1.14.2
           ports:
               - containerPort: 80
           - name: netshoot
           image: nicolaka/netshoot
           command: ["/bin/bash"]
           args: ["-c", "while true; do ping localhost; sleep 60;done"]

Apply the configuration:

kubectl apply -f debug-pod.yaml

Practical Tips

Set Restart Policies: Ensure that your pod specifications have appropriate restart policies to handle different failure scenarios.

Automated Monitoring: Set up automated monitoring and alerting for critical issues such as CrashLoopBackOff using Prometheus and Alertmanager.

Ephemeral Containers for Debugging

Ephemeral containers are temporary and explicitly created for debugging purposes. They are helpful in running diagnostic tools and commands without affecting the running application. This chapter will explore how to develop and use ephemeral pods for interactive troubleshooting in Kubernetes.

Why Use Ephemeral Pods?

Isolation: Debugging in an isolated environment prevents accidental changes to running applications.
Tool Availability: Allows the use of specialized tools that may not be present in the application container.
Temporary Nature: These pods can be easily created and destroyed as needed without leaving a residual impact on the cluster.

Creating Ephemeral Pods

There are several ways to create ephemeral pods in Kubernetes. One standard method is to use the kubectl run command.

Example: Creating an Ephemeral Pod

Using kubectl run:

kubectl debug mypod -it --image=nicolaka/netshoot
This command creates a debug pod using the Netshoot image and opens an interactive shell.

Practical Tips for Using Ephemeral Pods

Tool Availability: Ensure the debug container image includes all necessary tools for troubleshooting, such as curl, netcat, nslookup, df, top, and others.

Security Considerations: When creating ephemeral pods, consider security. Ensure they have limited access and are used by authorized personnel only.

Example: Advanced Debugging with Custom Debug Container

Let’s walk through an example of using a custom debug container for advanced debugging tasks.

Create an Ephemeral Pod with Custom Debug Container:

kubectl debug -it redis5 --image=nicolaka/netshoot

Defaulting debug container name to debugger-v4hfv.
If you don't see a command prompt, try pressing enter.

88d888b. .d8888b. d8888P .d8888b. 88d888b. .d8888b. .d8888b. d8888P
88'  `88 88ooood8   88   Y8ooooo. 88'  `88 88'  `88 88'  `88   88
88    88 88.  ...   88         88 88    88 88.  .88 88.  .88   88
dP    dP `88888P'   dP   `88888P' dP    dP `88888P' `88888P'   dP

Welcome to Netshoot! (github.com/nicolaka/netshoot)
Version: 0.13

Run Diagnostic Commands:

Inside the debug container we can run various commands.

Check DNS resolution

nslookup kubernetes.default.svc.cluster.local

Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

Test network connectivity
curl http://my-service:8080/healthBy using ephemeral pods, you can effectively debug and troubleshoot Kubernetes applications in an isolated and controlled environment, minimizing the risk of impacting production workloads.

Handling DNS and Network Issues

We will go through 2 common troubleshooting scenarios: DNS issues and stateful pods debugging. Let’s see what we have learned in action.

Common Network Issues

DNS Resolution Failures: Issues resolving service names to IP addresses.
Service Unreachable: Services are not accessible within the cluster.
Pod Communication Issues: Pods cannot communicate with each other.
Network Policy Misconfigurations: Incorrect network policies blocking traffic.

Tools and Commands for Troubleshooting

kubectl exec: Run commands in a container to diagnose network issues.
nslookup: Check DNS resolution.
ping: Test connectivity between pods and services.
curl: Verify HTTP connectivity and responses.
traceroute: Trace the path packets take to reach a destination.

Example: Diagnosing a DNS Resolution Issue

Let’s walk through an example of diagnosing a DNS resolution issue for a pod named my-app-pod trying to reach a service my-db-service.

Check DNS Resolution:

kubectl exec -it my-app-pod -- nslookup my-db-service

Alternatively we can use debug pod or ephemeral containers.
Output indicating a problem:

Server: 10.96.0.10
Address:10.96.0.10#53
** server can't find my-db-service: NXDOMAIN

Check CoreDNS Logs:

Inspect the logs of CoreDNS pods to identify any DNS resolution issues.

kubectl logs -l k8s-app=kube-dns -n kube-system

Look for errors or warnings indicating DNS resolution failures.

Verify Service and Endpoints:

Ensure that the service and endpoints exist and are correctly configured.

kubectl get svc my-db-service
kubectl get endpoints my-db-service 
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
my-db-serviceClusterIP   10.96.0.11   <none>        5432/TCP   1h 
NAME         ENDPOINTS            AGE
my-db-service10.244.0.5:5432      1h

Restart CoreDNS Pods:

Restart CoreDNS pods to resolve potential transient issues.

kubectl rollout restart deployment coredns -n kube-system

Verify DNS Resolution Again:

After resolving the issue, verify DNS resolution again:

kubectl exec -it my-app-pod -- nslookup my-db-service
Expected output:

Server: 10.96.0.10
Address:10.96.0.10#53 
Name:   my-db-service.default.svc.cluster.local
Address:10.96.0.11

Practical Tips

Use Network Debug Containers: Use network debug containers like nicolaka/netshoot for comprehensive network troubleshooting.

kubectl run netshoot --rm -it --image nicolaka/netshoot -- /bin/bash

Monitor Network Metrics: Use Prometheus and Grafana to monitor network metrics and set up network-issue alerts.

Implement Redundancy: Configure redundant DNS servers and failover mechanisms to enhance network reliability.

Debugging Stateful Applications

Stateful applications in Kubernetes require special debugging considerations due to their reliance on persistent storage and consistent state across restarts. This section will explore techniques for handling and debugging issues specific to stateful applications.

What are Stateful Applications?

Stateful applications maintain state information across sessions and restarts, often using persistent storage. Examples include databases, message queues, and other applications that require data persistence.

Common Issues in Stateful Applications

Persistent Storage Issues: Problems with PVCs or PVs can lead to data loss or unavailability.
Pod Start-up Failures: Errors during pod initialization due to state dependencies.
Network Partitioning: Network issues affecting communication between stateful pods.
Data Consistency Problems: Inconsistent data across replicas or restarts.

Example: Debugging a MySQL StatefulSet

Let’s walk through an example of debugging a MySQL StatefulSet named my-mysql.

Inspect the StatefulSet:

kubectl describe statefulset my-mysql

Output snippet:

Name:           my-mysql
Namespace:      default
Selector:       app=my-mysql
Replicas:       3 desired | 3 total
...
Events:
  Type    Reason            Age   From                    Message
  ----    ------            ----  ----                    -------
  Normal  SuccessfulCreate  1m    statefulset-controller  create Pod my-mysql-0 in StatefulSet my-mysql successful
  Normal  SuccessfulCreate  1m    statefulset-controller  create Pod my-mysql-1 in StatefulSet my-mysql successful
  Normal  SuccessfulCreate  1m    statefulset-controller  create Pod my-mysql-2 in StatefulSet my-mysql successful

Check Persistent Volume Claims:

kubectl get pvc
kubectl describe pvc data-my-mysql-0

Output snippet:

Name:          data-my-mysql-0
Namespace:     default
Status:        Bound
Volume:        pvc-1234abcd-56ef-78gh-90ij-klmnopqrstuv
...

Check Pod Logs:

kubectl logs my-mysql-0

Output snippet:

2025-01-01T00:00:00.000000Z 0 [Note] mysqld (mysqld 8.0.23) starting as process 1 ...
2025-01-01T00:00:00.000000Z 1 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 11

Execute Commands in Pods:

kubectl exec -it my-mysql-0 -- /bin/sh

Inside the pod:

Check mounted volumes:

df -h

Verify MySQL data directory:

ls -l /var/lib/mysql

Check MySQL status:

mysqladmin -u root -p status

Check Network Connectivity:

kubectl exec -it my-mysql-0 -- ping my-mysql-1.my-mysql.default.svc.cluster.local

Output snippet:

PING my-mysql-1.my-mysql.default.svc.cluster.local (10.244.0.6): 56 data bytes
64 bytes from 10.244.0.6: icmp_seq=0 ttl=64 time=0.047 ms

Advanced Debugging Techniques

Advanced debugging techniques in Kubernetes involve using specialized tools and strategies to diagnose and resolve complex issues. This chapter will cover tracing instrumentation and remote debugging.

Profiling with Jaeger

Jaeger is an open-source, end-to-end distributed tracing tool that helps monitor and troubleshoot transactions in complex distributed systems. Profiling with Jaeger can provide insights into the performance of your microservices and help identify latency issues.

You can install Jaeger in your Kubernetes cluster using the Jaeger Operator or Helm.

helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
helm install jaeger jaegertracing/jaeger

Instrument Your Application:

Ensure your application is instrumented to send tracing data to Jaeger. This typically involves adding Jaeger client libraries to your application code and configuring them to report to the Jaeger backend.

Example in a Go application:

import (
    "github.com/opentracing/opentracing-go"
    "github.com/uber/jaeger-client-go"
    "github.com/uber/jaeger-client-go/config"
)
func initJaeger(service string) (opentracing.Tracer, io.Closer) {
    cfg := config.Configuration{
        ServiceName: service,
        Sampler: &config.SamplerConfig{
            Type:  "const",
            Param: 1,
        },
        Reporter: &config.ReporterConfig{
            LogSpans:           true,
            LocalAgentHostPort: "jaeger-agent.default.svc.cluster.local:6831",
        },
    }
    tracer, closer, _ := cfg.NewTracer()
    opentracing.SetGlobalTracer(tracer)
    return tracer, closer
}

Access the Jaeger UI to view and analyze traces.

kubectl port-forward svc/jaeger-query 16686:16686
Open http://localhost:16686 in your browser.

Remote Debugging with mirrord

Mirrord is an open-source tool that enables remote debugging of Kubernetes services by running local processes in the context of your Kubernetes cluster and remote infrastructure.

Setting Up mirrord

curl -fsSL https://raw.githubusercontent.com/metalbear-co/mirrord/main/scripts/install.sh | bash

Connect to Your Cluster:

Start a mirrord session to connect your local environment to your Kubernetes cluster.

mirrord connect

Swap Deployment:

Use mirrord to swap a deployment in your cluster with your local service.

mirrord exec --target-namespace devops-team --target deployment/foo-app-deployment nodemon server.js

This command redirects traffic, environment variables, and file operations from your Kubernetes cluster to your local machine, allowing you to debug the service as if running locally.

Once the mirrord session is set up, you can debug the service on your local machine using your favourite debugging tools and IDES.

Set Breakpoints: Use your IDE to set breakpoints and step through the code.
Inspect Variables: Inspect variables and application state to identify issues.
Make Changes: Make code changes and immediately see the effects without redeploying to the cluster.

Additional Tools

In addition to the core Kubernetes commands and open-source tools, several other tools can enhance your troubleshooting capabilities across various categories. Here are a few noteworthy tools:

Closing Thoughts

Debugging Kubernetes applications can be complex and challenging, but it becomes much more manageable with the right tools and techniques.

Remember, effective debugging is not just about resolving issues as they arise but also about proactive monitoring, efficient resource management, and a deep understanding of your application’s architecture and dependencies.

By implementing the strategies and best practices outlined in this guide, you can build a robust debugging framework that empowers you to quickly identify, diagnose, and resolve issues, ensuring the smooth operation of your Kubernetes deployments.