<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Yaar Naumenko</title>
    <description>The latest articles on Forem by Yaar Naumenko (@ynaumenko).</description>
    <link>https://forem.com/ynaumenko</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F546357%2Fe7797afd-d623-4186-a1b9-217d0bee86bd.png</url>
      <title>Forem: Yaar Naumenko</title>
      <link>https://forem.com/ynaumenko</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ynaumenko"/>
    <language>en</language>
    <item>
      <title>OpenClaw on GCP Cloud Run: Secure, Serverless, Multi-Tenant</title>
      <dc:creator>Yaar Naumenko</dc:creator>
      <pubDate>Wed, 11 Mar 2026 10:47:11 +0000</pubDate>
      <link>https://forem.com/ynaumenko/openclaw-on-gcp-cloud-run-secure-serverless-multi-tenant-1mpl</link>
      <guid>https://forem.com/ynaumenko/openclaw-on-gcp-cloud-run-secure-serverless-multi-tenant-1mpl</guid>
      <description>&lt;p&gt;A few days ago, &lt;a href="https://dev.to/mkreder"&gt;Matias Kreder&lt;/a&gt; published a great &lt;a href="https://dev.to/aws-builders/openclaw-on-aws-agentcore-secure-serverless-production-ready-i8n"&gt;article&lt;/a&gt; on running OpenClaw on AWS Bedrock AgentCore.&lt;br&gt;
The architecture was elegant: ephemeral containers, S3-backed workspace sync, per-user isolation, no always-on VMs.&lt;br&gt;
I was already running OpenClaw on a GKE node, and the bill was… fine, but the node was sitting there 24/7 whether anyone was chatting with the agent or not.&lt;/p&gt;

&lt;p&gt;After reading Matias’s post, I thought: GCP has all the same primitives. Can I replicate this pattern natively on GCP?&lt;/p&gt;

&lt;p&gt;Turns out yes — and in some ways the GCP path is even cleaner. &lt;br&gt;
Cloud Run v2 supports native GCSFuse volume mounts, which means you get a persistent workspace without a sync daemon, a sidecar, or a background timer. &lt;br&gt;
The filesystem just works across container restarts.&lt;br&gt;
This post walks through how I built a multi-tenant OpenClaw deployment on Cloud Run, with full per-tenant isolation, Telegram/Slack support, and a shared router service as the only public endpoint.&lt;br&gt;
The full repo is on GitHub: &lt;a href="https://github.com/cloudon-one/openclaw-serverless" rel="noopener noreferrer"&gt;openclaw-serverless&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekrkgsf0n76ifj3k0697.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekrkgsf0n76ifj3k0697.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GCSFuse Workspace Persistence&lt;/strong&gt;&lt;br&gt;
Cloud Run containers are ephemeral — they spin up on demand and disappear when idle. OpenClaw stores everything it knows about a user under &lt;code&gt;.openclaw/&lt;/code&gt; (conversation memory, user profiles, tool outputs). Without a persistence strategy, that all disappears the moment a session ends.&lt;br&gt;
The solution here is simpler than the AWS approach: Cloud Run v2 has built-in GCSFuse support. The agent container gets a &lt;code&gt;/data&lt;/code&gt; volume mount backed by a per-tenant GCS bucket. &lt;br&gt;
The entrypoint writes &lt;code&gt;openclaw.json&lt;/code&gt; to that path on startup, and every file write the agent makes is transparently persisted to GCS. No sync loop, no &lt;code&gt;SIGTERM&lt;/code&gt; handler — it just works. Container restarts pick up exactly where the previous session left off.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One intentional detail&lt;/em&gt;: config is always overwritten on startup from environment variables. GCSFuse persists agent state; environment variables drive configuration. A new deploy always wins over stale config on disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Tenant Router&lt;/strong&gt;&lt;br&gt;
Rather than exposing each tenant’s Cloud Run service to the internet, a single lightweight Node.js router sits at the public endpoint. It validates webhook signatures (Telegram secret token, Slack HMAC-SHA256), looks up the tenant by user/channel ID, then forwards the request to the right tenant service using a GCP-issued ID token. Tenant services are deployed with &lt;code&gt;INGRESS_TRAFFIC_INTERNAL_ONLY&lt;/code&gt; — they are completely unreachable except through the router.&lt;br&gt;
Webhook secrets are fetched from Secret Manager and cached for 5 minutes. Both channels are fail-closed: requests without valid signatures are rejected before any tenant code runs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Security
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Network:&lt;/strong&gt; Tenant Cloud Run services have internal-only ingress. The only public endpoint is the router service. Even within GCP, a caller needs a valid ID token to invoke a tenant service — ambient network access is not enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-tenant isolation:&lt;/strong&gt; Each tenant gets its own Cloud Run service, GCS bucket, and service account. The tenant SA has objectAdmin on its own bucket only — no IAM binding to any other tenant’s resources. Secrets are scoped per-tenant; the SA can access its own secrets plus the shared Anthropic API key, nothing else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Least-privilege IAM&lt;/strong&gt;: The router SA has secretAccessor on webhook secrets and &lt;code&gt;run.invoker&lt;/code&gt; on each tenant service. Tenant SAs have &lt;code&gt;secretAccessor&lt;/code&gt; on their own secrets and &lt;code&gt;objectAdmin&lt;/code&gt; on their own bucket. That’s it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secret management:&lt;/strong&gt; Bot tokens, webhook secrets, and the Anthropic API key all live in Secret Manager. Nothing sensitive in environment variables or container images.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Device pairing bypass:&lt;/strong&gt; OpenClaw normally requires an interactive shell command to approve devices. Cloud Run has no shell. dmPolicy: allowlist with the tenant’s user ID in allowFrom bypasses pairing entirely — safe because the router already validated the webhook source before the message arrived.&lt;/p&gt;
&lt;h2&gt;
  
  
  Instructions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GCP project with billing enabled&lt;/li&gt;
&lt;li&gt;gcloud CLI authenticated&lt;/li&gt;
&lt;li&gt;terraform / opentofu installed&lt;/li&gt;
&lt;li&gt;Docker with linux/amd64 build support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;1. Clone the repo&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/cloudon-one/openclaw-serverless
&lt;span class="nb"&gt;cd &lt;/span&gt;openclaw-serverless
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Configure your project
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set your GCP project&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-gcp-project-id
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-central1
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;REGISTRY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-docker.pkg.dev/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/openclaw"&lt;/span&gt;
gcloud config &lt;span class="nb"&gt;set &lt;/span&gt;project &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Create the Artifact Registry repository and enable APIs&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud services &lt;span class="nb"&gt;enable &lt;/span&gt;run.googleapis.com &lt;span class="se"&gt;\&lt;/span&gt;
  secretmanager.googleapis.com &lt;span class="se"&gt;\&lt;/span&gt;
  artifactregistry.googleapis.com

gcloud artifacts repositories create openclaw &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--repository-format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;docker &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Build and push both container images&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./scripts/build.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or manually&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud auth configure-docker &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="nt"&gt;-docker&lt;/span&gt;.pkg.dev

docker build &lt;span class="nt"&gt;--platform&lt;/span&gt; linux/amd64 &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGISTRY&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/agent:latest agent/
docker build &lt;span class="nt"&gt;--platform&lt;/span&gt; linux/amd64 &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGISTRY&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/router:latest router/
docker push &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGISTRY&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/agent:latest
docker push &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGISTRY&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/router:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;5. Store your Anthropic API key&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"YOUR_ANTHROPIC_API_KEY"&lt;/span&gt; | gcloud secrets create openclaw-anthropic-api-key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;- &lt;span class="nt"&gt;--replication-policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;automatic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;6. Define your first tenant in tenants.yaml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tenants&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;alice&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;display_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Smith"&lt;/span&gt;
    &lt;span class="na"&gt;telegram_user_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_TELEGRAM_USER_ID"&lt;/span&gt;
    &lt;span class="na"&gt;telegram_enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;slack_enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;min_instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
    &lt;span class="na"&gt;max_instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;7. Deploy infrastructure&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;infrastructure
&lt;span class="nb"&gt;cp &lt;/span&gt;terraform.tfvars.example terraform.tfvars
&lt;span class="c"&gt;# Edit terraform.tfvars with your project ID, region, registry URL&lt;/span&gt;

terraform init
terraform apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terraform creates: service accounts, GCS buckets, Secret Manager containers, Cloud Run services (router + one per tenant).&lt;br&gt;
&lt;strong&gt;8. Create a Telegram bot&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message &lt;a class="mentioned-user" href="https://dev.to/botfather"&gt;@botfather&lt;/a&gt; on Telegram&lt;/li&gt;
&lt;li&gt;Use /newbot and copy the token&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;9. Store tenant secrets&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Telegram bot token&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"YOUR_BOT_TOKEN"&lt;/span&gt; | gcloud secrets versions add &lt;span class="se"&gt;\&lt;/span&gt;
  openclaw-sl-alice-telegram-token &lt;span class="nt"&gt;--data-file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;-

&lt;span class="c"&gt;# Webhook validation secret (random)&lt;/span&gt;
openssl rand &lt;span class="nt"&gt;-hex&lt;/span&gt; 32 | gcloud secrets versions add &lt;span class="se"&gt;\&lt;/span&gt;
  openclaw-sl-alice-telegram-webhook-secret &lt;span class="nt"&gt;--data-file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;-
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;10. Register the Telegram webhook&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;ROUTER_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;infrastructure &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; terraform output &lt;span class="nt"&gt;-raw&lt;/span&gt; router_url&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;WEBHOOK_SECRET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud secrets versions access latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;openclaw-sl-alice-telegram-webhook-secret&lt;span class="si"&gt;)&lt;/span&gt;

curl &lt;span class="s2"&gt;"https://api.telegram.org/bot&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;YOUR_BOT_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/setWebhook"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"url=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ROUTER_URL&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/webhook/telegram"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"secret_token=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;WEBHOOK_SECRET&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it. Send a message to your bot on Telegram. The first response takes ~15–20 seconds for a cold start; subsequent messages in the same session are fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The solution works well, and the GCSFuse approach is genuinely nicer than S3 sync — one less moving part, no 5-minute flush window, no shutdown race condition. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A few things worth knowing before you deploy:&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;cpu_idle: false&lt;/code&gt; adds cost but is required. Agent sessions involve async operations and WebSocket connections that break under CPU throttling. With &lt;code&gt;min_instances: 0&lt;/code&gt;, you’re only paying when the container is actually running, so this is acceptable — but it’s not free.&lt;br&gt;
&lt;strong&gt;Gen2 execution environment is non-negotiable&lt;/strong&gt;. GCSFuse is not available in Gen1. Set &lt;code&gt;execution_environment = “EXECUTION_ENVIRONMENT_GEN2”&lt;/code&gt; in Terraform, or the mount will silently fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold starts are real.&lt;/strong&gt; First message to an idle tenant takes 15–20 seconds. For async chat, this is fine; for anything latency-sensitive, it’s a problem. Set &lt;code&gt;min_instances: 1&lt;/code&gt; per tenant if you need it — just budget accordingly.&lt;/p&gt;

&lt;p&gt;Adding a second tenant is genuinely just one YAML entry and a &lt;code&gt;terraform apply&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The isolation model scales cleanly. Each tenant is a fully independent island with no shared state.&lt;br&gt;
The Terraform state bucket needs to exist before &lt;code&gt;terraform init&lt;/code&gt;. Create it manually or bootstrap it separately — classic chicken-and-egg.&lt;/p&gt;

&lt;p&gt;Compared to the AWS AgentCore approach, the GCP version skips the NAT gateway entirely (Cloud Run has direct internet egress), which removes the ~$32/month baseline AWS cost.&lt;br&gt;
For a single personal agent, this architecture is essentially free at idle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to try it?&lt;/strong&gt;&lt;br&gt;
The repo is at &lt;a href="https://github.com/cloudon-one/openclaw-serverless" rel="noopener noreferrer"&gt;https://github.com/cloudon-one/openclaw-serverless&lt;/a&gt;. &lt;br&gt;
If you run into issues or want to extend it to other channels (WhatsApp, Discord), the router is straightforward to configure.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openclaw</category>
      <category>gcp</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Lambda Fleet Monitoring with OpenSearch: Real-Time Insights at Scale</title>
      <dc:creator>Yaar Naumenko</dc:creator>
      <pubDate>Mon, 17 Feb 2025 10:37:22 +0000</pubDate>
      <link>https://forem.com/ynaumenko/lambda-fleet-monitoring-with-opensearch-real-time-insights-at-scale-ani</link>
      <guid>https://forem.com/ynaumenko/lambda-fleet-monitoring-with-opensearch-real-time-insights-at-scale-ani</guid>
      <description>&lt;p&gt;Do you manage multiple AWS accounts with countless Lambda functions — and feel overwhelmed by the complexity of monitoring them all? &lt;br&gt;
Look no further. The &lt;a href="https://github.com/cloudon-one/opensearch-monitoring" rel="noopener noreferrer"&gt;Lambda Fleet Monitoring Solution&lt;/a&gt; is a fully automated cross-account approach that tracks real-time metrics (invocations, errors, duration, and even cold starts) and funnels them into an OpenSearch cluster for robust analysis and visualization.&lt;br&gt;
This article walks through this solution's architecture, features, and setup. To dive deeper into the code and additional details, check out the opensearch-monitoring &lt;a href="https://github.com/cloudon-one/opensearch-monitoring" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;As serverless adoption grows, monitoring Lambda metrics becomes increasingly challenging, especially if you have multiple AWS accounts.&lt;/p&gt;

&lt;p&gt;With the Lambda Fleet Monitoring Solution, you gain:&lt;br&gt;
• &lt;strong&gt;Visibility&lt;/strong&gt; into every function’s performance and execution patterns.&lt;br&gt;
• &lt;strong&gt;Centralized dashboards&lt;/strong&gt; for easier troubleshooting.&lt;br&gt;
• &lt;strong&gt;Scalability&lt;/strong&gt; that covers as many AWS accounts as you need.&lt;/p&gt;
&lt;h2&gt;
  
  
  High-Level Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F184gzmk0wkqz98o7wmut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F184gzmk0wkqz98o7wmut.png" alt=" " width="800" height="534"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Key Components:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Amazon EventBridge&lt;/strong&gt;: Schedules the monitoring Lambda to run on a configurable interval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring Lambda&lt;/strong&gt;: Assumes roles in other AWS accounts to gather CloudWatch metrics and push them to OpenSearch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenSearch Domain&lt;/strong&gt;: Serves as the data store for all metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenSearch Dashboards&lt;/strong&gt;: Provides out-of-the-box (and customizable) visualization tools.
Core Features
• &lt;strong&gt;Cross-Account Monitoring&lt;/strong&gt;: Leverage IAM roles to gather data from multiple AWS accounts.
• &lt;strong&gt;Real-Time Metrics&lt;/strong&gt;: Track invocation rates, error counts, memory usage, duration statistics, cold starts, etc.
• &lt;strong&gt;Custom Dashboards&lt;/strong&gt;: Quickly visualize performance trends and identify anomalies.
• &lt;strong&gt;Automated Setup&lt;/strong&gt;: Minimal manual configuration required — Terraform automates resource creation.
• &lt;strong&gt;Customizable Alerts&lt;/strong&gt;: Integrate with AWS services or third-party tools for alerting on critical thresholds.
• &lt;strong&gt;Memory &amp;amp; Timeout Insights&lt;/strong&gt;: Optimize Lambda performance and costs based on usage patterns.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Metrics You’ll See
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Invocation Count&lt;/li&gt;
&lt;li&gt;Error Rates&lt;/li&gt;
&lt;li&gt;Duration Statistics&lt;/li&gt;
&lt;li&gt;Memory Utilization&lt;/li&gt;
&lt;li&gt;Cold Start Frequency&lt;/li&gt;
&lt;li&gt;Timeout Proximity&lt;/li&gt;
&lt;li&gt;Runtime Distribution&lt;/li&gt;
&lt;li&gt;Cost Metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;br&gt;
To get started, ensure you have:&lt;br&gt;
• AWS CLI configured with the right permissions.&lt;br&gt;
• Terraform v1.5.0+ installed.&lt;br&gt;
• Python 3.9+ installed.&lt;br&gt;
• Cross-account IAM roles set up in each AWS account you wish to monitor.&lt;br&gt;
• Permission to create:&lt;br&gt;
• Lambda functions&lt;br&gt;
• OpenSearch domains&lt;br&gt;
• IAM roles and policies&lt;br&gt;
• CloudWatch events&lt;br&gt;
• S3 buckets&lt;/p&gt;
&lt;h2&gt;
  
  
  QuickStart Installation
&lt;/h2&gt;

&lt;p&gt;Clone the Repository&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/cloudon-one/opensearch-monitoring.git
cd opensearch-monitoring/lambda/terraform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Configure Variables
In a terraform.tfvars file, define your settings:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_region                   = "us-west-1"
monitored_accounts           = ["123456789012", "098765432109"]
opensearch_master_user_password = "your-secure-password"
opensearch_instance_type     = "t3.small.search"
opensearch_instance_count    = 1
opensearch_volume_size       = 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Initialize Terraform
&lt;code&gt;terraform init&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Plan &amp;amp; Apply
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform plan
terraform apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will provision the OpenSearch domain, monitoring Lambda, IAM roles, and other necessary resources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Securing Your Setup
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Regular Rotation
• Rotate access keys and review roles periodically.&lt;/li&gt;
&lt;li&gt;Access Logging
• Enable CloudTrail logging for all AWS API activities.&lt;/li&gt;
&lt;li&gt;Least Privilege
• Minimize permissions where possible and remove unused policies.&lt;/li&gt;
&lt;li&gt;Organization Controls
• Use AWS Organizations Service Control Policies (SCPs) for additional governance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Wrapping Up&lt;/strong&gt;&lt;br&gt;
The Lambda Fleet Monitoring Solution offers a robust, scalable way to track and analyze performance for all your AWS Lambda functions — regardless of how many accounts you manage. By combining real-time CloudWatch metrics with the visualization power of OpenSearch, this solution ensures you stay on top of function behaviour, performance trends, and potential cost optimizations.&lt;br&gt;
For a deeper dive, including best practices, troubleshooting tips, and advanced configuration options, head to the opensearch-monitoring &lt;a href="https://github.com/cloudon-one/opensearch-monitoring" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; and explore the documentation. &lt;/p&gt;

&lt;p&gt;Feel free to fork, submit issues, or contribute enhancements!&lt;br&gt;
Have thoughts or questions?&lt;/p&gt;

&lt;p&gt;Comment below or open an issue on GitHub to share your ideas.&lt;br&gt;
Happy monitoring!&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>lambda</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>The Kubernetes Troubleshooting Handbook</title>
      <dc:creator>Yaar Naumenko</dc:creator>
      <pubDate>Wed, 22 Jan 2025 13:11:18 +0000</pubDate>
      <link>https://forem.com/ynaumenko/the-kubernetes-troubleshooting-handbook-3cfn</link>
      <guid>https://forem.com/ynaumenko/the-kubernetes-troubleshooting-handbook-3cfn</guid>
      <description>&lt;p&gt;Debugging Kubernetes applications can feel like navigating a labyrinth. With its distributed nature and myriad components, identifying and resolving issues in Kubernetes requires robust tools and techniques.&lt;/p&gt;

&lt;p&gt;This article will explore various techniques and tools for troubleshooting and debugging Kubernetes. Whether you’re an experienced Kubernetes user or just getting started, this guide will provide valuable insights into efficient debugging practices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyzing Pod Lifecycle Events&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Understanding a pod's lifecycle is crucial for debugging and maintaining applications running in Kubernetes. Each pod goes through several phases, from creation to termination, and analyzing these events can help you identify and resolve issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pod Lifecycle Phases&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A pod in Kubernetes goes through the following phases:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x41ahhmyuyf6ulz2r21.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x41ahhmyuyf6ulz2r21.png" alt="Pods Lifecycle Events" width="800" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;kubectl get&lt;/code&gt; and &lt;code&gt;kubectl describe&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;To analyze the lifecycle events of a pod, you can use the kubectl get and kubectl describe commands.&lt;/p&gt;

&lt;p&gt;The kubectl get command provides a high-level overview of the status of pods:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl get pods&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME              READY   STATUS    RESTARTS   AGE
web-server-pod    1/1     Running   0          5m
db-server-pod     1/1     Pending   0          2m
cache-server-pod  1/1     Completed 1          10m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This output shows each pod's current status, which can help you identify pods that require further investigation.&lt;/p&gt;

&lt;p&gt;The kubectl describe command provides detailed information about a pod, including its lifecycle events:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl describe pod &amp;lt;pod-name&amp;gt;&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Output snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name:           web-server-pod
Namespace:      default
Node:           node-1/192.168.1.1
Start Time:     Mon, 01 Jan 2025 10:00:00 GMT
Labels:         app=web-server
Status:         Running
IP:             10.244.0.2
Containers:
  web-container:
    Container ID:   docker://abcdef123456
    Image:          nginx:latest
    State:          Running
      Started:      Mon, 01 Jan 2025 10:01:00 GMT
    Ready:          True
    Restart Count:  0
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/web-server-pod to node-1
  Normal  Pulled     9m    kubelet, node-1    Container image "nginx:latest" already present on machine
  Normal  Created    9m    kubelet, node-1    Created container web-container
  Normal  Started    9m    kubelet, node-1    Started container web-container 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Analyzing Pod Events&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Events section in the kubectl describe output provides a chronological log of significant events for the pod. These events can help you understand the lifecycle transitions and identify issues such as:&lt;/p&gt;

&lt;p&gt;Scheduling Delays: Delays in scheduling the pod can indicate resource constraints or issues with the scheduler.&lt;br&gt;
Image Pull Errors: Failures in pulling container images can indicate network issues or problems with the container registry.&lt;br&gt;
Container Crashes: Repeated container crashes can be diagnosed by examining the events leading up to the crash.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes Events and Audit Logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kubernetes generates cluster-wide events resources &lt;strong&gt;kind&lt;/strong&gt;: Event which we can use to overview what’s happening on the cluster quickly.&lt;/p&gt;

&lt;p&gt;Audit logs &lt;strong&gt;kind&lt;/strong&gt;: Policy On the other hand, they help ensure compliance and security on the cluster. They can show login attempts, pod privileges escalation and more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes Events&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kubernetes events provide a timeline of significant occurrences within your cluster, such as pod scheduling, container restarts, and errors. They help understand state transitions and identify the root causes of issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Viewing Events&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To view events in your cluster, use the kubectl get events command:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl get events&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Output example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LAST SEEN   TYPE      REASON             OBJECT                                   MESSAGE
12s         Normal    Scheduled          pod/web-server-pod                       Successfully assigned default/web-server-pod to node-1
10s         Normal    Pulling            pod/web-server-pod                       Pulling image "nginx:latest"
8s          Normal    Created            pod/web-server-pod                       Created container web-container
7s          Normal    Started            pod/web-server-pod                       Started container web-container
5s          Warning   BackOff            pod/db-server-pod                        Back-off restarting failed container 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Filtering Events&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can filter events to focus on specific namespaces, resource types, or periods. For example, to view events related to a particular pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get events --field-selector involvedObject.name=web-server-pod 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Describing Resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;kubectl describe&lt;/code&gt; command includes events in its output, providing detailed information about a specific resource along with its event history:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl describe pod web-server-pod&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Output snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/web-server-pod to node-1
  Normal  Pulled     9m    kubelet, node-1    Container image "nginx:latest" already present on machine
  Normal  Created    9m    kubelet, node-1    Created container web-container
  Normal  Started    9m    kubelet, node-1    Started container web-container 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kubernetes Audit Logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Audit logs provide a detailed record of all API requests made to the Kubernetes API server, including the user, the action performed, and the outcome. They are essential for security auditing and compliance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enabling Audit Logging&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Configure the API server with the appropriate flags and audit policy to enable audit logging. Here’s an example of an audit policy configuration:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;audit-policy.yaml&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  resources:
  - group: ""
    resources: ["pods"]
- level: RequestResponse
  users: ["admin"]
  verbs: ["update", "patch"]
  resources:
  - group: ""
    resources: ["configmaps"] 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configuring the API Server&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Specify the audit policy file and log file location when starting the API server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kube-apiserver --audit-policy-file=/etc/kubernetes/audit-policy.yaml --audit-log-path=/var/log/kubernetes/audit.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Viewing Audit Logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Audit logs are typically written to a file. You can use standard log analysis tools to view and filter the logs. Here’s an example of an audit log entry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "Metadata",
    "auditID": "12345",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/default/pods",
    "verb": "create",
    "user": {
        "username": "admin",
        "groups": ["system:masters"]
    },
    "sourceIPs": ["192.168.1.1"],
    "objectRef": {
        "resource": "pods",
        "namespace": "default",
        "name": "web-server-pod"
    },
    "responseStatus": {
        "metadata": {},
        "code": 201
    },
    "requestReceivedTimestamp": "2025-01-01T12:00:00Z",
    "stageTimestamp": "2025-01-01T12:00:01Z"
} 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kubernetes Dashboard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Kubernetes Dashboard is a web-based UI that provides an easy way to manage and troubleshoot your Kubernetes cluster. It allows you to visualize cluster resources, deploy applications, and perform various administrative tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installing the Kubernetes Dashboard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Please take a look at the Kubernetes documentation for details on installing and accessing the dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fboctagfa8rieccvcp4ui.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fboctagfa8rieccvcp4ui.png" alt="Kubernetes Dashboard" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using the Dashboard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Dashboard provides various features to help manage and troubleshoot your Kubernetes cluster:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster Overview&lt;/strong&gt;: View the overall status of your cluster, including nodes, namespaces, and resource usage.&lt;br&gt;
&lt;strong&gt;Workloads&lt;/strong&gt;: Monitor and manage workloads, such as Deployments, ReplicaSets, StatefulSets, and DaemonSets.&lt;br&gt;
&lt;strong&gt;Services and Ingress&lt;/strong&gt;: Manage services and ingress resources to control network traffic.&lt;br&gt;
&lt;strong&gt;Config and Storage&lt;/strong&gt;: Manage ConfigMaps, Secrets, PersistentVolumeClaims, and other storage resources.&lt;br&gt;
&lt;strong&gt;Logs and Events&lt;/strong&gt;: View logs and events for troubleshooting and auditing purposes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring Resource Usage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitoring resource usage helps you understand how your applications consume resources and identify opportunities for optimization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools for Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;kubectl top: Provides real-time resource usage metrics.&lt;br&gt;
Prometheus: Collects and stores metrics for detailed analysis.&lt;br&gt;
Grafana: Visualizes metrics and provides dashboards for monitoring.&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;kubectl top&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;kubectl top&lt;/code&gt; command shows the current CPU and memory usage of pods and nodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl top pods
kubectl top nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME        CPU(cores)   MEMORY(bytes)
my-app-pod  100m         120Mi 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Using kubectl logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;kubectl logs is one of the most essential tools for debugging Kubernetes applications. This command retrieves logs from a specific container in a pod, allowing you to diagnose and troubleshoot issues effectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basic Usage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The simplest way to retrieve logs from a pod is by using the kubectl logs command followed by the pod name and namespace. Here’s a basic example for a pod running in a default namespace:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl logs &amp;lt;pod-name&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This command fetches the logs from the first container in the specified pod. If your pod has multiple containers, you need to specify the container name as well:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl logs &amp;lt;pod-name&amp;gt; -c &amp;lt;container-name&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time Logs with f Flag&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To stream logs in real-time, similar to &lt;code&gt;tail -f&lt;/code&gt; in Linux, use the &lt;code&gt;-f&lt;/code&gt; flag:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl logs -f &amp;lt;pod-name&amp;gt;&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;This is particularly useful for monitoring logs as your application runs and observing the output of live processes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Some projects enhance the log tailing with additional capabilities, such as &lt;strong&gt;stern&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Retrieving Previous Logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a pod has restarted, you can view the logs from the previous instance using the --previous flag:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl logs &amp;lt;pod-name&amp;gt; --previous&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Examining the logs before the failure helps us understand what caused the pod to restart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Filtering Logs with Labels&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can also filter logs from pods that match specific labels using kubectl along with &lt;code&gt;jq&lt;/code&gt; for advanced filtering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods -l &amp;lt;label-selector&amp;gt; -o json | jq -r '.items[] | .metadata.name' | xargs -I {} kubectl logs {} 
Replace &amp;lt;label-selector&amp;gt; with your specific labels, such as app=myapp.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Combining with Other Tools&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can combine kubectl logs with other Linux commands to enhance your debugging process. For example, to search for a specific error message in the logs, you can use grep:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl logs web-server-pod | grep "Error"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;For a continuous search in real-time logs:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl logs -f web-server-pod | grep --line-buffered "Error"&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical Tips&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log Rotation and Retention&lt;/strong&gt;: Please ensure your application handles log rotation to prevent the logs from consuming excessive disk space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured Logging:&lt;/strong&gt; Structured logging (e.g., JSON format) can make it easier to parse and analyze logs using tools like &lt;code&gt;jq&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralized Logging:&lt;/strong&gt; Consider setting up a centralized logging system (e.g., Elasticsearch, Fluentd, and Kibana — EFK stack) to aggregate and search logs from all your Kubernetes pods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using &lt;code&gt;kubectl exec&lt;/code&gt; for Interactive Troubleshooting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec&lt;/code&gt; allows us to execute commands directly inside a running container. This is particularly useful for interactive troubleshooting, enabling the inspection of the container’s environment, running diagnostic commands, and performing real-time fixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basic Usage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The basic syntax  kubectl exec is as follows:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec &amp;lt;pod-name&amp;gt; -- &amp;lt;command&amp;gt;&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Use the flag to execute a command in a specific container within a pod. This will execute a command and immediately exit the container.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec &amp;lt;pod-name&amp;gt; -c &amp;lt;container-name&amp;gt; -- &amp;lt;command&amp;gt;&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running an Interactive Shell&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the most common uses of kubectl exec is to open an interactive shell session within a container. This allows you to run multiple commands interactively. Here’s how to do it:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec -it &amp;lt;pod-name&amp;gt; -- /bin/bash&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;For containers using sh instead of bash:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec -it &amp;lt;pod-name&amp;gt; -- /bin/sh&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Inspecting Environment Variables&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To check the environment variables inside a container, you can use the env command:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec &amp;lt;pod-name&amp;gt; -- env&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;If you need to check environment variables in a specific container:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec &amp;lt;pod-name&amp;gt; -c &amp;lt;container-name&amp;gt; -- env&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Checking Configuration Files&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Suppose you need to inspect a configuration file inside the container. You can use cat or any text editor available inside the container:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec &amp;lt;pod-name&amp;gt; -- cat /path/to/config/file&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For a specific container:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec &amp;lt;pod-name&amp;gt; -c &amp;lt;container-name&amp;gt; -- cat /path/to/config/file 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Copying Files to and from Containers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you don’t have a binary you need inside a container, it’s easy to files to and from containers using kubectl cp. For example, to copy a file from your local machine to a container:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl cp /local/path/to/file &amp;lt;pod-name&amp;gt;:/container/path/to/file&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;To copy a file from a container to your local machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl cp &amp;lt;pod-name&amp;gt;:/container/path/to/file /local/path/to/file 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Practical Tips&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the&lt;code&gt;—i&lt;/code&gt; and&lt;code&gt;—t&lt;/code&gt; Flags&lt;/strong&gt;: The—i flag makes the session interactive, and the—t flag allocates a pseudo-TTY. Together, they enable a fully interactive session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run as a Specific User:&lt;/strong&gt; Use the --user flag to execute commands as a specific user inside the container, if required.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec --user=&amp;lt;username&amp;gt; -it &amp;lt;pod-name&amp;gt; -- /bin/bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Security Considerations:&lt;/strong&gt; Be cautious when running kubectl exec with elevated privileges. Ensure you have appropriate RBAC (Role-Based Access Control) policies in place to prevent unauthorized access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node-Level Debugging with kubectl debug&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most debugging techniques focus on the application level; however, the &lt;code&gt;kubectl debug&lt;/code&gt; node command can also be used to debug a specific Kubernetes node.&lt;/p&gt;

&lt;p&gt;Node-level debugging is crucial for diagnosing issues affecting the Kubernetes nodes, such as resource exhaustion, misconfigurations, or hardware failures.&lt;/p&gt;

&lt;p&gt;This way, the debugging Pod can access the node's root filesystem, mounted at /host in the Pod.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create a Debugging Session:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use the kubectl debug command to start a debugging session on a node. This command creates a pod running a debug container on the specified node.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl debug node/&amp;lt;node-name&amp;gt; -it --image=busybox 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;&amp;lt;node-name&amp;gt;&lt;/code&gt; with the name of the node you want to debug. The -it flag opens an interactive terminal, and &lt;code&gt;--image=busybox&lt;/code&gt; specifies the image for the debug container.&lt;/p&gt;

&lt;p&gt;For more details, refer to the official Kubernetes documentation on node-level debugging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Application-Level Debuging with Debug Containers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For more complex issues, consider using a debug container with pre-installed tools. There are a lot of good docker images with tooling and scripts for debugging, one that stands out to me is &lt;a href="https://github.com/nicolaka/netshoot" rel="noopener noreferrer"&gt;https://github.com/nicolaka/netshoot&lt;/a&gt;. It can quickly be created using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot 
Example: Using the debug container as a sidecar

 apiVersion: apps/v1
   kind: Deployment
   metadata:
       name: nginx-netshoot
       labels:
           app: nginx-netshoot
   spec:
   replicas: 1
   selector:
       matchLabels:
           app: nginx-netshoot
   template:
       metadata:
       labels:
           app: nginx-netshoot
       spec:
           containers:
           - name: nginx
           image: nginx:1.14.2
           ports:
               - containerPort: 80
           - name: netshoot
           image: nicolaka/netshoot
           command: ["/bin/bash"]
           args: ["-c", "while true; do ping localhost; sleep 60;done"] 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the configuration:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl apply -f debug-pod.yaml&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical Tips&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set Restart Policies&lt;/strong&gt;: Ensure that your pod specifications have appropriate restart policies to handle different failure scenarios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automated Monitoring&lt;/strong&gt;: Set up automated monitoring and alerting for critical issues such as CrashLoopBackOff using Prometheus and Alertmanager.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ephemeral Containers for Debugging&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ephemeral containers are temporary and explicitly created for debugging purposes. They are helpful in running diagnostic tools and commands without affecting the running application. This chapter will explore how to develop and use ephemeral pods for interactive troubleshooting in Kubernetes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Use Ephemeral Pods?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: Debugging in an isolated environment prevents accidental changes to running applications.&lt;br&gt;
&lt;strong&gt;Tool Availability&lt;/strong&gt;: Allows the use of specialized tools that may not be present in the application container.&lt;br&gt;
&lt;strong&gt;Temporary Nature&lt;/strong&gt;: These pods can be easily created and destroyed as needed without leaving a residual impact on the cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creating Ephemeral Pods&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are several ways to create ephemeral pods in Kubernetes. One standard method is to use the &lt;code&gt;kubectl run&lt;/code&gt; command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Creating an Ephemeral Pod&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;kubectl run&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl debug mypod -it --image=nicolaka/netshoot&lt;/code&gt;&lt;br&gt;
This command creates a debug pod using the Netshoot image and opens an interactive shell.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical Tips for Using Ephemeral Pods&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool Availability&lt;/strong&gt;: Ensure the debug container image includes all necessary tools for troubleshooting, such as &lt;code&gt;curl&lt;/code&gt;, &lt;code&gt;netcat&lt;/code&gt;, &lt;code&gt;nslookup&lt;/code&gt;, &lt;code&gt;df&lt;/code&gt;, &lt;code&gt;top&lt;/code&gt;, and others.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security Considerations&lt;/strong&gt;: When creating ephemeral pods, consider security. Ensure they have limited access and are used by authorized personnel only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Advanced Debugging with Custom Debug Container&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s walk through an example of using a custom debug container for advanced debugging tasks.&lt;/p&gt;

&lt;p&gt;Create an Ephemeral Pod with Custom Debug Container:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl debug -it redis5 --image=nicolaka/netshoot&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Defaulting debug container name to debugger-v4hfv.&lt;br&gt;
If you don't see a command prompt, try pressing enter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;88d888b. .d8888b. d8888P .d8888b. 88d888b. .d8888b. .d8888b. d8888P
88'  `88 88ooood8   88   Y8ooooo. 88'  `88 88'  `88 88'  `88   88
88    88 88.  ...   88         88 88    88 88.  .88 88.  .88   88
dP    dP `88888P'   dP   `88888P' dP    dP `88888P' `88888P'   dP

Welcome to Netshoot! (github.com/nicolaka/netshoot)
Version: 0.13
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run Diagnostic Commands:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Inside the debug container we can run various commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check DNS resolution&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nslookup kubernetes.default.svc.cluster.local

Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Test network connectivity&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;curl http://my-service:8080/healthBy&lt;/code&gt; using ephemeral pods, you can effectively debug and troubleshoot Kubernetes applications in an isolated and controlled environment, minimizing the risk of impacting production workloads. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handling DNS and Network Issues&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We will go through 2 common troubleshooting scenarios: DNS issues and stateful pods debugging. Let’s see what we have learned in action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Network Issues&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNS Resolution Failures&lt;/strong&gt;: Issues resolving service names to IP addresses.&lt;br&gt;
&lt;strong&gt;Service Unreachable&lt;/strong&gt;: Services are not accessible within the cluster.&lt;br&gt;
&lt;strong&gt;Pod Communication Issues&lt;/strong&gt;: Pods cannot communicate with each other.&lt;br&gt;
&lt;strong&gt;Network Policy Misconfigurations&lt;/strong&gt;: Incorrect network policies blocking traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools and Commands for Troubleshooting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec&lt;/code&gt;: Run commands in a container to diagnose network issues. &lt;br&gt;
&lt;code&gt;nslookup&lt;/code&gt;: Check DNS resolution. &lt;br&gt;
&lt;code&gt;ping&lt;/code&gt;: Test connectivity between pods and services. &lt;br&gt;
&lt;code&gt;curl&lt;/code&gt;: Verify HTTP connectivity and responses. &lt;br&gt;
&lt;code&gt;traceroute&lt;/code&gt;: Trace the path packets take to reach a destination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Diagnosing a DNS Resolution Issue&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s walk through an example of diagnosing a DNS resolution issue for a pod named my-app-pod trying to reach a service my-db-service.&lt;/p&gt;

&lt;p&gt;Check DNS Resolution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec -it my-app-pod -- nslookup my-db-service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alternatively we can use debug pod or ephemeral containers.&lt;br&gt;
Output indicating a problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server: 10.96.0.10
Address:10.96.0.10#53
** server can't find my-db-service: NXDOMAIN 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check CoreDNS Logs&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Inspect the logs of CoreDNS pods to identify any DNS resolution issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl logs -l k8s-app=kube-dns -n kube-system
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for errors or warnings indicating DNS resolution failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verify Service and Endpoints&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Ensure that the service and endpoints exist and are correctly configured.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get svc my-db-service
kubectl get endpoints my-db-service 
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
my-db-serviceClusterIP   10.96.0.11   &amp;lt;none&amp;gt;        5432/TCP   1h 
NAME         ENDPOINTS            AGE
my-db-service10.244.0.5:5432      1h 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Restart CoreDNS Pods&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Restart CoreDNS pods to resolve potential transient issues.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl rollout restart deployment coredns -n kube-system&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Verify DNS Resolution Again:&lt;/p&gt;

&lt;p&gt;After resolving the issue, verify DNS resolution again:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec -it my-app-pod -- nslookup my-db-service&lt;/code&gt;&lt;br&gt;
Expected output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server: 10.96.0.10
Address:10.96.0.10#53 
Name:   my-db-service.default.svc.cluster.local
Address:10.96.0.11 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Practical Tips&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use Network Debug Containers: Use network debug containers like &lt;code&gt;nicolaka/netshoot&lt;/code&gt; for comprehensive network troubleshooting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl run netshoot --rm -it --image nicolaka/netshoot -- /bin/bash 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Monitor Network Metrics&lt;/strong&gt;: Use Prometheus and Grafana to monitor network metrics and set up network-issue alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implement Redundancy&lt;/strong&gt;: Configure redundant DNS servers and failover mechanisms to enhance network reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debugging Stateful Applications&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stateful applications in Kubernetes require special debugging considerations due to their reliance on persistent storage and consistent state across restarts. This section will explore techniques for handling and debugging issues specific to stateful applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are Stateful Applications?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stateful applications maintain state information across sessions and restarts, often using persistent storage. Examples include databases, message queues, and other applications that require data persistence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Issues in Stateful Applications&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistent Storage Issues&lt;/strong&gt;: Problems with PVCs or PVs can lead to data loss or unavailability.&lt;br&gt;
&lt;strong&gt;Pod Start-up Failures&lt;/strong&gt;: Errors during pod initialization due to state dependencies.&lt;br&gt;
&lt;strong&gt;Network Partitioning&lt;/strong&gt;: Network issues affecting communication between stateful pods.&lt;br&gt;
&lt;strong&gt;Data Consistency Problems&lt;/strong&gt;: Inconsistent data across replicas or restarts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Debugging a MySQL StatefulSet&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s walk through an example of debugging a MySQL StatefulSet named my-mysql.&lt;/p&gt;

&lt;p&gt;Inspect the StatefulSet:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl describe statefulset my-mysql&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Output snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name:           my-mysql
Namespace:      default
Selector:       app=my-mysql
Replicas:       3 desired | 3 total
...
Events:
  Type    Reason            Age   From                    Message
  ----    ------            ----  ----                    -------
  Normal  SuccessfulCreate  1m    statefulset-controller  create Pod my-mysql-0 in StatefulSet my-mysql successful
  Normal  SuccessfulCreate  1m    statefulset-controller  create Pod my-mysql-1 in StatefulSet my-mysql successful
  Normal  SuccessfulCreate  1m    statefulset-controller  create Pod my-mysql-2 in StatefulSet my-mysql successful 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check Persistent Volume Claims:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pvc
kubectl describe pvc data-my-mysql-0 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name:          data-my-mysql-0
Namespace:     default
Status:        Bound
Volume:        pvc-1234abcd-56ef-78gh-90ij-klmnopqrstuv
... 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check Pod Logs:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl logs my-mysql-0&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Output snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2025-01-01T00:00:00.000000Z 0 [Note] mysqld (mysqld 8.0.23) starting as process 1 ...
2025-01-01T00:00:00.000000Z 1 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 11 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Execute Commands in Pods:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec -it my-mysql-0 -- /bin/sh&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Inside the pod:&lt;/p&gt;

&lt;p&gt;Check mounted volumes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;df -h&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Verify MySQL data directory:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ls -l /var/lib/mysql&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Check MySQL status:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mysqladmin -u root -p status&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Check Network Connectivity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec -it my-mysql-0 -- ping my-mysql-1.my-mysql.default.svc.cluster.local
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PING my-mysql-1.my-mysql.default.svc.cluster.local (10.244.0.6): 56 data bytes
64 bytes from 10.244.0.6: icmp_seq=0 ttl=64 time=0.047 ms 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Advanced Debugging Techniques&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Advanced debugging techniques in Kubernetes involve using specialized tools and strategies to diagnose and resolve complex issues. This chapter will cover tracing instrumentation and remote debugging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Profiling with Jaeger&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Jaeger is an open-source, end-to-end distributed tracing tool that helps monitor and troubleshoot transactions in complex distributed systems. Profiling with Jaeger can provide insights into the performance of your microservices and help identify latency issues.&lt;/p&gt;

&lt;p&gt;You can install Jaeger in your Kubernetes cluster using the Jaeger Operator or Helm.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
helm install jaeger jaegertracing/jaeger 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Instrument Your Application:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ensure your application is instrumented to send tracing data to Jaeger. This typically involves adding Jaeger client libraries to your application code and configuring them to report to the Jaeger backend.&lt;/p&gt;

&lt;p&gt;Example in a Go application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import (
    "github.com/opentracing/opentracing-go"
    "github.com/uber/jaeger-client-go"
    "github.com/uber/jaeger-client-go/config"
)
func initJaeger(service string) (opentracing.Tracer, io.Closer) {
    cfg := config.Configuration{
        ServiceName: service,
        Sampler: &amp;amp;config.SamplerConfig{
            Type:  "const",
            Param: 1,
        },
        Reporter: &amp;amp;config.ReporterConfig{
            LogSpans:           true,
            LocalAgentHostPort: "jaeger-agent.default.svc.cluster.local:6831",
        },
    }
    tracer, closer, _ := cfg.NewTracer()
    opentracing.SetGlobalTracer(tracer)
    return tracer, closer
} 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Access the Jaeger UI to view and analyze traces.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl port-forward svc/jaeger-query 16686:16686&lt;/code&gt; &lt;br&gt;
Open &lt;code&gt;http://localhost:16686&lt;/code&gt; in your browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remote Debugging with mirrord&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mirrord is an open-source tool that enables remote debugging of Kubernetes services by running local processes in the context of your Kubernetes cluster and remote infrastructure.&lt;/p&gt;

&lt;p&gt;Setting Up mirrord&lt;/p&gt;

&lt;p&gt;&lt;code&gt;curl -fsSL https://raw.githubusercontent.com/metalbear-co/mirrord/main/scripts/install.sh | bash&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Connect to Your Cluster:&lt;/p&gt;

&lt;p&gt;Start a mirrord session to connect your local environment to your Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mirrord connect&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Swap Deployment:&lt;/p&gt;

&lt;p&gt;Use mirrord to swap a deployment in your cluster with your local service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mirrord exec --target-namespace devops-team --target deployment/foo-app-deployment nodemon server.js 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command redirects traffic, environment variables, and file operations from your Kubernetes cluster to your local machine, allowing you to debug the service as if running locally.&lt;/p&gt;

&lt;p&gt;Once the mirrord session is set up, you can debug the service on your local machine using your favourite debugging tools and IDES.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set Breakpoints&lt;/strong&gt;: Use your IDE to set breakpoints and step through the code.&lt;br&gt;
&lt;strong&gt;Inspect Variables&lt;/strong&gt;: Inspect variables and application state to identify issues.&lt;br&gt;
&lt;strong&gt;Make Changes&lt;/strong&gt;: Make code changes and immediately see the effects without redeploying to the cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional Tools&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In addition to the core Kubernetes commands and open-source tools, several other tools can enhance your troubleshooting capabilities across various categories. Here are a few noteworthy tools:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Closing Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Debugging Kubernetes applications can be complex and challenging, but it becomes much more manageable with the right tools and techniques.&lt;/p&gt;

&lt;p&gt;Remember, effective debugging is not just about resolving issues as they arise but also about proactive monitoring, efficient resource management, and a deep understanding of your application’s architecture and dependencies.&lt;/p&gt;

&lt;p&gt;By implementing the strategies and best practices outlined in this guide, you can build a robust debugging framework that empowers you to quickly identify, diagnose, and resolve issues, ensuring the smooth operation of your Kubernetes deployments.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>security</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
