<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Daksh Goel</title>
    <description>The latest articles on Forem by Daksh Goel (@dakxsh).</description>
    <link>https://forem.com/dakxsh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3887627%2Fb635be94-a28a-42fd-9f60-abcbefa8d37b.jpg</url>
      <title>Forem: Daksh Goel</title>
      <link>https://forem.com/dakxsh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dakxsh"/>
    <language>en</language>
    <item>
      <title>How to Deploy an Open Source LLM Reliably on Kubernetes (Step-by-Step)</title>
      <dc:creator>Daksh Goel</dc:creator>
      <pubDate>Sun, 19 Apr 2026 17:53:35 +0000</pubDate>
      <link>https://forem.com/dakxsh/how-to-deploy-an-open-source-llm-reliably-on-kubernetes-step-by-step-n7f</link>
      <guid>https://forem.com/dakxsh/how-to-deploy-an-open-source-llm-reliably-on-kubernetes-step-by-step-n7f</guid>
      <description>&lt;h1&gt;
  
  
  How to Deploy an Open Source LLM Reliably on Kubernetes
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Running AI models in production requires more than just downloading a &lt;br&gt;
model and running it locally. Anyone can run &lt;code&gt;ollama run mistral&lt;/code&gt; in a &lt;br&gt;
terminal — but what happens when that process crashes at 2am? What happens &lt;br&gt;
when you need to monitor memory usage, restart failed services automatically, &lt;br&gt;
or scale to handle more requests?&lt;/p&gt;

&lt;p&gt;That is exactly what Kubernetes solves.&lt;/p&gt;

&lt;p&gt;In this guide I will walk you through the complete process of deploying &lt;br&gt;
TinyLlama (a real open source LLM) inside a production-grade Kubernetes &lt;br&gt;
cluster on your local machine — with live monitoring via Grafana, a working &lt;br&gt;
chatbot UI, and a head-to-head performance comparison against Claude Haiku &lt;br&gt;
from Anthropic.&lt;/p&gt;

&lt;p&gt;By the end you will have a fully working AI stack that auto-restarts on &lt;br&gt;
failure, shows you real-time health metrics, and costs you nothing to run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full code:&lt;/strong&gt; &lt;a href="https://github.com/daksh777f/llm-on-kubernetes" rel="noopener noreferrer"&gt;https://github.com/daksh777f/llm-on-kubernetes&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What We Are Building
&lt;/h2&gt;

&lt;p&gt;[Next.js Chatbot UI :3001]&lt;br&gt;
↓&lt;br&gt;
[Ollama API :11434]&lt;br&gt;
↓&lt;br&gt;
[Kubernetes Cluster — k3d]&lt;br&gt;
├── llm namespace&lt;br&gt;
│     └── Ollama Pod (serves TinyLlama)&lt;br&gt;
└── monitoring namespace&lt;br&gt;
├── Prometheus (scrapes metrics)&lt;br&gt;
└── Grafana (visualizes dashboards)&lt;/p&gt;

&lt;p&gt;The full stack uses these tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;k3d&lt;/strong&gt; — Lightweight Kubernetes that runs entirely inside Docker. 
No cloud account needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; — A tool that serves open source LLMs as a REST API 
inside a container.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TinyLlama&lt;/strong&gt; — A 1.1B parameter open source model that runs in 
637MB of RAM. Perfect for 8GB machines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus + Grafana&lt;/strong&gt; — Industry standard monitoring stack. 
Prometheus scrapes metrics, Grafana visualizes them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Next.js&lt;/strong&gt; — The chatbot frontend. Calls Ollama through an 
API route.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting, you need these installed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker Desktop (running)&lt;/li&gt;
&lt;li&gt;kubectl&lt;/li&gt;
&lt;li&gt;k3d&lt;/li&gt;
&lt;li&gt;Helm&lt;/li&gt;
&lt;li&gt;Node.js&lt;/li&gt;
&lt;li&gt;Python 3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On Windows, install the last four with one command using Chocolatey:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;choco&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;kubernetes-cli&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;k3d&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;kubernetes-helm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;nodejs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-y&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 1: Create the Kubernetes Cluster
&lt;/h2&gt;

&lt;p&gt;We use k3d because it creates a real multi-node Kubernetes cluster &lt;br&gt;
that runs entirely inside Docker containers. No cloud account, no &lt;br&gt;
VM setup, no cost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;k3d&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;create&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;llm-cluster&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--agents&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8080:80@loadbalancer"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a cluster with one server node and one agent node. The &lt;br&gt;
&lt;code&gt;--port&lt;/code&gt; flag exposes port 80 through a load balancer for later use.&lt;/p&gt;

&lt;p&gt;Verify both nodes are ready:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;kubectl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;nodes&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsg66ymunay00ynhfwf4z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsg66ymunay00ynhfwf4z.png" alt=" " width="800" height="116"&gt;&lt;/a&gt;&lt;br&gt;
You should see two nodes — one control-plane and one agent — both &lt;br&gt;
with status &lt;code&gt;Ready&lt;/code&gt;. That is your Kubernetes cluster running locally &lt;br&gt;
inside Docker.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 2: Deploy Ollama + TinyLlama
&lt;/h2&gt;

&lt;p&gt;Ollama is an open source tool that serves LLMs as a REST API. We &lt;br&gt;
deploy it as a Kubernetes Deployment so that if the pod ever crashes, &lt;br&gt;
Kubernetes automatically restarts it.&lt;/p&gt;

&lt;p&gt;Create a file called &lt;code&gt;ollama.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Namespace&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/ollama:latest&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;11434&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
        &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/api/tags&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;11434&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
          &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
        &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/api/tags&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;11434&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
          &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama-service&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;11434&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;11434&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the &lt;code&gt;livenessProbe&lt;/code&gt; and &lt;code&gt;readinessProbe&lt;/code&gt; — these are what make &lt;br&gt;
the deployment reliable. Kubernetes will continuously check if Ollama &lt;br&gt;
is responding. If it stops responding, Kubernetes kills the pod and &lt;br&gt;
starts a fresh one automatically.&lt;/p&gt;

&lt;p&gt;Apply the deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;kubectl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;apply&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-f&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ollama.yaml&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait for the pod to be running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;kubectl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;pods&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;llm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-w&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the pod shows &lt;code&gt;1/1 Running&lt;/code&gt;, pull TinyLlama into it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;kubectl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;llm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;deployment/ollama&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ollama&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;pull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tinyllama&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This downloads the 637MB TinyLlama model inside the running pod.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1to31sy94091919n9n2j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1to31sy94091919n9n2j.png" alt=" " width="768" height="109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Test it works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;kubectl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;port-forward&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;llm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;svc/ollama-service&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;11434:11434&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nv"&gt;$body&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'{"model":"tinyllama","prompt":"say hello","stream":false}'&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Invoke-RestMethod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Uri&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://127.0.0.1:11434/api/generate"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;-Method&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Post&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-ContentType&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"application/json"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Body&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$body&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see a response field with text from TinyLlama. Your LLM &lt;br&gt;
is running inside Kubernetes.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 3: Monitoring with Prometheus and Grafana
&lt;/h2&gt;

&lt;p&gt;Deploying without monitoring is flying blind. We use the &lt;br&gt;
kube-prometheus-stack Helm chart which installs Prometheus, Grafana, &lt;br&gt;
and all the necessary exporters in a single command.&lt;/p&gt;

&lt;p&gt;Add the Helm repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;repo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;add&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;prometheus-community&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;https://prometheus-community.github.io/helm-charts&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;repo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;update&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install the monitoring stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;monitoring&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;prometheus-community/kube-prometheus-stack&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--namespace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;monitoring&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--create-namespace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;grafana.adminPassword&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;admin123&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;alertmanager.enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;false&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;prometheus.prometheusSpec.retention&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;prometheus.prometheusSpec.resources.requests.memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="n"&gt;Mi&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;prometheus.prometheusSpec.resources.limits.memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="n"&gt;Mi&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last two &lt;code&gt;--set&lt;/code&gt; flags reduce memory usage — important on 8GB &lt;br&gt;
machines. The &lt;code&gt;alertmanager.enabled=false&lt;/code&gt; flag skips the alert &lt;br&gt;
manager to save another ~200MB.&lt;/p&gt;

&lt;p&gt;Wait for all pods to reach Running status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;kubectl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;pods&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;monitoring&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgcmcou4xbxr6lzidtwnt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgcmcou4xbxr6lzidtwnt.png" alt=" " width="800" height="303"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Access Grafana by port-forwarding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;kubectl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;port-forward&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;monitoring&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;svc/monitoring-grafana&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;3000:80&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;a href="http://localhost:3000" rel="noopener noreferrer"&gt;http://localhost:3000&lt;/a&gt; and login with &lt;code&gt;admin&lt;/code&gt; / &lt;code&gt;admin123&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Import the Kubernetes Dashboard
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Left sidebar → Dashboards → New → Import&lt;/li&gt;
&lt;li&gt;Enter ID &lt;code&gt;15661&lt;/code&gt; → Load&lt;/li&gt;
&lt;li&gt;Select Prometheus as data source → Import&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You now have a live dashboard showing cluster CPU, memory, pod counts, &lt;br&gt;
and network traffic — all updating in real time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftof36malj98hs245nq2b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftof36malj98hs245nq2b.png" alt=" " width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Create a Custom Ollama Panel
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Dashboards → New → New Dashboard → Add visualization&lt;/li&gt;
&lt;li&gt;Select Prometheus as data source&lt;/li&gt;
&lt;li&gt;Switch to Code mode and enter: &lt;code&gt;kube_pod_info{namespace="llm"}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Change visualization to Stat&lt;/li&gt;
&lt;li&gt;Title: &lt;code&gt;Ollama LLM Pod Status&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Save dashboard as &lt;code&gt;LLM Monitoring&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6ltgomd8ux3je5ed2d3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6ltgomd8ux3je5ed2d3.png" alt=" " width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This panel shows you at a glance whether your LLM pod is up. In a &lt;br&gt;
production setup, you would wire this to an alert that pages you &lt;br&gt;
when it goes down.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 4: Build the Chatbot UI
&lt;/h2&gt;

&lt;p&gt;The chatbot is built with Next.js and Tailwind CSS. It talks to &lt;br&gt;
TinyLlama through a server-side API route that calls Ollama directly.&lt;/p&gt;

&lt;p&gt;Create the app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;npx&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;create-next-app&lt;/span&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nx"&gt;latest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;chatbot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--typescript&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--tailwind&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--app&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--yes&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;chatbot&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API route (&lt;code&gt;app/api/chat/route.ts&lt;/code&gt;) handles all communication &lt;br&gt;
with Ollama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;NextRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;NextResponse&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;next/server&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;NextRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
      &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="s2"&gt;`User: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Assistant: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Assistant:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://127.0.0.1:11434/api/generate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tinyllama&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;NextResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;No response from model.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run on port 3001 since Grafana already uses port 3000:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;npm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dev&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;3001&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;a href="http://localhost:3001" rel="noopener noreferrer"&gt;http://localhost:3001&lt;/a&gt; — you have a working chatbot talking to &lt;br&gt;
your Kubernetes-hosted LLM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flfhva492rnk4okgax0z8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flfhva492rnk4okgax0z8.png" alt=" " width="800" height="519"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqnhvmw3ybi7q95s3kaf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqnhvmw3ybi7q95s3kaf.png" alt=" " width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Open Source vs Commercial LLM — Real Numbers
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. I wrote a Python script that ran &lt;br&gt;
10 identical prompts through both TinyLlama running in my Kubernetes &lt;br&gt;
cluster and Claude Haiku from Anthropic's API, measuring response &lt;br&gt;
time and cost for every single query.&lt;/p&gt;

&lt;p&gt;The 10 prompts covered a range of task types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identity questions ("Which model are you?")&lt;/li&gt;
&lt;li&gt;Technical explanations ("What is Kubernetes?", "What is Docker?")&lt;/li&gt;
&lt;li&gt;Coding tasks ("Write a Python function to reverse a string")&lt;/li&gt;
&lt;li&gt;Creative tasks ("Write a haiku about programming")&lt;/li&gt;
&lt;li&gt;General knowledge ("What is the capital of Australia?")&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;TinyLlama (Local K8s)&lt;/th&gt;
&lt;th&gt;Claude Haiku (API)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Average response time&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;20.23 seconds&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.41 seconds&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost for 10 queries&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.00&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$0.003&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per query&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;FREE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$0.0003&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runs on your hardware&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data leaves your machine&lt;/td&gt;
&lt;td&gt;Never&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response quality&lt;/td&gt;
&lt;td&gt;Good for simple tasks&lt;/td&gt;
&lt;td&gt;Stronger reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latest knowledge&lt;/td&gt;
&lt;td&gt;Cutoff at training&lt;/td&gt;
&lt;td&gt;More recent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vs3k16fsiike1s43mo4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vs3k16fsiike1s43mo4.png" alt=" " width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Real Story Behind These Numbers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Claude is 49 times faster.&lt;/strong&gt; That gap comes entirely from hardware. &lt;br&gt;
TinyLlama running on a CPU on an 8GB laptop takes 20 seconds. The &lt;br&gt;
same model running on a GPU node in Kubernetes would take under 1 &lt;br&gt;
second. Claude runs on Anthropic's optimized GPU infrastructure and &lt;br&gt;
responds in under half a second consistently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TinyLlama is completely free.&lt;/strong&gt; For 10 queries, 100 queries, or &lt;br&gt;
1 million queries — the cost is exactly zero. You pay for electricity &lt;br&gt;
and hardware you already own. Claude charges per token, which at &lt;br&gt;
Haiku pricing is extremely cheap (~$0.0003 per query) but adds up &lt;br&gt;
at scale. At 1 million queries per day, that is $300/day vs $0/day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy is where local wins completely.&lt;/strong&gt; Every query you send &lt;br&gt;
to Claude goes over the internet to Anthropic's servers. Every query &lt;br&gt;
you send to TinyLlama in your Kubernetes cluster never leaves your &lt;br&gt;
machine. For healthcare, legal, or financial applications where data &lt;br&gt;
privacy is non-negotiable, local is the only option.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use Each
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Use TinyLlama (local Kubernetes) when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data privacy is non-negotiable&lt;/li&gt;
&lt;li&gt;You are cost-sensitive at scale&lt;/li&gt;
&lt;li&gt;Tasks are simple: Q&amp;amp;A, summarization, basic code generation&lt;/li&gt;
&lt;li&gt;You want full control over your AI infrastructure&lt;/li&gt;
&lt;li&gt;You are building a product and do not want API dependency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Claude / commercial API when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Response speed matters (customer-facing, real-time)&lt;/li&gt;
&lt;li&gt;Tasks need strong reasoning or latest knowledge&lt;/li&gt;
&lt;li&gt;You are prototyping and do not want infra overhead&lt;/li&gt;
&lt;li&gt;Reliability SLAs matter more than cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The production answer is usually both.&lt;/strong&gt; Route privacy-sensitive &lt;br&gt;
queries to your local Kubernetes LLM. Route quality-critical queries &lt;br&gt;
to a commercial API. This hybrid architecture gives you the best of &lt;br&gt;
both worlds.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Kubernetes adds real reliability that you cannot get with raw Docker.&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;livenessProbe&lt;/code&gt; in our Ollama deployment means if the model &lt;br&gt;
crashes or hangs, Kubernetes detects it within 30 seconds and &lt;br&gt;
automatically restarts the pod. With a plain Docker container, &lt;br&gt;
you would need to notice the crash yourself and restart it manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. RAM matters more than you think with LLMs.&lt;/strong&gt;&lt;br&gt;
Mistral 7B requires 4.5GB of RAM and completely failed on my 8GB &lt;br&gt;
machine because the OS and Kubernetes overhead left only 3.1GB free. &lt;br&gt;
TinyLlama at 637MB ran perfectly. Always check model requirements &lt;br&gt;
before deploying — a model that cannot load is worse than no model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. One Helm command beats hours of configuration.&lt;/strong&gt;&lt;br&gt;
The entire Prometheus + Grafana monitoring stack — which would take &lt;br&gt;
hours to configure manually — installed in a single &lt;code&gt;helm install&lt;/code&gt; &lt;br&gt;
command. This is the power of the Kubernetes ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The speed gap closes dramatically with a GPU.&lt;/strong&gt;&lt;br&gt;
TinyLlama takes 20 seconds on a CPU. On an NVIDIA T4 GPU (available &lt;br&gt;
on GKE for ~$0.35/hour), the same model runs in under 1 second. &lt;br&gt;
If you need local + fast, the answer is a GPU node, not a bigger &lt;br&gt;
CPU model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Port-forwarding is for development only.&lt;/strong&gt;&lt;br&gt;
In production you would use a Kubernetes Ingress with a real domain &lt;br&gt;
name and TLS certificate — not port-forwarding. Everything in this &lt;br&gt;
guide is the right foundation for production, but swap port-forwards &lt;br&gt;
for a proper Ingress before going live.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Try Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add a GPU node&lt;/strong&gt; — Deploy to GKE with a T4 GPU and watch 
TinyLlama go from 20s to under 1s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try larger models&lt;/strong&gt; — With a GPU, Mistral 7B or Llama 3 8B 
will fit comfortably&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add streaming responses&lt;/strong&gt; — Ollama supports streaming; the 
chatbot can show tokens as they generate instead of waiting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up real alerting&lt;/strong&gt; — Configure Grafana alerts to send 
a Slack message when the Ollama pod goes down&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy to cloud&lt;/strong&gt; — Replace k3d with GKE or EKS for a 
production cluster with real uptime guarantees&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Full Code
&lt;/h2&gt;

&lt;p&gt;Everything in this guide is available on GitHub:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/daksh777f/llm-on-kubernetes" rel="noopener noreferrer"&gt;https://github.com/daksh777f/llm-on-kubernetes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The repo includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ollama.yaml&lt;/code&gt; — Kubernetes deployment for Ollama&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chatbot/&lt;/code&gt; — Complete Next.js chatbot&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;compare.py&lt;/code&gt; — The comparison script&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Deploying an open source LLM reliably is not just about running &lt;br&gt;
a model — it is about building infrastructure that handles failures &lt;br&gt;
gracefully, gives you visibility into what is happening, and scales &lt;br&gt;
when you need it to.&lt;/p&gt;

&lt;p&gt;The stack we built today — k3d + Ollama + Prometheus + Grafana + &lt;br&gt;
Next.js — is a genuine foundation for a production AI system. &lt;br&gt;
TinyLlama is free, private, and good enough for a wide range of &lt;br&gt;
tasks. When you need more power, swap it for Mistral or Llama 3 &lt;br&gt;
on a GPU node.&lt;/p&gt;

&lt;p&gt;The future of AI is not just API calls to commercial providers. &lt;br&gt;
It is open source models running on infrastructure you own and &lt;br&gt;
control. Now you know exactly how to build it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this guide helped you, drop a star on the &lt;br&gt;
&lt;a href="https://github.com/daksh777f/llm-on-kubernetes" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; &lt;br&gt;
and share it with someone building with LLMs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>docker</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
