<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Rost</title>
    <description>The latest articles on Forem by Rost (@rosgluk).</description>
    <link>https://forem.com/rosgluk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3544400%2F04dd81bf-749e-4055-971f-316c0134e76c.jpg</url>
      <title>Forem: Rost</title>
      <link>https://forem.com/rosgluk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rosgluk"/>
    <language>en</language>
    <item>
      <title>16 GB VRAM LLM benchmarks with llama.cpp (speed and context)</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sat, 04 Apr 2026 12:42:13 +0000</pubDate>
      <link>https://forem.com/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg</link>
      <guid>https://forem.com/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg</guid>
      <description>&lt;p&gt;Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.&lt;/p&gt;

&lt;p&gt;I have run these LLMs on llama.cpp with 19K, 32K, and 64K tokens context windows.&lt;/p&gt;

&lt;p&gt;For the broader performance picture (throughput versus latency, VRAM limits, parallel requests, and how benchmarks fit together across hardware and runtimes), see &lt;a href="https://www.glukhov.org/llm-performance/" rel="noopener noreferrer"&gt;LLM Performance in 2026: Benchmarks, Bottlenecks &amp;amp; Optimization&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The quality of the response is analysed in other articles, for instance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/ai-devtools/opencode/llms-comparison/" rel="noopener noreferrer"&gt;Best LLMs for OpenCode - Tested Locally&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/llm-hosting/ollama/translation-quality-comparison-llms-on-ollama/" rel="noopener noreferrer"&gt;Comparison of Hugo Page Translation quality - LLMs on Ollama&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I did run similar test for LLMs on Ollama: &lt;a href="https://www.glukhov.org/llm-performance/benchmarks/choosing-best-llm-for-ollama-on-16gb-vram-gpu/" rel="noopener noreferrer"&gt;Best LLMs for Ollama on 16GB VRAM GPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this post I am recording my attempts to squeeze as  much performance in a sense of speed as possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM speed comparison table (tokens per second and VRAM)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;19K VRAM&lt;/th&gt;
&lt;th&gt;19K GPU/CPU&lt;/th&gt;
&lt;th&gt;19K T/s&lt;/th&gt;
&lt;th&gt;32K VRAM&lt;/th&gt;
&lt;th&gt;32K Load&lt;/th&gt;
&lt;th&gt;32K T/s&lt;/th&gt;
&lt;th&gt;64K VRAM&lt;/th&gt;
&lt;th&gt;64K Load&lt;/th&gt;
&lt;th&gt;64K: T/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-35B-A3B-UD-IQ3_S&lt;/td&gt;
&lt;td&gt;13.6&lt;/td&gt;
&lt;td&gt;14.3GB&lt;/td&gt;
&lt;td&gt;93%/100%&lt;/td&gt;
&lt;td&gt;136.4&lt;/td&gt;
&lt;td&gt;14.6GB&lt;/td&gt;
&lt;td&gt;93%/100%&lt;/td&gt;
&lt;td&gt;138.5&lt;/td&gt;
&lt;td&gt;14.9GB&lt;/td&gt;
&lt;td&gt;88%/115%&lt;/td&gt;
&lt;td&gt;136.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-27B-UD-IQ3_XXS&lt;/td&gt;
&lt;td&gt;11.5&lt;/td&gt;
&lt;td&gt;12.9&lt;/td&gt;
&lt;td&gt;98/100&lt;/td&gt;
&lt;td&gt;45.3&lt;/td&gt;
&lt;td&gt;13.7&lt;/td&gt;
&lt;td&gt;98/100&lt;/td&gt;
&lt;td&gt;45.1&lt;/td&gt;
&lt;td&gt;14.7&lt;/td&gt;
&lt;td&gt;45/410&lt;/td&gt;
&lt;td&gt;22.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-122B-A10B-UD-IQ3_XXS&lt;/td&gt;
&lt;td&gt;44.7&lt;/td&gt;
&lt;td&gt;14.7&lt;/td&gt;
&lt;td&gt;30/470&lt;/td&gt;
&lt;td&gt;22.3&lt;/td&gt;
&lt;td&gt;14.7&lt;/td&gt;
&lt;td&gt;30/480&lt;/td&gt;
&lt;td&gt;21.8&lt;/td&gt;
&lt;td&gt;14.7&lt;/td&gt;
&lt;td&gt;28/490&lt;/td&gt;
&lt;td&gt;21.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nvidia Nemotron-Cascade-2-30B IQ4_XS&lt;/td&gt;
&lt;td&gt;18.2&lt;/td&gt;
&lt;td&gt;14.6&lt;/td&gt;
&lt;td&gt;60/305&lt;/td&gt;
&lt;td&gt;115.8&lt;/td&gt;
&lt;td&gt;14.7&lt;/td&gt;
&lt;td&gt;57/311&lt;/td&gt;
&lt;td&gt;113.6&lt;/td&gt;
&lt;td&gt;14.7&lt;/td&gt;
&lt;td&gt;55/324&lt;/td&gt;
&lt;td&gt;103.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma-4-26B-A4B-it-UD-IQ4_XS&lt;/td&gt;
&lt;td&gt;13.4&lt;/td&gt;
&lt;td&gt;14.7&lt;/td&gt;
&lt;td&gt;95/100&lt;/td&gt;
&lt;td&gt;121.7&lt;/td&gt;
&lt;td&gt;14.9&lt;/td&gt;
&lt;td&gt;95/115&lt;/td&gt;
&lt;td&gt;114.9&lt;/td&gt;
&lt;td&gt;14.9&lt;/td&gt;
&lt;td&gt;75/190&lt;/td&gt;
&lt;td&gt;96.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma-4-31B-it-UD-IQ3_XXS&lt;/td&gt;
&lt;td&gt;11.8&lt;/td&gt;
&lt;td&gt;14.8&lt;/td&gt;
&lt;td&gt;68/287&lt;/td&gt;
&lt;td&gt;29.2&lt;/td&gt;
&lt;td&gt;14.8&lt;/td&gt;
&lt;td&gt;41/480&lt;/td&gt;
&lt;td&gt;18.4&lt;/td&gt;
&lt;td&gt;14.8&lt;/td&gt;
&lt;td&gt;18/634&lt;/td&gt;
&lt;td&gt;8.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM-4.7-Flash-IQ4_XS&lt;/td&gt;
&lt;td&gt;16.3&lt;/td&gt;
&lt;td&gt;15.0&lt;/td&gt;
&lt;td&gt;66/240&lt;/td&gt;
&lt;td&gt;91.8&lt;/td&gt;
&lt;td&gt;14.9&lt;/td&gt;
&lt;td&gt;62/262&lt;/td&gt;
&lt;td&gt;86.1&lt;/td&gt;
&lt;td&gt;14.9&lt;/td&gt;
&lt;td&gt;53/313&lt;/td&gt;
&lt;td&gt;72.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM-4.7-Flash-REAP-23B IQ4_XS&lt;/td&gt;
&lt;td&gt;12.6&lt;/td&gt;
&lt;td&gt;13.7&lt;/td&gt;
&lt;td&gt;92/100&lt;/td&gt;
&lt;td&gt;122.0&lt;/td&gt;
&lt;td&gt;14.4&lt;/td&gt;
&lt;td&gt;95/102&lt;/td&gt;
&lt;td&gt;123.2&lt;/td&gt;
&lt;td&gt;14.9&lt;/td&gt;
&lt;td&gt;71/196&lt;/td&gt;
&lt;td&gt;97.1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;19K, 32K, and 64K are the context sizes.&lt;/p&gt;

&lt;p&gt;The  &lt;code&gt;load&lt;/code&gt; above is a &lt;code&gt;GPU Load&lt;/code&gt;.&lt;br&gt;
If you see a low number in this column- that means model is running mostly on CPU and can not get any decent speed on this hardware. That pattern matches what people see when too little of the model fits on the GPU or when context pushes work back to the host.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why context length changes tokens per second
&lt;/h2&gt;

&lt;p&gt;As you move from 19K to 32K or 64K tokens, the KV cache grows and VRAM pressure rises. Some rows show a big drop in tokens per second at 64K while others stay flat, which is the signal to revisit quants, context limits, or layer offload rather than assuming the model is “slow” in general.&lt;/p&gt;

&lt;p&gt;The models and quants I've chosen to test - are to run by myself and see do they give a good gain in a sense of cost/benefit on this equipment or not. So no q8 quants here with 200k context :) ...&lt;/p&gt;

&lt;p&gt;GPU/CPU is a load, measured by &lt;code&gt;nvitop&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;llama.cpp when autoconfiguring the layers unloading to GPU is trying to keep 1GB free.&lt;br&gt;
We manually specify this parameter via commandline param &lt;code&gt;-ngl&lt;/code&gt; but I'm not finetuning it here,&lt;br&gt;
just need to understand that if there is significant performance drop when increasing context window size from 32k to 64k - we can try to increase speed on 64k by finetuning number of unloaded layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test hardware and llama.cpp setup
&lt;/h2&gt;

&lt;p&gt;I tested the LLM speed on a PC with this config:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU i-14700&lt;/li&gt;
&lt;li&gt;RAM  64GB 6000Hz (2x32GB)&lt;/li&gt;
&lt;li&gt;GPU RTX-4080&lt;/li&gt;
&lt;li&gt;Ubuntu with NVidia drivers&lt;/li&gt;
&lt;li&gt;llama.cpp/llama-cli, no unloaded layers specified&lt;/li&gt;
&lt;li&gt;Initial VRAM used, before starting llama-cli: 300MB&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Extra runs at 128K context (Qwen3.5 27B and 122B)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;128K Load&lt;/th&gt;
&lt;th&gt;128K: T/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-27B-UD-IQ3_XXS&lt;/td&gt;
&lt;td&gt;16/625&lt;/td&gt;
&lt;td&gt;9.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-122B-A10B-UD-IQ3_XXS&lt;/td&gt;
&lt;td&gt;27/496&lt;/td&gt;
&lt;td&gt;19.2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Takeaways for 16 GB VRAM builds
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;My current favorite Qwen3.5-27B-UD-IQ3_XXS is looking good on it's sweetspot 50k context (I am getting approx 36t/s)&lt;/li&gt;
&lt;li&gt;Qwen3.5-122B-A10B-UD-IQ3_XXS is overtaking performance-wise the Qwen3.5 27B on the contexts above 64K.&lt;/li&gt;
&lt;li&gt;I can push Qwen3.5-35B-A3B-UD-IQ3_S to handle context 100k tokens, and it fits into vram, so no performance drop&lt;/li&gt;
&lt;li&gt;I will not use gemma-4-31B on 16GB VRAM, but gemma-4-26B might be medium-well..., need to test.&lt;/li&gt;
&lt;li&gt;Need to test how well Nemotron cascade 2 and GLM-4.7 Flash REAP 23B work. will they be better then Qwen3.5-35B q3? I doubt but still, might test to confirm the suspicion.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>selfhosting</category>
      <category>llm</category>
      <category>ai</category>
      <category>hardware</category>
    </item>
    <item>
      <title>RTX 5090 in Australia March 2026 Pricing Stock Reality</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Wed, 01 Apr 2026 02:51:36 +0000</pubDate>
      <link>https://forem.com/rosgluk/rtx-5090-in-australia-march-2026-pricing-stock-reality-55ai</link>
      <guid>https://forem.com/rosgluk/rtx-5090-in-australia-march-2026-pricing-stock-reality-55ai</guid>
      <description>&lt;p&gt;Australia has RTX 5090 stock.&lt;br&gt;
Barely.&lt;br&gt;
And if you find one, you will pay a premium that feels detached from reality.&lt;/p&gt;

&lt;p&gt;For GPUs, CPUs, memory, AI workstations, and wider compute trends, see &lt;a href="https://www.glukhov.org/hardware/" rel="noopener noreferrer"&gt;Compute Hardware in 2026: GPUs, CPUs, Memory &amp;amp; AI Workstations&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  RTX 5090 stock availability in Australia
&lt;/h2&gt;

&lt;p&gt;Let us be blunt: the RTX 5090 is not "out of stock everywhere", but it is effectively scarce.&lt;/p&gt;

&lt;p&gt;Across major Australian retailers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some models are listed as in stock, but only a handful at a time&lt;/li&gt;
&lt;li&gt;Most SKUs are sold out or flip between in stock and gone within days&lt;/li&gt;
&lt;li&gt;High end variants dominate availability, not entry models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scorptec shows multiple RTX 5090 listings, but most are sold out with only occasional stock appearing &lt;/li&gt;
&lt;li&gt;Umart and Mwave still carry limited inventory, but only a few models are actually purchasable at any given time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates a strange situation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Technically available&lt;/li&gt;
&lt;li&gt;Practically hard to buy at will&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real wait times
&lt;/h2&gt;

&lt;p&gt;If you are not lucky:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restock cycles feel like 2 to 6 weeks&lt;/li&gt;
&lt;li&gt;Popular models disappear within hours or days&lt;/li&gt;
&lt;li&gt;Preorders are quietly back without strong visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This mirrors global trends where RTX 5090 stock vanished quickly after demand spikes in early 2026&lt;/p&gt;




&lt;h2&gt;
  
  
  RTX 5090 pricing in Australia right now
&lt;/h2&gt;

&lt;p&gt;Here is where things get painful.&lt;/p&gt;

&lt;h3&gt;
  
  
  Typical street prices (March 2026)
&lt;/h3&gt;

&lt;p&gt;From real retailer listings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Entry level AIB cards: ~5999 AUD
&lt;/li&gt;
&lt;li&gt;Mid range premium cards: 6300 to 6500 AUD
&lt;/li&gt;
&lt;li&gt;High end OC or liquid cooled: 6500 to 7500 AUD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even within a single retailer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same GPU silicon&lt;/li&gt;
&lt;li&gt;Price swings of 1000+ AUD depending on cooling and branding&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Compared to official pricing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Official starting price was about 4039 AUD&lt;/li&gt;
&lt;li&gt;Real world pricing is now 50 to 80 percent higher&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not normal inflation. This is structural scarcity plus margin stacking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why RTX 5090 is still expensive and scarce
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. AI demand is eating supply
&lt;/h3&gt;

&lt;p&gt;Blackwell is not just a gaming GPU.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI workloads&lt;/li&gt;
&lt;li&gt;Local LLM inference&lt;/li&gt;
&lt;li&gt;Enterprise spillover demand&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gamers are competing with developers and businesses.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. GDDR7 and bleeding edge silicon constraints
&lt;/h3&gt;

&lt;p&gt;5090 depends on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New memory supply chains&lt;/li&gt;
&lt;li&gt;Advanced packaging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not scaling fast enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. AIB partner pricing strategy
&lt;/h3&gt;

&lt;p&gt;Board partners learned from previous cycles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Launch high&lt;/li&gt;
&lt;li&gt;Stay high&lt;/li&gt;
&lt;li&gt;Discount later (maybe)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Right now, there is zero incentive to reduce prices.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Australia tax and logistics penalty
&lt;/h3&gt;

&lt;p&gt;Australia always pays more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Import costs&lt;/li&gt;
&lt;li&gt;Smaller allocation pools&lt;/li&gt;
&lt;li&gt;Currency effects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So shortages hit harder locally.&lt;/p&gt;




&lt;h2&gt;
  
  
  Should you buy RTX 5090 in March 2026
&lt;/h2&gt;

&lt;p&gt;Opinionated answer: only if you absolutely need it.&lt;/p&gt;

&lt;p&gt;You are paying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Early adopter tax&lt;/li&gt;
&lt;li&gt;Supply chain tax&lt;/li&gt;
&lt;li&gt;Hype tax&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The only rational buyers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI developers needing local compute&lt;/li&gt;
&lt;li&gt;High end creators&lt;/li&gt;
&lt;li&gt;People upgrading from very old GPUs (like 20 series or earlier)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everyone else is subsidising Nvidia margins.&lt;/p&gt;




&lt;h2&gt;
  
  
  RTX 5080 Super and RTX 6000 series expectations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  RTX 5080 Super
&lt;/h3&gt;

&lt;p&gt;Rumors and historical patterns suggest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Release window: late 2026&lt;/li&gt;
&lt;li&gt;Likely improvements:

&lt;ul&gt;
&lt;li&gt;Higher clocks&lt;/li&gt;
&lt;li&gt;Faster GDDR7 bins&lt;/li&gt;
&lt;li&gt;Better efficiency&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;But do not expect miracles. It will be a refinement, not a revolution.&lt;/p&gt;

&lt;h3&gt;
  
  
  RTX 6000 consumer series
&lt;/h3&gt;

&lt;p&gt;Looking further:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expected timeframe: 2027&lt;/li&gt;
&lt;li&gt;Likely direction:

&lt;ul&gt;
&lt;li&gt;Stronger AI acceleration baked in&lt;/li&gt;
&lt;li&gt;Better perf per watt&lt;/li&gt;
&lt;li&gt;Possible shift toward chiplet style GPUs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The bigger question is pricing, not performance.&lt;/p&gt;

&lt;p&gt;If Nvidia keeps current strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RTX 6090 could launch even higher than current 5090 street prices&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The uncomfortable prognosis
&lt;/h2&gt;

&lt;p&gt;The GPU market has changed.&lt;/p&gt;

&lt;p&gt;What used to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Launch spike&lt;/li&gt;
&lt;li&gt;Then price normalization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Is now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Launch spike&lt;/li&gt;
&lt;li&gt;Then sustained high plateau&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unless one of these happens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AMD delivers real high end competition&lt;/li&gt;
&lt;li&gt;AI demand cools down&lt;/li&gt;
&lt;li&gt;Supply massively increases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Do not expect RTX 5090 prices in Australia to drop meaningfully in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  Related GPU and RAM pricing in Australia
&lt;/h2&gt;

&lt;p&gt;Earlier checks on the same cards and retailers, plus paired RAM moves for build totals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/hardware/memory/ram-and-gpu-price-increase/" rel="noopener noreferrer"&gt;GPU and RAM Prices Surge in Australia: RTX 5090 Up 15%, RAM Up 38% - January 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/hardware/gpu/nvidia-rtx-5080-rtx-5090-australia-prices-november-2025/" rel="noopener noreferrer"&gt;NVidia RTX 5080 and RTX 5090 prices in Australia - November 2025&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/hardware/gpu/nvidia-rtx-5080-rtx-5090-prices-october-2025/" rel="noopener noreferrer"&gt;NVidia RTX 5080 and RTX 5090 prices in Australia - October 2025&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/hardware/gpu/nvidia-rtx-5080-rtx-5090-prices-july-2025/" rel="noopener noreferrer"&gt;NVidia RTX 5080 and RTX 5090 prices in Australia - July 2025&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/hardware/gpu/nvidia-rtx-5080-rtx-5090-prices-australia/" rel="noopener noreferrer"&gt;Nvidia RTX 5080 and RTX 5090 Prices in Australia - June 2025&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For global DDR5 pressure and how it stacks next to GPU line items, see &lt;a href="https://www.glukhov.org/hardware/memory/ram-price-increase/" rel="noopener noreferrer"&gt;RAM Price Surge: Up to 619% in 2025&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;RTX 5090 is available in Australia, but barely
&lt;/li&gt;
&lt;li&gt;Prices are consistently 6000 to 7500 AUD
&lt;/li&gt;
&lt;li&gt;Supply is unstable and unpredictable
&lt;/li&gt;
&lt;li&gt;Waiting might save money, but not guaranteed
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are asking whether now is a good time to buy:&lt;/p&gt;

&lt;p&gt;It is not.&lt;/p&gt;

&lt;p&gt;It is just the least bad time if you need the performance right now.&lt;/p&gt;

</description>
      <category>selfhosting</category>
      <category>llm</category>
      <category>hardware</category>
      <category>nvidia</category>
    </item>
    <item>
      <title>Remote Ollama access via Tailscale or WireGuard, no public ports</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Tue, 31 Mar 2026 05:06:46 +0000</pubDate>
      <link>https://forem.com/rosgluk/remote-ollama-access-via-tailscale-or-wireguard-no-public-ports-3n40</link>
      <guid>https://forem.com/rosgluk/remote-ollama-access-via-tailscale-or-wireguard-no-public-ports-3n40</guid>
      <description>&lt;p&gt;Ollama is at its happiest when it is treated like a local daemon: the CLI and your apps talk to a loopback HTTP API, and the rest of the network never finds out it exists. &lt;/p&gt;

&lt;p&gt;By default, that is exactly what happens: the common local base address is on localhost port 11434.&lt;/p&gt;

&lt;p&gt;This article is about the moment you want remote access (laptop, another office machine, maybe a phone), but you do not want to publish an unauthenticated model runner to the whole internet. That intent matters, because the easiest scaling move (open a port, forward it, done) is also the move that creates the mess.&lt;/p&gt;

&lt;p&gt;A practical north star is simple: keep the Ollama API private, then make the private network path boring. Tailscale and WireGuard are two common ways to do that, and the rest is making sure the host listens only where it should and the firewall agrees.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Remote device
  |
  | (private VPN path: tailscale or wireguard)
  v
VPN interface on host (tailscale0 or wg0)
  |
  | (local hop)
  v
Ollama server (HTTP API on localhost or VPN IP)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Threat model and who should reach the API
&lt;/h2&gt;

&lt;p&gt;How can Ollama be accessed remotely without exposing it to the public internet? The answer is less about a specific tool and more about being explicit on "who is allowed to connect" and "from where".&lt;/p&gt;

&lt;p&gt;A useful mental model is three concentric rings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local only: only processes on the box can call the API.&lt;/li&gt;
&lt;li&gt;LAN only: devices on the same local network can call the API.&lt;/li&gt;
&lt;li&gt;VPN only: selected devices and users on a private overlay network can call the API.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first ring is the default. Many guides (and tools like Postman) assume the base URL is localhost 11434, which is both convenient and a surprisingly strong safety boundary.&lt;/p&gt;

&lt;p&gt;The reason the rings matter is that Ollama is commonly described as having no built-in authentication for its local HTTP API, meaning network exposure and access control become your job if you move beyond localhost.&lt;/p&gt;

&lt;p&gt;The other reason is cost and abuse: even a "private" LLM endpoint is still an API endpoint. The OWASP API Security Top 10 calls out categories like security misconfiguration and unrestricted resource consumption; a model runner is practically a poster child for "resource consumption" if exposed casually.&lt;/p&gt;

&lt;p&gt;So the basic threat model is not only "an attacker reads my data". It is also "someone can drive my CPU and GPU like a rented car" and "unintended users discover it and start building against it".&lt;/p&gt;

&lt;h2&gt;
  
  
  OLLAMA_HOST and bind semantics in 90 seconds
&lt;/h2&gt;

&lt;p&gt;What does OLLAMA_HOST do and what is the safest default value? OLLAMA_HOST is the switch that controls where the Ollama server listens. In &lt;code&gt;ollama serve&lt;/code&gt;, the environment variable is described as the IP address and port for the server, with a default of 127.0.0.1 and port 11434.&lt;/p&gt;

&lt;p&gt;In plain terms, the bind address decides which networks can even attempt a TCP connection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;127.0.0.1 means localhost only.&lt;/li&gt;
&lt;li&gt;A LAN IP (like 192.168.x.y) means the LAN can reach it.&lt;/li&gt;
&lt;li&gt;0.0.0.0 means all interfaces (LAN, VPN, everything) can reach it unless a firewall blocks them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why most "make it accessible" how-tos suggest switching from 127.0.0.1 to 0.0.0.0, but that advice is incomplete without an interface-aware firewall.&lt;/p&gt;

&lt;p&gt;Here is the cheat sheet I keep in my head:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Local only (baseline)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;127.0.0.1:11434

&lt;span class="c"&gt;# All interfaces (powerful, easy to regret)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.0.0.0:11434

&lt;span class="c"&gt;# VPN interface only (preferred when the VPN has a stable IP on the host)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100.64.0.10:11434   &lt;span class="c"&gt;# example tailscale IP&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10.10.10.1:11434    &lt;span class="c"&gt;# example wireguard IP&lt;/span&gt;

&lt;span class="c"&gt;# Different port (useful when 11434 is already taken)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;127.0.0.1:11435
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "different port" case is explicitly discussed in the Ollama issue tracker as an example of using OLLAMA_HOST to alter the listen port.&lt;/p&gt;

&lt;p&gt;One operational footnote that bites people: if Ollama runs as a managed service, setting environment variables in an interactive shell does not necessarily change the service configuration. This is why many "it worked in my terminal but not after reboot" stories end up in systemd unit overrides or service manager configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern A VPN first with Tailscale
&lt;/h2&gt;

&lt;p&gt;Can Tailscale restrict access to only one service port on a machine? Yes, and that is a big part of why Tailscale is a good fit for "remote access without publishing".&lt;/p&gt;

&lt;p&gt;Tailscale gives you a private network (a tailnet) with centrally managed access controls (ACLs). ACLs exist specifically to manage device permissions and secure the network.&lt;/p&gt;

&lt;h3&gt;
  
  
  No public port means no router choreography
&lt;/h3&gt;

&lt;p&gt;The cleanest pattern is to avoid opening any internet-facing port for Ollama at all and treat the VPN as the only ingress. With Tailscale, devices attempt to connect directly peer-to-peer when possible, and can fall back to relay mechanisms when direct connectivity is not possible.&lt;/p&gt;

&lt;p&gt;This is not magic security by itself, but it radically shrinks the blast radius compared to "I forwarded 11434 on my router".&lt;/p&gt;

&lt;h3&gt;
  
  
  Split horizon and naming with MagicDNS
&lt;/h3&gt;

&lt;p&gt;A second question that shows up in real life is "do I connect via LAN IP when I am at home and via VPN IP when I am away". That is basically a split-horizon problem.&lt;/p&gt;

&lt;p&gt;Tailscale MagicDNS helps by giving each device a stable tailnet hostname. Under the hood, MagicDNS generates a FQDN for every device that combines the machine name and your tailnet DNS name, and modern tailnet names end in &lt;code&gt;.ts.net&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The opinionated take is that using a name is usually better than hard-coding an IP, because the name follows the device even if your tailnet IP changes. But it is also fine to be intentionally boring and keep a small hosts file or a single internal DNS record if you prefer. MagicDNS exists so you do not have to.&lt;/p&gt;

&lt;h3&gt;
  
  
  Direct port versus tailnet-only proxying
&lt;/h3&gt;

&lt;p&gt;There are two common Tailscale ways to reach a service:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct port access, where the service listens on the tailnet interface and clients connect to that IP and port.&lt;/li&gt;
&lt;li&gt;Tailscale Serve, where Tailscale routes traffic from other tailnet devices to a local service on the host.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Serve is explicitly described as routing traffic from other tailnet devices to a local service running on your device.&lt;/p&gt;

&lt;p&gt;For Ollama, Serve can be attractive because it lets you keep Ollama on localhost and expose only a controlled ingress path through Tailscale. It also pairs naturally with HTTPS inside the tailnet if you want browser-friendly endpoints.&lt;/p&gt;

&lt;p&gt;A related feature worth naming and then mentally parking is Funnel. Funnel is designed to route traffic from the broader internet to a service on a tailnet device and is explicitly for "anyone to access even if they do not use Tailscale". That is the opposite of this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern B WireGuard for those who want the raw primitives
&lt;/h2&gt;

&lt;p&gt;WireGuard is the underlying primitive that powers many VPN products, and it is deliberately minimal: you configure an interface, define peers, and decide what traffic is allowed to flow.&lt;/p&gt;

&lt;p&gt;The WireGuard quick start shows the basic shape: create an interface such as &lt;code&gt;wg0&lt;/code&gt;, assign IPs, and configure peers with &lt;code&gt;wg&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The key concept for scoping access is AllowedIPs. In the Red Hat documentation, WireGuard reads the destination IP from a packet and compares it to the list of allowed IP addresses; if the peer is not found, WireGuard drops the packet.&lt;/p&gt;

&lt;p&gt;For an Ollama host, the practical translation is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Put the host on a private WireGuard subnet.&lt;/li&gt;
&lt;li&gt;Bind Ollama either to localhost and forward to it, or bind directly to the WireGuard IP.&lt;/li&gt;
&lt;li&gt;Only peers that have the correct keys and AllowedIPs can route traffic to that private IP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is fewer moving parts than a commercial overlay, but it also means you are responsible for key distribution, peer lifecycle, and how remote peers reach your network.&lt;/p&gt;

&lt;h2&gt;
  
  
  Firewall allow only VPN interface or tailnet
&lt;/h2&gt;

&lt;p&gt;How can a firewall limit Ollama to only VPN interface traffic? The goal is to prevent accidental exposure even if the bind address becomes broader than intended.&lt;/p&gt;

&lt;p&gt;The general pattern is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Allow the Ollama TCP port only on the VPN interface (tailscale0 or wg0).&lt;/li&gt;
&lt;li&gt;Deny the same port on everything else.&lt;/li&gt;
&lt;li&gt;Prefer "default deny inbound" if you operate that way for the host.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tailscale has explicit guidance on using UFW to restrict non-Tailscale traffic to a server, which is essentially the "lock down everything except the tailnet" approach.&lt;/p&gt;

&lt;p&gt;One nuance that matters for Tailscale specifically is that host firewall expectations may not match reality if you assume UFW will block tailnet traffic. The Tailscale project has discussed that it intentionally installs a rule to allow traffic on &lt;code&gt;tailscale0&lt;/code&gt; and relies on an ACL-controlled filter inside &lt;code&gt;tailscaled&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is not an argument against a host firewall. It is an argument for being deliberate about which control plane is actually enforcing policy. If you want "only these devices can reach port 11434", Tailscale ACLs are designed for that job.&lt;/p&gt;

&lt;p&gt;If you do want interface-level host controls anyway, the examples tend to look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# UFW style logic (illustrative)&lt;/span&gt;
ufw allow &lt;span class="k"&gt;in &lt;/span&gt;on tailscale0 to any port 11434 proto tcp
ufw deny  &lt;span class="k"&gt;in &lt;/span&gt;to any port 11434 proto tcp

&lt;span class="c"&gt;# Or for wireguard&lt;/span&gt;
ufw allow &lt;span class="k"&gt;in &lt;/span&gt;on wg0 to any port 11434 proto tcp
ufw deny  &lt;span class="k"&gt;in &lt;/span&gt;to any port 11434 proto tcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even if you rely primarily on VPN policy, the host firewall still provides a useful "seatbelt" against misbinding to 0.0.0.0 or unexpected service wrappers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optional reverse proxy only on VPN ingress
&lt;/h2&gt;

&lt;p&gt;When is a reverse proxy useful for remote Ollama access? A proxy is useful when you want one or more of these properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A standard authentication gate (basic auth, OIDC, client certs).&lt;/li&gt;
&lt;li&gt;TLS termination with a certificate clients trust.&lt;/li&gt;
&lt;li&gt;Request limits and timeouts.&lt;/li&gt;
&lt;li&gt;Cleaner URLs for tools that dislike raw ports.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where the "do not publish to the internet" intent should still stay true: the reverse proxy is reachable only via the VPN, not on the public WAN interface.&lt;/p&gt;

&lt;p&gt;Is TLS needed when traffic already goes through a VPN? Not always for cryptography, but often for ergonomics. Tailscale points out that connections between nodes are already end-to-end encrypted, but browsers are not aware of that because they rely on TLS certificates to establish HTTPS trust.&lt;/p&gt;

&lt;p&gt;If you are in the Tailscale world, you can enable HTTPS certificates for your tailnet, which requires MagicDNS and explicitly notes that machine names and the tailnet DNS name will be published on a public ledger (certificate transparency logs).&lt;/p&gt;

&lt;p&gt;That public-ledger detail is not a reason to avoid TLS, but it is a reason to name machines like an adult (avoid embedding private project names or customer identifiers in hostnames).&lt;/p&gt;

&lt;p&gt;This article intentionally does not include full reverse-proxy configuration (see your A1 article for that). The only idea that matters here is placement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ollama listens on localhost or VPN IP.&lt;/li&gt;
&lt;li&gt;Reverse proxy listens on the VPN interface only.&lt;/li&gt;
&lt;li&gt;Proxy forwards to Ollama.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security checklist for remote Ollama API access
&lt;/h2&gt;

&lt;p&gt;This is the checklist I use to keep "remote" from silently becoming "public".&lt;/p&gt;

&lt;p&gt;Binding and reachability&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Confirm the server listens where you think it listens. The documented default is 127.0.0.1 and port 11434, and OLLAMA_HOST changes that.&lt;/li&gt;
&lt;li&gt;Treat 0.0.0.0 as a deliberate choice, not a convenience toggle.&lt;/li&gt;
&lt;li&gt;Prefer binding to a VPN interface IP when it is stable and fits the topology.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Access control&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If using Tailscale, implement ACLs that allow only the specific users or tagged devices to the Ollama port. ACLs exist to manage device permissions.&lt;/li&gt;
&lt;li&gt;If using WireGuard, keep AllowedIPs tight and treat keys as the real identity boundary. WireGuard drops packets that do not match a valid peer AllowedIPs mapping.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Firewall&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add a host-level rule that allows the Ollama port only on tailscale0 or wg0 and blocks it everywhere else.&lt;/li&gt;
&lt;li&gt;If you expect UFW to block tailnet traffic, verify how Tailscale interacts with your firewall. Tailscale has discussed allowing tailscale0 traffic and relying on ACL filtering inside tailscaled.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TLS and proxying&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use TLS when clients are browsers or when tooling expects HTTPS, even if the VPN already encrypts transport. Tailscale documents this gap between VPN encryption and browser HTTPS trust.&lt;/li&gt;
&lt;li&gt;If you enable Tailscale HTTPS certs, remember the certificate transparency implication for hostnames.&lt;/li&gt;
&lt;li&gt;If you add a reverse proxy, keep it VPN-only and use it for auth and limits, not for internet exposure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Avoid accidental public exposure&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Be wary of features explicitly designed to publish services to the internet. Tailscale Funnel routes traffic from the broader internet to a tailnet device, which is not the default-safe path for an Ollama API.&lt;/li&gt;
&lt;li&gt;If anything ends up internet-reachable, do not leave an anonymous &lt;code&gt;/api&lt;/code&gt; surface. At that point, the OWASP API "security misconfiguration" and "resource consumption" risk categories stop being theoretical.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observability and damage control&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log requests at the ingress layer (VPN policy logs, proxy logs, or both).&lt;/li&gt;
&lt;li&gt;Add request and concurrency limits if your proxy supports them, because model inference is a resource event, not a normal API call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The consistent theme is boring on purpose: keep the Ollama API private by default, add a private path for remote access, then enforce that policy twice (VPN identity plus host firewall) so a single misstep does not turn into a public endpoint.&lt;/p&gt;

</description>
      <category>hosting</category>
      <category>selfhosting</category>
      <category>llm</category>
      <category>ollama</category>
    </item>
    <item>
      <title>Structured Logging in Go with slog for Observability and Alerting</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sat, 28 Mar 2026 10:34:14 +0000</pubDate>
      <link>https://forem.com/rosgluk/structured-logging-in-go-with-slog-for-observability-and-alerting-3fnm</link>
      <guid>https://forem.com/rosgluk/structured-logging-in-go-with-slog-for-observability-and-alerting-3fnm</guid>
      <description>&lt;p&gt;Logs are a debugging interface you can still use when the system is on fire.&lt;br&gt;
The problem is that plain text logs age poorly: as soon as you need filtering,&lt;br&gt;
aggregation, and alerting, you start parsing sentences.&lt;/p&gt;



&lt;p&gt;Structured logging is the antidote.&lt;br&gt;
It turns every log line into a small event with stable fields, so tools can search and aggregate reliably.&lt;br&gt;
For how logs connect to metrics, dashboards, and alerting in the wider stack, see the &lt;a href="https://www.glukhov.org/observability/" rel="noopener noreferrer"&gt;Observability: Monitoring, Metrics, Prometheus &amp;amp; Grafana Guide&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  What structured logging is and why it scales
&lt;/h2&gt;

&lt;p&gt;Structured logging is logging where a record is not just a string, but a&lt;br&gt;
message plus typed key value attributes. The idea is boring in the best way:&lt;br&gt;
once logs are machine-readable, an incident stops being a grep contest.&lt;/p&gt;

&lt;p&gt;A quick comparison:&lt;/p&gt;

&lt;p&gt;Plain text (human-first, tool-hostile)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;failed to charge card user=42 amount=19.99 ms=842 err=timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Structured (tool-first, still readable)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"msg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"failed to charge card"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;19.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"duration_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;842&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"timeout"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production, it helps to think of logs as an event stream emitted by the&lt;br&gt;
process, while routing and storage live outside the application. That mental&lt;br&gt;
model pushes you toward writing one event per line and keeping events easy to&lt;br&gt;
ship and re-process.&lt;/p&gt;
&lt;h2&gt;
  
  
  Slog in Go as a shared logging front end
&lt;/h2&gt;

&lt;p&gt;Go has had the classic log package since forever, but modern services need&lt;br&gt;
levels and fields. The log/slog package (Go 1.21 and later) brings structured&lt;br&gt;
logging into the standard library and formalises a common shape for log&lt;br&gt;
records: time, level, message, and attributes.&lt;br&gt;
For a compact language and command refresher alongside this guide, see the &lt;a href="https://www.glukhov.org/post/2022/12-golang-cheatsheet/" rel="noopener noreferrer"&gt;Go Cheatsheet&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The key parts of the model are:&lt;/p&gt;
&lt;h3&gt;
  
  
  Record
&lt;/h3&gt;

&lt;p&gt;A record is what happened. In slog terms, it contains time, level, message,&lt;br&gt;
and a set of attributes. You create records via methods like Info and Error,&lt;br&gt;
or via Log when you want to supply the level explicitly.&lt;/p&gt;
&lt;h3&gt;
  
  
  Attributes
&lt;/h3&gt;

&lt;p&gt;Attributes are the key value pairs that make logs queryable. If you log the&lt;br&gt;
same concept under three different keys (user, userId, uid), you get three&lt;br&gt;
different datasets. Consistent keys are where the real value hides.&lt;/p&gt;
&lt;h3&gt;
  
  
  Handler
&lt;/h3&gt;

&lt;p&gt;A handler is how records become bytes. The built-in TextHandler writes&lt;br&gt;
key=value output, while JSONHandler writes line-delimited JSON. Handlers are&lt;br&gt;
also where redaction, key renaming, and output routing tend to happen.&lt;/p&gt;

&lt;p&gt;One under-rated feature is that slog can sit in front of existing code. When&lt;br&gt;
you set a default slog logger, top-level slog functions use it, and the classic&lt;br&gt;
log package can be redirected to it too. That makes incremental migration&lt;br&gt;
possible.&lt;/p&gt;
&lt;h3&gt;
  
  
  Groups
&lt;/h3&gt;

&lt;p&gt;Groups solve the "every subsystem uses id" problem. You can group a set of&lt;br&gt;
attributes for a request (request.method, request.path) or namespace an entire&lt;br&gt;
subsystem with WithGroup so keys do not collide.&lt;/p&gt;
&lt;h2&gt;
  
  
  A production shaped slog setup
&lt;/h2&gt;

&lt;p&gt;The following setup hits the usual goals.&lt;br&gt;
The examples use a small &lt;code&gt;logx&lt;/code&gt; package; for where packages like that usually live in a real module, see &lt;a href="https://www.glukhov.org/post/2025/12/go-project-structure/" rel="noopener noreferrer"&gt;Go Project Structure: Practices &amp;amp; Patterns&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one JSON event per line&lt;/li&gt;
&lt;li&gt;logs written to stdout for collection&lt;/li&gt;
&lt;li&gt;stable service metadata attached once&lt;/li&gt;
&lt;li&gt;context-aware logging for request and trace IDs&lt;/li&gt;
&lt;li&gt;central redaction for sensitive keys
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;logx&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"log/slog"&lt;/span&gt;
    &lt;span class="s"&gt;"os"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LevelVar&lt;/span&gt; &lt;span class="c"&gt;// defaults to INFO&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerOptions&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Level&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// can be changed at runtime&lt;/span&gt;
        &lt;span class="n"&gt;AddSource&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c"&gt;// include file and line when available&lt;/span&gt;
        &lt;span class="n"&gt;ReplaceAttr&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;groups&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Attr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Attr&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c"&gt;// Centralised redaction: consistent and hard to bypass by accident.&lt;/span&gt;
            &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"token"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"authorization"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"api_key"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"[redacted]"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewJSONHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;With&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SERVICE_NAME"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="s"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ENV"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="s"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"VERSION"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;SetLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Level&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A tiny detail with large consequences: the built-in JSON handler uses standard&lt;br&gt;
keys (time, level, msg, source). When your log backend expects a different&lt;br&gt;
schema, ReplaceAttr is the pressure-release valve that lets you normalise keys&lt;br&gt;
without rewriting call sites.&lt;/p&gt;
&lt;h2&gt;
  
  
  Schema matters more than the logger
&lt;/h2&gt;

&lt;p&gt;Most "structured logging" failures are schema failures.&lt;/p&gt;
&lt;h3&gt;
  
  
  Essential fields that keep paying rent
&lt;/h3&gt;

&lt;p&gt;Every log backend will store a timestamp, level, and message. In practice, a&lt;br&gt;
useful application schema often adds a small set of stable fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;service, env, version&lt;/li&gt;
&lt;li&gt;component (or subsystem)&lt;/li&gt;
&lt;li&gt;event (a stable name for the thing that happened)&lt;/li&gt;
&lt;li&gt;request_id (when a request exists)&lt;/li&gt;
&lt;li&gt;trace_id and span_id (when tracing exists)&lt;/li&gt;
&lt;li&gt;error (string) and error_kind (stable bucket)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice the pattern: these fields answer operational questions, not developer&lt;br&gt;
curiosity.&lt;/p&gt;
&lt;h3&gt;
  
  
  Semantic conventions are a cheap consistency hack
&lt;/h3&gt;

&lt;p&gt;If you already use OpenTelemetry, its semantic conventions provide a standard&lt;br&gt;
vocabulary for attributes across telemetry signals. Even if you do not export&lt;br&gt;
logs via OpenTelemetry, borrowing attribute names reduces the "what did we call&lt;br&gt;
this field in service B" tax.&lt;/p&gt;
&lt;h3&gt;
  
  
  High cardinality and why logs get expensive
&lt;/h3&gt;

&lt;p&gt;High cardinality means "too many unique values". It is fine inside a JSON&lt;br&gt;
payload, but it becomes painful when a backend treats some fields as indexed&lt;br&gt;
labels or stream keys. User IDs, IP addresses, random request tokens, and full&lt;br&gt;
URLs tend to explode combinations.&lt;/p&gt;

&lt;p&gt;The practical outcome is simple: keep labels and index keys boring (service,&lt;br&gt;
environment, region), and keep high-cardinality fields inside the structured&lt;br&gt;
payload for filtering at query time.&lt;/p&gt;
&lt;h2&gt;
  
  
  Correlation with request IDs and traces
&lt;/h2&gt;

&lt;p&gt;Correlation is the point where logs stop being just text and start behaving&lt;br&gt;
like telemetry.&lt;/p&gt;
&lt;h3&gt;
  
  
  Request ID as the lowest-friction correlation key
&lt;/h3&gt;

&lt;p&gt;A request ID is the simplest bridge between an incoming request and everything&lt;br&gt;
that happens because of it. It tends to work even without distributed tracing,&lt;br&gt;
and it is still useful when traces are sampled.&lt;/p&gt;

&lt;p&gt;A common pattern is to attach a per-request logger to the context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;logx&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"context"&lt;/span&gt;
    &lt;span class="s"&gt;"log/slog"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;ctxKey&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;WithLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctxKey&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;FromContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctxKey&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Trace correlation with W3C Trace Context and OpenTelemetry
&lt;/h3&gt;

&lt;p&gt;W3C Trace Context defines a standard way to propagate trace identity (for HTTP,&lt;br&gt;
via traceparent and tracestate). OpenTelemetry builds on that so trace IDs and&lt;br&gt;
span IDs can be extracted from context.&lt;/p&gt;

&lt;p&gt;This middleware example logs both request_id and trace identifiers when&lt;br&gt;
available:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;middleware&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"crypto/rand"&lt;/span&gt;
    &lt;span class="s"&gt;"encoding/hex"&lt;/span&gt;
    &lt;span class="s"&gt;"net/http"&lt;/span&gt;

    &lt;span class="s"&gt;"go.opentelemetry.io/otel/trace"&lt;/span&gt;
    &lt;span class="s"&gt;"log/slog"&lt;/span&gt;

    &lt;span class="s"&gt;"example.com/project/logx"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;requestID&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hex&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EncodeToString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;WithRequestLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"X-Request-Id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requestID&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

            &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;With&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="s"&gt;"request_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpanContextFromContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsValid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;With&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="s"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TraceID&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="s"&gt;"span_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpanID&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

            &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;logx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once correlation fields exist, the log line becomes an index into other data.&lt;br&gt;
The difference in a live incident is not subtle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turning structured logs into monitoring and alerting signals
&lt;/h2&gt;

&lt;p&gt;Logs are great at answering "what happened". Alerting is usually about "how&lt;br&gt;
often and how bad".&lt;/p&gt;

&lt;p&gt;A practical approach is to treat certain log events as counters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;event=payment_failed&lt;/li&gt;
&lt;li&gt;event=db_timeout&lt;/li&gt;
&lt;li&gt;event=cache_miss&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many platforms can derive log-based metrics by counting matching records over a&lt;br&gt;
window. Structured logs make that count resilient, because it is based on a&lt;br&gt;
field value rather than a brittle text match.&lt;br&gt;
When you are ready to visualise and explore those signals, &lt;a href="https://www.glukhov.org/observability/grafana-installing-using-in-ubuntu/" rel="noopener noreferrer"&gt;Install and Use Grafana on Ubuntu: Complete Guide&lt;/a&gt; walks through a full Grafana setup you can point at common log and metrics backends.&lt;/p&gt;

&lt;p&gt;This is also where log levels start to matter. Debug logs are often valuable,&lt;br&gt;
but they are also where cost and noise hides. Using a dynamic level (LevelVar)&lt;br&gt;
lets the system stay quiet by default, while still allowing targeted detail&lt;br&gt;
when needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thoughts
&lt;/h2&gt;

&lt;p&gt;Structured logging in Go is no longer a library debate. The interesting part is&lt;br&gt;
whether your log records are consistent, correlatable, and affordable to store.&lt;/p&gt;

&lt;p&gt;When your logs carry stable fields like event, request_id, and trace_id, they&lt;br&gt;
stop being "strings someone wrote" and start being a dataset you can operate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Notes
&lt;/h2&gt;

&lt;p&gt;The Go team introduced log/slog in Go 1.21 and emphasised that structured logs&lt;br&gt;
use key-value pairs so they can be parsed, filtered, searched, and analysed&lt;br&gt;
reliably, and also noted the motivation of providing a common framework shared&lt;br&gt;
across the ecosystem.&lt;/p&gt;

&lt;p&gt;The log/slog package documentation defines the record model (time, level,&lt;br&gt;
message, key-value pairs) and the built-in handlers (TextHandler for key=value&lt;br&gt;
and JSONHandler for line-delimited JSON), and documents SetDefault integration&lt;br&gt;
with the classic log package.&lt;/p&gt;

&lt;p&gt;For distributed correlation, the W3C Trace Context specification standardises&lt;br&gt;
traceparent and tracestate propagation, and OpenTelemetry specifies that its&lt;br&gt;
SpanContext conforms to W3C Trace Context and exposes TraceId and SpanId,&lt;br&gt;
making log-trace correlation straightforward when a span is present.&lt;/p&gt;

&lt;p&gt;For log storage cost and performance, Grafana Loki documentation strongly&lt;br&gt;
recommends bounded, static labels and warns about high cardinality labels&lt;br&gt;
creating too many streams and a huge index, which is directly relevant when&lt;br&gt;
deciding what becomes a label vs what stays as an unindexed JSON field.&lt;/p&gt;

</description>
      <category>logging</category>
      <category>observability</category>
      <category>monitoring</category>
      <category>dev</category>
    </item>
    <item>
      <title>Ollama in Docker Compose with GPU and Persistent Model Storage</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 27 Mar 2026 09:57:49 +0000</pubDate>
      <link>https://forem.com/rosgluk/ollama-in-docker-compose-with-gpu-and-persistent-model-storage-224h</link>
      <guid>https://forem.com/rosgluk/ollama-in-docker-compose-with-gpu-and-persistent-model-storage-224h</guid>
      <description>&lt;p&gt;Ollama works great on bare metal. It gets even more interesting when you treat it like a service: a stable endpoint, pinned versions, persistent storage, and a GPU that is either available or it is not.&lt;/p&gt;

&lt;p&gt;This post focuses on one goal: a reproducible local or single-node Ollama "server" using Docker Compose, with GPU acceleration and persistent model storage.&lt;/p&gt;

&lt;p&gt;It intentionally skips generic Docker and Compose basics. When you need a compact list of the commands you reach for most often (images, containers, volumes, &lt;code&gt;docker compose&lt;/code&gt;), the &lt;a href="https://www.glukhov.org/developer-tools/containers/docker-cheatsheet/" rel="noopener noreferrer"&gt;Docker Cheatsheet&lt;/a&gt; is a good companion.&lt;/p&gt;

&lt;p&gt;When you want HTTPS in front of Ollama, correct streaming and WebSocket proxying, and edge controls (auth, timeouts, rate limits), see &lt;a href="https://www.glukhov.org/llm-hosting/ollama/ollama-behind-reverse-proxy/" rel="noopener noreferrer"&gt;Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For how Ollama fits alongside vLLM, Docker Model Runner, LocalAI, and cloud hosting trade-offs, see&lt;br&gt;
&lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM Hosting in 2026: Local, Self-Hosted &amp;amp; Cloud Infrastructure Compared&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  When Compose beats a bare metal install
&lt;/h2&gt;

&lt;p&gt;A native install is frictionless for one developer on one machine. The moment you have any of the following, Compose starts to win on ergonomics:&lt;/p&gt;

&lt;p&gt;A team setup benefits because the service definition is a file you can review, version, and share. A single-node server benefits because upgrades turn into an image tag bump and a restart, while your model storage stays put (as long as it is on a volume). Ollama also tends to live next to sidecars: a Web UI, a reverse proxy, an auth gateway, a vector DB, or an agent runtime. Compose is good at "one command to start the whole stack", without turning your host into a snowflake.&lt;/p&gt;

&lt;p&gt;This approach aligns well with how the official Ollama container is designed: the image runs &lt;code&gt;ollama serve&lt;/code&gt; by default, exposes port 11434, and is meant to keep state under a mountable directory.&lt;/p&gt;
&lt;h2&gt;
  
  
  A Compose skeleton that is actually useful for Ollama
&lt;/h2&gt;

&lt;p&gt;Start with two decisions:&lt;/p&gt;

&lt;p&gt;First, how you will pin versions. The Docker Hub image is &lt;code&gt;ollama/ollama&lt;/code&gt;, so you can pin a specific tag in &lt;code&gt;.env&lt;/code&gt; instead of relying on &lt;code&gt;latest&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Second, where the model data will live. The official docs mount a volume to &lt;code&gt;/root/.ollama&lt;/code&gt; so models are not re-downloaded each time the container is replaced.&lt;/p&gt;

&lt;p&gt;Here is a Compose file that bakes those decisions in, and keeps the "knobs" close to the service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ollama&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/ollama:${OLLAMA_IMAGE_TAG:-latest}&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;

    &lt;span class="c1"&gt;# Keep it local by default, expose it later if you need to.&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OLLAMA_BIND_IP:-127.0.0.1}:11434:11434"&lt;/span&gt;

    &lt;span class="c1"&gt;# Persistent models and server state.&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ollama:/root/.ollama&lt;/span&gt;

    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# The official image already defaults to 0.0.0.0:11434 inside the container,&lt;/span&gt;
      &lt;span class="c1"&gt;# but keeping it explicit helps when you override things later.&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OLLAMA_HOST=0.0.0.0:11434&lt;/span&gt;

      &lt;span class="c1"&gt;# Service tuning.&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OLLAMA_KEEP_ALIVE=${OLLAMA_KEEP_ALIVE:-5m}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OLLAMA_NUM_PARALLEL=${OLLAMA_NUM_PARALLEL:-1}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OLLAMA_MAX_LOADED_MODELS=${OLLAMA_MAX_LOADED_MODELS:-1}&lt;/span&gt;

      &lt;span class="c1"&gt;# Optional, but relevant when a browser-based UI talks to Ollama directly.&lt;/span&gt;
      &lt;span class="c1"&gt;# See the Networking section for why this exists.&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OLLAMA_ORIGINS=${OLLAMA_ORIGINS:-}&lt;/span&gt;

    &lt;span class="c1"&gt;# GPU reservation is a separate section below.&lt;/span&gt;
    &lt;span class="c1"&gt;# Add it only on hosts that actually have NVIDIA GPUs.&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ollama&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A matching &lt;code&gt;.env&lt;/code&gt; keeps upgrades boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pin the image version you have tested.&lt;/span&gt;
&lt;span class="nv"&gt;OLLAMA_IMAGE_TAG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;latest

&lt;span class="c"&gt;# Local by default. Change to 0.0.0.0 when you intentionally expose it.&lt;/span&gt;
&lt;span class="nv"&gt;OLLAMA_BIND_IP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;127.0.0.1

&lt;span class="c"&gt;# Keep-alive tweaks cold-start latency vs memory footprint.&lt;/span&gt;
&lt;span class="nv"&gt;OLLAMA_KEEP_ALIVE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5m

&lt;span class="c"&gt;# Concurrency knobs.&lt;/span&gt;
&lt;span class="nv"&gt;OLLAMA_NUM_PARALLEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;span class="nv"&gt;OLLAMA_MAX_LOADED_MODELS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1

&lt;span class="c"&gt;# Leave empty unless you are serving browser clients that hit Ollama directly.&lt;/span&gt;
&lt;span class="nv"&gt;OLLAMA_ORIGINS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A small but important nuance: Ollama itself has a default host bind of &lt;code&gt;127.0.0.1:11434&lt;/code&gt; in the general configuration, but the official container image sets &lt;code&gt;OLLAMA_HOST=0.0.0.0:11434&lt;/code&gt; so the service is reachable through published ports.&lt;/p&gt;

&lt;p&gt;If you want a quick sanity check without involving any client SDKs, the Ollama API includes a "list local models" endpoint at &lt;code&gt;GET /api/tags&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Persistent model storage and the least painful way to move it
&lt;/h2&gt;

&lt;p&gt;If you only remember one thing, make it this: the container must have persistent storage, otherwise every rebuild is a re-download.&lt;/p&gt;

&lt;p&gt;Ollama lets you choose the models directory using &lt;code&gt;OLLAMA_MODELS&lt;/code&gt;. In the reference implementation, the default is &lt;code&gt;$HOME/.ollama/models&lt;/code&gt;, and setting &lt;code&gt;OLLAMA_MODELS&lt;/code&gt; overrides that.&lt;/p&gt;

&lt;p&gt;Inside the official Docker image, &lt;code&gt;$HOME&lt;/code&gt; maps naturally to the &lt;code&gt;/root&lt;/code&gt; layout used by the documented volume mount (&lt;code&gt;/root/.ollama&lt;/code&gt;), which is exactly why the official &lt;code&gt;docker run&lt;/code&gt; examples mount that directory.&lt;/p&gt;

&lt;p&gt;There are two storage patterns that tend to work well in practice:&lt;/p&gt;

&lt;p&gt;A named Docker volume is simplest and portable. It is also easy to accidentally orphan, so it is worth naming it intentionally (for example &lt;code&gt;ollama&lt;/code&gt;) and keeping it stable across Compose refactors.&lt;/p&gt;

&lt;p&gt;A bind mount to a dedicated disk is better when model sizes start to dominate your root filesystem. In that case, you either mount the whole &lt;code&gt;/root/.ollama&lt;/code&gt; to that disk, or you mount a custom directory and point &lt;code&gt;OLLAMA_MODELS&lt;/code&gt; at it.&lt;/p&gt;

&lt;p&gt;If you are actively reorganising storage, this is where an explicit "move models" playbook helps. See: &lt;a href="https://www.glukhov.org/llm-hosting/ollama/move-ollama-models/" rel="noopener noreferrer"&gt;move-ollama-models&lt;/a&gt; .&lt;/p&gt;

&lt;h2&gt;
  
  
  NVIDIA GPU support with Compose and the NVIDIA Container Toolkit
&lt;/h2&gt;

&lt;p&gt;Ollama can use NVIDIA GPUs in Docker, but the image cannot magic a GPU into existence. The host needs working NVIDIA drivers and the NVIDIA Container Toolkit, and Docker must be configured to use it. The Ollama Docker docs explicitly call out installing &lt;code&gt;nvidia-container-toolkit&lt;/code&gt;, configuring the runtime via &lt;code&gt;nvidia-ctk runtime configure --runtime=docker&lt;/code&gt;, and restarting Docker.&lt;/p&gt;

&lt;p&gt;On the Compose side, the clean, modern way is device reservations. Docker documents GPU access in Compose using &lt;code&gt;deploy.resources.reservations.devices&lt;/code&gt;, with &lt;code&gt;capabilities: [gpu]&lt;/code&gt;, &lt;code&gt;driver: nvidia&lt;/code&gt;, and either &lt;code&gt;count&lt;/code&gt; (including &lt;code&gt;all&lt;/code&gt;) or &lt;code&gt;device_ids&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Add this to the &lt;code&gt;ollama&lt;/code&gt; service when you are on an NVIDIA host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;reservations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;devices&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;driver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia&lt;/span&gt;
          &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;
          &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you have multiple GPUs and want to keep Ollama on specific devices, switch from &lt;code&gt;count&lt;/code&gt; to &lt;code&gt;device_ids&lt;/code&gt; as documented by Docker (they are mutually exclusive).&lt;/p&gt;

&lt;p&gt;You will sometimes see legacy Compose examples that use &lt;code&gt;runtime: nvidia&lt;/code&gt;. That can fail on newer setups with errors like "unknown or invalid runtime name: nvidia", which is a strong hint that you should move to the supported device reservation pattern and make sure the toolkit is configured on the host.&lt;/p&gt;

&lt;p&gt;A useful detail hiding in plain sight: the official &lt;code&gt;ollama/ollama&lt;/code&gt; image sets &lt;code&gt;NVIDIA_VISIBLE_DEVICES=all&lt;/code&gt; and &lt;code&gt;NVIDIA_DRIVER_CAPABILITIES=compute,utility&lt;/code&gt;. These are standard knobs recognised by the NVIDIA container runtime, and they are already present unless you overwrite them.&lt;/p&gt;

&lt;p&gt;To confirm whether you are actually getting GPU inference (not just a container that starts), Ollama recommends using &lt;code&gt;ollama ps&lt;/code&gt; and checking the "Processor" column, which shows whether the model is on GPU memory.&lt;/p&gt;

&lt;p&gt;Platform reality check: Ollama notes that GPU acceleration in Docker is available on Linux (and Windows with WSL2), and not available on Docker Desktop for macOS due to the lack of GPU passthrough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Networking choices: host vs bridge, ports, and CORS
&lt;/h2&gt;

&lt;p&gt;Networking is where most "it runs but my app cannot connect" bugs come from.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bridge networking with published ports
&lt;/h3&gt;

&lt;p&gt;The default Compose network is a bridge network. In this setup, publishing &lt;code&gt;11434:11434&lt;/code&gt; makes Ollama reachable from the host on port 11434, while other containers should talk to it using the service name &lt;code&gt;ollama&lt;/code&gt; (not &lt;code&gt;localhost&lt;/code&gt;). A lot of people trip on this because &lt;code&gt;localhost&lt;/code&gt; inside a container means "this container", not "the Ollama container".&lt;/p&gt;

&lt;p&gt;Ollama itself runs an HTTP server on port 11434 (the image exposes it), and the common convention is that clients use &lt;code&gt;http://localhost:11434&lt;/code&gt; on the host when ports are published.&lt;/p&gt;

&lt;h3&gt;
  
  
  Host networking
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;network_mode: host&lt;/code&gt; can be tempting on a single-node server because it removes port publishing and makes &lt;code&gt;localhost&lt;/code&gt; semantics simpler. The trade-off is you lose the isolation and namespacing benefits of a bridge network, and you are more likely to hit port conflicts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exposing Ollama intentionally
&lt;/h3&gt;

&lt;p&gt;Ollama on a normal install binds to &lt;code&gt;127.0.0.1&lt;/code&gt; by default, and the documented way to change the bind address is &lt;code&gt;OLLAMA_HOST&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In Docker, you have two layers:&lt;/p&gt;

&lt;p&gt;Ollama bind address, controlled by &lt;code&gt;OLLAMA_HOST&lt;/code&gt; (the container image defaults to binding on all interfaces inside the container).&lt;/p&gt;

&lt;p&gt;Reachability from outside the container, controlled by Compose &lt;code&gt;ports&lt;/code&gt; and the host firewall.&lt;/p&gt;

&lt;p&gt;A pattern I like is "bind locally by default" via &lt;code&gt;127.0.0.1:11434:11434&lt;/code&gt;, then switch to &lt;code&gt;0.0.0.0:11434:11434&lt;/code&gt; only when I have a reason to expose it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Browser clients and OLLAMA_ORIGINS
&lt;/h3&gt;

&lt;p&gt;If a browser-based UI or extension calls Ollama directly, you are in CORS territory. Ollama allows cross-origin requests from &lt;code&gt;127.0.0.1&lt;/code&gt; and &lt;code&gt;0.0.0.0&lt;/code&gt; by default, and you can configure additional origins using &lt;code&gt;OLLAMA_ORIGINS&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This matters even on a single node, because "it works with curl" does not mean "it works from a browser app".&lt;/p&gt;

&lt;h2&gt;
  
  
  Upgrade and rollback patterns that fit a single-node server
&lt;/h2&gt;

&lt;p&gt;Ollama evolves quickly. Your Compose file can make that a calm process instead of a late-night surprise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Upgrade by bumping a tag, not by hoping "latest" behaves
&lt;/h3&gt;

&lt;p&gt;The most practical upgrade strategy is to pin the image to a known-good tag in &lt;code&gt;.env&lt;/code&gt;, and bump it intentionally. The image is published as &lt;code&gt;ollama/ollama&lt;/code&gt; on Docker Hub.&lt;/p&gt;

&lt;p&gt;Because model data and server state are stored under a mounted directory (&lt;code&gt;/root/.ollama&lt;/code&gt; in the official docs), replacing the container does not imply re-downloading models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rollback is just switching the tag back
&lt;/h3&gt;

&lt;p&gt;Rollback is the same mechanism in reverse: set the previous tag, recreate the container, keep the same volume. This is where pinning pays for itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data migration is mostly about storage paths
&lt;/h3&gt;

&lt;p&gt;Most "migrations" in a single-node setup are not about database schemas. They are about disk layout. If you change the models directory (via &lt;code&gt;OLLAMA_MODELS&lt;/code&gt;) or move the mounted volume to a new disk, you are doing a data migration whether you call it that or not.&lt;/p&gt;

&lt;p&gt;If you want a practical guide for reorganising the model directory on real machines, see: &lt;a href="https://www.glukhov.org/llm-hosting/ollama/move-ollama-models/" rel="noopener noreferrer"&gt;move-ollama-models&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;A final note that is easy to miss: Ollama's API documentation explicitly says the API is expected to be stable and backwards compatible, with rare deprecations announced in release notes. That makes "upgrade the server, keep clients working" a reasonable default expectation for a single-node service endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common failures: GPU permissions, driver mismatch, and OOM
&lt;/h2&gt;

&lt;p&gt;This section is deliberately symptom-driven. The goal is not "every possible Docker error", only the failures that show up specifically in Ollama + GPU + persistent storage setups.&lt;/p&gt;

&lt;h3&gt;
  
  
  GPU visible on the host, missing in the container
&lt;/h3&gt;

&lt;p&gt;If the host has a working NVIDIA driver but the container does not see a GPU, the common causes are:&lt;/p&gt;

&lt;p&gt;The NVIDIA Container Toolkit is not installed or the Docker runtime is not configured via &lt;code&gt;nvidia-ctk&lt;/code&gt;. Ollama's Docker docs call this out directly.&lt;/p&gt;

&lt;p&gt;Compose is not reserving a GPU device. The supported way is &lt;code&gt;deploy.resources.reservations.devices&lt;/code&gt; with the &lt;code&gt;gpu&lt;/code&gt; capability as documented by Docker.&lt;/p&gt;

&lt;p&gt;A legacy &lt;code&gt;runtime: nvidia&lt;/code&gt; configuration is being used on a daemon that does not recognise it, producing "unknown or invalid runtime name: nvidia".&lt;/p&gt;

&lt;p&gt;For validation, &lt;code&gt;ollama ps&lt;/code&gt; gives you a pragmatic check: it shows whether a model is loaded in GPU memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Permission denied on GPU devices
&lt;/h3&gt;

&lt;p&gt;The "permission denied" flavour of GPU failures typically points to environment constraints rather than Ollama itself. Examples include running rootless Docker, security policies, or device nodes not being exposed as expected. The Docker Compose GPU support docs are explicit that the host must have GPU devices and that the Docker daemon must be set accordingly.&lt;/p&gt;

&lt;p&gt;When in doubt, reduce the variables: confirm the toolkit configuration (host), then confirm GPU reservation (Compose), then confirm GPU usage (&lt;code&gt;ollama ps&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrong driver, wrong expectation
&lt;/h3&gt;

&lt;p&gt;Ollama in Docker relies on the host driver stack. If the host driver is missing, too old, or misconfigured, you will see failures that look like "Ollama is broken" but are really "CUDA stack is not usable". The official docs place the container toolkit and Docker daemon configuration as prerequisites for NVIDIA GPU usage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Out of memory: VRAM or RAM disappears fast
&lt;/h3&gt;

&lt;p&gt;OOM is the most predictable failure mode for local inference, and it is usually self-inflicted by configuration.&lt;/p&gt;

&lt;p&gt;Ollama supports concurrent processing through multiple loaded models and parallel request handling, but it is constrained by available memory (system RAM on CPU inference, VRAM on GPU inference). When GPU inference is used, new models must fit in VRAM to allow concurrent model loads.&lt;/p&gt;

&lt;p&gt;Two configuration details are worth treating as first-class "server settings":&lt;/p&gt;

&lt;p&gt;&lt;code&gt;OLLAMA_NUM_PARALLEL&lt;/code&gt; increases parallel request processing per model, but required memory scales with &lt;code&gt;OLLAMA_NUM_PARALLEL * OLLAMA_CONTEXT_LENGTH&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;OLLAMA_KEEP_ALIVE&lt;/code&gt; controls how long models remain loaded (default is 5 minutes). Keeping models loaded reduces cold-start latency, but it also pins memory.&lt;/p&gt;

&lt;p&gt;If you are stabilising a single-node service under load, the non-dramatic fixes usually look like:&lt;/p&gt;

&lt;p&gt;Lower parallelism and context defaults before you change anything else.&lt;/p&gt;

&lt;p&gt;Limit how many models are allowed to remain loaded concurrently.&lt;/p&gt;

&lt;p&gt;Consider memory-reduction features like Flash Attention (&lt;code&gt;OLLAMA_FLASH_ATTENTION=1&lt;/code&gt;) and lower precision K/V cache types (&lt;code&gt;OLLAMA_KV_CACHE_TYPE&lt;/code&gt;) when your bottleneck is memory, not raw compute.&lt;/p&gt;

&lt;h3&gt;
  
  
  When it is not Ollama: choosing Docker Model Runner instead
&lt;/h3&gt;

&lt;p&gt;Sometimes the "failure" is really a tooling mismatch. If your organisation already standardises on Docker-native artifacts and workflows, Docker Model Runner (DMR) can be a better fit than running Ollama as a long-lived service container.&lt;/p&gt;

&lt;p&gt;Docker positions DMR as a way to manage, run, and serve models directly via Docker, pulling from Docker Hub or other OCI registries, and serving OpenAI-compatible and Ollama-compatible APIs.&lt;/p&gt;

&lt;p&gt;It also supports multiple inference engines (including llama.cpp, and vLLM on Linux with NVIDIA GPUs), which can matter if you care about throughput characteristics, not just "run one model locally".&lt;/p&gt;

&lt;p&gt;If you want a practical command reference and a deeper comparison angle, see: &lt;a href="https://www.glukhov.org/llm-hosting/docker-model-runner/docker-model-runner-cheatsheet/" rel="noopener noreferrer"&gt;Docker Model Runner Cheatsheet: Commands &amp;amp; Examples&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>selfhosting</category>
      <category>llm</category>
      <category>ollama</category>
      <category>devops</category>
    </item>
    <item>
      <title>Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 27 Mar 2026 09:57:32 +0000</pubDate>
      <link>https://forem.com/rosgluk/ollama-behind-a-reverse-proxy-with-caddy-or-nginx-for-https-streaming-20b5</link>
      <guid>https://forem.com/rosgluk/ollama-behind-a-reverse-proxy-with-caddy-or-nginx-for-https-streaming-20b5</guid>
      <description>&lt;p&gt;Running Ollama behind a reverse proxy is the simplest way to get HTTPS, optional access control, and predictable streaming behaviour.&lt;/p&gt;

&lt;p&gt;This post focuses on Caddy and Nginx ingress for the Ollama API, not on client code.&lt;/p&gt;

&lt;p&gt;If you already have Python or Go clients talking to Ollama, this post is the missing piece: ingress and transport for the same API.&lt;/p&gt;

&lt;p&gt;For how Ollama fits alongside vLLM, Docker Model Runner, LocalAI, and cloud hosting trade-offs, see&lt;br&gt;
&lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM Hosting in 2026: Local, Self-Hosted &amp;amp; Cloud Infrastructure Compared&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For request examples and client code, see&lt;br&gt;
&lt;a href="https://www.glukhov.org/llm-hosting/ollama/ollama-cheatsheet/" rel="noopener noreferrer"&gt;Ollama CLI Cheatsheet&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For UI and multi-user layers, see&lt;br&gt;
&lt;a href="https://www.glukhov.org/llm-hosting/llm-frontends/open-webui-overview-quickstart-and-alternatives/" rel="noopener noreferrer"&gt;Open WebUI overview, quickstart and alternatives&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For the bigger picture on self-hosting and data control, see&lt;br&gt;
&lt;a href="https://www.glukhov.org/llm-hosting/self-hosting/llm-selfhosting-and-ai-sovereignty/" rel="noopener noreferrer"&gt;LLM self-hosting and AI sovereignty&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For a reproducible single-node Ollama service in Docker Compose (persistent volumes, &lt;code&gt;OLLAMA_HOST&lt;/code&gt;, NVIDIA GPUs, upgrades), see&lt;br&gt;
&lt;a href="https://www.glukhov.org/llm-hosting/ollama/ollama-in-docker-compose/" rel="noopener noreferrer"&gt;Ollama in Docker Compose with GPU and Persistent Model Storage&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why you should proxy Ollama instead of exposing port 11434
&lt;/h2&gt;

&lt;p&gt;Ollama is designed to run locally first. Out of the box it binds to localhost on port 11434, which is great for a developer workstation and a not-so-subtle hint that the raw port is not meant to be internet-facing.&lt;/p&gt;

&lt;p&gt;I treat port 11434 as an internal, high-cost API. If it is reachable from the public internet, anyone who finds it can burn your CPU or GPU time, fill your disk by pulling models, or just keep connections open until something times out. A reverse proxy does not make Ollama safer by magic, but it gives you a place to put the controls that matter at the edge: TLS, authentication, timeouts, rate limits, and logs.&lt;/p&gt;

&lt;p&gt;This matters because the local Ollama API does not ship with a built-in authentication layer. If you expose it, you typically add auth at the edge or keep it private and reachable only over a trusted network.&lt;/p&gt;

&lt;p&gt;The second reason is UX. Ollama streams responses by default. If the proxy buffers or compresses in the wrong spot, streaming feels broken and UIs look like they are "thinking" with no output.&lt;/p&gt;
&lt;h2&gt;
  
  
  Minimal architecture and binding strategy
&lt;/h2&gt;

&lt;p&gt;A clean minimum looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client (curl, Python, Go, UI)
        |
        | HTTPS (optional Basic Auth or SSO)
        v
Reverse proxy (Caddy or Nginx)
        |
        | HTTP (private LAN, localhost, or Docker network)
        v
Ollama server (ollama serve on 127.0.0.1:11434)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two practical rules keep this boring in the best way.&lt;/p&gt;

&lt;p&gt;First, keep Ollama private and move exposure to the proxy. If Caddy or Nginx runs on the same host, proxy to 127.0.0.1:11434 and do not change Ollama's bind address. If the proxy runs elsewhere (separate host, separate VM, or a container network), bind Ollama to a private interface, not 0.0.0.0 on the public NIC, and lean on a firewall.&lt;/p&gt;

&lt;p&gt;Second, decide early whether browsers will call Ollama directly. If a browser-based tool hits Ollama from a different origin, you may need to deal with CORS. If everything is served from one domain via the proxy (recommended for sanity), you can often avoid CORS entirely and keep Ollama strict.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reverse proxy configs for streaming and WebSockets
&lt;/h2&gt;

&lt;p&gt;Ollama's API is regular HTTP, and its streaming is newline-delimited JSON (NDJSON). That means you want a proxy that can do three things well:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do not buffer streaming responses.&lt;/li&gt;
&lt;li&gt;Do not kill long-running requests just because the model took a while to speak.&lt;/li&gt;
&lt;li&gt;If a UI uses WebSockets (some do), forward Upgrade cleanly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can keep this simple. In many cases, "correct WebSockets handling" is just having a config that is Upgrade-safe even if the upstream does not use WebSockets today.&lt;/p&gt;

&lt;h3&gt;
  
  
  Caddy Caddyfile example
&lt;/h3&gt;

&lt;p&gt;Caddy is the "less config, more defaults" option. If you put a public domain name in the site address, Caddy will typically obtain and renew certificates automatically.&lt;/p&gt;

&lt;p&gt;Minimal reverse proxy, HTTPS, and streaming-friendly settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ollama.example.com A/AAAA -&amp;gt; your proxy host
ollama.example.com {

    # Optional Basic Auth at the edge.
    # Generate a password hash with:
    #   caddy hash-password --algorithm bcrypt
    #
    # basic_auth {
    #   alice $2a$12$REDACTED...
    # }

    reverse_proxy 127.0.0.1:11434 {

        # Some setups prefer pinning the upstream Host header.
        # Ollama's own docs show this pattern for Nginx.
        header_up Host localhost:11434

        # For streaming or chat-like workloads, prefer low latency.
        # NDJSON streaming usually flushes immediately anyway, but this makes it explicit.
        flush_interval -1

        transport http {
            # Avoid upstream gzip negotiation if it interferes with streaming.
            compression off

            # Give Ollama time to load a model and produce the first chunk.
            response_header_timeout 10m
            dial_timeout 10s
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you already have an SSO gateway (oauth2-proxy, Authelia, authentik outpost, etc), Caddy has an opinionated forward auth directive. The pattern is "auth first, then proxy":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ollama.example.com {
    forward_auth 127.0.0.1:4180 {
        uri /oauth2/auth
        # Copy the identity headers your gateway returns, if you need them.
        copy_headers X-Auth-Request-User X-Auth-Request-Email Authorization
    }

    reverse_proxy 127.0.0.1:11434
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Nginx server block example
&lt;/h3&gt;

&lt;p&gt;Nginx gives you a bit more rope. The upside is that the knobs are explicit, and it has built-in primitives for rate limiting and connection limiting. The footgun is buffering: Nginx buffers proxied responses by default, which is the opposite of what you want for NDJSON streaming.&lt;/p&gt;

&lt;p&gt;This example includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP to HTTPS redirect&lt;/li&gt;
&lt;li&gt;TLS certificate paths (Certbot style)&lt;/li&gt;
&lt;li&gt;WebSocket-safe Upgrade forwarding&lt;/li&gt;
&lt;li&gt;Streaming-friendly proxy_buffering off&lt;/li&gt;
&lt;li&gt;Longer timeouts than the default 60s
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="c1"&gt;# /etc/nginx/conf.d/ollama.conf&lt;/span&gt;

&lt;span class="c1"&gt;# WebSocket-safe Connection header handling&lt;/span&gt;
&lt;span class="k"&gt;map&lt;/span&gt; &lt;span class="nv"&gt;$http_upgrade&lt;/span&gt; &lt;span class="nv"&gt;$connection_upgrade&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;default&lt;/span&gt; &lt;span class="s"&gt;upgrade&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;""&lt;/span&gt;      &lt;span class="s"&gt;close&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Optional request rate limiting (IP-based)&lt;/span&gt;
&lt;span class="c1"&gt;# limit_req_zone $binary_remote_addr zone=ollama_rate:10m rate=10r/s;&lt;/span&gt;

&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server_name&lt;/span&gt; &lt;span class="s"&gt;ollama.example.com&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;301&lt;/span&gt; &lt;span class="s"&gt;https://&lt;/span&gt;&lt;span class="nv"&gt;$host$request_uri&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt; &lt;span class="s"&gt;ssl&lt;/span&gt; &lt;span class="s"&gt;http2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server_name&lt;/span&gt; &lt;span class="s"&gt;ollama.example.com&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;ssl_certificate&lt;/span&gt;     &lt;span class="n"&gt;/etc/letsencrypt/live/ollama.example.com/fullchain.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;ssl_certificate_key&lt;/span&gt; &lt;span class="n"&gt;/etc/letsencrypt/live/ollama.example.com/privkey.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;# Optional Basic Auth at the edge.&lt;/span&gt;
    &lt;span class="c1"&gt;# auth_basic "Ollama";&lt;/span&gt;
    &lt;span class="c1"&gt;# auth_basic_user_file /etc/nginx/.htpasswd;&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;# Optional rate limit&lt;/span&gt;
        &lt;span class="c1"&gt;# limit_req zone=ollama_rate burst=20 nodelay;&lt;/span&gt;

        &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://127.0.0.1:11434&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;# Match Ollama docs pattern when proxying to localhost.&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Host&lt;/span&gt; &lt;span class="nf"&gt;localhost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;11434&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;# WebSocket Upgrade handling (harmless if unused).&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_http_version&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Upgrade&lt;/span&gt; &lt;span class="nv"&gt;$http_upgrade&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Connection&lt;/span&gt; &lt;span class="nv"&gt;$connection_upgrade&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;# Critical for NDJSON streaming.&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_buffering&lt;/span&gt; &lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;# Prevent 60s idle timeouts while waiting for tokens.&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_read_timeout&lt;/span&gt; &lt;span class="s"&gt;3600s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_send_timeout&lt;/span&gt; &lt;span class="s"&gt;3600s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want an SSO-style gate in Nginx, the equivalent pattern is auth_request. Nginx sends a subrequest to your auth service, and only proxies to Ollama when auth returns 2xx.&lt;/p&gt;

&lt;h2&gt;
  
  
  TLS automation and renewal gotchas
&lt;/h2&gt;

&lt;p&gt;For TLS, the operational split is simple.&lt;/p&gt;

&lt;p&gt;With Caddy, TLS is usually "part of the reverse proxy". Automatic HTTPS is one of its flagship features, so certificate issuance and renewal are coupled to keeping Caddy running, having working DNS, and exposing ports 80 and 443.&lt;/p&gt;

&lt;p&gt;With Nginx, TLS is usually "a separate ACME client plus Nginx". The common failure mode is not crypto, it is plumbing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Port 80 not reachable for HTTP-01 challenges.&lt;/li&gt;
&lt;li&gt;Certificates stored in a container but not persisted.&lt;/li&gt;
&lt;li&gt;Rate limits when doing repeated fresh installs or test deploys.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A subtle point that matters for long-lived services is that certificate lifetimes are short by design. Treat renewals as a background automation requirement, not an annual calendar event.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authentication, abuse control, and verification
&lt;/h2&gt;

&lt;p&gt;This is the part that makes an internet-facing LLM endpoint feel professional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication options, from blunt to elegant&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Basic Auth at the proxy is blunt, but surprisingly effective for a private endpoint. It is also easy to apply to both HTTP requests and WebSocket upgrades.&lt;/p&gt;

&lt;p&gt;If you want browser-friendly login flows, forward auth and auth_request are the common pattern. Your proxy stays stateless, and an auth gateway owns sessions and MFA. The trade-off is more moving parts.&lt;/p&gt;

&lt;p&gt;If you are already running Open WebUI, you can also rely on its app-level authentication and keep Ollama itself private. The proxy then protects Open WebUI, not Ollama directly.&lt;/p&gt;

&lt;p&gt;If you do not need public access at all, a network-only approach can be cleaner. For example, Tailscale Serve can expose a local service inside your tailnet without opening inbound ports on your router.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Abuse basics for an expensive API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ollama is a powerful local API, and its surface goes beyond generation. It has endpoints for chat, embeddings, listing models, and version checks. Treat the whole API as sensitive.&lt;/p&gt;

&lt;p&gt;Official API reference (endpoints and streaming): &lt;a href="https://docs.ollama.com/api" rel="noopener noreferrer"&gt;https://docs.ollama.com/api&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the proxy layer, there are three low-effort controls that reduce day-one pain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rate limiting per IP on generation endpoints.&lt;/li&gt;
&lt;li&gt;Connection limits to stop a small number of clients holding everything open.&lt;/li&gt;
&lt;li&gt;Conservative timeouts that match your model and hardware reality, not generic web defaults.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the Ollama layer, it can also reject overload with 503 and has server-side knobs for queueing. Proxy rate limiting keeps you from getting there as often.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verification checklist&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use the same checks you would use for any streaming API.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Basic connectivity and TLS&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;curl -sS https://ollama.example.com/api/version&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;curl -sS https://ollama.example.com/api/tags | head&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Streaming works end to end (no buffering)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;curl -N https://ollama.example.com/api/generate -H "Content-Type: application/json" -d '{"model":"mistral","prompt":"Write 10 words only.","stream":true}'&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you are behind Basic Auth:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;curl -N -u alice:REDACTED https://ollama.example.com/api/generate -H "Content-Type: application/json" -d '{"model":"mistral","prompt":"Write 10 words only.","stream":true}'&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Browser UI sanity

&lt;ul&gt;
&lt;li&gt;Load your chat UI and trigger a response.&lt;/li&gt;
&lt;li&gt;If the UI uses WebSockets, confirm you do not see 400 or 426 errors and the connection stays open during generation.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the curl output only appears at the end, it is almost always buffering at the proxy. Re-check proxy_buffering off in Nginx, and consider forcing low-latency flushing in Caddy for the Ollama site block.&lt;/p&gt;

</description>
      <category>selfhosting</category>
      <category>llm</category>
      <category>ollama</category>
      <category>devops</category>
    </item>
    <item>
      <title>Neo4j graph database for GraphRAG, install, Cypher, vectors, ops</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Tue, 24 Mar 2026 10:59:46 +0000</pubDate>
      <link>https://forem.com/rosgluk/neo4j-graph-database-for-graphrag-install-cypher-vectors-ops-mlp</link>
      <guid>https://forem.com/rosgluk/neo4j-graph-database-for-graphrag-install-cypher-vectors-ops-mlp</guid>
      <description>&lt;p&gt;Neo4j is what you reach for when the relationships &lt;em&gt;are&lt;/em&gt; the data. If your domain looks like a whiteboard of circles and arrows, forcing it into tables is painful.&lt;/p&gt;

&lt;p&gt;Neo4j models that picture as a &lt;strong&gt;property graph&lt;/strong&gt; and queries it with &lt;strong&gt;Cypher&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This guide covers &lt;strong&gt;what Neo4j is used for&lt;/strong&gt;, &lt;strong&gt;ACID behaviour&lt;/strong&gt;, &lt;strong&gt;Neo4j vs Amazon Neptune vs TigerGraph&lt;/strong&gt; (and peers), &lt;strong&gt;GraphRAG&lt;/strong&gt; with &lt;strong&gt;vector indexes&lt;/strong&gt;, &lt;strong&gt;local and production install&lt;/strong&gt; paths, &lt;strong&gt;ports and neo4j.conf&lt;/strong&gt;, and copy-paste &lt;strong&gt;Cypher and Python&lt;/strong&gt; patterns.&lt;/p&gt;

&lt;p&gt;For broader context on data infrastructure choices, see the &lt;a href="https://www.glukhov.org/data-infrastructure/" rel="noopener noreferrer"&gt;Data Infrastructure for AI Systems&lt;/a&gt; pillar.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Neo4j used for in production graph workloads
&lt;/h2&gt;

&lt;p&gt;Neo4j is for connected data where you need to ask connected questions, repeatedly, under production constraints. That is the direct answer to &lt;strong&gt;what is Neo4j used for&lt;/strong&gt; in most teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  Property graph data model with nodes, relationships, and properties
&lt;/h3&gt;

&lt;p&gt;Neo4j uses the property graph model: nodes represent entities, relationships connect nodes, and both can have properties. Labels and relationship types give structure without locking you into a brittle schema.&lt;/p&gt;

&lt;p&gt;You can start with a thin model, ship value, and evolve the graph as new questions appear.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cypher graph query language for pattern matching without join soup
&lt;/h3&gt;

&lt;p&gt;Cypher is declarative and built around pattern matching. You describe subgraph shapes and let the planner execute them.&lt;/p&gt;

&lt;p&gt;If SQL is about sets, Cypher is about subgraphs. That matters for multi-hop traversal, path queries, recommendations, provenance, and “who touched what via which system” questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Neo4j ACID compliant and why you should care
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Is Neo4j ACID compliant?&lt;/strong&gt; Yes. Creating or updating relationships touches coherent structure; the database keeps that consistent under failure and concurrency.&lt;/p&gt;

&lt;p&gt;Design graph apps around strong transactional guarantees unless you are forced otherwise. That makes debugging and reasoning about behaviour much easier than assuming vague eventual consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Neo4j vs Amazon Neptune vs TigerGraph: a senior engineer comparison
&lt;/h2&gt;

&lt;p&gt;A “Neo4j vs X” question is usually “Which ecosystem will we live in for years?”&lt;/p&gt;

&lt;p&gt;Short, opinionated view—about engineering time, not benchmark slides.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;Core model and query style&lt;/th&gt;
&lt;th&gt;Where it wins&lt;/th&gt;
&lt;th&gt;Where it bites&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Neo4j&lt;/td&gt;
&lt;td&gt;Property graph and Cypher&lt;/td&gt;
&lt;td&gt;Strong ergonomics for connected data, mature tooling, graph plus vector retrieval&lt;/td&gt;
&lt;td&gt;Graph modelling is a skill you must invest in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Neptune&lt;/td&gt;
&lt;td&gt;Managed graph on AWS (Gremlin, openCypher, SPARQL for RDF)&lt;/td&gt;
&lt;td&gt;AWS-centric contracts and operations&lt;/td&gt;
&lt;td&gt;Query language mix can feel platform-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TigerGraph&lt;/td&gt;
&lt;td&gt;GSQL and OpenCypher-related patterns&lt;/td&gt;
&lt;td&gt;Analytics-style workloads and compiled query approaches&lt;/td&gt;
&lt;td&gt;Different mental model; not drop-in Cypher everywhere&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JanusGraph&lt;/td&gt;
&lt;td&gt;Distributed graph with external storage backends&lt;/td&gt;
&lt;td&gt;Open source with pluggable backends&lt;/td&gt;
&lt;td&gt;You operate the backend stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ArangoDB&lt;/td&gt;
&lt;td&gt;Multi-model (documents, KV, graph)&lt;/td&gt;
&lt;td&gt;One database for mixed shapes&lt;/td&gt;
&lt;td&gt;Graph depth varies versus graph-first engines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memgraph&lt;/td&gt;
&lt;td&gt;Property graph, Cypher compatible&lt;/td&gt;
&lt;td&gt;Streaming and fresh-data workflows&lt;/td&gt;
&lt;td&gt;Engine behaviour differs; compatibility is not identity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What to decide before you pick a graph database
&lt;/h3&gt;

&lt;p&gt;Pick &lt;strong&gt;query language&lt;/strong&gt; and &lt;strong&gt;operations model&lt;/strong&gt; first.&lt;/p&gt;

&lt;p&gt;If your team wants Cypher and a graph-first workflow, Neo4j is a strong default. If you already have Gremlin expertise, Neptune or JanusGraph can fit. If you want one multi-model store, ArangoDB can reduce moving parts.&lt;/p&gt;

&lt;p&gt;Be honest about operations. “We will run a distributed storage backend” is easy to say until you are paged about compactions or JVM pressure at 03:00.&lt;/p&gt;

&lt;h2&gt;
  
  
  Neo4j for RAG and GraphRAG: vector search plus graph context
&lt;/h2&gt;

&lt;p&gt;Many RAG stacks start as &lt;strong&gt;vector search plus prompt&lt;/strong&gt;. That works until you need provenance, entity resolution, multi-hop context, or disambiguation—then you risk rebuilding a knowledge graph &lt;strong&gt;in application code&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does GraphRAG improve retrieval augmented generation?&lt;/strong&gt; It uses the graph to pull structured context—entities, relationships, neighbourhoods—that similarity alone often misses, which helps &lt;strong&gt;grounding&lt;/strong&gt; and &lt;strong&gt;trustworthiness&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Neo4j vector index for embedding similarity search
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Can Neo4j do vector search for RAG?&lt;/strong&gt; Yes. Neo4j supports &lt;strong&gt;vector indexes&lt;/strong&gt; for similarity over embeddings (commonly HNSW-style approximate nearest neighbour search).&lt;/p&gt;

&lt;p&gt;Vectors find “things that look similar”. They do not by themselves encode “how they relate” in your domain. Neo4j lets you combine similarity with traversals.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using the SEARCH subclause for vector-constrained pattern matching
&lt;/h3&gt;

&lt;p&gt;Neo4j’s &lt;strong&gt;SEARCH&lt;/strong&gt; subclause lets you constrain a Cypher &lt;code&gt;MATCH&lt;/code&gt; pattern using approximate nearest neighbour hits from a vector index. That is the ergonomic bridge for &lt;strong&gt;hybrid retrieval&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Practical pattern: vector retrieval for candidates, then graph expansion for context, filters, and explanation.&lt;/p&gt;

&lt;h3&gt;
  
  
  GraphRAG in Python with neo4j-graphrag
&lt;/h3&gt;

&lt;p&gt;Neo4j’s &lt;strong&gt;neo4j-graphrag&lt;/strong&gt; package for Python wires a driver, retriever, and LLM interface into a GraphRAG flow. You can still use external vector stores if you want to split responsibilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to install Neo4j locally and in production
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How do you install Neo4j locally?&lt;/strong&gt; Match the option to your risk profile.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install Neo4j with Docker for local development
&lt;/h3&gt;

&lt;p&gt;Docker is the fastest path to a repeatable server.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Minimal run. Data is NOT persisted between restarts.&lt;/span&gt;
docker run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; always &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--publish&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;7474:7474 &lt;span class="nt"&gt;--publish&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;7687:7687 &lt;span class="se"&gt;\&lt;/span&gt;
  neo4j:5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For real work, set an initial password and mount a data volume.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; always &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--publish&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;7474:7474 &lt;span class="nt"&gt;--publish&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;7687:7687 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="nv"&gt;NEO4J_AUTH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;neo4j/your_password &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--volume&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;/neo4j/data:/data &lt;span class="se"&gt;\&lt;/span&gt;
  neo4j:5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Docker Compose for a team-friendly setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;neo4j&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;neo4j:5&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7474:7474"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7687:7687"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;NEO4J_AUTH=neo4j/your_password&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;$HOME/neo4j/logs:/logs&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;$HOME/neo4j/config:/config&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;$HOME/neo4j/data:/data&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;$HOME/neo4j/plugins:/plugins&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Neo4j Desktop
&lt;/h3&gt;

&lt;p&gt;Neo4j Desktop is strong for prototyping and teaching—projects, GUI, local instances. For CI and integration tests, Docker usually wins.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linux, Windows, or macOS servers
&lt;/h3&gt;

&lt;p&gt;For long-running hosts, follow official OS install paths. You will eventually care about service management, logs, memory, backups, and upgrades.&lt;/p&gt;

&lt;h3&gt;
  
  
  Neo4j AuraDB (managed)
&lt;/h3&gt;

&lt;p&gt;If you prefer shipping product to running databases, &lt;strong&gt;AuraDB&lt;/strong&gt; is the managed Neo4j cloud option.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kubernetes with Helm
&lt;/h3&gt;

&lt;p&gt;If the platform is Kubernetes, use the Helm-based deployment and expose Bolt and HTTP through services. Only deploy databases on K8s if your organisation can run state reliably there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Neo4j configuration essentials: ports, connectors, and neo4j.conf
&lt;/h2&gt;

&lt;p&gt;Settings live in &lt;strong&gt;neo4j.conf&lt;/strong&gt; (key=value, &lt;code&gt;#&lt;/code&gt; comments). Strict validation helps catch typos before you serve traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Default Neo4j ports and connectors
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What are the default Neo4j ports?&lt;/strong&gt; &lt;strong&gt;Bolt&lt;/strong&gt; 7687, &lt;strong&gt;HTTP&lt;/strong&gt; 7474, &lt;strong&gt;HTTPS&lt;/strong&gt; 7473 by default. In production, expose only what you need; often Bolt on a private network and HTTP UI restricted.&lt;/p&gt;

&lt;p&gt;Example hardening (adapt IPs and TLS to your environment):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;server.bolt.listen_address&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10.0.1.10:7687&lt;/span&gt;
&lt;span class="py"&gt;server.http.listen_address&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1:7474&lt;/span&gt;
&lt;span class="py"&gt;server.https.enabled&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;server.https.listen_address&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10.0.1.10:7473&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Transaction settings that limit unbounded damage
&lt;/h3&gt;

&lt;p&gt;Useful levers in reviews include &lt;strong&gt;db.transaction.timeout&lt;/strong&gt; for runaway queries and &lt;strong&gt;db.transaction.concurrent.maximum&lt;/strong&gt; to avoid thundering herds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;db.transaction.timeout&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10s&lt;/span&gt;
&lt;span class="py"&gt;db.transaction.concurrent.maximum&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;1000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Practical Cypher and vector index examples for RAG
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Create a vector index and store embeddings
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cypher"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;VECTOR&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;doc_embeddings&lt;/span&gt;
&lt;span class="n"&gt;FOR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;d:&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d.embedding&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;OPTIONS&lt;/span&gt; &lt;span class="ss"&gt;{&lt;/span&gt;&lt;span class="py"&gt;indexConfig:&lt;/span&gt; &lt;span class="ss"&gt;{&lt;/span&gt;
  &lt;span class="sb"&gt;`vector.dimensions`&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="ss"&gt;,&lt;/span&gt;
  &lt;span class="sb"&gt;`vector.similarity_function`&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"cosine"&lt;/span&gt;
&lt;span class="ss"&gt;}};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Vector retrieval then graph expansion
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Vector search for candidate nodes.&lt;/li&gt;
&lt;li&gt;Traverse for neighbours, provenance, and constraints.&lt;/li&gt;
&lt;li&gt;Format context for the LLM with clear boundaries.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example using SEARCH inside MATCH (syntax may vary slightly by Neo4j version—check the manual for your server version):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cypher"&gt;&lt;code&gt;&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;d:&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;SEARCH&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;VECTOR&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;doc_embeddings&lt;/span&gt;
    &lt;span class="n"&gt;FOR&lt;/span&gt; &lt;span class="n"&gt;$queryEmbedding&lt;/span&gt;
    &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
  &lt;span class="ss"&gt;)&lt;/span&gt; &lt;span class="n"&gt;SCORE&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:MENTIONS&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;e:&lt;/span&gt;&lt;span class="n"&gt;Entity&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;d.id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="ss"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="ss"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="n"&gt;e.name&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="ss"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Minimal GraphRAG in Python
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;neo4j&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GraphDatabase&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;neo4j_graphrag.retrievers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorRetriever&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;neo4j_graphrag.embeddings&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;neo4j_graphrag.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAILLM&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;neo4j_graphrag.generation&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GraphRAG&lt;/span&gt;

&lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GraphDatabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neo4j://localhost:7687&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neo4j&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;embedder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VectorRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAILLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;rag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GraphRAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I do similarity search in Neo4j?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retriever_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-world Neo4j use cases: fraud, recommendations, and knowledge graphs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fraud detection and risk graphs
&lt;/h3&gt;

&lt;p&gt;Fraud is rarely one row. It is patterns across accounts, devices, IPs, merchants, identities, and time. Graphs express neighbourhoods and multi-hop paths without ten-way join mazes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendations with behaviour and explicit relationships
&lt;/h3&gt;

&lt;p&gt;Production recommendations combine &lt;strong&gt;scored candidates&lt;/strong&gt; with &lt;strong&gt;inventory, constraints, hierarchies&lt;/strong&gt;, and &lt;strong&gt;explainability&lt;/strong&gt;. Graphs help you return paths people can reason about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Knowledge graphs for RAG and agents
&lt;/h3&gt;

&lt;p&gt;RAG needs grounding; agents need memory, provenance, and constraints. A knowledge graph stores entities, relationships, sources, and embeddings in one model—natural fit for GraphRAG.&lt;/p&gt;

&lt;h2&gt;
  
  
  When should you choose Neo4j over Amazon Neptune or TigerGraph?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;When should you choose Neo4j over Amazon Neptune or TigerGraph?&lt;/strong&gt; Choose Neo4j for a &lt;strong&gt;Cypher-first&lt;/strong&gt; graph and &lt;strong&gt;vector + traversal&lt;/strong&gt; in one product. Choose Neptune when &lt;strong&gt;AWS&lt;/strong&gt; and &lt;strong&gt;Gremlin or RDF&lt;/strong&gt; lines up with your org. Choose TigerGraph when &lt;strong&gt;GSQL&lt;/strong&gt; and &lt;strong&gt;analytics-style&lt;/strong&gt; workloads are the primary bet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/" rel="noopener noreferrer"&gt;Neo4j graph concepts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://neo4j.com/docs/cypher-manual/current/introduction/" rel="noopener noreferrer"&gt;Cypher manual&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/" rel="noopener noreferrer"&gt;Indexes including vector indexes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://neo4j.com/docs/cypher-manual/current/clauses/search/" rel="noopener noreferrer"&gt;SEARCH clause&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://neo4j.com/docs/operations-manual/current/docker/introduction/" rel="noopener noreferrer"&gt;Docker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://neo4j.com/docs/operations-manual/current/configuration/neo4j-conf/" rel="noopener noreferrer"&gt;neo4j.conf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://neo4j.com/docs/neo4j-graphrag-python/current/" rel="noopener noreferrer"&gt;Neo4j GraphRAG Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/neptune/" rel="noopener noreferrer"&gt;Amazon Neptune&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.tigergraph.com/gsql-ref/4.2/intro/" rel="noopener noreferrer"&gt;TigerGraph GSQL&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>docker</category>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Netlify for Hugo &amp; static sites: pricing, free tier, and alternatives</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Tue, 24 Mar 2026 10:59:34 +0000</pubDate>
      <link>https://forem.com/rosgluk/netlify-for-hugo-static-sites-pricing-free-tier-and-alternatives-2470</link>
      <guid>https://forem.com/rosgluk/netlify-for-hugo-static-sites-pricing-free-tier-and-alternatives-2470</guid>
      <description>&lt;p&gt;Netlify is one of the most developer-friendly ways to ship &lt;strong&gt;Hugo sites&lt;/strong&gt; and &lt;strong&gt;modern web apps&lt;/strong&gt; with a production-grade workflow: preview URLs for every pull request, atomic deploys, a global CDN, and optional serverless and edge capabilities.&lt;/p&gt;

&lt;p&gt;This guide explains &lt;strong&gt;how Netlify works&lt;/strong&gt;, how &lt;strong&gt;credit-based pricing&lt;/strong&gt; affects real deployments, what you can do on the &lt;strong&gt;Free plan&lt;/strong&gt;, and when an alternative like &lt;strong&gt;Vercel&lt;/strong&gt; or &lt;strong&gt;Cloudflare Pages&lt;/strong&gt; is a better fit.&lt;/p&gt;

&lt;p&gt;For a broader view of static site deployment options, see the &lt;a href="https://www.glukhov.org/web-infrastructure/" rel="noopener noreferrer"&gt;Web Infrastructure&lt;/a&gt; cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Netlify is used for
&lt;/h2&gt;

&lt;p&gt;Netlify is a deployment platform (often described as WebOps or a modern JAMstack platform) that connects to your repository, runs a build, and publishes the output behind a global CDN. The practical outcome is a workflow where &lt;strong&gt;every change can be previewed&lt;/strong&gt;, and production releases are &lt;strong&gt;repeatable&lt;/strong&gt;, &lt;strong&gt;reversible&lt;/strong&gt;, and &lt;strong&gt;fast&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you run a Hugo-based technical blog, Netlify’s sweet spot is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Static sites&lt;/strong&gt; built by Hugo, Astro, Eleventy, and similar generators.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-page applications&lt;/strong&gt; where the build produces static assets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sites with lightweight backend needs&lt;/strong&gt;, implemented via serverless functions (APIs, webhooks, auth glue) or edge logic (routing, geo-based content, experiments).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The core deploy model in one sentence
&lt;/h3&gt;

&lt;p&gt;Netlify deploys are &lt;strong&gt;atomic&lt;/strong&gt;: a new deploy becomes live only after the whole new version is uploaded, so visitors do not see inconsistent intermediate states.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why developers pick Netlify
&lt;/h2&gt;

&lt;p&gt;Netlify’s popularity is less about “static hosting” and more about the workflow and platform primitives around it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploy Previews for pull requests
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Deploy Previews&lt;/strong&gt; generate a unique preview URL for each pull or merge request in a connected Git repository. Reviewers can validate content, layout, and performance without publishing to production. That is &lt;strong&gt;how Deploy Previews work on Netlify&lt;/strong&gt; in practice—per-PR preview environments with their own URLs and deploy contexts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Branch deploys for long-lived environments
&lt;/h3&gt;

&lt;p&gt;For stable environments like &lt;code&gt;staging&lt;/code&gt;, &lt;code&gt;qa&lt;/code&gt;, or &lt;code&gt;release/*&lt;/code&gt;, Netlify supports &lt;strong&gt;branch deploys&lt;/strong&gt;. Configure branch deploys for specific branches (or for all new branches) when you want a permanent staging URL independent of PR previews.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serverless Functions for web apps
&lt;/h3&gt;

&lt;p&gt;Netlify Functions run on-demand code without provisioning servers. A “static site” can still handle webhooks, small API endpoints, scheduled automation, and form-driven notifications. Functions deploy with your site, so previews and rollbacks apply to those endpoints too.&lt;/p&gt;

&lt;p&gt;If your “dynamic” work is &lt;strong&gt;model inference&lt;/strong&gt; (tokens, GPUs, long-running jobs) rather than short HTTP handlers, you will usually run a dedicated inference stack outside Netlify Functions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Edge Functions for low-latency logic
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Edge Functions&lt;/strong&gt; move selected logic to the edge. Typical uses include geolocation-based content, redirects, auth checks, and response modification close to the user—useful for global audiences and first-hit performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Built-in forms and basic protections
&lt;/h3&gt;

&lt;p&gt;For many Hugo sites, a contact form is the last reason to keep a separate server. &lt;strong&gt;Netlify Forms&lt;/strong&gt; can handle submissions as part of the deploy pipeline, with spam-protection options.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying a Hugo site on Netlify
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Netlify auto-detects for Hugo
&lt;/h3&gt;

&lt;p&gt;When you link a repo, Netlify can detect Hugo and suggest defaults such as build command &lt;code&gt;hugo&lt;/code&gt; and publish directory &lt;code&gt;public&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pin your Hugo version for repeatable builds
&lt;/h3&gt;

&lt;p&gt;The most common CI failure is &lt;strong&gt;Hugo version drift&lt;/strong&gt;. Pin the version with an environment variable.&lt;/p&gt;

&lt;p&gt;A minimal &lt;code&gt;netlify.toml&lt;/code&gt; pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[build]&lt;/span&gt;
  &lt;span class="py"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"hugo"&lt;/span&gt;
  &lt;span class="py"&gt;publish&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"public"&lt;/span&gt;

&lt;span class="nn"&gt;[build.environment]&lt;/span&gt;
  &lt;span class="py"&gt;HUGO_VERSION&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"YOUR_HUGO_VERSION"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That pattern is central to &lt;strong&gt;the best way to deploy a Hugo site on Netlify&lt;/strong&gt;—reproducible builds that match local development.&lt;/p&gt;

&lt;h3&gt;
  
  
  Make Deploy Previews render correctly
&lt;/h3&gt;

&lt;p&gt;Deploy Previews use their own URLs. If your Hugo config relies on absolute URLs (canonical links, sitemap, assets), set the base URL during preview builds. Netlify exposes &lt;code&gt;DEPLOY_PRIME_URL&lt;/code&gt; for this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[context.deploy-preview]&lt;/span&gt;
  &lt;span class="py"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"hugo --gc --minify --buildFuture -b $DEPLOY_PRIME_URL"&lt;/span&gt;

&lt;span class="nn"&gt;[context.branch-deploy]&lt;/span&gt;
  &lt;span class="py"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"hugo --gc --minify -b $DEPLOY_PRIME_URL"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Themes and submodules
&lt;/h3&gt;

&lt;p&gt;If you use a Hugo theme, treat it as a CI dependency—typically a &lt;strong&gt;Git submodule&lt;/strong&gt; so Netlify can fetch it on build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Netlify pricing and plan model
&lt;/h2&gt;

&lt;p&gt;Separate two ideas:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Plan features&lt;/strong&gt; (collaboration, security, team workflows).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measured usage&lt;/strong&gt; (what you consume while deploying and serving).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Credit-based plans
&lt;/h3&gt;

&lt;p&gt;Many newer accounts use &lt;strong&gt;credit-based pricing&lt;/strong&gt;. Credits cover production deploys, bandwidth, requests, function compute, form usage, and related consumption. Older blog posts that only discuss “build minutes” may be outdated for your account type—check Netlify’s own billing docs for your team.&lt;/p&gt;

&lt;h3&gt;
  
  
  Plans at a glance
&lt;/h3&gt;

&lt;p&gt;Self-serve tiers are commonly listed as &lt;strong&gt;Free&lt;/strong&gt;, &lt;strong&gt;Personal&lt;/strong&gt;, &lt;strong&gt;Pro&lt;/strong&gt;, and &lt;strong&gt;Enterprise&lt;/strong&gt;, each with a monthly credit allowance (Free has a hard cap; paid plans can add credits).&lt;/p&gt;

&lt;h3&gt;
  
  
  How credits are consumed
&lt;/h3&gt;

&lt;p&gt;Credits map to real cost drivers—&lt;strong&gt;how Netlify pricing works with credits&lt;/strong&gt; in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;production deploys&lt;/li&gt;
&lt;li&gt;bandwidth&lt;/li&gt;
&lt;li&gt;web requests&lt;/li&gt;
&lt;li&gt;serverless function compute&lt;/li&gt;
&lt;li&gt;form submissions&lt;/li&gt;
&lt;li&gt;optional platform features you enable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treat credits as a &lt;strong&gt;monthly budget&lt;/strong&gt;, not a single number you ignore until the dashboard complains.&lt;/p&gt;

&lt;h3&gt;
  
  
  Team seats vs reviewers
&lt;/h3&gt;

&lt;p&gt;Netlify distinguishes people who manage and deploy projects from people who only review. Using &lt;strong&gt;reviewer&lt;/strong&gt; roles for stakeholders can control cost without blocking feedback.&lt;/p&gt;

&lt;h2&gt;
  
  
  How much you can achieve on the Free plan
&lt;/h2&gt;

&lt;p&gt;The Free plan is real for production, but only if you respect limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  What you get on Free
&lt;/h3&gt;

&lt;p&gt;Typical Free-tier benefits include custom domains and TLS, &lt;strong&gt;unlimited Deploy Previews&lt;/strong&gt; (previews are the main collaboration win), and access to CDN, functions, and related primitives. The hard constraint is the &lt;strong&gt;monthly credit limit&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick mental models for planning
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Many &lt;strong&gt;production&lt;/strong&gt; deploys to main can burn credits quickly.&lt;/li&gt;
&lt;li&gt;Viral traffic or large assets can dominate &lt;strong&gt;bandwidth&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Asset-heavy pages can increase &lt;strong&gt;request&lt;/strong&gt; volume.&lt;/li&gt;
&lt;li&gt;Serverless APIs add &lt;strong&gt;compute&lt;/strong&gt;—track it if you add backends.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Realistic Free-plan scenarios
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;A — Hugo blog, few production releases, optimised images, moderate traffic&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Usually a good fit. Previews absorb most review load; production deploys stay low.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;B — Docs site with constant merges to main&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Production deploys can consume the budget. Batching merges, leaning on PR previews, or controlling release timing helps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;C — Static front end plus a small API&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Often workable, but watch function compute. Heavy work belongs elsewhere—same story as for GPU-backed inference workloads, where you monitor latency, cost, and production signals on the serving tier, not inside Netlify’s function sandbox.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens when you hit the limit
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens when you run out of Netlify credits?&lt;/strong&gt; On Free, Netlify aims to avoid surprise charges by enforcing the cap—projects can be &lt;strong&gt;paused&lt;/strong&gt; until the next cycle or until you upgrade or add credits on an eligible plan. Verify the exact behaviour for your account in Netlify’s current billing documentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Netlify competitors and alternatives
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How does Netlify compare with Vercel and Cloudflare Pages?&lt;/strong&gt; Roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vercel&lt;/strong&gt; — Strong for modern frontend apps and preview-centric workflows; evaluate usage-based scaling for your traffic profile.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare Pages&lt;/strong&gt; — Pairs static hosting with Cloudflare’s edge; often attractive when bandwidth and edge integration matter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Pages&lt;/strong&gt; — Minimal moving parts for simple static sites; stricter limits and fewer platform features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Static Web Apps&lt;/strong&gt; — Fits teams already on Azure; path from static hosting to Azure Functions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Amplify Hosting&lt;/strong&gt; — Makes sense when you want AWS-native integration and are comfortable with AWS billing models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For CLI-first AWS workflows, see &lt;a href="https://www.glukhov.org/web-infrastructure/hugo/hugo-website-deployment-to-aws-s3-with-aws-cli/" rel="noopener noreferrer"&gt;Deploy Hugo Site to AWS S3 with AWS CLI&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final recommendations
&lt;/h2&gt;

&lt;p&gt;Pick Netlify when you want Git-centric &lt;strong&gt;Deploy Previews&lt;/strong&gt;, atomic deploys, rollbacks, and optional functions or edge logic—&lt;strong&gt;what Netlify is used for&lt;/strong&gt; in most successful Hugo teams.&lt;/p&gt;

&lt;p&gt;Before you rely on Free for production, estimate &lt;strong&gt;monthly production deploy count&lt;/strong&gt; and &lt;strong&gt;bandwidth or request volume&lt;/strong&gt; (especially for large media). If you outgrow the free budget, pricing becomes part of architecture—not an afterthought.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is Netlify free for commercial use?&lt;/strong&gt; Yes, within plan limits; high traffic or deploy-heavy workflows usually need a paid tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.netlify.com/pricing/" rel="noopener noreferrer"&gt;Netlify pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.netlify.com/manage/accounts-and-billing/billing/billing-for-credit-based-plans/how-credits-work/" rel="noopener noreferrer"&gt;How credits work (credit-based plans)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.netlify.com/deploy/deploy-overview/" rel="noopener noreferrer"&gt;Deploy overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.netlify.com/build/frameworks/framework-setup-guides/hugo/" rel="noopener noreferrer"&gt;Hugo on Netlify&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>hugo</category>
      <category>deployment</category>
      <category>hosting</category>
    </item>
    <item>
      <title>Apache Flink on K8s and Kafka: PyFlink, Go, ops, and managed pricing</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Tue, 24 Mar 2026 10:57:30 +0000</pubDate>
      <link>https://forem.com/rosgluk/apache-flink-on-k8s-and-kafka-pyflink-go-ops-and-managed-pricing-93j</link>
      <guid>https://forem.com/rosgluk/apache-flink-on-k8s-and-kafka-pyflink-go-ops-and-managed-pricing-93j</guid>
      <description>&lt;p&gt;Apache Flink is a framework for &lt;em&gt;stateful computations&lt;/em&gt; over unbounded and bounded data streams.&lt;/p&gt;

&lt;p&gt;Teams adopt it for &lt;strong&gt;correct&lt;/strong&gt;, &lt;strong&gt;low-latency&lt;/strong&gt; streaming with event-time semantics (watermarks), fault tolerance (checkpoints), controlled upgrades (savepoints), and operational surfaces (metrics and REST).&lt;/p&gt;

&lt;p&gt;This guide targets DevOps and Go/Python developers. It compares deployment models (self-managed vs managed), explains core architecture, covers Kubernetes (Helm and Operator) and standalone setups, contrasts Flink with Spark, Kafka Streams, Beam, and streaming databases, and shows PyFlink plus Go integration patterns including LLM and AI-oriented pipelines.&lt;/p&gt;

&lt;p&gt;For broader context on data infrastructure patterns including object storage, databases, and messaging, see &lt;a href="https://www.glukhov.org/data-infrastructure/" rel="noopener noreferrer"&gt;Data Infrastructure for AI Systems: Object Storage, Databases, Search &amp;amp; AI Data Architecture&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Apache Flink and why teams use it for real-time processing
&lt;/h2&gt;

&lt;p&gt;Apache Flink is explicitly positioned as a &lt;em&gt;stateful stream processing&lt;/em&gt; engine: you model your logic as a pipeline of operators and Flink runs it as a distributed dataflow with managed state and time semantics. In modern Flink documentation, the project describes itself as a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. &lt;/p&gt;

&lt;p&gt;From a practical DevOps/software engineering perspective, Flink is a good fit when you need at least one of these properties:&lt;/p&gt;

&lt;p&gt;If you need &lt;strong&gt;join/aggregate/enrich at low latency with correctness guarantees&lt;/strong&gt;, you typically use Flink’s event-time processing, where “time” is when the event happened (not when it arrived), and watermarks communicate event-time progress through the pipeline. &lt;/p&gt;

&lt;p&gt;If you need &lt;strong&gt;stateful computation at scale&lt;/strong&gt; (rolling counters, sessions, fraud rules, feature engineering), Flink treats state as a first-class part of the programming model and makes it fault tolerant via checkpointing. &lt;/p&gt;

&lt;p&gt;If you need &lt;strong&gt;operationally robust streaming&lt;/strong&gt; (failures, rolling upgrades, restarts), Flink checkpoints state and stream positions so the job can recover and continue with the same semantics “as a failure-free execution”. &lt;/p&gt;

&lt;h3&gt;
  
  
  Typical use cases for DevOps, Go, Python, and AI teams
&lt;/h3&gt;

&lt;p&gt;Flink is widely used for “data pipelines &amp;amp; ETL”, “streaming analytics”, and “event-driven applications” (the categories used by the Flink docs). &lt;/p&gt;

&lt;p&gt;For a DevOps + Go/Python stack, typical patterns look like this:&lt;/p&gt;

&lt;p&gt;A Go service produces events to Kafka; Flink consumes those events, performs stateful processing (e.g., dedupe, windowed aggregation, enrichment), then writes derived facts back to Kafka or a database. Flink’s operator and checkpointing mechanisms exist to make these stateful pipelines production-safe. &lt;/p&gt;

&lt;p&gt;For ML/LLM teams, PyFlink explicitly calls out scenarios like “machine learning prediction” and loading machine learning models inside Python UDFs as a dependency-management motivation, which is a direct endorsement of “Flink job as online inference / feature engineering runtime” patterns. &lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Flink architecture and core features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Apache Flink cluster architecture for production deployments
&lt;/h3&gt;

&lt;p&gt;Flink’s runtime consists of two process types: &lt;strong&gt;JobManager&lt;/strong&gt; and &lt;strong&gt;TaskManagers&lt;/strong&gt;. The docs emphasise that clients submit the dataflow to the JobManager; the client can then disconnect (detached mode) or stay connected (attached mode). &lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;JobManager&lt;/strong&gt; coordinates distributed execution: scheduling, reacting to task completion/failure, coordinating checkpoints, and coordinating recovery. Internally, it includes: ResourceManager (slots/resources), Dispatcher (REST + Web UI + per-job JobMaster creation), and JobMaster (manages one job). &lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;TaskManagers&lt;/strong&gt; execute the operators/tasks, and exchange/buffer data streams. The smallest scheduling unit is the &lt;strong&gt;task slot&lt;/strong&gt;; multiple operators can execute in one slot (operator chaining and slot sharing affect this). &lt;/p&gt;

&lt;h3&gt;
  
  
  Operator chaining and task slots for performance and cost control
&lt;/h3&gt;

&lt;p&gt;Flink chains operator subtasks into &lt;strong&gt;tasks&lt;/strong&gt;, where each task is executed by a single thread. This is described as a performance optimisation that reduces thread handover and buffering overhead, increasing throughput and decreasing latency. &lt;/p&gt;

&lt;p&gt;Slots matter operationally because they are the unit of resource scheduling/isolation. Flink notes that each TaskManager may have one or more task slots; slotting reserves &lt;em&gt;managed memory&lt;/em&gt; per slot, but &lt;strong&gt;does not isolate CPU&lt;/strong&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  Event-time processing, watermarks, and late data
&lt;/h3&gt;

&lt;p&gt;Flink supports multiple notions of time—event time, ingestion time, processing time—and uses &lt;strong&gt;watermarks&lt;/strong&gt; to model progress in event time. &lt;/p&gt;

&lt;p&gt;To work with event time, Flink needs timestamps assigned to events and watermarks generated; the official “Generating Watermarks” documentation explains timestamp assignment and watermark generation as the core building blocks, with WatermarkStrategy being the standard way to configure common strategies. &lt;/p&gt;

&lt;h3&gt;
  
  
  Fault tolerance: checkpoints versus savepoints in real systems
&lt;/h3&gt;

&lt;p&gt;Checkpointing exists because “every function and operator in Flink can be stateful”; state must be checkpointed to become fault tolerant. Checkpoints enable recovery of both state and stream positions so execution can resume with failure-free semantics. &lt;/p&gt;

&lt;p&gt;Flink is very explicit that &lt;strong&gt;savepoints&lt;/strong&gt; are “a consistent image of the execution state of a streaming job, created via Flink’s checkpointing mechanism”, used to stop-and-resume, fork, or update jobs. Savepoints live on stable storage (e.g., HDFS, S3). &lt;/p&gt;

&lt;p&gt;The official “Checkpoints vs Savepoints” page frames the difference like backups vs recovery logs: checkpoints are frequent, lightweight, managed by Flink for failure recovery; savepoints are user-managed and used for controlled operations like upgrades. &lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Flink deployment options and pricing plans
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Free/self-managed Apache Flink option
&lt;/h3&gt;

&lt;p&gt;The open-source Flink runtime is “free” in the licensing sense, but in production you pay for infrastructure and operational effort.&lt;/p&gt;

&lt;p&gt;Flink is designed to integrate with common resource managers (e.g., YARN and Kubernetes) and can also run as a standalone cluster or as a library. &lt;/p&gt;

&lt;h4&gt;
  
  
  Self-managed cost drivers for Apache Flink
&lt;/h4&gt;

&lt;p&gt;Compute and memory costs are driven by JobManager and TaskManagers, and by your parallelism/slot layout. Flink’s configuration documentation explicitly calls out &lt;code&gt;jobmanager.memory.process.size&lt;/code&gt;, &lt;code&gt;taskmanager.memory.process.size&lt;/code&gt;, &lt;code&gt;taskmanager.numberOfTaskSlots&lt;/code&gt;, and &lt;code&gt;parallelism.default&lt;/code&gt; as core knobs for distributed setups. &lt;/p&gt;

&lt;p&gt;Local disk is a frequent hidden cost for stateful jobs. Flink notes that &lt;code&gt;io.tmp.dirs&lt;/code&gt; stores local data including &lt;strong&gt;RocksDB files&lt;/strong&gt;, spilled intermediate results, and cached JARs; if this data is deleted, it can force “a heavyweight recovery operation”, so it should live on storage that is not periodically purged. &lt;/p&gt;

&lt;p&gt;Durable object/file storage cost is driven by checkpoint/savepoint directories. In Flink 2.x config, checkpoints and savepoints are configured via &lt;code&gt;execution.checkpointing.dir&lt;/code&gt; and &lt;code&gt;execution.checkpointing.savepoint-dir&lt;/code&gt; and accept URIs like &lt;code&gt;s3://…&lt;/code&gt; or &lt;code&gt;hdfs://…&lt;/code&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  Managed Apache Flink plans and typical billing models
&lt;/h3&gt;

&lt;p&gt;Managed services reduce operational cost but add platform fees and constraints. The specifics are provider-dependent.&lt;/p&gt;

&lt;p&gt;Amazon Managed Service for Apache Flink bills by &lt;strong&gt;KPUs&lt;/strong&gt; (1 vCPU + 4 GB memory per KPU) and charges by duration and number of KPUs in one-second increments. AWS also charges an additional “orchestration” KPU per application and separate storage/backups fees. &lt;/p&gt;

&lt;p&gt;Confluent Cloud for Apache Flink is usage-based and serverless: you create a compute pool, and you’re billed for &lt;strong&gt;CFUs consumed per minute&lt;/strong&gt; while statements are running. The billing page includes an example CFU price of &lt;strong&gt;$0.21 per CFU-hour&lt;/strong&gt; (region-dependent) and emphasises that you can limit spend via compute pool maximums. &lt;/p&gt;

&lt;p&gt;Aiven and Alibaba Cloud are notable managed Flink providers in the market, but their public pricing and billing details vary by plan/region and may require calculators or sales contact; treat exact costs as &lt;strong&gt;unspecified unless you quote a region+plan&lt;/strong&gt; from their current docs. &lt;/p&gt;

&lt;p&gt;Ververica offers both self-managed and managed deployment options around Flink; public pages emphasise deployment choices and managed service positioning, while exact pricing is typically handled via “contact/pricing details” flows (so specific numbers are often &lt;strong&gt;unspecified&lt;/strong&gt; publicly). &lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment options table for Apache Flink in production
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Deployment option&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Operational complexity&lt;/th&gt;
&lt;th&gt;Key benefits&lt;/th&gt;
&lt;th&gt;Key risks / trade-offs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standalone cluster (VMs/bare metal)&lt;/td&gt;
&lt;td&gt;Small teams, fixed capacity&lt;/td&gt;
&lt;td&gt;Medium–High&lt;/td&gt;
&lt;td&gt;Full control; simplest mental model&lt;/td&gt;
&lt;td&gt;HA, autoscaling, upgrades are DIY (more toil)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes with Flink Kubernetes Operator&lt;/td&gt;
&lt;td&gt;Most modern platform teams&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Declarative deployments; lifecycle management via control loop; operator supports Application/Session/Job deployments&lt;/td&gt;
&lt;td&gt;Kubernetes + operator expertise required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native Kubernetes (without operator)&lt;/td&gt;
&lt;td&gt;K8s teams wanting direct integration&lt;/td&gt;
&lt;td&gt;Medium–High&lt;/td&gt;
&lt;td&gt;Direct resource integration; dynamic TaskManager allocation/deallocation described in Flink-on-K8s docs&lt;/td&gt;
&lt;td&gt;More bespoke automation than operator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YARN&lt;/td&gt;
&lt;td&gt;Hadoop-centric platforms&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Integrates with YARN resource mgmt&lt;/td&gt;
&lt;td&gt;Hadoop stack complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Managed Service for Apache Flink&lt;/td&gt;
&lt;td&gt;AWS-native data stacks&lt;/td&gt;
&lt;td&gt;Low–Medium&lt;/td&gt;
&lt;td&gt;Managed orchestration + scaling options; predictable billing unit (KPU)&lt;/td&gt;
&lt;td&gt;Platform coupling; extra per-app overhead KPU + storage fees&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confluent Cloud for Apache Flink&lt;/td&gt;
&lt;td&gt;Kafka-first shops, SQL-first stream apps&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Serverless usage billing; CFU-minute accounting; compute pools to cap spend&lt;/td&gt;
&lt;td&gt;CFU costs + Kafka networking costs; service-specific APIs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ververica managed offerings&lt;/td&gt;
&lt;td&gt;Enterprises needing Flink expert ops&lt;/td&gt;
&lt;td&gt;Low–Medium&lt;/td&gt;
&lt;td&gt;“Flink experts” managed service positioning&lt;/td&gt;
&lt;td&gt;Pricing often not transparent (unspecified)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Managed providers and costs table
&lt;/h3&gt;

&lt;p&gt;Prices change by region and time; if you need exact numbers for your region, treat this as a starting point and verify against the provider’s current pricing pages (unquoted regions are &lt;strong&gt;unspecified&lt;/strong&gt;).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;“Plan” shape&lt;/th&gt;
&lt;th&gt;Billing unit&lt;/th&gt;
&lt;th&gt;Example compute price&lt;/th&gt;
&lt;th&gt;Notable additional cost drivers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Managed Service for Apache Flink&lt;/td&gt;
&lt;td&gt;Managed runtime&lt;/td&gt;
&lt;td&gt;KPU (1 vCPU + 4 GB)&lt;/td&gt;
&lt;td&gt;Example shown: $0.11 per KPU-hour (US East N. Virginia)&lt;/td&gt;
&lt;td&gt;+1 orchestration KPU per app; running storage; optional durable backups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confluent Cloud for Apache Flink&lt;/td&gt;
&lt;td&gt;Serverless SQL/processing&lt;/td&gt;
&lt;td&gt;CFU-minute/CFU-hour&lt;/td&gt;
&lt;td&gt;Example shown: $0.21 per CFU-hour (region varies)&lt;/td&gt;
&lt;td&gt;Kafka networking rates still apply; compute pool max to cap spend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ververica (managed)&lt;/td&gt;
&lt;td&gt;Managed “Unified Streaming Data Platform”&lt;/td&gt;
&lt;td&gt;Unspecified (public pages)&lt;/td&gt;
&lt;td&gt;Unspecified&lt;/td&gt;
&lt;td&gt;Platform features/SLAs; pricing typically via sales (unspecified)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aiven for Apache Flink&lt;/td&gt;
&lt;td&gt;Managed service&lt;/td&gt;
&lt;td&gt;Hourly usage billing model (platform-wide)&lt;/td&gt;
&lt;td&gt;Unspecified without plan/region&lt;/td&gt;
&lt;td&gt;Plan tier + cloud region + add-ons (unspecified)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alibaba Cloud Realtime Compute for Apache Flink&lt;/td&gt;
&lt;td&gt;Managed/serverless&lt;/td&gt;
&lt;td&gt;Hybrid billing (pay-as-you-go + subscription mix)&lt;/td&gt;
&lt;td&gt;Unspecified without region/workspace details&lt;/td&gt;
&lt;td&gt;CU-based limits and workspace model (details vary; unspecified here)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Apache Flink vs competitors comparison
&lt;/h2&gt;

&lt;p&gt;Flink sits in a busy ecosystem. The “best” choice depends on latency, statefulness, operational preferences, and authoring model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Competitor comparison table: Flink vs Spark vs Kafka Streams vs Beam and newer options
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it is&lt;/th&gt;
&lt;th&gt;Streaming execution model&lt;/th&gt;
&lt;th&gt;State &amp;amp; exactly-once story&lt;/th&gt;
&lt;th&gt;Where it shines&lt;/th&gt;
&lt;th&gt;Typical pain points&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Flink&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distributed stream processing engine for stateful computations&lt;/td&gt;
&lt;td&gt;Continuous streaming + event time via watermarks&lt;/td&gt;
&lt;td&gt;Checkpoint-based fault tolerance; savepoints for controlled upgrades&lt;/td&gt;
&lt;td&gt;Low-latency stateful pipelines; complex event-time logic&lt;/td&gt;
&lt;td&gt;Operating state, checkpoints, upgrades correctly takes discipline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Spark Structured Streaming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spark’s streaming engine built around DataFrames/Datasets&lt;/td&gt;
&lt;td&gt;Default &lt;strong&gt;micro-batch&lt;/strong&gt; model (with a continuous mode discussed separately)&lt;/td&gt;
&lt;td&gt;Strong for analytical pipelines; state exists but often higher latency&lt;/td&gt;
&lt;td&gt;Unified batch+stream APIs; Spark ecosystem&lt;/td&gt;
&lt;td&gt;Micro-batch latency and “streaming as incremental batches” mental model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kafka Streams&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Library to build stream-processing apps on Kafka&lt;/td&gt;
&lt;td&gt;Record-at-a-time processing&lt;/td&gt;
&lt;td&gt;Supports exactly-once processing semantics (EOS)&lt;/td&gt;
&lt;td&gt;Simple Kafka-native apps; embed in JVM service&lt;/td&gt;
&lt;td&gt;JVM-only; less flexible for large distributed compute patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Beam&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unified programming model + SDKs; executed via runners (Flink, Spark, Dataflow, etc.)&lt;/td&gt;
&lt;td&gt;Depends on runner; Beam pipelines translate to runner jobs&lt;/td&gt;
&lt;td&gt;Semantics depend on runner capability matrix (runner-specific)&lt;/td&gt;
&lt;td&gt;Portability, multi-language pipelines; avoid engine lock-in&lt;/td&gt;
&lt;td&gt;Operational tuning still ends up being runner-specific&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Materialize&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;“Live data layer” / streaming SQL DB; incrementally updates results as data arrives&lt;/td&gt;
&lt;td&gt;Continuous incremental view maintenance&lt;/td&gt;
&lt;td&gt;Strong consistency claims in product docs (details are product-specific)&lt;/td&gt;
&lt;td&gt;Serving fresh derived views to apps/AI agents&lt;/td&gt;
&lt;td&gt;Different operational model than Flink jobs; not a general operator API runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RisingWave&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Streaming database where stream processing is expressed as materialized views&lt;/td&gt;
&lt;td&gt;Continuous materialised view maintenance&lt;/td&gt;
&lt;td&gt;SQL-first; engine-specific semantics&lt;/td&gt;
&lt;td&gt;SQL-centric streaming apps without building Flink jobs&lt;/td&gt;
&lt;td&gt;Less flexible for arbitrary code-heavy pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A useful heuristic: if you want a &lt;em&gt;runtime&lt;/em&gt; for complex stateful streaming jobs with deep control over event-time, operator logic, and deployments, Flink is a primary candidate. If you want &lt;em&gt;SQL-first incremental views&lt;/em&gt; for serving, streaming databases may be alternatives. If you want &lt;em&gt;a library embedded in a service&lt;/em&gt;, Kafka Streams is competitive. If you want &lt;em&gt;one portable pipeline definition across engines&lt;/em&gt;, Beam is compelling.&lt;/p&gt;

&lt;p&gt;For cloud-native event-driven architectures using AWS, &lt;a href="https://www.glukhov.org/data-infrastructure/stream-processing/service-oriented-and-microservices-with-aws-kinesis/" rel="noopener noreferrer"&gt;Building Event-Driven Microservices with AWS Kinesis&lt;/a&gt; covers Kinesis Data Streams patterns for real-time processing and service decoupling. &lt;/p&gt;

&lt;h2&gt;
  
  
  How to use Apache Flink in custom-made systems
&lt;/h2&gt;

&lt;p&gt;This section is intentionally practical: configuration, deployment, and how your Go/Python services typically interact with Flink.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommended architecture pattern: Go services + Kafka + Flink + serving layer
&lt;/h3&gt;

&lt;p&gt;Flink is often the “stateful middle” that turns high-volume events into durable signals (counters, sessions, anomalies, enriched records). Checkpoints and state backends are what make that middle reliable in production. &lt;/p&gt;

&lt;h3&gt;
  
  
  Standalone configuration example for Apache Flink 2.x
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Important version note:&lt;/strong&gt; starting with &lt;strong&gt;Flink 2.0&lt;/strong&gt;, the supported configuration file is &lt;code&gt;conf/config.yaml&lt;/code&gt;; the previous &lt;code&gt;flink-conf.yaml&lt;/code&gt; is “no longer supported”. &lt;/p&gt;

&lt;p&gt;A minimal (illustrative) &lt;code&gt;conf/config.yaml&lt;/code&gt; for a small self-managed cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# conf/config.yaml (Flink 2.x style)&lt;/span&gt;
&lt;span class="na"&gt;rest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flink-jobmanager.example.internal&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8081&lt;/span&gt;

&lt;span class="na"&gt;jobmanager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flink-jobmanager.example.internal&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6123&lt;/span&gt;
  &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;process&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2048m&lt;/span&gt;

&lt;span class="na"&gt;taskmanager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;process&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4096m&lt;/span&gt;
  &lt;span class="na"&gt;numberOfTaskSlots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

&lt;span class="na"&gt;parallelism&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

&lt;span class="c1"&gt;# Checkpointing defaults (jobs can still override in code)&lt;/span&gt;
&lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rocksdb&lt;/span&gt;
&lt;span class="na"&gt;execution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;checkpointing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3://my-bucket/flink/checkpoints&lt;/span&gt;
    &lt;span class="na"&gt;savepoint-dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3://my-bucket/flink/savepoints&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60 s&lt;/span&gt;

&lt;span class="c1"&gt;# Avoid tmp dirs that get purged (RocksDB files, cached jars, etc.)&lt;/span&gt;
&lt;span class="na"&gt;io&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tmp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dirs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/var/lib/flink/tmp"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why these keys: Flink’s configuration reference explicitly documents the &lt;code&gt;rest.*&lt;/code&gt; and &lt;code&gt;jobmanager.rpc.*&lt;/code&gt; discovery details, the process memory keys, the slot/parallelism keys, and the default checkpoint settings including &lt;code&gt;state.backend.type&lt;/code&gt;, &lt;code&gt;execution.checkpointing.dir&lt;/code&gt;, &lt;code&gt;execution.checkpointing.savepoint-dir&lt;/code&gt;, and &lt;code&gt;execution.checkpointing.interval&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;The &lt;code&gt;io.tmp.dirs&lt;/code&gt; choice is operationally important because Flink uses it for local RocksDB files and cached artefacts; deleting it can cause heavyweight recovery. &lt;/p&gt;

&lt;h3&gt;
  
  
  Legacy standalone config example for Flink 1.x
&lt;/h3&gt;

&lt;p&gt;If you are on Flink 1.x (still common in some managed environments), you’ll see &lt;code&gt;flink-conf.yaml&lt;/code&gt; in the wild. This is &lt;strong&gt;legacy&lt;/strong&gt; for Flink 2.x users.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# conf/flink-conf.yaml (legacy 1.x style; NOT supported in Flink 2.x)&lt;/span&gt;
&lt;span class="na"&gt;jobmanager.rpc.address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flink-jobmanager&lt;/span&gt;
&lt;span class="na"&gt;rest.port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8081&lt;/span&gt;
&lt;span class="na"&gt;taskmanager.numberOfTaskSlots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="na"&gt;parallelism.default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

&lt;span class="c1"&gt;# Legacy checkpoint keys differ by version; treat as illustrative.&lt;/span&gt;
&lt;span class="na"&gt;state.backend.type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rocksdb&lt;/span&gt;
&lt;span class="na"&gt;state.checkpoints.dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3://my-bucket/flink/checkpoints&lt;/span&gt;
&lt;span class="na"&gt;state.savepoints.dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3://my-bucket/flink/savepoints&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you’re migrating, Flink provides a migration script (&lt;code&gt;bin/migrate-config-file.sh&lt;/code&gt;) to convert &lt;code&gt;flink-conf.yaml&lt;/code&gt; to &lt;code&gt;config.yaml&lt;/code&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  Kubernetes/Helm deployment with the Flink Kubernetes Operator
&lt;/h3&gt;

&lt;p&gt;The Flink Kubernetes Operator acts as a control plane for Flink application lifecycle management and is installed using Helm. &lt;/p&gt;

&lt;p&gt;From the official operator Helm docs, you can install either from the source tree chart, or from the Apache-hosted chart repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# install from bundled chart in source tree&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;flink-kubernetes-operator helm/flink-kubernetes-operator

&lt;span class="c"&gt;# install from Apache downloads Helm repository (replace &amp;lt;OPERATOR-VERSION&amp;gt;)&lt;/span&gt;
helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-&amp;lt;OPERATOR-VERSION&amp;gt;/
helm &lt;span class="nb"&gt;install &lt;/span&gt;flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These exact commands are shown in the operator’s Helm installation documentation. &lt;/p&gt;

&lt;h4&gt;
  
  
  Example FlinkDeployment CR (illustrative)
&lt;/h4&gt;

&lt;p&gt;This is a simplified example to show the integration points you’ll typically customise (image, resources, checkpoint locations, logging/metrics). The operator reconciles this desired state via its control loop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flink.apache.org/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FlinkDeployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;realtime-sessions&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flink&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-registry.example.com/flink/realtime-sessions:2026-03-06&lt;/span&gt;
  &lt;span class="na"&gt;flinkVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v2_2&lt;/span&gt;
  &lt;span class="na"&gt;serviceAccount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flink&lt;/span&gt;
  &lt;span class="na"&gt;flinkConfiguration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;taskmanager.numberOfTaskSlots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
    &lt;span class="na"&gt;state.backend.type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rocksdb"&lt;/span&gt;
    &lt;span class="na"&gt;execution.checkpointing.dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/flink/checkpoints/realtime-sessions"&lt;/span&gt;
    &lt;span class="na"&gt;execution.checkpointing.savepoint-dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/flink/savepoints/realtime-sessions"&lt;/span&gt;
    &lt;span class="na"&gt;execution.checkpointing.interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;60&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;s"&lt;/span&gt;
    &lt;span class="na"&gt;rest.port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8081"&lt;/span&gt;
  &lt;span class="na"&gt;jobManager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2048m"&lt;/span&gt;
  &lt;span class="na"&gt;taskManager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4096m"&lt;/span&gt;
  &lt;span class="na"&gt;job&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;jarURI&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local:///opt/flink/usrlib/realtime-sessions.jar&lt;/span&gt;
    &lt;span class="na"&gt;parallelism&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
    &lt;span class="na"&gt;upgradeMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;savepoint&lt;/span&gt;
    &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;running&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;upgradeMode: savepoint&lt;/code&gt; pattern is common when you want safe stateful upgrades; savepoints are designed for stop/resume/fork/update workflows and point to stable storage locations. &lt;/p&gt;

&lt;h3&gt;
  
  
  PyFlink development: realistic Kafka streaming job with checkpoints and RocksDB state
&lt;/h3&gt;

&lt;p&gt;PyFlink is the Python API for Apache Flink and is explicitly pitched for scalable batch/stream workloads including ML pipelines and ETL. &lt;/p&gt;

&lt;h4&gt;
  
  
  Dependency packaging for PyFlink Kafka jobs
&lt;/h4&gt;

&lt;p&gt;When you use JVM connectors (Kafka, JDBC, etc.) from PyFlink, you must ensure the relevant JARs are available to the job. Flink’s Python “Dependency Management” docs show three standard mechanisms:&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;pipeline.jars&lt;/code&gt; (Table API), calling &lt;code&gt;add_jars()&lt;/code&gt; (DataStream API), or CLI &lt;code&gt;--jarfile&lt;/code&gt; at submission time. &lt;/p&gt;

&lt;h4&gt;
  
  
  PyFlink Kafka job example (DataStream API + event time + state + checkpointing)
&lt;/h4&gt;

&lt;p&gt;This example reads JSON events from Kafka, assigns event-time timestamps (with bounded out-of-orderness), maintains a per-user rolling count in keyed state, and writes an enriched event to an output topic.&lt;/p&gt;

&lt;p&gt;Notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KafkaSource is built via &lt;code&gt;KafkaSource.builder()&lt;/code&gt; and requires bootstrap servers, topics, and a deserialiser.
&lt;/li&gt;
&lt;li&gt;Exactly-once Kafka sink configuration in PyFlink requires setting delivery guarantee &lt;strong&gt;and&lt;/strong&gt; a transactional ID prefix.
&lt;/li&gt;
&lt;li&gt;Checkpoint defaults can be configured in Flink config (&lt;code&gt;execution.checkpointing.*&lt;/code&gt;) and/or in code; the config keys are documented in the Flink configuration reference.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.common&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Types&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.common.serialization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleStringSchema&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.common.watermark_strategy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WatermarkStrategy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimestampAssigner&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.common.configuration&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Configuration&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.datastream&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StreamExecutionEnvironment&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.datastream.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KeyedProcessFunction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RuntimeContext&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.datastream.state&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ValueStateDescriptor&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.datastream.connectors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DeliveryGuarantee&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.datastream.connectors.kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;KafkaSource&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;KafkaOffsetsInitializer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;KafkaSink&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;KafkaRecordSerializationSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EventTimeFromJson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TimestampAssigner&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Extract event_time_ms from the JSON payload.
    Expect: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;u1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:1710000000000,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,...}
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record_timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# fallback: use record timestamp (ingestion) if malformed
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;record_timestamp&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RollingCount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KeyedProcessFunction&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;runtime_context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RuntimeContext&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ValueStateDescriptor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rolling_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LONG&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;runtime_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;KeyedProcessFunction.Context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;value&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# emit enriched event
&lt;/span&gt;        &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rolling_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;
        &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_env&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Cluster/job defaults (can also be set in config.yaml)
&lt;/span&gt;    &lt;span class="n"&gt;cfg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state.backend.type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rocksdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execution.checkpointing.dir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/flink/checkpoints/realtime-sessions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execution.checkpointing.interval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;60 s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_execution_environment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# In PyFlink, connector jars must be available; use env.add_jars(...) if needed.
&lt;/span&gt;    &lt;span class="c1"&gt;# env.add_jars("file:///opt/flink/lib/flink-connector-kafka-&amp;lt;VERSION&amp;gt;.jar")
&lt;/span&gt;
    &lt;span class="c1"&gt;# Enable checkpointing explicitly as well (jobs can override defaults)
&lt;/span&gt;    &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enable_checkpointing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_parallelism&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_env&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;KafkaSource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_bootstrap_servers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_topics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events.raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_group_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;realtime-sessions-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_value_only_deserializer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SimpleStringSchema&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_starting_offsets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KafkaOffsetsInitializer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;earliest&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;watermark_strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;WatermarkStrategy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;for_bounded_out_of_orderness&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;of_seconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_timestamp_assigner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;EventTimeFromJson&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;watermark_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;watermark_strategy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka-events-raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;key_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;RollingCount&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;output_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;record_serializer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;KafkaRecordSerializationSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_topic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events.enriched&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_value_serialization_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SimpleStringSchema&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;sink&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;KafkaSink&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_bootstrap_servers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_record_serializer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record_serializer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_delivery_guarantee&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DeliveryGuarantee&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EXACTLY_ONCE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_transactional_id_prefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;realtime-sessions-txn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sink_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sink&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;realtime-sessions-pyflink&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API calls above line up with PyFlink’s KafkaSource builder usage pattern and required fields.&lt;br&gt;&lt;br&gt;
For delivery guarantees, PyFlink’s KafkaSinkBuilder documentation explicitly says that for &lt;code&gt;DeliveryGuarantee.EXACTLY_ONCE&lt;/code&gt; you must set the transactional ID prefix.&lt;br&gt;&lt;br&gt;
For timestamping/watermarking, Flink’s watermark documentation explains timestamp assignment and watermark generation as the mechanism to process event time, and PyFlink provides a WatermarkStrategy API mirroring this model. &lt;/p&gt;
&lt;h3&gt;
  
  
  Go integration: Kafka producer/consumer + Flink REST job submission
&lt;/h3&gt;

&lt;p&gt;Go does not have a native Flink job authoring API like Java/Python, so Go systems typically integrate with Flink through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka (or other brokers) as ingestion/egress.&lt;/li&gt;
&lt;li&gt;The Flink REST API for operational actions (uploading JARs, starting jobs, querying job status, triggering savepoints, rescaling).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For Kafka setup and local development patterns, see &lt;a href="https://www.glukhov.org/data-infrastructure/stream-processing/apache-kafka/" rel="noopener noreferrer"&gt;Apache Kafka Quickstart - Install Kafka 4.2 with CLI and Local Examples&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  Go Kafka producer/consumer example (kafka-go)
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"context"&lt;/span&gt;
    &lt;span class="s"&gt;"log"&lt;/span&gt;
    &lt;span class="s"&gt;"time"&lt;/span&gt;

    &lt;span class="s"&gt;"github.com/segmentio/kafka-go"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c"&gt;// Producer: write raw events&lt;/span&gt;
    &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;kafka&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Writer&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Addr&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;         &lt;span class="n"&gt;kafka&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"kafka:9092"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Topic&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;"events.raw"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;RequiredAcks&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequireAll&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteMessages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"user:u1"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;`{"user_id":"u1","event_time_ms":1710000000000,"event":"click"}`&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Consumer: read enriched events&lt;/span&gt;
    &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kafka&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReaderConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Brokers&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"kafka:9092"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;Topic&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="s"&gt;"events.enriched"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;GroupID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;"go-debug-consumer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;MinBytes&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;MaxBytes&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10e6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"enriched key=%s value=%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This is “plumbing” code, but it’s the most common practical integration surface: Kafka topics are the boundary between Flink and custom services.&lt;/p&gt;
&lt;h4&gt;
  
  
  Flink REST API: upload and run jobs from Go
&lt;/h4&gt;

&lt;p&gt;Flink’s REST API is part of the JobManager web server and listens on port &lt;code&gt;8081&lt;/code&gt; by default (configurable via &lt;code&gt;rest.port&lt;/code&gt;). &lt;/p&gt;

&lt;p&gt;The official OpenAPI spec for the dispatcher includes &lt;code&gt;/jars/upload&lt;/code&gt; and explicitly states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JAR upload must be sent as &lt;strong&gt;multi-part&lt;/strong&gt; data&lt;/li&gt;
&lt;li&gt;ensure the &lt;code&gt;Content-Type&lt;/code&gt; header is set to &lt;code&gt;application/x-java-archive&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;provides a curl example using &lt;code&gt;-F jarfile=@path/to/flink-job.jar&lt;/code&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A practical Go snippet to upload a JAR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;flink&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"bytes"&lt;/span&gt;
    &lt;span class="s"&gt;"context"&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"io"&lt;/span&gt;
    &lt;span class="s"&gt;"mime/multipart"&lt;/span&gt;
    &lt;span class="s"&gt;"net/http"&lt;/span&gt;
    &lt;span class="s"&gt;"os"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;UploadJar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flinkBaseURL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jarPath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jarPath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;multipart&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewWriter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CreateFormFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"jarfile"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"job.jar"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRequestWithContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MethodPost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flinkBaseURL&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;"/jars/upload"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Important: multipart boundary&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FormDataContentType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c"&gt;// Some clients also set "Expect:" similarly to the curl example in the spec.&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Expect"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"upload failed: %s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code is guided by the REST API OpenAPI description for &lt;code&gt;/jars/upload&lt;/code&gt; including its multipart requirement and curl reference. &lt;/p&gt;

&lt;p&gt;To run a previously uploaded JAR, Flink exposes &lt;code&gt;/jars/{jarid}/run&lt;/code&gt; and supports passing program args via query parameters (and/or JSON body). &lt;/p&gt;

&lt;p&gt;Operationally valuable endpoints you’ll likely automate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/jobs&lt;/code&gt; and &lt;code&gt;/jobs/{jobid}&lt;/code&gt; to list and inspect job state
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/jobs/{jobid}/savepoints&lt;/code&gt; to trigger savepoints (async trigger + polling)
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/jobs/{jobid}/rescaling&lt;/code&gt; to trigger rescaling
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Code snippets comparison table: PyFlink vs Go in a Flink-based platform
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;PyFlink (Python jobs)&lt;/th&gt;
&lt;th&gt;Go (services around Flink)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Authoring Flink logic&lt;/td&gt;
&lt;td&gt;Native authoring via DataStream/Table APIs; supports state + timers&lt;/td&gt;
&lt;td&gt;No native Flink API; implement logic in Flink (Java/Python) and integrate externally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Connectors/dependencies&lt;/td&gt;
&lt;td&gt;Must ship connector JARs via &lt;code&gt;pipeline.jars&lt;/code&gt;, &lt;code&gt;add_jars&lt;/code&gt;, or &lt;code&gt;--jarfile&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Not applicable (you’re not running inside Flink), but you manage Kafka/DB clients&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ingestion/egress&lt;/td&gt;
&lt;td&gt;KafkaSource/KafkaSink builders in PyFlink&lt;/td&gt;
&lt;td&gt;Kafka producer/consumer libraries; standard microservice patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ops automation&lt;/td&gt;
&lt;td&gt;Can call Flink REST endpoints too&lt;/td&gt;
&lt;td&gt;Often owns automation: upload JAR, deploy, rescale, trigger savepoint via REST&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  DevOps guide: monitoring, scaling, backups, and CI/CD for Apache Flink
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Monitoring Apache Flink in Kubernetes and on VMs
&lt;/h3&gt;

&lt;p&gt;Flink supports exporting metrics by configuring &lt;strong&gt;metric reporters&lt;/strong&gt; in the Flink configuration file; these reporters are instantiated on JobManager and TaskManagers. &lt;/p&gt;

&lt;p&gt;For Prometheus, Flink exposes Prometheus-format metrics when configured with &lt;code&gt;metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory&lt;/code&gt; in a supported Flink version environment. &lt;/p&gt;

&lt;p&gt;You generally combine that with Kubernetes ServiceMonitors (Prometheus Operator) or with your managed monitoring stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scaling: parallelism, slots, and operator-based autoscaling
&lt;/h3&gt;

&lt;p&gt;Flink’s scheduling model defines execution resources via task slots, and each slot can run a pipeline of parallel tasks. &lt;/p&gt;

&lt;p&gt;For manual scaling, the REST API provides a rescaling endpoint for a job (&lt;code&gt;/jobs/{jobid}/rescaling&lt;/code&gt;) as an async operation. &lt;/p&gt;

&lt;p&gt;If you’re on Kubernetes with the Flink Kubernetes Operator, the operator project advertises a “Flink Job Autoscaler” as part of its feature set, which is worth evaluating if your workloads vary substantially. &lt;/p&gt;

&lt;h3&gt;
  
  
  Backups and safe upgrades: checkpoints and savepoints
&lt;/h3&gt;

&lt;p&gt;Checkpoints are for automated recovery and are managed by Flink; savepoints are for user-driven lifecycle operations (stop/resume/fork/upgrade). &lt;/p&gt;

&lt;p&gt;From an SRE standpoint:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use checkpoints for “keep the pipeline running through failures”.&lt;/li&gt;
&lt;li&gt;Use savepoints for “deploy a new version without losing state”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Flink’s REST API also supports triggering savepoints asynchronously, which is useful for GitOps-style “deploy → trigger savepoint → upgrade” workflows. &lt;/p&gt;

&lt;h3&gt;
  
  
  CI/CD: GitOps + Helm + REST job submission
&lt;/h3&gt;

&lt;p&gt;For Kubernetes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep the operator installation and your FlinkDeployment CRs in Git, deploy via Argo CD/Flux, and version container images per build. The operator Helm docs explicitly discuss “Working with Argo CD”. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For standalone/session clusters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use the Flink REST API JAR upload and run endpoints for immutable artefact deployments. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also note a subtle but valuable security/ops toggle: &lt;code&gt;web.submit.enable&lt;/code&gt; governs uploads via the Web UI, but the docs note that even when disabled, session clusters still accept job submissions through REST requests; this is relevant when hardening UI surfaces while retaining CI/CD automation. &lt;/p&gt;

&lt;h2&gt;
  
  
  LLM/AI integration patterns with Apache Flink for real-time pipelines
&lt;/h2&gt;

&lt;p&gt;LLM systems are often only as good as their real-time context. Flink fits into LLM/AI stacks as the component that produces “always fresh” features, embeddings, and behavioural aggregates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-time embeddings pipeline with Flink
&lt;/h3&gt;

&lt;p&gt;A common pattern is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ingest user actions/events,&lt;/li&gt;
&lt;li&gt;aggregate sessions and preferences,&lt;/li&gt;
&lt;li&gt;produce embedding-generation tasks,&lt;/li&gt;
&lt;li&gt;write embeddings to a vector store and/or feature store.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PyFlink’s dependency management documentation explicitly calls out “machine learning prediction” and loading ML models inside Python UDFs (for remote cluster execution), which maps directly to “online inference inside Flink operators” approaches. &lt;/p&gt;

&lt;h3&gt;
  
  
  Online feature store updates for recommendation and ranking
&lt;/h3&gt;

&lt;p&gt;Flink’s keyed state and checkpointing model is built to maintain operator state across events and recover it reliably. That’s a natural match for continuous feature computation (rolling rates, counts, time-decayed metrics) that downstream recommenders need. &lt;/p&gt;

&lt;h3&gt;
  
  
  Practical latency/consistency trade-offs for AI pipelines
&lt;/h3&gt;

&lt;p&gt;If your architecture requires exactly-once semantics end-to-end (e.g., avoid duplicate feature updates or duplicate billing events), you’ll structure sinks and sources around checkpointing and transactional guarantees.&lt;/p&gt;

&lt;p&gt;In Kafka-based stacks specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flink’s Kafka connector can deliver exactly-once guarantees when checkpointing is enabled and delivery guarantee options are configured.
&lt;/li&gt;
&lt;li&gt;Kafka Streams also supports exactly-once semantics (EOS), which is relevant if your “AI feature pipeline” is small enough to live inside application code rather than a Flink cluster.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Architecture view for “Flink as the real-time AI context builder”
&lt;/h3&gt;

&lt;p&gt;This diagram is grounded in Flink’s core primitives: event-time processing (watermarks), state backends (&lt;code&gt;state.backend.type&lt;/code&gt; and system-managed local state), and checkpoint/savepoint mechanisms for fault tolerance and operations.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>go</category>
      <category>devops</category>
      <category>microservices</category>
    </item>
    <item>
      <title>IndexNow explained - notify search engines when you publish</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 23 Mar 2026 08:51:42 +0000</pubDate>
      <link>https://forem.com/rosgluk/indexnow-explained-notify-search-engines-when-you-publish-4dim</link>
      <guid>https://forem.com/rosgluk/indexnow-explained-notify-search-engines-when-you-publish-4dim</guid>
      <description>&lt;p&gt;Static sites and blogs change whenever you deploy. &lt;a href="https://www.indexnow.org/" rel="noopener noreferrer"&gt;Search engines&lt;/a&gt; that support &lt;strong&gt;IndexNow&lt;/strong&gt; can learn about those changes without waiting for the next blind crawl.&lt;/p&gt;

&lt;p&gt;This page covers why that matters, what the protocol does, and how to wire it into a real workflow, including patterns you can reuse in your own automation or a small Go CLI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use IndexNow on a static or Hugo site
&lt;/h2&gt;

&lt;p&gt;If you &lt;a href="https://www.glukhov.org/web-infrastructure/hugo/deploy-hugo-s3/" rel="noopener noreferrer"&gt;deploy Hugo to S3 or similar&lt;/a&gt; you already ship HTML and a &lt;code&gt;sitemap.xml&lt;/code&gt;. Crawlers will eventually read the sitemap but &lt;strong&gt;timing is not under your control&lt;/strong&gt;. After a migration or a batch of new posts you care about &lt;strong&gt;fresh indexing&lt;/strong&gt; more than "sometime next week."&lt;/p&gt;

&lt;p&gt;IndexNow is a &lt;strong&gt;push&lt;/strong&gt; channel. You POST a list of canonical URLs you care about. Participating engines (including Microsoft Bing and others listed on &lt;a href="https://www.indexnow.org/" rel="noopener noreferrer"&gt;indexnow.org&lt;/a&gt;) can prioritize fetching those URLs. It does not replace good URLs redirects or internal linking but it closes the loop between &lt;strong&gt;git push&lt;/strong&gt; and &lt;strong&gt;search engine awareness&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What IndexNow does
&lt;/h2&gt;

&lt;p&gt;At a high level each submission is an HTTPS &lt;strong&gt;POST&lt;/strong&gt; with JSON like this conceptually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;host&lt;/code&gt;&lt;/strong&gt; - your site hostname (for example &lt;code&gt;www.example.com&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;key&lt;/code&gt;&lt;/strong&gt; - your pre-generated secret string&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;keyLocation&lt;/code&gt;&lt;/strong&gt; (optional) - full URL of the verification file if it is not at the default path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;urlList&lt;/code&gt;&lt;/strong&gt; - one or more absolute URLs on that host you want to signal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Engines reject bad keys wrong hosts or malformed payloads. Success is usually HTTP 200 or 202 depending on the endpoint.&lt;/p&gt;

&lt;p&gt;You can read the full rules and partner list on the official site. The important mental model is &lt;strong&gt;domain ownership proof via a text file&lt;/strong&gt; plus &lt;strong&gt;explicit URL list&lt;/strong&gt; not keywords or page content.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to prepare your site
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key file and hostname
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick a key&lt;/strong&gt; - a long random string (treat it like a secret).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish&lt;/strong&gt; &lt;code&gt;https://your-domain/&amp;lt;key&amp;gt;.txt&lt;/code&gt; with &lt;strong&gt;only&lt;/strong&gt; the key as the body (one line).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use the same key&lt;/strong&gt; in your CLI or automation when you POST.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Submit only URLs&lt;/strong&gt; on that host that you want recrawled (new posts updated pages or redirects targets).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After you move many URLs at once you may want to batch-notify paths. IndexNow accepts multiple URLs in one request subject to each engine’s limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ways to submit URLs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manual POST&lt;/strong&gt; - fine for debugging use &lt;code&gt;curl&lt;/code&gt; with JSON.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plugins&lt;/strong&gt; - some CMS and hosting panels include IndexNow toggles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your deploy script&lt;/strong&gt; - after &lt;code&gt;hugo&lt;/code&gt; and upload call a small binary with the list of changed URLs or your sitemap URL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a &lt;strong&gt;Hugo&lt;/strong&gt; workflow the natural triggers are "after build" or "after sync to bucket." Pass &lt;strong&gt;full HTTPS URLs&lt;/strong&gt; that match your live site including &lt;code&gt;www&lt;/code&gt; vs apex if that is what you serve.&lt;/p&gt;

&lt;h2&gt;
  
  
  A small Go CLI (optional)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Features you might implement
&lt;/h3&gt;

&lt;p&gt;A minimal &lt;strong&gt;Go&lt;/strong&gt; command-line tool fits IndexNow well because the payload is a small JSON &lt;strong&gt;POST&lt;/strong&gt; and you can wire it into deploy scripts. A typical design includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single or multiple URLs&lt;/strong&gt; as positional arguments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--sitemap&lt;/code&gt;&lt;/strong&gt; to fetch a &lt;code&gt;sitemap.xml&lt;/code&gt; and submit every &lt;code&gt;&amp;lt;loc&amp;gt;&lt;/code&gt; (with optional &lt;strong&gt;&lt;code&gt;--limit&lt;/code&gt;&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Several engines in parallel&lt;/strong&gt; via &lt;strong&gt;&lt;code&gt;--engines&lt;/code&gt;&lt;/strong&gt; (for example &lt;code&gt;indexnow&lt;/code&gt; for the global aggregator, or per-provider endpoints)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flags or environment variables&lt;/strong&gt; such as &lt;code&gt;INDEXNOW_KEY&lt;/code&gt;, &lt;code&gt;INDEXNOW_WEBSITE_URL&lt;/code&gt;, and &lt;code&gt;INDEXNOW_ENGINES&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verbose output&lt;/strong&gt; with &lt;code&gt;-v&lt;/code&gt; for debugging 403 or 422 responses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Build with &lt;code&gt;go build&lt;/code&gt; or &lt;code&gt;go install&lt;/code&gt;, install the binary on your &lt;code&gt;PATH&lt;/code&gt;, then call it after publish:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;indexnow &lt;span class="nt"&gt;--key&lt;/span&gt; YOUR_KEY &lt;span class="nt"&gt;--website&lt;/span&gt; https://www.example.com https://www.example.com/new-post/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a full site refresh after deploy you can pass &lt;strong&gt;&lt;code&gt;--sitemap&lt;/code&gt;&lt;/strong&gt; with your public sitemap URL. Document response codes and engine lists in your own README, and keep a &lt;strong&gt;publish-then-index&lt;/strong&gt; shell snippet next to whatever triggers your static deploy.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.glukhov.org/ai-devtools/opencode/llms-comparison/" rel="noopener noreferrer"&gt;Best LLMs for OpenCode - tested locally&lt;/a&gt; post used "implement an IndexNow notifier in Go" as a coding benchmark - useful if you want to see how different models handle the same spec and structured tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical tips
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Prefer the &lt;strong&gt;&lt;code&gt;indexnow&lt;/code&gt;&lt;/strong&gt; engine target when you want one submission to fan out through the shared infrastructure (see &lt;a href="https://www.indexnow.org/searchengines.json" rel="noopener noreferrer"&gt;searchengines.json&lt;/a&gt; and mirror that list in your own client if you support multiple endpoints).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;429&lt;/strong&gt; means slow down. &lt;strong&gt;403&lt;/strong&gt; usually means key or host mismatch. Fix the key file location or hostname first.&lt;/li&gt;
&lt;li&gt;IndexNow does &lt;strong&gt;not&lt;/strong&gt; replace &lt;strong&gt;301 redirects&lt;/strong&gt; when you rename paths. Notify after redirects are live.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  See also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/web-infrastructure/" rel="noopener noreferrer"&gt;Web Infrastructure&lt;/a&gt; - the full cluster for static site deployment and indexing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-devtools/opencode/llms-comparison/" rel="noopener noreferrer"&gt;Best LLMs for OpenCode - tested locally&lt;/a&gt; - includes a real-world coding benchmark around this protocol&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/web-infrastructure/hugo/deploy-hugo-s3/" rel="noopener noreferrer"&gt;Deploy Hugo site to AWS S3&lt;/a&gt; - deploy flow where post-publish hooks fit&lt;/li&gt;
&lt;li&gt;Official protocol - &lt;a href="https://www.indexnow.org/" rel="noopener noreferrer"&gt;indexnow.org&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>hugo</category>
      <category>seo</category>
      <category>devops</category>
      <category>selfhosting</category>
    </item>
    <item>
      <title>Hosted email for custom domains compared - Workspace, Microsoft 365, Zoho, Proton, WorkMail</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 23 Mar 2026 08:51:40 +0000</pubDate>
      <link>https://forem.com/rosgluk/hosted-email-for-custom-domains-compared-workspace-microsoft-365-zoho-proton-workmail-14a7</link>
      <guid>https://forem.com/rosgluk/hosted-email-for-custom-domains-compared-workspace-microsoft-365-zoho-proton-workmail-14a7</guid>
      <description>&lt;p&gt;Putting &lt;strong&gt;email on your own domain&lt;/strong&gt; sounds like a weekend DNS task. In practice it is a small distributed system with a twenty-year legacy.&lt;/p&gt;

&lt;p&gt;MX routes inbound mail. &lt;strong&gt;SPF&lt;/strong&gt; lists who may send for your domain. &lt;strong&gt;DKIM&lt;/strong&gt; signs messages. &lt;strong&gt;DMARC&lt;/strong&gt; ties the story together and tells receivers what to do when something does not line up. Skip or miswire any layer and you get silent failures, spam-folder burial, or both.&lt;/p&gt;

&lt;p&gt;This article compares the five hosted options I see engineers actually choose - &lt;strong&gt;Google Workspace&lt;/strong&gt;, &lt;strong&gt;Microsoft 365&lt;/strong&gt;, &lt;strong&gt;Zoho Mail&lt;/strong&gt;, &lt;strong&gt;Proton Mail&lt;/strong&gt;, and &lt;strong&gt;AWS WorkMail&lt;/strong&gt; - with rough &lt;strong&gt;USD&lt;/strong&gt; prices, setup friction, and honest &lt;strong&gt;when to pick what&lt;/strong&gt; advice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why providers beat rolling your own
&lt;/h2&gt;

&lt;p&gt;You &lt;em&gt;can&lt;/em&gt; run Postfix or Exim on a VPS. You &lt;em&gt;should not&lt;/em&gt; unless you sell email infrastructure. &lt;strong&gt;IP reputation&lt;/strong&gt;, &lt;strong&gt;PTR&lt;/strong&gt;, &lt;strong&gt;bounce handling&lt;/strong&gt;, &lt;strong&gt;backscatter&lt;/strong&gt;, and &lt;strong&gt;greylisting&lt;/strong&gt; eat weekends. Managed hosts amortize that pain across millions of mailboxes. For a personal domain or a small team, &lt;strong&gt;paying a few dollars per seat&lt;/strong&gt; is cheaper than your hourly rate debugging why Gmail drops your replies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Google Workspace
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Price&lt;/strong&gt; - roughly &lt;strong&gt;6-12 USD&lt;/strong&gt; per user per month depending on tier and region.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get&lt;/strong&gt; - Gmail UI, strong &lt;strong&gt;spam filtering&lt;/strong&gt;, predictable &lt;strong&gt;deliverability&lt;/strong&gt;, shared calendar and drive if you use them, and admin tooling that "just works" for most SMB setups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt; - The default when you want &lt;strong&gt;reliability over novelty&lt;/strong&gt;. Setup is boring in the good way. If your only goal is "&lt;code&gt;you@yourdomain.com&lt;/code&gt; works forever," start here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Microsoft 365
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Price&lt;/strong&gt; - roughly &lt;strong&gt;8-22 USD&lt;/strong&gt; per user per month depending on bundle (often bundled with Office apps).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get&lt;/strong&gt; - Outlook, deep &lt;strong&gt;enterprise&lt;/strong&gt; features, Entra ID (formerly Azure AD) integration, and a natural fit if your company already lives in &lt;strong&gt;Excel&lt;/strong&gt;, &lt;strong&gt;Teams&lt;/strong&gt;, and &lt;strong&gt;SharePoint&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt; - Match made in heaven for Microsoft-heavy shops. Slightly heavier admin surface than Google for tiny teams, but &lt;strong&gt;reliability&lt;/strong&gt; and &lt;strong&gt;deliverability&lt;/strong&gt; are solid once DNS is correct.&lt;/p&gt;

&lt;h2&gt;
  
  
  Zoho Mail
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Price&lt;/strong&gt; - &lt;strong&gt;free&lt;/strong&gt; tiers exist for tiny teams; paid plans often land around &lt;strong&gt;1-5 USD&lt;/strong&gt; per user per month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get&lt;/strong&gt; - Straightforward &lt;strong&gt;hosted mail&lt;/strong&gt;, control panels that get the job done, fewer bells than the giants.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt; - Best when &lt;strong&gt;budget&lt;/strong&gt; matters more than polish. Fine for side projects and low-stakes mail. Expect more rough edges in UX and occasionally &lt;strong&gt;messier&lt;/strong&gt; deliverability stories than Workspace or Microsoft.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proton Mail
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Price&lt;/strong&gt; - roughly &lt;strong&gt;5-10 USD&lt;/strong&gt; per user per month for paid plans with custom domains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get&lt;/strong&gt; - &lt;strong&gt;Privacy-first&lt;/strong&gt; design, &lt;strong&gt;encryption&lt;/strong&gt; oriented workflows, Swiss jurisdiction story that matters to some buyers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt; - Choose Proton when &lt;strong&gt;privacy&lt;/strong&gt; is a requirement, not a vibe. You may accept more &lt;strong&gt;integration friction&lt;/strong&gt; (calendar, third-party clients, automation) than with mainstream suites.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS WorkMail
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Price&lt;/strong&gt; - list pricing is modest (on the order of &lt;strong&gt;4 USD&lt;/strong&gt; per user per month) before add-ons and data transfer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get&lt;/strong&gt; - Mailboxes in &lt;strong&gt;AWS&lt;/strong&gt;, IAM-flavored admin, and hooks into the broader AWS universe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt; - &lt;strong&gt;Paper&lt;/strong&gt; looks cheap. &lt;strong&gt;Reality&lt;/strong&gt; includes &lt;strong&gt;SES&lt;/strong&gt; identities, &lt;strong&gt;receipt rules&lt;/strong&gt;, routing surprises, and debugging sessions that belong in a ticket queue. Pick WorkMail only when &lt;strong&gt;AWS-native&lt;/strong&gt; email is a product requirement. Otherwise you trade dollars for &lt;strong&gt;engineering hours&lt;/strong&gt; you will not get back.&lt;/p&gt;

&lt;h2&gt;
  
  
  At a glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Rough $/user/mo&lt;/th&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Google Workspace&lt;/td&gt;
&lt;td&gt;6-12&lt;/td&gt;
&lt;td&gt;Easiest&lt;/td&gt;
&lt;td&gt;Default choice, minimum drama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft 365&lt;/td&gt;
&lt;td&gt;8-22&lt;/td&gt;
&lt;td&gt;Easy-medium&lt;/td&gt;
&lt;td&gt;Microsoft-centric orgs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zoho Mail&lt;/td&gt;
&lt;td&gt;1-5 or free tier&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Tight budgets, side projects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Proton Mail&lt;/td&gt;
&lt;td&gt;5-10&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Privacy requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS WorkMail&lt;/td&gt;
&lt;td&gt;~4 + AWS overhead&lt;/td&gt;
&lt;td&gt;Hard&lt;/td&gt;
&lt;td&gt;Deep AWS integration (rare cases)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Prices move with region, currency, and promotions - treat the table as &lt;strong&gt;order-of-magnitude&lt;/strong&gt;, not a quote.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing today
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Want it to work and move on&lt;/strong&gt; → &lt;strong&gt;Google Workspace&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Already on Office and Entra&lt;/strong&gt; → &lt;strong&gt;Microsoft 365&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Money is tight and mail is not mission-critical&lt;/strong&gt; → &lt;strong&gt;Zoho&lt;/strong&gt; (eyes open on deliverability).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threat model or policy demands privacy&lt;/strong&gt; → &lt;strong&gt;Proton&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need mail inside AWS for architectural reasons you can defend in a design review&lt;/strong&gt; → &lt;strong&gt;WorkMail&lt;/strong&gt;; otherwise skip.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Email rewards &lt;strong&gt;boring&lt;/strong&gt; choices. The clever move is rarely "my own Postfix box at a discount VPS." It is picking a &lt;strong&gt;reputable host&lt;/strong&gt;, nailing &lt;strong&gt;DNS&lt;/strong&gt; (see the &lt;a href="https://www.glukhov.org/web-infrastructure/" rel="noopener noreferrer"&gt;web infrastructure cluster&lt;/a&gt; for more on DNS), and spending your creativity on software that is not SMTP.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>hosting</category>
    </item>
    <item>
      <title>SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sun, 22 Mar 2026 12:19:25 +0000</pubDate>
      <link>https://forem.com/rosgluk/sglang-quickstart-install-configure-and-serve-llms-via-openai-api-3l77</link>
      <guid>https://forem.com/rosgluk/sglang-quickstart-install-configure-and-serve-llms-via-openai-api-3l77</guid>
      <description>&lt;p&gt;SGLang is a high-performance serving framework for large language models and multimodal models, built to deliver low-latency and high-throughput inference across everything from a single GPU to distributed clusters.&lt;/p&gt;

&lt;p&gt;For a broader comparison of self-hosted and cloud LLM hosting options — including Ollama, vLLM, llama-swap, LocalAI, and managed cloud providers — see the &lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM hosting guide for 2026&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you already have apps wired to the OpenAI API shape, SGLang is especially appealing because it can expose OpenAI-compatible endpoints for chat completions and completions, helping you migrate from hosted APIs to self-hosted models with minimal client-side changes. When you need to route requests across multiple backends (llama.cpp, vLLM, SGLang, etc.) with hot-swap and TTL-based unloading, &lt;a href="https://www.glukhov.org/llm-hosting/llama-swap/" rel="noopener noreferrer"&gt;llama-swap&lt;/a&gt; provides a transparent proxy layer that keeps a single &lt;code&gt;/v1&lt;/code&gt; URL stable while swapping upstreams on demand.&lt;/p&gt;

&lt;p&gt;This QuickStart walks through installation (multiple methods), practical configuration patterns, and a clean "install → serve → verify → integrate → tune" workflow, with working examples for both HTTP serving and offline batch inference.&lt;/p&gt;

&lt;p&gt;If you need multimodal support (text, embeddings, images, audio) with a built-in Web UI and maximum OpenAI API drop-in compatibility, &lt;a href="https://www.glukhov.org/llm-hosting/local-ai/" rel="noopener noreferrer"&gt;LocalAI&lt;/a&gt; offers a broader feature set with more model format support.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is SGLang for high-throughput LLM and multimodal model serving
&lt;/h2&gt;

&lt;p&gt;At its core, SGLang is designed for efficient inference and scalable serving. The “fast runtime” stack includes RadixAttention for prefix caching, a zero-overhead CPU scheduler, speculative decoding, continuous batching, paged attention, multiple parallelism strategies (tensor, pipeline, expert, data parallelism), structured outputs, chunked prefill, and multiple quantisation options (for example FP4, FP8, INT4, AWQ, GPTQ).&lt;/p&gt;

&lt;p&gt;It targets broad-platform deployment: NVIDIA GPUs, AMD GPUs, Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.&lt;/p&gt;

&lt;p&gt;PyPI requires Python &amp;gt;= 3.10. As of 20 March 2026, the published line included 0.5.9 (released 23 February 2026)—pin or check current versions when installing.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to install SGLang on Linux GPU hosts with uv pip, source builds, or Docker
&lt;/h2&gt;

&lt;p&gt;Install options include uv or pip, source builds, Docker images, Kubernetes manifests, Docker Compose, SkyPilot, and AWS SageMaker. Most walkthroughs assume common NVIDIA GPU setups; other accelerators have their own setup notes elsewhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install SGLang quickly with uv or pip on Python 3.10+
&lt;/h3&gt;

&lt;p&gt;For a straightforward local install, &lt;strong&gt;uv&lt;/strong&gt; is usually the fastest path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip
pip &lt;span class="nb"&gt;install &lt;/span&gt;uv
uv pip &lt;span class="nb"&gt;install &lt;/span&gt;sglang
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  CUDA 13 notes
&lt;/h4&gt;

&lt;p&gt;For CUDA 13, Docker avoids host-side PyTorch/CUDA mismatches. Without Docker: install a CUDA 13 PyTorch build, then &lt;code&gt;sglang&lt;/code&gt;, then the matching &lt;code&gt;sglang-kernel&lt;/code&gt; wheel from the published wheel releases (version must match the stack).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1) Install PyTorch with CUDA 13 support (replace X.Y.Z as needed)&lt;/span&gt;
uv pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;X.Y.Z torchvision torchaudio &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu130

&lt;span class="c"&gt;# 2) Install SGLang&lt;/span&gt;
uv pip &lt;span class="nb"&gt;install &lt;/span&gt;sglang

&lt;span class="c"&gt;# 3) Install the matching CUDA 13 sglang-kernel wheel (replace X.Y.Z)&lt;/span&gt;
uv pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sglang_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_x86_64.whl"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Install and run SGLang with Docker Hub images
&lt;/h3&gt;

&lt;p&gt;For containerised deployments—or to sidestep host CUDA/PyTorch pairing—use the published Docker Hub images. A typical &lt;code&gt;docker run&lt;/code&gt; mounts the Hugging Face cache and passes &lt;code&gt;HF_TOKEN&lt;/code&gt; when pulling gated models.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--shm-size&lt;/span&gt; 32g &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 30000:30000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ~/.cache/huggingface:/root/.cache/huggingface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="s2"&gt;"HF_TOKEN=&amp;lt;secret&amp;gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ipc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host &lt;span class="se"&gt;\&lt;/span&gt;
  lmsysorg/sglang:latest &lt;span class="se"&gt;\&lt;/span&gt;
  python3 &lt;span class="nt"&gt;-m&lt;/span&gt; sglang.launch_server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model-path&lt;/span&gt; meta-llama/Llama-3.1-8B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--port&lt;/span&gt; 30000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For production-style images, &lt;strong&gt;&lt;code&gt;latest-runtime&lt;/code&gt;&lt;/strong&gt; drops build tools and dev dependencies, so the image stays much smaller than the default &lt;code&gt;latest&lt;/code&gt; variant.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--shm-size&lt;/span&gt; 32g &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 30000:30000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ~/.cache/huggingface:/root/.cache/huggingface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--env&lt;/span&gt; &lt;span class="s2"&gt;"HF_TOKEN=&amp;lt;secret&amp;gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ipc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host &lt;span class="se"&gt;\&lt;/span&gt;
  lmsysorg/sglang:latest-runtime &lt;span class="se"&gt;\&lt;/span&gt;
  python3 &lt;span class="nt"&gt;-m&lt;/span&gt; sglang.launch_server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model-path&lt;/span&gt; meta-llama/Llama-3.1-8B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--port&lt;/span&gt; 30000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Install from source and other deployment methods
&lt;/h3&gt;

&lt;p&gt;To develop against SGLang or carry local patches, clone a release branch and install the Python package in editable mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &lt;span class="nt"&gt;-b&lt;/span&gt; v0.5.9 https://github.com/sgl-project/sglang.git
&lt;span class="nb"&gt;cd &lt;/span&gt;sglang

pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"python"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For orchestration, the repo includes Kubernetes manifests (single- and multi-node) and a minimal Docker Compose layout—reasonable starting points before custom wiring.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to configure SGLang server arguments with YAML config files and environment variables
&lt;/h2&gt;

&lt;p&gt;SGLang configuration is driven by server arguments and environment variables. Flags cover model selection, parallelism, memory, and optimisation knobs; the full set is listed with &lt;code&gt;python3 -m sglang.launch_server --help&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Environment variables use two prefixes: &lt;code&gt;SGL_&lt;/code&gt; and &lt;code&gt;SGLANG_&lt;/code&gt; (many flags accept either CLI or env form—&lt;code&gt;launch_server --help&lt;/code&gt; shows the mapping).&lt;/p&gt;

&lt;p&gt;Some commonly relevant env vars include host and port controls such as &lt;code&gt;SGLANG_HOST_IP&lt;/code&gt; and &lt;code&gt;SGLANG_PORT&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use a YAML config file for reproducible SGLang server launches
&lt;/h3&gt;

&lt;p&gt;For repeatable deployments and shorter command lines, pass a YAML file with &lt;code&gt;--config&lt;/code&gt;. CLI arguments override values from the file when both set the same option.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create config.yaml&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; config.yaml &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
model-path: meta-llama/Meta-Llama-3-8B-Instruct
host: 0.0.0.0
port: 30000
tensor-parallel-size: 2
enable-metrics: true
log-requests: true
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Launch server with config file&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; sglang.launch_server &lt;span class="nt"&gt;--config&lt;/span&gt; config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few configuration and tuning essentials to keep in mind:&lt;/p&gt;

&lt;p&gt;SGLang's &lt;code&gt;--model-path&lt;/code&gt; can point to a local folder or a Hugging Face repo ID, which makes it easy to switch between local weights and Hub-hosted models without changing your serving code.&lt;/p&gt;

&lt;p&gt;For multi-GPU, enable tensor parallelism with &lt;code&gt;--tp&lt;/code&gt;. If startup fails with &lt;strong&gt;“peer access is not supported between these two devices”&lt;/strong&gt;, add &lt;code&gt;--enable-p2p-check&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If serving hits OOM, reduce KV cache pressure with a smaller &lt;code&gt;--mem-fraction-static&lt;/code&gt; (default is &lt;strong&gt;&lt;code&gt;0.9&lt;/code&gt;&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;If long prompts OOM during prefill, lower &lt;code&gt;--chunked-prefill-size&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to run an OpenAI-compatible SGLang server and call it from the OpenAI Python client
&lt;/h2&gt;

&lt;p&gt;A practical "happy path" workflow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install SGLang (uv/pip or Docker).&lt;/li&gt;
&lt;li&gt;Start the server with your chosen model and port.&lt;/li&gt;
&lt;li&gt;Verify basic serving via OpenAI-compatible endpoints.&lt;/li&gt;
&lt;li&gt;Integrate your application by pointing the OpenAI SDK &lt;code&gt;base_url&lt;/code&gt; at the local server.&lt;/li&gt;
&lt;li&gt;Tune throughput and memory with server args once you have real traffic.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Send a local chat completion request to SGLang using OpenAI SDK
&lt;/h3&gt;

&lt;p&gt;For OpenAI-compatible usage, two details matter:&lt;/p&gt;

&lt;p&gt;The server implements the OpenAI HTTP surface and, when the tokenizer provides one, applies the Hugging Face chat template automatically. Override with &lt;code&gt;--chat-template&lt;/code&gt; at launch if needed.&lt;/p&gt;

&lt;p&gt;Point an OpenAI client at the server’s &lt;strong&gt;&lt;code&gt;/v1&lt;/code&gt;&lt;/strong&gt; prefix (&lt;code&gt;base_url&lt;/code&gt; → &lt;code&gt;http://&amp;lt;host&amp;gt;:&amp;lt;port&amp;gt;/v1&lt;/code&gt;), then call &lt;code&gt;client.chat.completions.create(...)&lt;/code&gt; as usual.&lt;/p&gt;

&lt;p&gt;Start the server with either entrypoint: &lt;code&gt;python -m sglang.launch_server&lt;/code&gt; still works, but &lt;strong&gt;&lt;code&gt;sglang serve&lt;/code&gt;&lt;/strong&gt; is the preferred CLI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Recommended CLI entrypoint&lt;/span&gt;
sglang serve &lt;span class="nt"&gt;--model-path&lt;/span&gt; qwen/qwen2.5-0.5b-instruct &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 30000

&lt;span class="c"&gt;# Still supported&lt;/span&gt;
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; sglang.launch_server &lt;span class="nt"&gt;--model-path&lt;/span&gt; qwen/qwen2.5-0.5b-instruct &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 30000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then call it with the OpenAI Python client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:30000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen/qwen2.5-0.5b-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;List 3 countries and their capitals.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How to run batch inference with the SGLang Offline Engine API and native endpoints
&lt;/h2&gt;

&lt;p&gt;SGLang supports multiple "API surfaces" depending on what you're building:&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;/generate&lt;/code&gt;&lt;/strong&gt; endpoint is the low-level runtime API. Prefer &lt;strong&gt;&lt;code&gt;/v1/...&lt;/code&gt;&lt;/strong&gt; OpenAI-compatible routes when you want chat templates and the usual client ecosystem handled for you.&lt;/p&gt;

&lt;p&gt;Without any HTTP server, the &lt;strong&gt;Offline Engine&lt;/strong&gt; runs inference in-process: suited to batch jobs and custom services. It supports sync/async and streaming/non-streaming combinations—pick the mode that matches the call pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example using the native /generate endpoint
&lt;/h3&gt;

&lt;p&gt;Minimal pattern: run a server, then POST &lt;code&gt;/generate&lt;/code&gt; with &lt;code&gt;temperature&lt;/code&gt; and &lt;code&gt;max_new_tokens&lt;/code&gt; (and any other sampling fields you need).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:30000/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The capital of France is&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sampling_params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_new_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;temperature = 0&lt;/code&gt; is greedy sampling; higher values increase diversity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example using the Offline Engine API for in-process batch inference
&lt;/h3&gt;

&lt;p&gt;Typical flow: construct &lt;code&gt;sgl.Engine(model_path=...)&lt;/code&gt;, run &lt;code&gt;llm.generate(...)&lt;/code&gt; over a batch of prompts, then &lt;strong&gt;&lt;code&gt;llm.shutdown()&lt;/code&gt;&lt;/strong&gt; to release GPU and other resources.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sglang&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sgl&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sgl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen/qwen2.5-0.5b-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;prompts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a concise self-introduction.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain what prefix caching is in one paragraph.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;sampling_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sampling_params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PROMPT:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OUTPUT:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>cheatsheet</category>
      <category>selfhosting</category>
      <category>llm</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
