<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: ArshTechPro</title>
    <description>The latest articles on Forem by ArshTechPro (@arshtechpro).</description>
    <link>https://forem.com/arshtechpro</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3258664%2F7a2cc61a-0b4d-4cf8-884e-52f33905cac3.png</url>
      <title>Forem: ArshTechPro</title>
      <link>https://forem.com/arshtechpro</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/arshtechpro"/>
    <language>en</language>
    <item>
      <title>Gemma 4: A Practical Guide for Developers</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Fri, 03 Apr 2026 15:23:32 +0000</pubDate>
      <link>https://forem.com/arshtechpro/gemma-4-a-practical-guide-for-developers-2co5</link>
      <guid>https://forem.com/arshtechpro/gemma-4-a-practical-guide-for-developers-2co5</guid>
      <description>&lt;p&gt;Google DeepMind released Gemma 4 on April 2, 2026. It is their most capable open model family to date, built from the same research behind Gemini 3, and shipped under the Apache 2.0 license. That means no usage caps, no restrictive policies, and full commercial freedom.&lt;/p&gt;

&lt;p&gt;This article breaks down what Gemma 4 is, what it can do, and how to actually run it in your projects. No fluff. Just the parts that matter if you are building something.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Gemma 4
&lt;/h2&gt;

&lt;p&gt;Gemma 4 is a family of open-weight multimodal models designed for reasoning, code generation, and agentic workflows. It comes in four sizes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Context Window&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;E2B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.3B effective (5.1B total)&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;td&gt;Phones, Raspberry Pi, IoT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;E4B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.5B effective (8B total)&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;td&gt;Edge devices, fast inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;26B A4B&lt;/strong&gt; (MoE)&lt;/td&gt;
&lt;td&gt;26B total, 4B active&lt;/td&gt;
&lt;td&gt;256K tokens&lt;/td&gt;
&lt;td&gt;Low-latency server inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;31B&lt;/strong&gt; (Dense)&lt;/td&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;256K tokens&lt;/td&gt;
&lt;td&gt;Maximum quality, fine-tuning base&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each size comes in both a base variant and an instruction-tuned (IT) variant. For most developer use cases, you want the IT variant.&lt;/p&gt;

&lt;p&gt;The "E" prefix on the smaller models stands for "effective parameters." These models use a technique called Per-Layer Embeddings (PLE) that feeds a secondary embedding signal into every decoder layer, which means the model activates fewer parameters at inference time, saving RAM and battery.&lt;/p&gt;

&lt;p&gt;The 26B model is a Mixture of Experts (MoE) architecture. It has 26 billion total parameters but only activates about 3.8 billion during inference. This makes it fast while still scoring near the top of the Arena AI leaderboard.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Can Do
&lt;/h2&gt;

&lt;p&gt;Gemma 4 is not just a text chatbot. Here is what the model family supports out of the box:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Text generation and reasoning.&lt;/strong&gt; Multi-step planning, deep logic, math. The 31B model scores 85.2% on MMLU Pro and 80.0% on LiveCodeBench v6.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision.&lt;/strong&gt; All four model sizes accept image and video input. The vision encoder supports variable aspect ratios and configurable token budgets (70, 140, 280, 560, or 1120 tokens per image). More tokens means more detail at the cost of more compute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audio.&lt;/strong&gt; The E2B and E4B models accept audio input natively. They handle speech recognition and speech-to-translated-text across multiple languages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code generation.&lt;/strong&gt; All models can generate, complete, and correct code. The 31B model is strong enough to function as an offline code assistant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Function calling.&lt;/strong&gt; Native support for structured JSON output, function-calling syntax, and system instructions. This is the foundation for building agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;140+ languages.&lt;/strong&gt; Pre-trained on over 140 languages with strong support for 35+.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Pick Your Model
&lt;/h2&gt;

&lt;p&gt;Start by deciding which model fits your hardware and use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you are running on a phone, Raspberry Pi, or Jetson Nano:&lt;/strong&gt; Use &lt;code&gt;gemma-4-E2B-it&lt;/code&gt; or &lt;code&gt;gemma-4-E4B-it&lt;/code&gt;. These are designed for edge devices and run offline with near-zero latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you have a single GPU (A100 or H100):&lt;/strong&gt; Use &lt;code&gt;gemma-4-26B-A4B-it&lt;/code&gt;. The MoE model fits in one GPU and gives you excellent latency because it only activates 4B parameters per forward pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you have two GPUs or want maximum quality:&lt;/strong&gt; Use &lt;code&gt;gemma-4-31B-it&lt;/code&gt;. This is the dense model. It needs tensor parallelism across two 80GB GPUs for full bfloat16 inference, but quantized versions run on consumer GPUs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you just want to try it out first:&lt;/strong&gt; Open Google AI Studio at &lt;code&gt;aistudio.google.com&lt;/code&gt; and select the Gemma 4 model. No setup required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Install Dependencies
&lt;/h2&gt;

&lt;p&gt;Gemma 4 requires &lt;code&gt;transformers&lt;/code&gt; version 5.5.0 or later. Install the core packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; transformers torch accelerate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you plan to work with images, also install &lt;code&gt;timm&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; timm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want 4-bit quantization to run larger models on smaller GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;bitsandbytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3: Run Inference with Transformers
&lt;/h2&gt;

&lt;p&gt;The fastest way to get started is with the Hugging Face pipeline API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Text-only generation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;any-to-any&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-E2B-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain dependency injection in three sentences.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_full_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Image + text (vision)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;any-to-any&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-E4B-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/your-image.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Describe what you see in this image.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_full_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lower-level control with AutoModel
&lt;/h3&gt;

&lt;p&gt;If you need more control over generation parameters, load the model and processor directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForImageTextToText&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-E4B-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForImageTextToText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a Python function that reverses a linked list.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;return_dict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;input_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;input_len&lt;/span&gt;&lt;span class="p"&gt;:],&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4: Enable Thinking Mode
&lt;/h2&gt;

&lt;p&gt;Gemma 4 supports chain-of-thought reasoning. When enabled, the model outputs its internal reasoning before the final answer.&lt;/p&gt;

&lt;p&gt;To turn it on, include the &lt;code&gt;&amp;lt;|think|&amp;gt;&lt;/code&gt; token at the start of your system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;|think|&amp;gt;You are a helpful assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 127 * 43?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model will output a thinking block followed by the final answer. If you are using the &lt;code&gt;processor.parse_response()&lt;/code&gt; method, you can separate the thinking from the content automatically.&lt;/p&gt;

&lt;p&gt;To disable thinking, simply remove the &lt;code&gt;&amp;lt;|think|&amp;gt;&lt;/code&gt; token from the system prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Serve It with vLLM
&lt;/h2&gt;

&lt;p&gt;For production workloads, you will want to serve Gemma 4 behind an OpenAI-compatible API using vLLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install vLLM
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; vllm &lt;span class="nt"&gt;--pre&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--extra-index-url&lt;/span&gt; https://wheels.vllm.ai/nightly/cu129 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--extra-index-url&lt;/span&gt; https://download.pytorch.org/whl/cu129 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--index-strategy&lt;/span&gt; unsafe-best-match
pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;transformers&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;5.5.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Start the server
&lt;/h3&gt;

&lt;p&gt;For the 26B MoE on a single A100/H100:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve google/gemma-4-26B-A4B-it &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.90
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the 31B dense model on two GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve google/gemma-4-31B-it &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.90
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the E4B edge model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve google/gemma-4-E4B-it &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 131072
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Query it
&lt;/h3&gt;

&lt;p&gt;Once the server is running, hit it with a standard OpenAI-compatible request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
        "model": "google/gemma-4-26B-A4B-it",
        "messages": [
            {"role": "user", "content": "Explain quantum entanglement in simple terms."}
        ],
        "max_tokens": 512,
        "temperature": 0.7
    }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means you can swap Gemma 4 into any application that already talks to an OpenAI-compatible API. No code changes beyond the model name and endpoint URL.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Run It Locally with Ollama
&lt;/h2&gt;

&lt;p&gt;If you want to run Gemma 4 on your laptop without any server setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is it. Ollama handles downloading the quantized weights, setting up the runtime, and exposing a local API. This is the easiest path for local development and testing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Fine-Tune for Your Use Case
&lt;/h2&gt;

&lt;p&gt;Gemma 4 is strong out of the box, but fine-tuning lets you specialize it for your domain. The recommended approach is QLoRA through the TRL library.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install fine-tuning dependencies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;trl peft datasets bitsandbytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Load with 4-bit quantization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForImageTextToText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BitsAndBytesConfig&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-E2B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;quantization_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_use_double_quant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForImageTextToText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From here, you attach LoRA adapters using PEFT, prepare your dataset, and train with TRL's &lt;code&gt;SFTTrainer&lt;/code&gt;. The E2B model can be fine-tuned on a free Google Colab T4 GPU. The larger models need proportionally more memory.&lt;/p&gt;

&lt;p&gt;You can also fine-tune on Vertex AI or with Unsloth for additional optimizations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Function Calling
&lt;/h2&gt;

&lt;p&gt;Gemma 4 supports native function calling, which is what makes it useful for building agents. The model can output structured JSON that specifies which function to call and with what arguments.&lt;/p&gt;

&lt;p&gt;Here is the general pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Define your available functions in the system prompt as a JSON schema.&lt;/li&gt;
&lt;li&gt;Send the user's message.&lt;/li&gt;
&lt;li&gt;The model responds with a function call in structured JSON.&lt;/li&gt;
&lt;li&gt;You execute the function and return the result.&lt;/li&gt;
&lt;li&gt;The model uses the result to generate its final answer.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works across all four model sizes. Combined with the long context windows (up to 256K tokens), you can pass entire codebases or document collections alongside your tool definitions in a single prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Get the Models
&lt;/h2&gt;

&lt;p&gt;All model weights are available for download:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hugging Face:&lt;/strong&gt; &lt;code&gt;huggingface.co/collections/google/gemma-4&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kaggle:&lt;/strong&gt; &lt;code&gt;kaggle.com/models/google/gemma-4&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama:&lt;/strong&gt; &lt;code&gt;ollama.com/library/gemma4&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google AI Studio (browser):&lt;/strong&gt; &lt;code&gt;aistudio.google.com&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Hugging Face model IDs you will use most often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;google/gemma-4-E2B-it&lt;/code&gt; (smallest, edge)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;google/gemma-4-E4B-it&lt;/code&gt; (small, edge)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;google/gemma-4-26B-A4B-it&lt;/code&gt; (MoE, fast server inference)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;google/gemma-4-31B-it&lt;/code&gt; (dense, maximum quality)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Architecture Details (if you care)
&lt;/h2&gt;

&lt;p&gt;A few things worth knowing about how Gemma 4 works under the hood:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alternating attention.&lt;/strong&gt; Layers alternate between local sliding-window attention (512-1024 tokens) and global full-context attention. This is how it stays efficient while still handling long context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dual RoPE.&lt;/strong&gt; Standard rotary position embeddings for sliding-window layers, proportional RoPE for global layers. This is what enables the 256K context window without quality degradation at long distances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared KV cache.&lt;/strong&gt; The last N layers reuse key-value tensors from earlier layers instead of computing their own. This cuts both memory and compute during inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision encoder.&lt;/strong&gt; Learned 2D position encoder with multidimensional RoPE. Preserves original aspect ratios. Token budgets are configurable from 70 to 1120 tokens per image.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audio encoder.&lt;/strong&gt; USM-style conformer architecture (same as Gemma-3n). Handles speech recognition and translation with up to 30 seconds of audio on the smaller models.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed from Gemma 3
&lt;/h2&gt;

&lt;p&gt;If you have used Gemma 3 before, here is what is different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;License.&lt;/strong&gt; Gemma 3 used a custom Google license with restrictions. Gemma 4 uses Apache 2.0. This is a significant change for commercial use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MoE model.&lt;/strong&gt; The 26B A4B is the first Mixture of Experts model in the Gemma family.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-Layer Embeddings.&lt;/strong&gt; The E2B and E4B models use PLE for better parameter efficiency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared KV cache.&lt;/strong&gt; New efficiency optimization not present in Gemma 3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio input.&lt;/strong&gt; The E2B and E4B models handle audio natively. Gemma 3 did not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Roles.&lt;/strong&gt; Gemma 4 uses standard &lt;code&gt;system&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, and &lt;code&gt;assistant&lt;/code&gt; roles in chat templates. Gemma 3 had a different role structure.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Gemma 4 gives you a complete open model stack: four sizes covering everything from phones to multi-GPU servers, multimodal input (text, image, video, audio), native function calling for agents, up to 256K context, and an Apache 2.0 license that lets you ship products without restrictions.&lt;/p&gt;

&lt;p&gt;The fastest path from zero to running code:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;pip install -U transformers torch&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Load &lt;code&gt;google/gemma-4-E2B-it&lt;/code&gt; with the pipeline API&lt;/li&gt;
&lt;li&gt;Start prompting&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>google</category>
      <category>programming</category>
    </item>
    <item>
      <title>Hermes Agent: A Self-Improving AI Agent That Runs Anywhere</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Mon, 30 Mar 2026 14:20:05 +0000</pubDate>
      <link>https://forem.com/arshtechpro/hermes-agent-a-self-improving-ai-agent-that-runs-anywhere-2b7d</link>
      <guid>https://forem.com/arshtechpro/hermes-agent-a-self-improving-ai-agent-that-runs-anywhere-2b7d</guid>
      <description>&lt;p&gt;Most AI agents today are chatbots with extra steps. You talk to them, they forget everything, and you start over next time. Hermes Agent, built by Nous Research, takes a different approach. It remembers what it learns, creates reusable skills from experience, and runs on your own infrastructure instead of someone else's cloud.&lt;/p&gt;

&lt;p&gt;This article covers what Hermes Agent actually is, why it matters for developers, and how to get it running.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Hermes Agent?
&lt;/h2&gt;

&lt;p&gt;Hermes Agent is an open-source AI agent (MIT licensed) with a built-in learning loop. That phrase sounds like marketing, but it describes something specific: after completing a complex task, the agent can save the approach as a reusable "skill" for next time. It maintains persistent memory across sessions. It builds a model of who you are and how you work.&lt;/p&gt;

&lt;p&gt;It is not a wrapper around a single API. You can plug in whatever LLM provider you want -- OpenAI, Anthropic, OpenRouter (which gives you access to 200+ models), or your own self-hosted endpoint running Ollama, vLLM, or SGLang. Switching providers is a single command. No code changes.&lt;/p&gt;

&lt;p&gt;The project has about 8,700 stars on GitHub, 142 contributors, and 2,293 commits as of late March 2026. It is written primarily in Python.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Should a Developer Care?
&lt;/h2&gt;

&lt;p&gt;There are many agent frameworks out there. Here is what makes Hermes Agent worth looking at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It is not tied to your laptop.&lt;/strong&gt; You can run the agent on a $5 VPS, inside a Docker container, over SSH to a remote server, or on serverless infrastructure like Modal or Daytona that hibernates when idle. You talk to it from Telegram, Discord, Slack, WhatsApp, Signal, or the terminal. The conversation continues across platforms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It has real memory, not a hack.&lt;/strong&gt; The agent maintains two small, curated files: MEMORY.md (for environment facts, conventions, and lessons learned) and USER.md (for your preferences and communication style). These are injected into the system prompt at session start. The agent also has full-text search over all past sessions stored in SQLite, so it can recall conversations from weeks ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It learns from its own work.&lt;/strong&gt; After a complex task (typically 5+ tool calls), the agent can autonomously create a skill -- a structured markdown document with procedures, pitfalls, and verification steps. Next time a similar task comes up, it loads the skill instead of figuring things out from scratch. Skills can also self-improve during use when the agent discovers a better approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It supports MCP out of the box.&lt;/strong&gt; You can connect any Model Context Protocol server by adding a few lines to the config file. This means the agent can interact with GitHub, databases, or any service that exposes an MCP endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It is research-ready.&lt;/strong&gt; If you are working on training better tool-calling models, Hermes includes batch trajectory generation, Atropos RL environments, and trajectory compression. This is not just a user-facing product; it is also infrastructure for building the next generation of agent models.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Concepts
&lt;/h2&gt;

&lt;p&gt;Before setting it up, it helps to understand the main building blocks.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Agent Loop
&lt;/h3&gt;

&lt;p&gt;The core of Hermes is &lt;code&gt;AIAgent&lt;/code&gt; in &lt;code&gt;run_agent.py&lt;/code&gt;. It handles provider selection, prompt construction, tool execution, retries, compression, and persistence. It is a synchronous orchestration engine -- one loop that drives everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skills
&lt;/h3&gt;

&lt;p&gt;Skills are on-demand knowledge documents stored in &lt;code&gt;~/.hermes/skills/&lt;/code&gt;. They follow a progressive disclosure pattern to minimize token usage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Level 0: the agent sees a list of skill names and descriptions (about 3,000 tokens)&lt;/li&gt;
&lt;li&gt;Level 1: the agent loads the full content of a specific skill when needed&lt;/li&gt;
&lt;li&gt;Level 2: the agent loads a specific reference file within a skill&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each skill is a directory with a &lt;code&gt;SKILL.md&lt;/code&gt; file and optional reference materials, templates, and scripts. The format uses YAML front matter for metadata and markdown for the actual instructions.&lt;/p&gt;

&lt;p&gt;The agent creates skills automatically after complex tasks, but you can also write them by hand, install them from the Skills Hub (which aggregates multiple registries including skills.sh and GitHub repos), or share them with a team via external skill directories.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory
&lt;/h3&gt;

&lt;p&gt;Memory is bounded and curated. MEMORY.md gets 2,200 characters. USER.md gets 1,375 characters. That is roughly 1,300 tokens total -- small enough that it does not bloat the context window, but large enough to hold 15-20 useful entries.&lt;/p&gt;

&lt;p&gt;The agent manages memory itself. It adds entries when it learns something useful, replaces entries when information changes, and consolidates entries when memory gets full. There is also security scanning on memory entries to prevent prompt injection.&lt;/p&gt;

&lt;p&gt;For deeper recall, the agent can search all past sessions using SQLite full-text search and LLM summarization. This is not in the system prompt -- it is on-demand, called when the agent needs to find something from a previous conversation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Terminal Backends
&lt;/h3&gt;

&lt;p&gt;Hermes supports six ways to execute commands: local, Docker, SSH, Daytona, Singularity, and Modal. Docker and SSH give you sandboxed execution. Daytona and Modal give you serverless persistence -- the environment hibernates when idle and wakes on demand.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Messaging Gateway
&lt;/h3&gt;

&lt;p&gt;The gateway is a long-running process that connects the agent to messaging platforms. You configure it with &lt;code&gt;hermes gateway setup&lt;/code&gt;, start it with &lt;code&gt;hermes gateway start&lt;/code&gt;, and then talk to the agent from your phone. The same slash commands work across all platforms.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Set It Up
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;You need git installed. That is it. The installer handles Python, Node.js, and all dependencies.&lt;/p&gt;

&lt;p&gt;Windows is not natively supported. Use WSL2 instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install
&lt;/h3&gt;

&lt;p&gt;Run the one-line installer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After it finishes, reload your shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; ~/.bashrc   &lt;span class="c"&gt;# or source ~/.zshrc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Configure a Provider
&lt;/h3&gt;

&lt;p&gt;Run the model selection command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This walks you through choosing an LLM provider. Your options include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nous Portal&lt;/strong&gt; -- subscription-based, zero-config, uses OAuth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenRouter&lt;/strong&gt; -- multi-provider routing, supports 200+ models, needs an API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt; -- uses Codex models via device code auth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic&lt;/strong&gt; -- Claude models directly, via Claude Code auth or an API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom Endpoint&lt;/strong&gt; -- any OpenAI-compatible API (Ollama, vLLM, SGLang, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are also providers for Z.AI, Kimi/Moonshot, MiniMax, Alibaba Cloud, Hugging Face, and several others.&lt;/p&gt;

&lt;p&gt;You can switch providers at any time by running &lt;code&gt;hermes model&lt;/code&gt; again.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Start Chatting
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire startup command. You will see a welcome banner showing your active model, available tools, and installed skills. Type a message and press Enter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Set Up Sandboxed Execution (Recommended)
&lt;/h3&gt;

&lt;p&gt;By default, the agent runs commands on your local machine. For safety, use a sandboxed backend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes config &lt;span class="nb"&gt;set &lt;/span&gt;terminal.backend docker    &lt;span class="c"&gt;# Docker isolation&lt;/span&gt;
hermes config &lt;span class="nb"&gt;set &lt;/span&gt;terminal.backend ssh       &lt;span class="c"&gt;# Remote server&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Connect Messaging (Optional)
&lt;/h3&gt;

&lt;p&gt;If you want to talk to Hermes from Telegram, Discord, Slack, or another platform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes gateway setup    &lt;span class="c"&gt;# Interactive configuration&lt;/span&gt;
hermes gateway start    &lt;span class="c"&gt;# Start the gateway process&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Useful Commands to Know
&lt;/h2&gt;

&lt;p&gt;Once Hermes is running, these commands cover most of what you need day-to-day:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes              &lt;span class="c"&gt;# Start a conversation&lt;/span&gt;
hermes &lt;span class="nt"&gt;-c&lt;/span&gt;           &lt;span class="c"&gt;# Resume the last session&lt;/span&gt;
hermes model        &lt;span class="c"&gt;# Switch LLM provider/model&lt;/span&gt;
hermes tools        &lt;span class="c"&gt;# Configure enabled tools&lt;/span&gt;
hermes doctor       &lt;span class="c"&gt;# Diagnose issues&lt;/span&gt;
hermes update       &lt;span class="c"&gt;# Update to latest version&lt;/span&gt;
hermes gateway      &lt;span class="c"&gt;# Manage messaging platforms&lt;/span&gt;
hermes skills search &amp;lt;query&amp;gt;   &lt;span class="c"&gt;# Find skills to install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside a conversation, type &lt;code&gt;/&lt;/code&gt; to see all available slash commands. A few highlights:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/model              &lt;span class="c"&gt;# Switch models mid-conversation&lt;/span&gt;
/tools              &lt;span class="c"&gt;# List available tools&lt;/span&gt;
/skills             &lt;span class="c"&gt;# Browse and manage skills&lt;/span&gt;
/personality pirate &lt;span class="c"&gt;# Try a fun personality&lt;/span&gt;
/save               &lt;span class="c"&gt;# Save the conversation&lt;/span&gt;
/compress           &lt;span class="c"&gt;# Compress context when it gets long&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Adding MCP Servers
&lt;/h2&gt;

&lt;p&gt;To connect external tools via MCP, add entries to &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;mcp_servers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npx&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-y"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@modelcontextprotocol/server-github"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;GITHUB_PERSONAL_ACCESS_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ghp_xxx"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent will then have access to whatever tools that MCP server exposes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up Scheduled Tasks
&lt;/h2&gt;

&lt;p&gt;Hermes has a built-in cron scheduler that delivers results to any connected platform. You set up tasks in natural language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; Every morning at 9am, check Hacker News for AI news and send me a summary on Telegram.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent creates a cron job that runs automatically via the gateway. No crontab editing required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;If you want to contribute to the project, here is the development setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/NousResearch/hermes-agent.git
&lt;span class="nb"&gt;cd &lt;/span&gt;hermes-agent
git submodule update &lt;span class="nt"&gt;--init&lt;/span&gt; mini-swe-agent
curl &lt;span class="nt"&gt;-LsSf&lt;/span&gt; https://astral.sh/uv/install.sh | sh
uv venv .venv &lt;span class="nt"&gt;--python&lt;/span&gt; 3.11
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
uv pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[all,dev]"&lt;/span&gt;
uv pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"./mini-swe-agent"&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;-q&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;mini-swe-agent&lt;/code&gt; submodule is a required terminal backend. If you want to work on the RL/training side, also initialize the &lt;code&gt;tinker-atropos&lt;/code&gt; submodule.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Is Not
&lt;/h2&gt;

&lt;p&gt;Hermes Agent is not a managed service. There is no hosted version you sign up for. You run it on your own machine or server, bring your own API keys, and own all the data. If that is what you want -- full control over an agent that gets better the more you use it -- this is worth trying.&lt;/p&gt;

&lt;p&gt;It is also not a simple chatbot wrapper. The codebase includes multiple API modes, prompt caching and compression, gateway-specific session management, RL environment infrastructure, and an editor integration layer via ACP. It is a substantial project with real architectural depth.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;github.com/NousResearch/hermes-agent&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>TurboQuant: What Developers Need to Know About Google's KV Cache Compression</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Sat, 28 Mar 2026 18:26:46 +0000</pubDate>
      <link>https://forem.com/arshtechpro/turboquant-what-developers-need-to-know-about-googles-kv-cache-compression-eeg</link>
      <guid>https://forem.com/arshtechpro/turboquant-what-developers-need-to-know-about-googles-kv-cache-compression-eeg</guid>
      <description>&lt;p&gt;If you've ever run a large language model on your own hardware and watched your GPU memory vanish as the context window grows, TurboQuant is built for exactly that problem.&lt;/p&gt;

&lt;p&gt;Published by Google Research on March 24, 2026 and headed to ICLR 2026, TurboQuant is a compression algorithm that shrinks the KV cache -- the biggest memory bottleneck during LLM inference -- down to 3-4 bits per element without any retraining or fine-tuning. The result is roughly a 4-6x reduction in KV cache memory with negligible quality loss.&lt;/p&gt;

&lt;p&gt;This article breaks down what TurboQuant actually does, why it matters for anyone deploying or experimenting with LLMs, and how to start using community implementations right now.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: KV Cache Is Eating Your VRAM
&lt;/h2&gt;

&lt;p&gt;When a transformer model generates text, it computes key and value vectors for every token in the context and stores them so it doesn't have to recompute them on subsequent steps. This is the key-value (KV) cache.&lt;/p&gt;

&lt;p&gt;The issue is simple: it grows linearly with context length, and it stores everything in full precision (typically FP16). For an 8B parameter model at 32K context, the KV cache alone can consume around 4.6 GB of VRAM. Scale that to multiple concurrent users or longer contexts, and you're out of memory before the model weights themselves become the bottleneck.&lt;/p&gt;

&lt;p&gt;Existing approaches to this problem -- like FP8 quantization in vLLM or the q4_0/q8_0 cache types in Ollama -- either don't compress aggressively enough or introduce quality trade-offs that are hard to predict. TurboQuant aims to do better on both fronts.&lt;/p&gt;




&lt;h2&gt;
  
  
  How TurboQuant Works (The Short Version)
&lt;/h2&gt;

&lt;p&gt;TurboQuant is a two-stage compression pipeline. It doesn't need any training data, calibration, or model-specific tuning. It works on any vector, which means it slots into any transformer architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: PolarQuant (b-1 bits)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first step is a random orthogonal rotation applied to each KV vector. This rotation spreads the energy of the vector uniformly across all coordinates, which transforms the problem: after rotation, each coordinate follows a predictable statistical distribution (approximately Beta or Gaussian depending on the head dimension). Because the distribution is known in advance, you can compute a mathematically optimal set of quantization buckets (using the Lloyd-Max algorithm) &lt;em&gt;once&lt;/em&gt;, ahead of time. No per-model or per-dataset calibration needed. PolarQuant then converts coordinates into polar form -- radius and angle rather than Cartesian x/y/z -- which eliminates the costly per-block normalization constants that traditional quantizers need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: QJL Residual Correction (1 bit)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The second stage takes the tiny quantization error left over from Stage 1, projects it through a random Gaussian matrix using the Johnson-Lindenstrauss transform, and stores only the sign bit (+1 or -1) of each resulting value. This single-bit sketch acts as a bias correction that makes the inner product estimates (i.e., attention scores) mathematically unbiased. The overhead is just 1 extra bit per coordinate.&lt;/p&gt;

&lt;p&gt;The combined result: b bits total per coordinate, with provably near-optimal distortion bounds and zero memory overhead from normalization constants.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters for Developers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;It's training-free and model-agnostic.&lt;/strong&gt; TurboQuant doesn't require fine-tuning, calibration datasets, or model-specific configuration. The rotation matrix and codebook are derived from math, not data. Point it at any transformer's KV cache and it works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compression scales with context length.&lt;/strong&gt; The benefit is proportional to how much KV cache you have. At 512 tokens the savings are modest (tens of megabytes). At 4K tokens you start saving over 1 GB. At 8K+ tokens the savings reach 2 GB or more on a single model -- and that's when it starts changing what you can actually run on your hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It enables longer contexts on existing hardware.&lt;/strong&gt; If you're hitting OOM at 16K context on a 16 GB GPU, TurboQuant can push that boundary significantly without buying new hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed gains under memory pressure.&lt;/strong&gt; When FP16 KV cache pushes your GPU into swap territory, inference speed collapses. Community benchmarks show TurboQuant maintaining 2-3x higher token throughput in these regimes because the compressed cache stays in fast GPU memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It applies beyond LLMs.&lt;/strong&gt; The same algorithm works for vector search / nearest-neighbor retrieval, compressing high-dimensional embedding indices with state-of-the-art recall.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started: The pip-Installable Path
&lt;/h2&gt;

&lt;p&gt;The fastest way to try TurboQuant today is the &lt;code&gt;turboquant&lt;/code&gt; Python package, a community implementation that provides a drop-in replacement for HuggingFace's KV cache:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;turboquant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three lines to compress your model's KV cache:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;turboquant&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TurboQuantCache&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2.5-3B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2.5-3B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create compressed cache -- that's it
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TurboQuantCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your prompt here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;past_key_values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_cache&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's also a built-in OpenAI-compatible inference server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;turboquant-server &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen/Qwen2.5-3B-Instruct &lt;span class="nt"&gt;--bits&lt;/span&gt; 4 &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you can use the core quantizer directly on any vectors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;turboquant&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TurboQuantMSE&lt;/span&gt;

&lt;span class="n"&gt;tq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TurboQuantMSE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;norms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quantize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# vectors: (N, 128)
&lt;/span&gt;&lt;span class="n"&gt;vectors_hat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dequantize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;norms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# reconstruct
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The llama.cpp Path
&lt;/h2&gt;

&lt;p&gt;If you're running models locally through llama.cpp, there are active community implementations integrating TurboQuant as a KV cache type. One notable fork (&lt;code&gt;turboquant_plus&lt;/code&gt;) already works end-to-end on Apple Silicon with Metal GPU kernels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Server mode&lt;/span&gt;
./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; models/your-model.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;-c&lt;/span&gt; 262144 &lt;span class="nt"&gt;-fa&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080

&lt;span class="c"&gt;# CLI mode&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; models/your-model.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;-c&lt;/span&gt; 2048 &lt;span class="nt"&gt;-fa&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 100 &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Hello world"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's also an open feature request on the vLLM project to integrate TurboQuant as a native KV cache quantization option. Google's official implementation is expected around Q2 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Notes and Gotchas
&lt;/h2&gt;

&lt;p&gt;A few things the benchmarks and community experiments have surfaced that the paper doesn't emphasize:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4-bit is the sweet spot for most use cases.&lt;/strong&gt; At 4 bits, quality is essentially indistinguishable from FP16 on 3B+ parameter models. At 3 bits, you get more compression but quality starts degrading noticeably on models smaller than 8B.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Small models are more sensitive.&lt;/strong&gt; On 0.5B-1.6B parameter models, quantization noise from TurboQuant can produce repetitive or degraded output, especially at 3-bit. If you're running something under 3B parameters, test carefully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keys and values have different sensitivities.&lt;/strong&gt; Community experiments have found that value quantization tends to be the bottleneck -- 2-bit values cause significant cosine similarity degradation (around 0.94), while 4-bit values maintain 0.997. If you're tuning bit allocation, give values more bits than keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short contexts don't benefit much.&lt;/strong&gt; Below 1K tokens, the KV cache is small enough that compression savings are negligible and the overhead of rotation + quantization can even be a net negative. TurboQuant really shines at 4K+ tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The residual window matters.&lt;/strong&gt; Most implementations keep the most recent 128-256 tokens in full FP16 precision and only compress older tokens. This is important for output quality since attention focuses heavily on recent context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Community Implementations at a Glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Integration&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;back2matching/turboquant&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;HuggingFace drop-in&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pip install turboquant&lt;/code&gt;, includes OpenAI-compatible server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tonbistudio/turboquant-pytorch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Python/PyTorch&lt;/td&gt;
&lt;td&gt;Standalone&lt;/td&gt;
&lt;td&gt;From-scratch implementation with detailed validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0xSero/turboquant&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;vLLM adapter&lt;/td&gt;
&lt;td&gt;Triton kernels, vLLM monkey-patch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TheTom/turboquant_plus&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;C/Python&lt;/td&gt;
&lt;td&gt;llama.cpp + Metal&lt;/td&gt;
&lt;td&gt;Apple Silicon optimized, 500+ tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;RecursiveIntell/turbo-quant&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;Standalone lib&lt;/td&gt;
&lt;td&gt;Embedding + KV cache, no runtime dependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ggml-org/llama.cpp#20969&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;llama.cpp discussion&lt;/td&gt;
&lt;td&gt;Multiple community PRs in progress&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;TurboQuant is one piece of a larger shift happening in LLM deployment: making inference cheaper and more accessible without sacrificing quality. It pairs well with weight quantization (GPTQ, AWQ, GGUF formats), speculative decoding, and other serving optimizations. The combination of a 4-bit quantized model with a 4-bit TurboQuant KV cache means you can run meaningfully large models on consumer GPUs with long contexts -- something that was impractical a year ago.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>google</category>
    </item>
    <item>
      <title>Xcode 26.4 - Here Is What Actually Matters for Devs</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Thu, 26 Mar 2026 19:25:27 +0000</pubDate>
      <link>https://forem.com/arshtechpro/xcode-264-here-is-what-actually-matters-for-devs-2hke</link>
      <guid>https://forem.com/arshtechpro/xcode-264-here-is-what-actually-matters-for-devs-2hke</guid>
      <description>&lt;p&gt;Apple released Xcode 26.4 (build 17E192) on March 24, 2026. It ships with Swift 6.3, updated SDKs for iOS/iPadOS/tvOS/macOS/visionOS 26.4, and the largest Instruments update in the entire Xcode 26 cycle.&lt;/p&gt;

&lt;p&gt;This post cuts through the release notes and organizes everything by what will actually affect your day-to-day work.&lt;/p&gt;




&lt;h2&gt;
  
  
  If You Read Nothing Else: The Sanitizer Regression
&lt;/h2&gt;

&lt;p&gt;Address Sanitizer and Thread Sanitizer &lt;strong&gt;hang indefinitely&lt;/strong&gt; on any OS 26.4 target when you build with Xcode 26.3 or older. Not crash. Not fail. Hang. Your CI jobs will silently block without producing any error output.&lt;/p&gt;

&lt;p&gt;If your machines are running OS 26.4, upgrade Xcode to 26.4 immediately. There is no other workaround.&lt;/p&gt;

&lt;p&gt;This is the single most urgent item in the release.&lt;/p&gt;




&lt;h2&gt;
  
  
  Instruments: Serious New Profiling Tools
&lt;/h2&gt;

&lt;p&gt;Instruments got three headline features that are worth learning before your next performance investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run Comparison.&lt;/strong&gt; You can now compare call trees across multiple profiling sessions. Open it from View &amp;gt; Detail Area &amp;gt; Compare With, select a previous run, and Instruments shows you which functions got faster or slower between sessions. Combined with Call Tree filtering (like "Charge to callers"), this makes before/after performance work much less manual.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top Functions.&lt;/strong&gt; A new top-level Call Tree mode that surfaces the most expensive functions across an entire trace, regardless of where they sit in the call hierarchy. If you have ever spent time drilling through nested call stacks to find the real bottleneck, this saves that effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Power Profiler per-core breakdown.&lt;/strong&gt; On-device Power Profiler traces now show activity broken down by individual CPU core. Useful for understanding how your workload distributes across efficiency and performance cores on Apple Silicon.&lt;/p&gt;

&lt;p&gt;Other Instruments changes worth noting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;xctrace import&lt;/code&gt; now supports &lt;code&gt;--append-run&lt;/code&gt; to combine multiple files into one trace document&lt;/li&gt;
&lt;li&gt;A new "Hide Inlined Functions" option charges inlined function samples to their parent, which cleans up noisy traces&lt;/li&gt;
&lt;li&gt;Flame graph fixes for node selection, context menus, resizing, and color contrast&lt;/li&gt;
&lt;li&gt;Applying a dSYM now resymbolicates all recorded runs, not just the current one&lt;/li&gt;
&lt;li&gt;The "Compress Run Data" setting is gone; trace files compress automatically now&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Swift 6.3
&lt;/h2&gt;

&lt;p&gt;Xcode 26.4 bundles Swift 6.3. The release notes do not detail Swift language changes in depth within the Xcode notes themselves, but the compiler and runtime are updated. Check the separate Swift 6.3 release notes for language-level details.&lt;/p&gt;




&lt;h2&gt;
  
  
  C++ Standard Library: Some Huge Performance Numbers
&lt;/h2&gt;

&lt;p&gt;If you maintain any C++ code in your project, the standard library improvements in this release are significant.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;std::ranges::copy&lt;/code&gt;, &lt;code&gt;copy_n&lt;/code&gt;, &lt;code&gt;copy_backward&lt;/code&gt;, &lt;code&gt;move&lt;/code&gt;, &lt;code&gt;move_backward&lt;/code&gt;, and &lt;code&gt;rotate&lt;/code&gt; algorithms are optimized for &lt;code&gt;std::vector::iterator&lt;/code&gt;. Apple reports improvements up to 2000x in applicable workloads. Even if your real-world gains are a fraction of that, it is worth knowing about if you do bulk vector operations.&lt;/p&gt;

&lt;p&gt;Other highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;std::ranges::equal&lt;/code&gt; improved up to 188x&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;std::ranges::swap_ranges&lt;/code&gt; improved up to 611x&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;std::stable_sort&lt;/code&gt; now uses radix sort for floating-point types (up to 10x faster)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bitset::to_string&lt;/code&gt; is up to 16x faster for dense bitsets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Five new C++26 language features landed in Apple Clang:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured Bindings can introduce a Pack (P1061R10)&lt;/li&gt;
&lt;li&gt;Structured binding declaration as a condition (P0963R3)&lt;/li&gt;
&lt;li&gt;Variadic Friends (P2893R3)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;constexpr&lt;/code&gt; placement new (P2747R2)&lt;/li&gt;
&lt;li&gt;The Oxford variadic comma (P3176R1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The library also adds &lt;code&gt;std::flat_set&lt;/code&gt;, &lt;code&gt;views::join_with&lt;/code&gt;, &lt;code&gt;constexpr&lt;/code&gt; stable sorting, and Unicode 16.0.0 support in the formatting library, among 14 newly implemented C++ papers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Localization: String Catalog Gets Practical
&lt;/h2&gt;

&lt;p&gt;The String Catalog editor received seven improvements that remove friction from localization work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cut, copy, paste, duplicate&lt;/strong&gt; now work on strings, both within a catalog and across catalogs. When you paste, you choose between adding a new key with all its translations or applying translations to an existing key.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One-click language removal&lt;/strong&gt; is available directly in the editor. You pick whether to remove the language from just that catalog or the entire project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-fill translations from an existing language&lt;/strong&gt; when adding a new supported language. Instead of starting from empty catalogs, you get a working baseline to edit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;BUILD_ONLY_KNOWN_LOCALIZATIONS&lt;/code&gt; build setting&lt;/strong&gt; limits compiled localized content to the languages in your Project Editor. Languages outside that list are visually de-emphasized in String Catalogs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strings are no longer extracted from code comments&lt;/strong&gt; by default. If you relied on this behavior, set &lt;code&gt;LOCALIZED_STRING_CODE_COMMENTS&lt;/code&gt; to &lt;code&gt;YES&lt;/code&gt; to restore it.&lt;/p&gt;

&lt;p&gt;Known issue: removing a language from a String Catalog inside a Swift Package can cause it to reappear.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing: Breaking Change and Key Fixes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;XCTest interoperability with Swift Testing is off by default now.&lt;/strong&gt; If your CI relies on it, you need to explicitly set the &lt;code&gt;SWIFT_TESTING_XCTEST_INTEROP_MODE&lt;/code&gt; environment variable to &lt;code&gt;limited&lt;/code&gt; in your test plan. This will break any pipeline that depended on the previous implicit behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Swift Testing now supports image attachments&lt;/strong&gt; directly for &lt;code&gt;CGImage&lt;/code&gt;, &lt;code&gt;NSImage&lt;/code&gt;, &lt;code&gt;UIImage&lt;/code&gt;, and &lt;code&gt;CIImage&lt;/code&gt;. You can also set severity levels when recording an Issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flaky test fix.&lt;/strong&gt; A bug where unwaited notification expectations could fire in unrelated tests has been resolved. If you have been chasing intermittent test failures in a project that creates expectations without always waiting for them, this might be the cause.&lt;/p&gt;

&lt;p&gt;Known limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;UIImage&lt;/code&gt; attachments do not work in Mac Catalyst test targets. Use &lt;code&gt;UIImage.cgImage&lt;/code&gt; as a workaround.&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;continueAfterFailure&lt;/code&gt; is false, a failure in an async test method, setUp, or tearDown skips remaining retries. Workaround: enable "Relaunch Tests for Each Repetition" in the test plan.&lt;/li&gt;
&lt;li&gt;Swift Testing tests may crash on Rosetta run destinations on Apple Silicon. Use Xcode Universal to avoid this.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Build System and Package Manager
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Setting &lt;code&gt;SKIP_MERGEABLE_LIBRARY_BUNDLE_HOOK&lt;/code&gt; to YES on mergeable libraries that do not access resources through standard Bundle APIs avoids extra launch-time overhead.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;swift test&lt;/code&gt; now correctly applies sanitizers when using &lt;code&gt;--sanitize&lt;/code&gt; together with &lt;code&gt;--filter&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Source editor tabs survive external tools (like &lt;code&gt;git&lt;/code&gt;) removing and re-adding files while Xcode is running.&lt;/li&gt;
&lt;li&gt;You can now enable package traits on dependencies directly from the Package Dependencies view.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Coding Intelligence Fixes
&lt;/h2&gt;

&lt;p&gt;Two fixes in this area:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Externally configured MCP servers were being overwritten during Codex initialization. Fixed.&lt;/li&gt;
&lt;li&gt;Multiple "Allow Connection?" dialogs that appeared when external development tools connected to Xcode are resolved.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  System Requirements
&lt;/h2&gt;

&lt;p&gt;Xcode 26.4 requires &lt;strong&gt;macOS Tahoe 26.2 or later&lt;/strong&gt; on the host Mac.&lt;/p&gt;

&lt;p&gt;On-device debugging is supported for iOS 15+, tvOS 15+, watchOS 8+, and visionOS. The macOS requirement applies only to the machine running Xcode, not to your deployment targets.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference: Who Needs to Care About What
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you...&lt;/th&gt;
&lt;th&gt;Pay attention to...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Run CI on OS 26.4 devices&lt;/td&gt;
&lt;td&gt;Sanitizer hang fix -- upgrade Xcode immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Do performance profiling&lt;/td&gt;
&lt;td&gt;Run Comparison, Top Functions, Power Profiler core breakdown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintain C++ code&lt;/td&gt;
&lt;td&gt;Standard library algorithm performance improvements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ship localized apps&lt;/td&gt;
&lt;td&gt;String Catalog editor overhaul, &lt;code&gt;BUILD_ONLY_KNOWN_LOCALIZATIONS&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use Swift Testing + XCTest&lt;/td&gt;
&lt;td&gt;Interoperability is off by default now -- explicit opt-in required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debug flaky tests&lt;/td&gt;
&lt;td&gt;Unwaited notification expectation fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use AI coding features&lt;/td&gt;
&lt;td&gt;MCP server overwrite fix, connection dialog fix&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Where to Get It
&lt;/h2&gt;

&lt;p&gt;The full release notes are at &lt;a href="https://developer.apple.com/documentation/xcode-release-notes/xcode-26_4-release-notes" rel="noopener noreferrer"&gt;developer.apple.com/documentation/xcode-release-notes/xcode-26_4-release-notes&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ios</category>
      <category>mobile</category>
      <category>swift</category>
      <category>programming</category>
    </item>
    <item>
      <title>DeerFlow 2.0: What It Is, How It Works, and Why Developers Should Pay Attention</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Tue, 24 Mar 2026 09:24:15 +0000</pubDate>
      <link>https://forem.com/arshtechpro/deerflow-20-what-it-is-how-it-works-and-why-developers-should-pay-attention-3ip3</link>
      <guid>https://forem.com/arshtechpro/deerflow-20-what-it-is-how-it-works-and-why-developers-should-pay-attention-3ip3</guid>
      <description>&lt;p&gt;ByteDance open-sourced DeerFlow 2.0 on February 27, 2026, and within 24 hours it was sitting at the top of GitHub Trending. The repository has since accumulated around 25,000 stars and 3,000 forks. That kind of adoption is worth understanding, not just celebrating. This article breaks down what DeerFlow actually is, how its architecture fits together, and what it means for you as a developer who builds with or on top of AI systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Brief History: From Research Tool to Runtime
&lt;/h2&gt;

&lt;p&gt;DeerFlow started as an internal deep-research framework at ByteDance, essentially a tool for automating information gathering and summarization. Version 1 did that job well enough that developers outside its intended scope started bending it to do other things: building data pipelines, spinning up dashboards, automating content workflows.&lt;/p&gt;

&lt;p&gt;The ByteDance team noticed this and drew a conclusion: DeerFlow was not just a research tool. It was an execution harness waiting to be built properly.&lt;/p&gt;

&lt;p&gt;So they did a full rewrite. Version 2.0 shares no code with v1. If you need the original deep-research framework, it lives on the &lt;code&gt;1.x&lt;/code&gt; branch, which still receives contributions. But active development has moved entirely to 2.0, and the architecture is fundamentally different.&lt;/p&gt;




&lt;h2&gt;
  
  
  What DeerFlow 2.0 Actually Is
&lt;/h2&gt;

&lt;p&gt;DeerFlow calls itself a "SuperAgent harness." That label is doing real work, so it is worth unpacking.&lt;/p&gt;

&lt;p&gt;Most agent frameworks give you an agent that produces text. You ask it to research a topic, and it hands you back a string. You ask it to write code, and it gives you a code block. The execution of that code, the transformation of that report into a slide deck, the deployment of that scaffolded application -- all of that is your problem.&lt;/p&gt;

&lt;p&gt;DeerFlow closes that gap. It gives the agent an actual computer: an isolated Docker container with a full filesystem, a bash terminal, and the ability to read, write, and execute files. The agent does not suggest a bash command. It runs it. The agent does not sketch a web page. It builds and outputs one.&lt;/p&gt;

&lt;p&gt;That is the core shift. DeerFlow is an execution engine, not just a reasoning layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;DeerFlow is built on LangGraph and LangChain. Here is how the major pieces fit together.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Lead Agent and Task Decomposition
&lt;/h3&gt;

&lt;p&gt;When you give DeerFlow a complex prompt, such as "Research the top 10 AI startups in 2026 and build me a presentation," it does not try to handle this in a single linear pass.&lt;/p&gt;

&lt;p&gt;A lead agent acts as the orchestrator. It breaks the prompt into structured sub-tasks, decides which tasks can run in parallel, spawns sub-agents to handle them, and then synthesizes the results into a coherent output.&lt;/p&gt;

&lt;p&gt;Each sub-agent gets its own scoped context, its own tools, and its own termination conditions. Sub-agents run in parallel where possible. One might handle web scraping for funding data. Another might run competitor analysis. A third might generate charts. The lead agent pulls it all together at the end.&lt;/p&gt;

&lt;p&gt;This is how DeerFlow handles tasks that take minutes to hours without hitting context limits or losing coherence.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Sandbox
&lt;/h3&gt;

&lt;p&gt;Each task runs inside an isolated Docker container. This container has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A persistent filesystem (skills, workspace, uploads, outputs)&lt;/li&gt;
&lt;li&gt;A bash terminal&lt;/li&gt;
&lt;li&gt;The ability to execute Python scripts and arbitrary shell commands&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a simulation. The agent can create files, modify them, run scripts against them, and produce outputs you can actually download and use. That is the distinction from most agent frameworks, which essentially role-play execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skills
&lt;/h3&gt;

&lt;p&gt;DeerFlow's extensibility mechanism is called Skills. A Skill is a Markdown file that defines a workflow, describes best practices, and references supporting resources. The format is intentionally plain: if you can write Markdown, you can write a DeerFlow Skill.&lt;/p&gt;

&lt;p&gt;DeerFlow ships with built-in Skills for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deep web research&lt;/li&gt;
&lt;li&gt;Report generation&lt;/li&gt;
&lt;li&gt;Slide deck creation&lt;/li&gt;
&lt;li&gt;Web page generation&lt;/li&gt;
&lt;li&gt;Image and video generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can add your own. A Skill tells the lead agent how to approach a category of task, what tools to use, and what the output should look like. The agent loads relevant Skills progressively, which keeps context consumption manageable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory
&lt;/h3&gt;

&lt;p&gt;DeerFlow has a persistent memory system. It tracks user preferences, writing styles, project structures, and other context across sessions. Memory updates happen asynchronously through a debounced queue so they do not block the main conversation thread.&lt;/p&gt;

&lt;p&gt;The project recently added TIAMAT as a cloud memory backend, which suggests ByteDance is thinking beyond local development toward enterprise-scale persistence.&lt;/p&gt;

&lt;p&gt;It is worth being clear-eyed about this: persistent memory in agent systems is still an unsolved problem in practice. Confidence scoring on stored facts sounds good in theory and fails in interesting ways in production. The architecture here is thoughtful, but you should verify memory behavior in your own workloads before depending on it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Agnosticism
&lt;/h3&gt;

&lt;p&gt;DeerFlow integrates with any OpenAI-compatible API. You can point it at GPT-4, Claude, Gemini, DeepSeek, or local models via Ollama without changing the agent logic.&lt;/p&gt;

&lt;p&gt;The repository recommends Doubao-Seed-2.0-Code (ByteDance's own model), DeepSeek v3.2, and Kimi 2.5 for best results. That recommendation matters: DeerFlow's lead agent needs strong instruction-following and structured output capabilities to decompose tasks properly. Smaller local models will likely struggle with the orchestration layer even if they handle individual sub-tasks acceptably. If you are running local models, start with Qwen 3.5 or DeepSeek before reaching for something smaller.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;DeerFlow requires Docker (for the sandbox), Node.js (for the frontend), and Python 3.11+ (for the backend). The general setup path looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/bytedance/deer-flow.git
&lt;span class="nb"&gt;cd &lt;/span&gt;deer-flow

&lt;span class="c"&gt;# Copy environment config&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env to add your API keys and model endpoint&lt;/span&gt;

&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
npm &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;span class="c"&gt;# Start the service&lt;/span&gt;
docker compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The web interface will be available at &lt;code&gt;localhost:3000&lt;/code&gt;. You interact with DeerFlow through chat, and it handles the orchestration behind the scenes.&lt;/p&gt;

&lt;p&gt;For API integration, DeerFlow exposes a REST interface. You can send a task programmatically and poll for results or stream the output as the agent works through sub-tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Can Actually Build With It
&lt;/h2&gt;

&lt;p&gt;Here are some concrete use cases that community members have demonstrated:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research and report generation.&lt;/strong&gt; Give DeerFlow a topic and it will search the web, gather sources, generate charts from the data it finds, and produce a formatted report with citations. Not a text summary: an actual document.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data pipeline automation.&lt;/strong&gt; DeerFlow can receive a dataset, write Python scripts to clean and transform it, execute those scripts in the sandbox, and return the processed output. The sandbox means this runs isolated from your host environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slide deck creation.&lt;/strong&gt; Feed it a research brief and it will generate a slide deck, including sourced visuals and structured content. This is a built-in Skill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full-stack web application scaffolding.&lt;/strong&gt; Developers have used DeerFlow to go from a prompt describing an application to a working codebase. The agent writes the code, runs tests, iterates on failures, and returns a working project directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitive analysis.&lt;/strong&gt; Spawn multiple sub-agents to research different competitors in parallel, then converge the findings into a comparison document.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things to Think Carefully About
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Security and governance.&lt;/strong&gt; DeerFlow executes code in Docker containers, fetches external content, and writes to a filesystem. For local development and experimentation, the defaults are fine. For any production or enterprise deployment, you need to think about container hardening, network egress restrictions, and supply chain analysis. The code is MIT-licensed and auditable, which is a genuine advantage here, but auditing takes time and effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ByteDance provenance.&lt;/strong&gt; Depending on your organization or sector, the country-of-origin and ownership context may trigger an additional review process. This is not a technical concern about the code itself. It is an organizational one, and it varies by context. Be aware of it rather than dismissing it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local model limitations.&lt;/strong&gt; If you plan to run DeerFlow with local models to avoid API costs, test the orchestration layer specifically. Task decomposition and sub-agent spawning require strong structured output capabilities. Many smaller models will not handle this reliably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory reliability.&lt;/strong&gt; The persistent memory system is architecturally thoughtful, but agent memory is still a difficult problem. Do not assume it will recall the right things at the right times without testing against your specific use cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Compares
&lt;/h2&gt;

&lt;p&gt;DeerFlow occupies a different position than frameworks like LangGraph (which it uses internally), CrewAI, or Microsoft's AutoGen. Those tools give you building blocks. DeerFlow gives you a running system with opinions: a default execution model, built-in Skills, a sandbox, and a memory layer. You can extend it, but you are extending something that already works, not assembling from scratch.&lt;/p&gt;

&lt;p&gt;That is a tradeoff. You move faster with DeerFlow's defaults. You also take on its constraints and assumptions. For teams that want to ship an agentic workflow quickly without building all the infrastructure themselves, DeerFlow's opinionated architecture is an asset. For teams with highly specific orchestration requirements, it may feel like friction.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>agents</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>NVIDIA: Training Billion-Parameter Models : A Developer's Guide to Megatron-LM</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Sat, 21 Mar 2026 19:22:08 +0000</pubDate>
      <link>https://forem.com/arshtechpro/training-billion-parameter-models-a-developers-guide-to-megatron-lm-4ali</link>
      <guid>https://forem.com/arshtechpro/training-billion-parameter-models-a-developers-guide-to-megatron-lm-4ali</guid>
      <description>&lt;p&gt;If you have ever tried to train a large language model on a single GPU and watched it crash with an out-of-memory error, you already know the problem. Models that matter today — the ones with tens or hundreds of billions of parameters — simply do not fit on one device. Megatron-LM is NVIDIA's answer to that problem, and it has been quietly powering some of the most serious LLM research and production training runs in the world.&lt;/p&gt;

&lt;p&gt;This article walks you through what Megatron-LM actually is, how its internals work, and how to go from zero to a real training run.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Megatron-LM, and Why Should You Care
&lt;/h2&gt;

&lt;p&gt;The repository at &lt;a href="https://github.com/NVIDIA/Megatron-LM" rel="noopener noreferrer"&gt;github.com/NVIDIA/Megatron-LM&lt;/a&gt; contains two things that are worth keeping distinct in your head.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Megatron-LM&lt;/strong&gt; is the research-oriented layer — it bundles ready-to-run training scripts for GPT, BERT, T5, LLaMA, and multimodal models. If you want to get something running quickly without writing glue code from scratch, this is your starting point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Megatron Core&lt;/strong&gt; is the composable library underneath. It exposes the GPU-optimized building blocks (attention layers, parallelism strategies, optimizers, dataset loaders) as importable Python modules so you can assemble your own training framework on top of them. If you are a framework engineer or ML infrastructure developer, this is the layer you actually care about.&lt;/p&gt;

&lt;p&gt;The benchmark numbers give you a sense of the scale this is designed for. The codebase has been used to train models ranging from 2B to 462B parameters across thousands of H100 GPUs, achieving up to 47% Model FLOP Utilization. That number matters because MFU is how you measure how efficiently you are using the hardware you paid for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding the Parallelism — This Is the Core Idea
&lt;/h2&gt;

&lt;p&gt;Before you touch a line of code, you need to understand how Megatron distributes a model that does not fit in memory. There are four strategies, and Megatron uses them in combination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tensor Parallelism (TP)&lt;/strong&gt; splits individual weight matrices across GPUs. A single attention layer's weight matrix gets sliced column-wise or row-wise across multiple devices. During the forward pass, each GPU handles a slice of the computation, and a small all-reduce communication syncs the results. The math is elegant: for a linear layer &lt;code&gt;Y = XA&lt;/code&gt;, you can split &lt;code&gt;A&lt;/code&gt; column-wise so GPU 0 computes &lt;code&gt;XA_0&lt;/code&gt; and GPU 1 computes &lt;code&gt;XA_1&lt;/code&gt;, then concatenate. This is why Megatron can fit a single transformer layer that is too large for one device.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline Parallelism (PP)&lt;/strong&gt; assigns different layers to different GPUs. GPU 0 handles layers 0–7, GPU 1 handles layers 8–15, and so on. Data flows through the pipeline in micro-batches, and Megatron's interleaved scheduling keeps the GPUs from sitting idle while waiting for the previous stage to finish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Parallelism (DP)&lt;/strong&gt; is the familiar one: you run copies of the model on multiple GPUs with different data shards, then average the gradients. Megatron supports both standard DDP and a distributed optimizer that shards optimizer states to reduce memory further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Parallelism (CP)&lt;/strong&gt; is newer and specifically useful for long sequences. It distributes the sequence dimension across GPUs so you can train on sequences that would otherwise blow your memory budget. The recent Dynamic Context Parallelism feature pushes this further by adapting the parallel size per batch based on actual sequence lengths, yielding up to 1.48x speedup for variable-length training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expert Parallelism (EP)&lt;/strong&gt; is relevant if you are training Mixture-of-Experts models like DeepSeek-V3 or Qwen3. Different experts in the MoE layer live on different GPUs.&lt;/p&gt;

&lt;p&gt;In practice you combine these. A typical 70B model training run might use TP=4, PP=4, DP=8, giving you 128 GPUs working together. Megatron handles the communication scheduling so you do not have to wire it up yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Structure at a Glance
&lt;/h2&gt;

&lt;p&gt;The repository is well organized once you know where to look:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Megatron-LM/
├── megatron/
│   ├── core/               # The library — import this in your own code
│   │   ├── models/         # GPT, BERT, T5, multimodal
│   │   ├── transformer/    # Attention, MLP, layer building blocks
│   │   ├── tensor_parallel/
│   │   ├── pipeline_parallel/
│   │   ├── distributed/    # FSDP, DDP
│   │   ├── optimizer/
│   │   ├── datasets/
│   │   └── inference/
│   ├── training/           # High-level training scripts
│   └── post_training/      # Quantization, distillation, pruning
├── examples/               # Shell scripts for GPT, LLaMA, Mixtral, etc.
├── tools/                  # Data preprocessing utilities
└── tests/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The split between &lt;code&gt;megatron/core/&lt;/code&gt; and &lt;code&gt;megatron/training/&lt;/code&gt; mirrors the Megatron Core vs Megatron-LM distinction described above. If you are building something custom, spend most of your time in &lt;code&gt;core/&lt;/code&gt;. If you are running experiments, the &lt;code&gt;examples/&lt;/code&gt; directory has working shell scripts you can adapt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting It Running
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One or more NVIDIA GPUs (Ampere or later recommended; H100 for FP8 training)&lt;/li&gt;
&lt;li&gt;CUDA 12.x&lt;/li&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;PyTorch 2.x&lt;/li&gt;
&lt;li&gt;NCCL (usually installed with CUDA)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The easiest path is NVIDIA's NGC container, which ships with all dependencies pre-installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull nvcr.io/nvidia/pytorch:24.01-py3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are not using a container, you will also need Transformer Engine for FP8 support and Flash Attention for efficient attention computation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Megatron Core with its language model dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-build-isolation&lt;/span&gt; megatron-core[mlm,dev]

&lt;span class="c"&gt;# Clone the repo for examples and training scripts&lt;/span&gt;
git clone https://github.com/NVIDIA/Megatron-LM.git
&lt;span class="nb"&gt;cd &lt;/span&gt;Megatron-LM
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-build-isolation&lt;/span&gt; .[mlm,dev]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--no-build-isolation&lt;/code&gt; flag matters here. Megatron builds some CUDA extensions during install and needs access to your environment's CUDA headers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Your First Training Run
&lt;/h3&gt;

&lt;p&gt;Once installed, the simplest way to verify everything works is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run a distributed training loop on 2 GPUs with mock data&lt;/span&gt;
torchrun &lt;span class="nt"&gt;--nproc_per_node&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 examples/run_simple_mcore_train_loop.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to try something closer to a real use case, the LLaMA-3 example uses 8 GPUs with FP8 precision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./examples/llama/train_llama3_8b_fp8.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These scripts handle the argument wiring for you. Once you understand what the flags mean, you can start modifying them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Preparation
&lt;/h2&gt;

&lt;p&gt;This is where most people get tripped up the first time. Megatron does not consume raw text files. It needs data preprocessed into a binary indexed format (&lt;code&gt;.bin&lt;/code&gt; + &lt;code&gt;.idx&lt;/code&gt; files) for memory-mapped, high-throughput loading during training.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Format Your Data as JSONL
&lt;/h3&gt;

&lt;p&gt;Each line is a JSON object with a &lt;code&gt;text&lt;/code&gt; field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"text": "Your first training document goes here."}
{"text": "Each document is a separate JSON line."}
{"text": "The tokenizer will handle the rest."}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Run the Preprocessor
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python tools/preprocess_data.py &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--input&lt;/span&gt; data.jsonl &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--output-prefix&lt;/span&gt; /path/to/processed_data &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--tokenizer-type&lt;/span&gt; HuggingFaceTokenizer &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--tokenizer-model&lt;/span&gt; /path/to/your/tokenizer &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--workers&lt;/span&gt; 8 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--append-eod&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key flags to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--output-prefix&lt;/code&gt;: The path prefix for the &lt;code&gt;.bin&lt;/code&gt; and &lt;code&gt;.idx&lt;/code&gt; output files. Pass this same prefix to &lt;code&gt;--data-path&lt;/code&gt; in your training script.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--tokenizer-type&lt;/code&gt;: Use &lt;code&gt;HuggingFaceTokenizer&lt;/code&gt; for any tokenizer that follows the Hugging Face interface. &lt;code&gt;GPT2BPETokenizer&lt;/code&gt; is available for GPT-2's original BPE tokenizer.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--workers&lt;/code&gt;: Parallelizes tokenization. Set this to the number of CPU cores you can spare.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--append-eod&lt;/code&gt;: Adds an end-of-document token between documents. Almost always what you want.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This step can take a while for large datasets. The output is worth the wait — the indexed binary format lets training data loaders access tokens at random offsets in O(1) time, which is critical when you have trillions of tokens.&lt;/p&gt;




&lt;h2&gt;
  
  
  Writing a Custom Training Loop with Megatron Core
&lt;/h2&gt;

&lt;p&gt;If you want to plug Megatron's parallelism into your own training code rather than using the provided scripts, here is the minimal structure. This is what &lt;code&gt;examples/run_simple_mcore_train_loop.py&lt;/code&gt; does under the hood.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;megatron.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;parallel_state&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;megatron.core.models.gpt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GPTModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;megatron.core.transformer.transformer_config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TransformerConfig&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize distributed and parallelism groups
&lt;/span&gt;&lt;span class="n"&gt;parallel_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize_model_parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tensor_model_parallel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pipeline_model_parallel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define your model configuration
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TransformerConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;num_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_attention_heads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;use_cpu_initialization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pipeline_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Build the model — Megatron handles sharding based on parallel_state
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;megatron.core.models.gpt.gpt_layer_specs&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_gpt_layer_local_spec&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GPTModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;transformer_layer_spec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;get_gpt_layer_local_spec&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;vocab_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50257&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_sequence_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# From here, training looks mostly like standard PyTorch
&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key thing to notice is that after &lt;code&gt;initialize_model_parallel&lt;/code&gt;, the model construction is aware of which GPU it is running on. When you call &lt;code&gt;GPTModel(...)&lt;/code&gt;, Megatron automatically places the right layers on the right devices based on your TP and PP settings. You do not have to manually slice weight matrices.&lt;/p&gt;




&lt;h2&gt;
  
  
  Parallelism Configuration in Practice
&lt;/h2&gt;

&lt;p&gt;When you run the provided shell scripts, you will see flags like these:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--tensor-model-parallel-size&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--pipeline-model-parallel-size&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--num-layers&lt;/span&gt; 32 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--hidden-size&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--num-attention-heads&lt;/span&gt; 32 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--micro-batch-size&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--global-batch-size&lt;/span&gt; 1024 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--seq-length&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few rules of thumb:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tensor parallel size&lt;/strong&gt; should evenly divide &lt;code&gt;num-attention-heads&lt;/code&gt; and the hidden size's intermediate MLP size. For a model with 32 attention heads, TP sizes of 1, 2, 4, 8, or 16 all work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline parallel size&lt;/strong&gt; should evenly divide &lt;code&gt;num-layers&lt;/code&gt;. 32 layers with PP=4 means 8 layers per pipeline stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Global batch size&lt;/strong&gt; equals &lt;code&gt;micro-batch-size * gradient-accumulation-steps * data-parallel-size&lt;/code&gt;. Megatron enforces this math and will error if your settings are inconsistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Communication overlap flags&lt;/strong&gt; are worth enabling once things are working. These flags let Megatron overlap gradient reduction with backward computation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--overlap-grad-reduce&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--overlap-param-gather&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--tp-comm-overlap&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;They typically improve throughput by 5–15% at large scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Checkpoint Conversion: Getting In and Out of the Megatron Format
&lt;/h2&gt;

&lt;p&gt;One practical concern is that Megatron's checkpoint format is not the same as Hugging Face's. If you want to start from a pretrained HuggingFace model or publish your trained weights back, you need to convert.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/NVIDIA-NeMo/Megatron-Bridge" rel="noopener noreferrer"&gt;Megatron Bridge&lt;/a&gt; project handles this bidirectionally. It supports popular models and is the recommended path for production checkpoint management.&lt;/p&gt;

&lt;p&gt;For LLaMA specifically, the Megatron-LM repository includes conversion scripts under &lt;code&gt;tools/&lt;/code&gt; that have been used for Llama, Mistral, and other Llama-derived architectures.&lt;/p&gt;




&lt;h2&gt;
  
  
  FP8 Training
&lt;/h2&gt;

&lt;p&gt;If you are running on H100 or later GPUs, FP8 training is worth trying. It reduces memory usage and can increase throughput significantly because Tensor Cores on Hopper are much faster in FP8 than BF16.&lt;/p&gt;

&lt;p&gt;Enabling it is mostly a matter of setting flags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--fp8-format&lt;/span&gt; hybrid &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--fp8-amax-history-len&lt;/span&gt; 1024 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--fp8-amax-compute-algo&lt;/span&gt; max
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You also need Transformer Engine installed, which handles the actual FP8 kernels. The &lt;code&gt;train_llama3_8b_fp8.sh&lt;/code&gt; example shows a working configuration if you want to see all the pieces together.&lt;/p&gt;




&lt;h2&gt;
  
  
  MoE Models
&lt;/h2&gt;

&lt;p&gt;The MoE (Mixture of Experts) support in Megatron Core has been one of the most active development areas recently. DeepSeek-V3 and Qwen3-style MoE architectures are explicitly supported, and there is an active roadmap for further optimizations through 2025.&lt;/p&gt;

&lt;p&gt;The key additional parallelism dimension for MoE is Expert Parallelism:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--expert-model-parallel-size&lt;/span&gt; 8 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--num-experts&lt;/span&gt; 64 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--moe-router-topk&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With 64 experts and EP=8, each GPU holds 8 experts. During the forward pass, tokens are routed to their top-2 experts, and the MoE communication layer handles moving tokens across GPU boundaries as needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The thing that makes Megatron-LM worth learning is not any single feature — it is that the entire system is designed around the constraint that you are always working at the edge of what the hardware can do.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>nvidia</category>
      <category>programming</category>
    </item>
    <item>
      <title>NemoClaw: NVIDIA's Open Source Stack for Running AI Agents You Can Actually Trust</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Thu, 19 Mar 2026 11:30:55 +0000</pubDate>
      <link>https://forem.com/arshtechpro/nemoclaw-nvidias-open-source-stack-for-running-ai-agents-you-can-actually-trust-50gl</link>
      <guid>https://forem.com/arshtechpro/nemoclaw-nvidias-open-source-stack-for-running-ai-agents-you-can-actually-trust-50gl</guid>
      <description>&lt;p&gt;AI agents have crossed a threshold. They're no longer chatbots that answer questions and forget you exist. The new generation can remember context across sessions, spawn sub-agents, write their own code to learn new skills, and keep executing tasks long after you close your laptop. Tools like OpenClaw have made it possible for a single developer to spin up an autonomous assistant that works like a small team.&lt;/p&gt;

&lt;p&gt;That's exciting. It's also terrifying if you think about it for more than five seconds.&lt;/p&gt;

&lt;p&gt;A long-running agent with persistent shell access, live credentials, and the ability to rewrite its own tooling is a fundamentally different threat model than a stateless chatbot. Every prompt injection becomes a potential credential leak. Every third-party skill the agent installs is an unreviewed binary with filesystem access. Every sub-agent it spawns can inherit permissions it was never meant to have.&lt;/p&gt;

&lt;p&gt;The agents are ready. The infrastructure to trust them has been missing — until now.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is NemoClaw?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;NemoClaw&lt;/strong&gt; is an open source stack from NVIDIA that wraps &lt;a href="https://openclaw.ai" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; (the popular always-on AI assistant) with enterprise-grade privacy and security controls. It's built on top of two key NVIDIA projects:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenShell&lt;/strong&gt; — an open source runtime (part of the NVIDIA Agent Toolkit) that acts as a governance layer between your agent and your infrastructure. Think of it as a browser sandbox, but for AI agents. It controls what the agent can see, do, and where its inference requests go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nemotron&lt;/strong&gt; — NVIDIA's family of open models that can run locally on your own hardware for enhanced privacy and cost efficiency.&lt;/p&gt;

&lt;p&gt;The whole point: you get the productivity of autonomous agents without giving up control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;If you're building with or deploying AI agents, you've likely hit the "trust trilemma." You need three things simultaneously: &lt;strong&gt;safety&lt;/strong&gt;, &lt;strong&gt;capability&lt;/strong&gt;, and &lt;strong&gt;autonomy&lt;/strong&gt;. With existing approaches, you can only reliably get two at a time.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Safe + autonomous&lt;/strong&gt; but the agent can't access the tools and data it needs → it can't finish the job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capable + safe&lt;/strong&gt; but gated on constant approvals → you're babysitting it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capable + autonomous&lt;/strong&gt; with full access → a long-running process policing itself, with guardrails living inside the same process they're supposed to guard.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last scenario is the critical failure mode. Tools like Claude Code and Cursor ship with valuable internal guardrails, but those protections live &lt;em&gt;inside&lt;/em&gt; the agent. A compromised agent can potentially override them.&lt;/p&gt;

&lt;p&gt;NemoClaw solves this by moving the control point &lt;strong&gt;outside&lt;/strong&gt; the agent entirely. The agent literally cannot override the security policies because they're enforced at the infrastructure level, not the prompt level.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works Under the Hood
&lt;/h2&gt;

&lt;p&gt;NemoClaw's architecture has four main components:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Plugin (CLI)
&lt;/h3&gt;

&lt;p&gt;A TypeScript CLI that orchestrates everything. You use &lt;code&gt;nemoclaw&lt;/code&gt; commands on your host machine to launch, connect to, and manage sandboxed agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Blueprint
&lt;/h3&gt;

&lt;p&gt;A versioned Python artifact that handles sandbox creation, policy configuration, and inference setup. It follows a four-stage lifecycle: resolve the artifact → verify its digest → plan resources → apply through the OpenShell CLI.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Sandbox
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens. It's not generic container isolation — it's purpose-built for long-running, self-evolving agents. The sandbox provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Landlock + seccomp + network namespace isolation&lt;/strong&gt; — the agent runs in a locked-down environment at the OS level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filesystem restrictions&lt;/strong&gt; — the agent can only read/write inside &lt;code&gt;/sandbox&lt;/code&gt; and &lt;code&gt;/tmp&lt;/code&gt;. Everything else is off-limits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network egress control&lt;/strong&gt; — unauthorized outbound connections are blocked. If the agent tries to reach an unlisted host, OpenShell blocks it and surfaces the request for your approval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process protection&lt;/strong&gt; — privilege escalation and dangerous syscalls are blocked at sandbox creation time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live policy updates&lt;/strong&gt; — network and inference policies can be hot-reloaded at runtime as you approve new permissions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. The Inference Layer
&lt;/h3&gt;

&lt;p&gt;Inference requests from the agent never leave the sandbox directly. OpenShell intercepts every call and routes it through a &lt;strong&gt;privacy router&lt;/strong&gt;. This router decides — based on &lt;em&gt;your&lt;/em&gt; policy, not the agent's preferences — whether a request goes to a local model (like Nemotron running on your GPU) or to a cloud-based frontier model.&lt;/p&gt;

&lt;p&gt;The default setup routes through NVIDIA's cloud API using &lt;code&gt;nvidia/nemotron-3-super-120b-a12b&lt;/code&gt;. Local inference via Ollama and vLLM is experimental but available.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started (One Command, Seriously)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;You need a Linux machine (Ubuntu 22.04+) with Docker, Node.js 20+, and at least 8 GB of RAM (16 GB recommended). macOS is supported via Colima or Docker Desktop on Apple Silicon.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install and Onboard
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://www.nvidia.com/nemoclaw.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The script installs Node.js if needed, then runs a guided onboarding wizard that creates a sandbox, configures inference, and applies security policies. It'll prompt you for an NVIDIA API key (grab one free from their website).&lt;/p&gt;

&lt;p&gt;When it finishes, you'll see something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;──────────────────────────────────────────────────
Sandbox      my-assistant (Landlock + seccomp + netns)
Model        nvidia/nemotron-3-super-120b-a12b (NVIDIA Cloud API)
──────────────────────────────────────────────────
Run:         nemoclaw my-assistant connect
Status:      nemoclaw my-assistant status
Logs:        nemoclaw my-assistant logs --follow
──────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Connect and Chat
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Open a shell inside the sandbox&lt;/span&gt;
nemoclaw my-assistant connect

&lt;span class="c"&gt;# Launch the interactive TUI&lt;/span&gt;
sandbox@my-assistant:~&lt;span class="nv"&gt;$ &lt;/span&gt;openclaw tui

&lt;span class="c"&gt;# Or use the CLI for a single message&lt;/span&gt;
sandbox@my-assistant:~&lt;span class="nv"&gt;$ &lt;/span&gt;openclaw agent &lt;span class="nt"&gt;--agent&lt;/span&gt; main &lt;span class="nt"&gt;--local&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"hello"&lt;/span&gt; &lt;span class="nt"&gt;--session-id&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Commands Cheat Sheet
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nemoclaw onboard&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive setup wizard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nemoclaw &amp;lt;name&amp;gt; connect&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shell into a sandbox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nemoclaw &amp;lt;name&amp;gt; status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check sandbox health&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nemoclaw &amp;lt;name&amp;gt; logs --follow&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stream logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;nemoclaw start&lt;/code&gt; / &lt;code&gt;stop&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Manage auxiliary services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;openshell term&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Launch OpenShell TUI for monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Protection Model Explained
&lt;/h2&gt;

&lt;p&gt;Here's what's actually enforced — and when:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What It Guards&lt;/th&gt;
&lt;th&gt;When Applied&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Network&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Blocks unauthorized outbound connections&lt;/td&gt;
&lt;td&gt;Hot-reloadable at runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Filesystem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No reads/writes outside &lt;code&gt;/sandbox&lt;/code&gt; and &lt;code&gt;/tmp&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Locked at sandbox creation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Process&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Blocks privilege escalation and dangerous syscalls&lt;/td&gt;
&lt;td&gt;Locked at sandbox creation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reroutes model API calls to controlled backends&lt;/td&gt;
&lt;td&gt;Hot-reloadable at runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The critical design choice: filesystem and process restrictions are &lt;strong&gt;locked at creation time&lt;/strong&gt;. The agent can't unlock them mid-session, even if compromised. Network and inference policies can be updated live — but only by &lt;em&gt;you&lt;/em&gt;, from outside the sandbox.&lt;/p&gt;

&lt;p&gt;When the agent hits a constraint, it can reason about why it's blocked and propose a policy update. You see the request in the OpenShell TUI and make the final call. Full audit trail of every allow and deny decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes This Different from Just Using Docker?
&lt;/h2&gt;

&lt;p&gt;Fair question. You &lt;em&gt;could&lt;/em&gt; run an agent in a Docker container with restrictive policies. But NemoClaw/OpenShell gives you several things Docker alone doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent-aware policy engine&lt;/strong&gt; — it evaluates actions at the binary, destination, method, and path level. An agent can install a verified skill but can't execute an unreviewed binary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy routing&lt;/strong&gt; — inference is intercepted and routed based on policy, keeping sensitive context on-device when needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live policy updates&lt;/strong&gt; — approve new network destinations or inference providers without restarting the sandbox.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skill verification&lt;/strong&gt; — as the agent evolves and learns new capabilities, each new skill is subject to the same controls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operator approval flow&lt;/strong&gt; — blocked actions surface in a TUI for human review, not just silent failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hardware Support
&lt;/h2&gt;

&lt;p&gt;NemoClaw runs on a range of NVIDIA hardware:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GeForce RTX PCs/laptops&lt;/strong&gt; — your everyday dev machine with a GPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RTX PRO workstations&lt;/strong&gt; — for heavier local inference workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DGX Spark&lt;/strong&gt; — NVIDIA's compact AI workstation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DGX Station&lt;/strong&gt; — for enterprise-scale local deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Current Limitations (It's Alpha)
&lt;/h2&gt;

&lt;p&gt;The project is in &lt;strong&gt;alpha&lt;/strong&gt;, and the README says so clearly. A few things to be aware of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Interfaces, APIs, and behavior may change without notice.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;openclaw nemoclaw&lt;/code&gt; plugin commands are under active development — use the &lt;code&gt;nemoclaw&lt;/code&gt; host CLI as the primary interface.&lt;/li&gt;
&lt;li&gt;Local inference (Ollama, vLLM) is experimental, especially on macOS.&lt;/li&gt;
&lt;li&gt;On machines with less than 8 GB RAM, the sandbox image (~2.4 GB compressed) can trigger the OOM killer during setup. Adding swap helps.&lt;/li&gt;
&lt;li&gt;Setup may require manual workarounds on some platforms.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where to Go from Here
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub repo&lt;/strong&gt;: &lt;a href="https://github.com/NVIDIA/NemoClaw" rel="noopener noreferrer"&gt;github.com/NVIDIA/NemoClaw&lt;/a&gt; — 9k+ stars, Apache 2.0 licensed&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agentskills</category>
      <category>agents</category>
    </item>
    <item>
      <title>MiroFish: The Open-Source AI Engine That Builds Digital Worlds to Predict the Future</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Sat, 14 Mar 2026 23:41:35 +0000</pubDate>
      <link>https://forem.com/arshtechpro/mirofish-the-open-source-ai-engine-that-builds-digital-worlds-to-predict-the-future-ki8</link>
      <guid>https://forem.com/arshtechpro/mirofish-the-open-source-ai-engine-that-builds-digital-worlds-to-predict-the-future-ki8</guid>
      <description>&lt;p&gt;MiroFish is an open-source AI prediction engine that takes real-world data (news, reports, even novels), spawns thousands of AI agents with unique personalities and memories, lets them interact in a simulated world, and produces a prediction report based on what emerges. Think of it as SimCity meets AI forecasting.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Problem Does MiroFish Solve?
&lt;/h2&gt;

&lt;p&gt;Traditional prediction models — whether statistical or ML-based — treat the world like a math equation. You feed in numbers, you get numbers out. But the real world doesn't work that way. People react to each other. Opinions shift. Coalitions form and break apart. A single tweet can change the trajectory of a news cycle.&lt;/p&gt;

&lt;p&gt;MiroFish takes a fundamentally different approach. Instead of crunching numbers, it &lt;strong&gt;simulates the messy, social dynamics of the real world&lt;/strong&gt; using thousands of AI agents that talk, argue, persuade, and evolve — just like people do.&lt;/p&gt;

&lt;p&gt;The result? You get a prediction that accounts for group behavior, social contagion, and emergent patterns that traditional models simply can't capture.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Actually Works (The 5-Step Pipeline)
&lt;/h2&gt;

&lt;p&gt;Here's the workflow, broken down without the buzzwords:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Knowledge Graph Construction
&lt;/h3&gt;

&lt;p&gt;You upload "seed material" — this could be a news article, a financial report, a policy document, or even the first 80 chapters of a novel (yes, they actually did this with &lt;em&gt;Dream of the Red Chamber&lt;/em&gt; to predict its lost ending).&lt;/p&gt;

&lt;p&gt;MiroFish uses &lt;strong&gt;GraphRAG&lt;/strong&gt; (Graph-based Retrieval Augmented Generation) to parse your input and extract entities and relationships. Instead of treating your document as a flat bag of text, it builds a structured knowledge graph — who are the key players, how are they connected, what pressures exist, what institutions are involved.&lt;/p&gt;

&lt;p&gt;This graph becomes the "reality" that the simulated world is built on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Environment Setup &amp;amp; Agent Creation
&lt;/h3&gt;

&lt;p&gt;Based on the knowledge graph, MiroFish automatically generates &lt;strong&gt;agent personas&lt;/strong&gt;. Each agent gets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A unique personality and background&lt;/li&gt;
&lt;li&gt;A distinct stance or perspective on the topic&lt;/li&gt;
&lt;li&gt;Long-term memory (powered by Zep Cloud)&lt;/li&gt;
&lt;li&gt;Behavioral logic that governs how they interact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An "Environment Configuration Agent" then sets up the simulation parameters — essentially deciding the rules of the world these agents will live in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Dual-Platform Parallel Simulation
&lt;/h3&gt;

&lt;p&gt;This is where things get interesting. MiroFish runs simulations on &lt;strong&gt;two platforms simultaneously&lt;/strong&gt; (think Twitter-like and Reddit-like environments). Dozens or hundreds of agents start interacting — posting, commenting, debating, forming opinions, influencing each other.&lt;/p&gt;

&lt;p&gt;The simulation engine under the hood is &lt;strong&gt;OASIS&lt;/strong&gt; (Open Agent Social Interaction Simulations), built by the CAMEL-AI team. OASIS can scale up to one million agents and supports 23 different social actions (following, commenting, reposting, etc.).&lt;/p&gt;

&lt;p&gt;During the simulation, the system automatically tracks your prediction question and dynamically updates each agent's memory as events unfold.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Report Generation
&lt;/h3&gt;

&lt;p&gt;After the simulation ends, a dedicated &lt;strong&gt;ReportAgent&lt;/strong&gt; steps in. This agent has access to a rich toolkit and interacts with the post-simulation environment to synthesize everything that happened. It analyzes how agents' opinions shifted, what coalitions formed, and what patterns emerged — then produces a structured prediction report.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Deep Interaction
&lt;/h3&gt;

&lt;p&gt;The report isn't the final product. You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chat with any agent&lt;/strong&gt; in the simulated world to understand their reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Talk to the ReportAgent&lt;/strong&gt; to ask follow-up questions or get alternative analyses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inject new variables&lt;/strong&gt; and re-run scenarios ("What if we change X?")&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Tech Stack
&lt;/h2&gt;

&lt;p&gt;Here's what's under the hood:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Backend&lt;/td&gt;
&lt;td&gt;Python 3.11+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;Vue.js&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simulation Engine&lt;/td&gt;
&lt;td&gt;OASIS (by CAMEL-AI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge Graphs&lt;/td&gt;
&lt;td&gt;GraphRAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent Memory&lt;/td&gt;
&lt;td&gt;Zep Cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM Support&lt;/td&gt;
&lt;td&gt;Any OpenAI SDK-compatible model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recommended LLM&lt;/td&gt;
&lt;td&gt;Qwen-plus (via Alibaba's Bailian platform)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Package Manager&lt;/td&gt;
&lt;td&gt;uv (for Python)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Getting Started (Self-Hosted Setup)
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; MiroFish was developed and tested on macOS. Windows compatibility is still being tested.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Node.js 18+&lt;/li&gt;
&lt;li&gt;Python 3.11+&lt;/li&gt;
&lt;li&gt;uv (Python package manager)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1. Clone and Configure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/666ghj/MiroFish.git
&lt;span class="nb"&gt;cd &lt;/span&gt;MiroFish

&lt;span class="c"&gt;# Copy the example env file&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Edit the &lt;code&gt;.env&lt;/code&gt; file with your API keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# LLM Configuration (any OpenAI SDK-compatible LLM)
# Recommended: Qwen-plus on Alibaba's Bailian platform
LLM_API_KEY=your_api_key
LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
LLM_MODEL_NAME=qwen-plus

# Zep Cloud (for agent memory persistence)
# Free tier is enough for basic usage: https://app.getzep.com/
ZEP_API_KEY=your_zep_api_key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Install Dependencies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One command to install everything (root + frontend + backend)&lt;/span&gt;
npm run setup:all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or step by step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Node dependencies (root + frontend)&lt;/span&gt;
npm run setup

&lt;span class="c"&gt;# Python dependencies (auto-creates virtual environment)&lt;/span&gt;
npm run setup:backend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Run It
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start both frontend and backend&lt;/span&gt;
npm run dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Your frontend will be at &lt;code&gt;http://localhost:3000&lt;/code&gt; and the API at &lt;code&gt;http://localhost:5001&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can also start them separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm run backend   &lt;span class="c"&gt;# Backend only&lt;/span&gt;
npm run frontend  &lt;span class="c"&gt;# Frontend only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Can You Actually Predict With This?
&lt;/h2&gt;

&lt;p&gt;The team has demonstrated several use cases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Public Opinion Simulation&lt;/strong&gt; — Feed in a news event and simulate how public sentiment might evolve. The demo shows a prediction of how a university controversy might unfold across social media.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Financial Forecasting&lt;/strong&gt; — Inject market signals and watch how simulated traders, analysts, and retail investors react to each other's moves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Policy Impact Testing&lt;/strong&gt; — Upload a policy draft and see how different stakeholder groups might respond, form alliances, or push back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creative Exploration&lt;/strong&gt; — The team fed the first 80 chapters of a classic Chinese novel into MiroFish and had it predict the lost ending based on how the characters would behave. This is a fun one — it shows the engine isn't limited to "serious" forecasting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Important Caveats
&lt;/h2&gt;

&lt;p&gt;Let's be real about what this is and isn't:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not a crystal ball.&lt;/strong&gt; The team hasn't published benchmarks comparing predictions against actual outcomes. The simulations illustrate &lt;em&gt;plausible&lt;/em&gt; scenarios based on emergent agent behavior — they're not probability estimates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM costs add up.&lt;/strong&gt; Running hundreds of agents through multiple simulation rounds means lots of LLM API calls. The README recommends starting with fewer than 40 rounds to manage costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent bias matters.&lt;/strong&gt; The OASIS research paper notes that LLM agents tend to be &lt;em&gt;more&lt;/em&gt; susceptible to herd behavior than real humans. Simulated crowds can polarize faster than real ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's early.&lt;/strong&gt; Version 0.1.0 was released in December 2025. This is a v0 product — powerful in concept, but still maturing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Backstory
&lt;/h2&gt;

&lt;p&gt;MiroFish was built by Guo Hangjiang, a senior undergraduate student in China. It topped GitHub's Global Trending list in March 2026 and has attracted investment from Shanda Group founder Chen Tianqiao. The project's predecessor, BettaFish (a multi-agent public opinion analysis tool), also hit #1 on GitHub Trending in late 2024.&lt;/p&gt;

&lt;p&gt;The core simulation engine comes from OASIS, an open-source project by the CAMEL-AI research community that supports up to one million agent interactions and has been published in peer-reviewed research.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Developers Should Care
&lt;/h2&gt;

&lt;p&gt;Even if you're not building a prediction engine, MiroFish is worth studying because it's a clean example of several patterns coming together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GraphRAG for knowledge grounding&lt;/strong&gt; — how to give agents structured context, not just raw text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent agent memory&lt;/strong&gt; — using Zep to let agents remember across simulation rounds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent orchestration at scale&lt;/strong&gt; — coordinating hundreds of autonomous agents in real-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emergent behavior as a feature&lt;/strong&gt; — designing systems where the output isn't programmed but &lt;em&gt;emerges&lt;/em&gt; from agent interactions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are patterns you'll see increasingly in production AI systems, and MiroFish packages them in a way that's easy to study and experiment with.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/666ghj/MiroFish" rel="noopener noreferrer"&gt;github.com/666ghj/MiroFish&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>Core ML vs Foundation Models: Which Should You Use?</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Wed, 25 Feb 2026 12:16:03 +0000</pubDate>
      <link>https://forem.com/arshtechpro/core-ml-vs-foundation-models-which-should-you-use-3jo0</link>
      <guid>https://forem.com/arshtechpro/core-ml-vs-foundation-models-which-should-you-use-3jo0</guid>
      <description>&lt;p&gt;With iOS 26.3 now in the wild, iOS developers have two powerful on-device AI frameworks to choose from: &lt;strong&gt;Core ML&lt;/strong&gt; — Apple's veteran ML inference engine — and &lt;strong&gt;Foundation Models&lt;/strong&gt; — the new framework that exposes Apple's ~3B parameter LLM from iOS 26 onwards.&lt;/p&gt;

&lt;p&gt;They sound like they do similar things. They don't.&lt;/p&gt;

&lt;p&gt;This article cuts through the confusion, explains what each framework is actually designed for, and gives you a clear decision framework so you can pick the right tool for the job — or know when to combine both.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Each Framework Actually Is
&lt;/h2&gt;

&lt;p&gt;Before comparing them, let's be precise about what they are.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core ML&lt;/strong&gt; has been available since iOS 11. It's a general-purpose inference engine — you bring a trained model (in &lt;code&gt;.mlmodel&lt;/code&gt; or &lt;code&gt;.mlpackage&lt;/code&gt; format), and Core ML runs it on-device using the best available hardware: Neural Engine, GPU, or CPU depending on the task. Core ML itself doesn't contain any models. You either train your own (with Create ML), convert from TensorFlow or PyTorch, or download pre-trained ones. It supports image classification, object detection, NLP, audio analysis, tabular data prediction, and more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Foundation Models&lt;/strong&gt; was introduced in iOS 26. It's an API that gives you direct access to Apple's own pre-trained ~3B parameter large language model — the same one behind Apple Intelligence. You don't bring a model. The model is already on the device (downloaded as part of enabling Apple Intelligence). The framework specialises in natural language: text generation, summarisation, structured data extraction, and tool calling.&lt;/p&gt;

&lt;p&gt;One is a runtime that runs &lt;em&gt;your&lt;/em&gt; models. The other is an API for a specific pre-built Apple model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Device Availability: A Critical Difference
&lt;/h2&gt;

&lt;p&gt;This is where the two frameworks diverge most sharply and it matters enormously for your architecture decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core ML&lt;/strong&gt; works on every device that runs iOS 11 and later. On iOS 26, that means any iPhone 11 or newer (A13 chip and up). It runs on essentially your entire user base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Foundation Models&lt;/strong&gt; requires Apple Intelligence, which means iPhone 15 Pro/Max or any iPhone 16 or 17 model. An iPhone 14, 15 (standard), or anything older simply cannot use it, regardless of which iOS version it runs. On top of the device requirement, Apple Intelligence must be enabled by the user, 7 GB of free storage is required, and the on-device model needs to finish downloading after Apple Intelligence is first enabled.&lt;/p&gt;

&lt;p&gt;If your feature needs to work for the majority of your users today, Core ML is the safer choice. If you're building a feature that's an enhancement for users on newer devices, Foundation Models is a compelling option.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Each Framework Is Good At
&lt;/h2&gt;

&lt;p&gt;Think of the two frameworks as operating in completely different problem spaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core ML: Structured ML Tasks
&lt;/h3&gt;

&lt;p&gt;Core ML excels when your task is well-defined, has a clear input/output structure, and can be solved with a trained model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Image classification&lt;/strong&gt; — "What object is in this photo?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object detection&lt;/strong&gt; — "Where are the faces/products/items in this frame?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pose estimation&lt;/strong&gt; — "Where are this person's joints?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio classification&lt;/strong&gt; — "Is this the sound of a dog or a car?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text classification / sentiment analysis&lt;/strong&gt; — "Is this review positive or negative?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tabular prediction&lt;/strong&gt; — "Based on these health metrics, what category does this fall into?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time video analysis&lt;/strong&gt; — Frame-by-frame inference at high frequency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A key characteristic of Core ML tasks: you can define the exact output schema up front, and the model reliably produces a label, a bounding box, a confidence score, or a numeric prediction. Deterministic, fast, and predictable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Foundation Models: Language and Reasoning Tasks
&lt;/h3&gt;

&lt;p&gt;Foundation Models is designed for tasks that involve language understanding and generation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Text summarisation&lt;/strong&gt; — turning long content into concise summaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured extraction&lt;/strong&gt; — pulling structured data from unstructured text (e.g. extracting a name, date, and location from a messy user note)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content classification using natural language&lt;/strong&gt; — not just labels, but nuanced categorisation with explanation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contextual suggestions&lt;/strong&gt; — generating personalised recommendations based on user context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-augmented reasoning&lt;/strong&gt; — letting the model call into your app's data to answer user questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Apple itself explicitly warns that the on-device 3B model is &lt;strong&gt;not&lt;/strong&gt; designed for world-knowledge Q&amp;amp;A, code generation, or complex maths. It's optimised for task-oriented, app-integrated intelligence.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use Core ML
&lt;/h2&gt;

&lt;p&gt;Use Core ML when:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need vision or audio inference.&lt;/strong&gt; Core ML is the only on-device option for camera-based features — real-time object detection, face analysis, pose estimation, scene classification. Foundation Models cannot process images at all (no image input support as of iOS 26.3).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need it to work on older devices.&lt;/strong&gt; If your feature must work on an iPhone 12, 13, or 14, Core ML is your only on-device option. It runs on any device with iOS 11+.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You have a specific, narrow ML task.&lt;/strong&gt; A model trained to classify 10 types of skin lesions, or to detect a specific product in a frame, will outperform a general 3B LLM on that narrow task — and at a fraction of the memory and compute cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need deterministic, repeatable outputs.&lt;/strong&gt; Core ML models return the same output for the same input, every time. Foundation Models, being a generative LLM, produces varied responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You're doing real-time inference.&lt;/strong&gt; Core ML can process frames from a camera feed at 30+ fps. Foundation Models is not suited for frame-by-frame tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use Foundation Models
&lt;/h2&gt;

&lt;p&gt;Use Foundation Models when:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your task is fundamentally a language task.&lt;/strong&gt; Summarising a document, extracting key facts from a user's note, generating a personalised caption — these are natural fits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You want structured output from unstructured text.&lt;/strong&gt; The &lt;code&gt;@Generable&lt;/code&gt; macro lets you extract type-safe Swift structs directly from free-form input. No JSON parsing, no regex, no post-processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need natural language reasoning with tool calling.&lt;/strong&gt; Foundation Models can decide when to call into your app's data, fetch it, and incorporate it into a response. Core ML models can't reason about when or whether to request more context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You want zero model maintenance.&lt;/strong&gt; With Core ML, you own the model — you retrain it, update it, and deal with drift over time. With Foundation Models, Apple maintains the base model. You get improvements with OS updates for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed of integration matters.&lt;/strong&gt; Building a Core ML-powered feature involves choosing a model architecture, training data, training, conversion, and integration. Foundation Models can be integrated in an afternoon.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Combination Pattern
&lt;/h2&gt;

&lt;p&gt;Here's something worth noting: the two frameworks aren't competitors. They're complementary layers, and the most powerful apps combine them.&lt;/p&gt;

&lt;p&gt;Apple highlighted a real example of this with SwingVision, a tennis/pickleball coaching app. It uses Core ML to analyse video frames and extract structured data about a user's movement and technique. It then feeds that structured output as context into a Foundation Models session to generate natural language coaching feedback.&lt;/p&gt;

&lt;p&gt;This is the pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Core ML handles the perception layer&lt;/strong&gt; — processing images, audio, or sensor data into structured signals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Foundation Models handles the reasoning and language layer&lt;/strong&gt; — turning those signals into meaningful, natural language insights&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A fitness app might use Core ML for pose estimation during a workout, then pass rep counts and form data to Foundation Models to generate a personalised summary. A cooking app might use Core ML to identify ingredients from a photo, then Foundation Models to suggest a recipe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Step 1: Core ML classifies the image&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;classificationRequest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;VNCoreMLRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mlModel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
    &lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="k"&gt;as?&lt;/span&gt; &lt;span class="kt"&gt;VNClassificationObservation&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Step 2: Feed the structured output into Foundation Models&lt;/span&gt;
    &lt;span class="kt"&gt;Task&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"The user photographed: &lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;identifier&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt; with &lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;% confidence. Suggest what they could make with this."&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Decision Framework
&lt;/h2&gt;

&lt;p&gt;Here's a straightforward way to pick:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is your task visual, audio-based, or real-time?&lt;/strong&gt; → Core ML&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you need it to work on iPhone 14 or older?&lt;/strong&gt; → Core ML&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is your task text-only — generation, extraction, summarisation, or reasoning?&lt;/strong&gt; → Foundation Models&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you need tight control over the model's behaviour for a narrow domain?&lt;/strong&gt; → Core ML with a custom-trained model&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you want to ship an AI feature today with minimal setup on iOS 26 devices?&lt;/strong&gt; → Foundation Models&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you need both vision and natural language?&lt;/strong&gt; → Core ML for perception + Foundation Models for language&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Core ML and Foundation Models solve different problems. Core ML is a mature, flexible inference engine for deterministic ML tasks across all your users. Foundation Models is a purpose-built API for language tasks on Apple Intelligence-compatible devices, with almost zero setup cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core ML: iOS 11+ (any iPhone 11+ for iOS 26)&lt;/li&gt;
&lt;li&gt;Foundation Models: iOS 26 + Apple Intelligence-compatible device (iPhone 15 Pro or newer, all iPhone 16/17 models)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ios</category>
      <category>mobile</category>
      <category>swift</category>
      <category>ai</category>
    </item>
    <item>
      <title>Xcode 26.4 Beta: Smaller Changes, Real Developer Impact</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Mon, 23 Feb 2026 10:46:14 +0000</pubDate>
      <link>https://forem.com/arshtechpro/xcode-264-beta-smaller-changes-real-developer-impact-20ol</link>
      <guid>https://forem.com/arshtechpro/xcode-264-beta-smaller-changes-real-developer-impact-20ol</guid>
      <description>&lt;p&gt;Apple quietly released &lt;strong&gt;Xcode 26.4 beta&lt;/strong&gt;, bringing updated SDKs and a set of practical improvements — especially around &lt;strong&gt;testing and localization workflows&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Unlike &lt;strong&gt;Xcode 26.3&lt;/strong&gt;, which introduced agentic coding, &lt;strong&gt;26.4 is a refinement release&lt;/strong&gt;. No big headline features — but a lot of small improvements that remove friction from day-to-day development.&lt;/p&gt;

&lt;p&gt;Here’s what developers should actually care about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Updated SDKs
&lt;/h2&gt;

&lt;p&gt;Xcode 26.4 includes SDK updates for all platforms:&lt;/p&gt;

&lt;h2&gt;
  
  
  Swift Testing Keeps Getting Better
&lt;/h2&gt;

&lt;p&gt;Swift Testing continues to evolve and &lt;strong&gt;Xcode 26.4 makes it much more practical for real-world debugging.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Attach Images to Tests
&lt;/h3&gt;

&lt;p&gt;You can now attach images directly to Swift tests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CGImage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NSImage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;UIImage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CIImage&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snapshot testing&lt;/li&gt;
&lt;li&gt;UI verification&lt;/li&gt;
&lt;li&gt;Rendering tests&lt;/li&gt;
&lt;li&gt;Vision / image pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of logging text failures, you can now &lt;strong&gt;see exactly what went wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For UI-heavy apps, this is a big quality-of-life improvement.&lt;/p&gt;




&lt;h3&gt;
  
  
  Severity Levels for Test Issues
&lt;/h3&gt;

&lt;p&gt;Swift Testing now lets you record issues with a &lt;strong&gt;specific severity level&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of everything being a hard failure, you can now distinguish between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Warnings&lt;/li&gt;
&lt;li&gt;Non-critical issues&lt;/li&gt;
&lt;li&gt;Actual failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes test reporting more realistic for large test suites.&lt;/p&gt;

&lt;p&gt;Not every issue needs to break CI.&lt;/p&gt;




&lt;h3&gt;
  
  
  Better Error Reporting for Attachments
&lt;/h3&gt;

&lt;p&gt;If Xcode fails to save a test attachment, it now shows up as a &lt;strong&gt;runtime issue&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Previously this could fail silently, which made debugging test infrastructure harder.&lt;/p&gt;

&lt;p&gt;Now you'll know immediately if something went wrong.&lt;/p&gt;




&lt;h3&gt;
  
  
  UI Test Crash Reports Are Easier to See
&lt;/h3&gt;

&lt;p&gt;When an app crashes during UI testing via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;XCUIApplication(bundleIdentifier:)
XCUIApplication(url:)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Xcode now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reports the crash as a warning&lt;/li&gt;
&lt;li&gt;Attaches the crash log automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This removes a lot of manual digging through DerivedData.&lt;/p&gt;

&lt;p&gt;Small change — big time saver.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mixing XCTest and Swift Testing Is Safer
&lt;/h3&gt;

&lt;p&gt;Many projects are gradually migrating from XCTest to Swift Testing.&lt;/p&gt;

&lt;p&gt;Xcode 26.4 improves this transition.&lt;/p&gt;

&lt;p&gt;If you call an assertion from the wrong framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XCTest inside Swift Testing&lt;/li&gt;
&lt;li&gt;Swift Testing inside XCTest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Xcode now reports a &lt;strong&gt;runtime warning&lt;/strong&gt; instead of failing silently.&lt;/p&gt;

&lt;p&gt;This makes mixed test suites much easier to maintain.&lt;/p&gt;




&lt;h2&gt;
  
  
  Localization Improvements (Finally)
&lt;/h2&gt;

&lt;p&gt;If you've used String Catalogs for localization, you've probably noticed that editing them could feel limited.&lt;/p&gt;

&lt;p&gt;Xcode 26.4 fixes several long-standing pain points.&lt;/p&gt;




&lt;h3&gt;
  
  
  Removing Languages Is Now Easy
&lt;/h3&gt;

&lt;p&gt;You can now remove languages directly from the &lt;strong&gt;String Catalog editor&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Even better, you can choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove from just this catalog&lt;/li&gt;
&lt;li&gt;Remove from the entire project&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Previously this required manual cleanup.&lt;/p&gt;

&lt;p&gt;Now it's a one-click operation.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pre-Fill Translations When Adding Languages
&lt;/h3&gt;

&lt;p&gt;When adding a new supported language in Project Settings, Xcode can now:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-fill translations using an existing language.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is surprisingly useful when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creating regional variants&lt;/li&gt;
&lt;li&gt;Bootstrapping translations&lt;/li&gt;
&lt;li&gt;Working with external translators&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of starting from empty catalogs, you get a usable baseline.&lt;/p&gt;




&lt;h3&gt;
  
  
  Copy &amp;amp; Paste Support for String Catalogs
&lt;/h3&gt;

&lt;p&gt;String Catalog editing finally behaves like a normal editor.&lt;/p&gt;

&lt;p&gt;You can now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cut&lt;/li&gt;
&lt;li&gt;Copy&lt;/li&gt;
&lt;li&gt;Paste&lt;/li&gt;
&lt;li&gt;Duplicate strings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And this works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Within a catalog&lt;/li&gt;
&lt;li&gt;Between catalogs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When pasting strings, you can choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add as a new key with translations&lt;/li&gt;
&lt;li&gt;Apply translations to an existing key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've ever reorganized large catalogs, this will save a lot of time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Build and Compiler Improvements
&lt;/h2&gt;

&lt;p&gt;Like most minor Xcode updates, 26.4 includes compiler and build system improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better diagnostics&lt;/li&gt;
&lt;li&gt;Improved stability&lt;/li&gt;
&lt;li&gt;Fewer unexpected build issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These updates rarely make headlines but usually improve everyday development.&lt;/p&gt;

&lt;p&gt;Large projects tend to benefit the most.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Xcode 26.4 isn’t a flashy release — but it improves areas developers use every day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better Swift Testing&lt;/li&gt;
&lt;li&gt;Easier localization&lt;/li&gt;
&lt;li&gt;Updated SDKs&lt;/li&gt;
&lt;li&gt;More stable builds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;Testing and Localization improvements alone make this a worthwhile upgrade&lt;/strong&gt; for many teams.&lt;/p&gt;

&lt;p&gt;If Xcode 26.3 introduced new workflows, &lt;strong&gt;26.4 makes the existing ones smoother.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>mobile</category>
      <category>swift</category>
      <category>programming</category>
    </item>
    <item>
      <title>How to Fall Back Gracefully When Apple Intelligence Isn't Available</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Sat, 21 Feb 2026 00:24:36 +0000</pubDate>
      <link>https://forem.com/arshtechpro/how-to-fall-back-gracefully-when-apple-intelligence-isnt-available-48j</link>
      <guid>https://forem.com/arshtechpro/how-to-fall-back-gracefully-when-apple-intelligence-isnt-available-48j</guid>
      <description>&lt;p&gt;Apple Intelligence is one of the most exciting things to happen to iOS development in years. The Foundation Models framework gives you direct access to an on-device LLM with zero API costs, zero network calls, and full privacy. But here's the hard truth: &lt;strong&gt;a huge chunk of your users can't run it&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you just drop a &lt;code&gt;LanguageModelSession()&lt;/code&gt; into your app without any checks, you'll ship broken experiences to a large portion of your user base. This article is about how to handle that properly — detecting unavailability, communicating it clearly to users, and falling back gracefully so your app stays useful for everyone.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Can't Use Apple Intelligence?
&lt;/h2&gt;

&lt;p&gt;Let's be precise about this, because the numbers matter.&lt;/p&gt;

&lt;p&gt;iOS 26.3 (released February 11, 2026) runs on &lt;strong&gt;iPhone 11 and later&lt;/strong&gt; — that's any device with an A13 chip or newer. But Apple Intelligence requires an &lt;strong&gt;A17 Pro chip or newer&lt;/strong&gt;, which means only:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;iPhone 15 Pro / iPhone 15 Pro Max&lt;/li&gt;
&lt;li&gt;iPhone 16 / 16 Plus / 16 Pro / 16 Pro Max&lt;/li&gt;
&lt;li&gt;iPhone 16e&lt;/li&gt;
&lt;li&gt;iPhone 17 / 17 Pro / 17 Pro Max / iPhone Air&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So an iPhone 15 (standard) running iOS 26.3 supports all the Liquid Glass UI changes, but gets &lt;strong&gt;zero&lt;/strong&gt; Foundation Models access. Same for anything older. On top of that, even eligible devices need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apple Intelligence &lt;strong&gt;enabled in Settings&lt;/strong&gt; (it's opt-in)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7 GB of free storage&lt;/strong&gt; on the device&lt;/li&gt;
&lt;li&gt;Device and Siri language set to a &lt;strong&gt;supported language&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The model &lt;strong&gt;fully downloaded&lt;/strong&gt; (it downloads in the background after enabling)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom line: even among users on supported hardware, not everyone will have Apple Intelligence ready to go. You cannot assume availability.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Unavailability Cases
&lt;/h2&gt;

&lt;p&gt;The Foundation Models framework gives you exactly three reasons why the model might not be available, surfaced through &lt;code&gt;SystemLanguageModel.default.availability&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;FoundationModels&lt;/span&gt;

&lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="kt"&gt;SystemLanguageModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;availability&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;// Good to go&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deviceNotEligible&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;// A13 or older chip — Foundation Models will never work here&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appleIntelligenceNotEnabled&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;// Compatible device, but user hasn't turned on Apple Intelligence&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;modelNotReady&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;// Compatible + enabled, but model is still downloading&lt;/span&gt;
&lt;span class="kd"&gt;@unknown&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;// Future-proof: handle any new cases Apple might add&lt;/span&gt;
    &lt;span class="k"&gt;break&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each case needs a different response from your app. They are not all the same problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building a Proper Fallback Strategy
&lt;/h2&gt;

&lt;p&gt;Think of these three cases as three separate UX problems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case 1: Device Not Eligible
&lt;/h3&gt;

&lt;p&gt;This is permanent. The user's hardware will never support Apple Intelligence. Don't show them a spinner. Don't show a "check back later" message. Show them a functional experience that doesn't rely on the model at all.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deviceNotEligible&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;// Serve a non-AI version of the feature&lt;/span&gt;
    &lt;span class="nf"&gt;showBasicTextSummarizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What "basic version" means depends on your feature. For a smart journaling app, you might skip auto-tagging and let the user tag manually. For a writing assistant, you might offer simpler preset templates instead of generated suggestions. The key is: the feature should still work, just without the AI enhancement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case 2: Apple Intelligence Not Enabled
&lt;/h3&gt;

&lt;p&gt;This one is different — the hardware supports it, but the user hasn't opted in. You &lt;em&gt;can&lt;/em&gt; prompt the user to enable it. But be thoughtful about how you do this. Don't block the UI. Don't show it repeatedly. Show it once, explain the benefit clearly, and deep link directly to the setting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appleIntelligenceNotEnabled&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;showEnablementBanner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nv"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Enable Apple Intelligence in Settings to unlock AI-powered suggestions."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nv"&gt;settingsURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;string&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;UIApplication&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;openSettingsURLString&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// Still show the basic version of the feature below the banner&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apple Intelligence is enabled at &lt;strong&gt;Settings → Apple Intelligence &amp;amp; Siri&lt;/strong&gt;. You can open Settings directly with &lt;code&gt;UIApplication.openSettingsURLString&lt;/code&gt;, but you can't deep link to that exact screen — the user has to navigate there themselves.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case 3: Model Not Ready
&lt;/h3&gt;

&lt;p&gt;This is temporary. The model is downloaded in the background after a user enables Apple Intelligence, and it can take a while. The right response here is to wait and retry — not to permanently fall back.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;modelNotReady&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;showLoadingState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"AI features are warming up. This only takes a moment."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;scheduleAvailabilityCheck&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the retry logic, you can periodically re-check &lt;code&gt;SystemLanguageModel.default.availability&lt;/code&gt;. A simple approach is to use a &lt;code&gt;Timer&lt;/code&gt; or &lt;code&gt;Task&lt;/code&gt; with a delay:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;scheduleAvailabilityCheck&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;Task&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="kt"&gt;Task&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;checkAndUpdateAvailability&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Don't poll too aggressively — once every 10–30 seconds is fine while the user is actively in that screen.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It Together: A Clean Architecture
&lt;/h2&gt;

&lt;p&gt;Here's a practical pattern that keeps your feature code clean and handles all three cases from a single place.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;FoundationModels&lt;/span&gt;
&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SwiftUI&lt;/span&gt;

&lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="kt"&gt;AIAvailabilityState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;available&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;unsupportedDevice&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;notEnabled&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;modelLoading&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;@Observable&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;AIFeatureManager&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private(set)&lt;/span&gt; &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;AIAvailabilityState&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;modelLoading&lt;/span&gt;

    &lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;refreshAvailability&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;refreshAvailability&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="kt"&gt;SystemLanguageModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;availability&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deviceNotEligible&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unsupportedDevice&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appleIntelligenceNotEnabled&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notEnabled&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;modelNotReady&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;modelLoading&lt;/span&gt;
            &lt;span class="nf"&gt;scheduleRetry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="kd"&gt;@unknown&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unsupportedDevice&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;scheduleRetry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;Task&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="kt"&gt;Task&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="nf"&gt;refreshAvailability&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And in your view:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;SmartFeatureView&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;@State&lt;/span&gt; &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;aiManager&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;AIFeatureManager&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kd"&gt;some&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="n"&gt;aiManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="kt"&gt;AIEnhancedView&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;unsupportedDevice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="kt"&gt;BasicFallbackView&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;notEnabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="kt"&gt;EnablePromptView&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;aiManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refreshAvailability&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;modelLoading&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="kt"&gt;LoadingView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"AI features are getting ready..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps your views thin. Each state gets its own view. And when the model becomes available, &lt;code&gt;refreshAvailability()&lt;/code&gt; updates the state and SwiftUI re-renders automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Your Fallback UI Should Actually Do
&lt;/h2&gt;

&lt;p&gt;A fallback isn't just "hide the AI button." A good fallback means the feature still delivers value without the model. Here are patterns for common use cases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smart text summarization&lt;/strong&gt; → Fall back to a character-count preview or a "show more/less" toggle. Not as smart, but still useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-tagging / content classification&lt;/strong&gt; → Fall back to a curated list of tags the user picks from manually. Or skip tagging entirely and search by keyword.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI-generated suggestions&lt;/strong&gt; → Fall back to a set of hand-written preset options. Less personalised, but still functional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contextual chat assistant&lt;/strong&gt; → Fall back to an FAQ-style interface or a link to your help docs.&lt;/p&gt;

&lt;p&gt;The goal is: a user on an iPhone 14 should open your app and find a working, useful feature — not a broken screen or a wall of text explaining why their device isn't good enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Note on the &lt;code&gt;@unknown default&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Always include &lt;code&gt;@unknown default&lt;/code&gt; in your switch statement. Apple's API is still relatively new and they may add new unavailability reasons in future OS versions. If you omit it, a new case could cause a compile-time warning and — more importantly — unexpected runtime behaviour. Treat any unknown case the same as &lt;code&gt;deviceNotEligible&lt;/code&gt;: assume the model isn't coming, and serve the basic experience.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing Without an Eligible Device
&lt;/h2&gt;

&lt;p&gt;Testing all three unavailability states in Simulator can be tricky. Here's what works in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.deviceNotEligible&lt;/code&gt;&lt;/strong&gt;: Use an iPhone simulator that's older than iPhone 15 Pro (e.g. iPhone 14).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.appleIntelligenceNotEnabled&lt;/code&gt;&lt;/strong&gt;: On a supported simulator, go to Settings → Apple Intelligence &amp;amp; Siri and toggle it off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.modelNotReady&lt;/code&gt;&lt;/strong&gt;: Harder to simulate reliably. You can mock this in your &lt;code&gt;AIFeatureManager&lt;/code&gt; for testing by injecting a fake availability value.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For unit testing, make &lt;code&gt;SystemLanguageModel.default.availability&lt;/code&gt; mockable by abstracting it behind a protocol:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;protocol&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelAvailabilityChecker&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SystemLanguageModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;Availability&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;LiveChecker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelAvailabilityChecker&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SystemLanguageModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;Availability&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;SystemLanguageModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;availability&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;MockChecker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelAvailabilityChecker&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SystemLanguageModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;Availability&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inject &lt;code&gt;MockChecker&lt;/code&gt; in your tests, &lt;code&gt;LiveChecker&lt;/code&gt; in production. This lets you write clean unit tests for every availability state without needing a physical device.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The developers who ship great apps right now are the ones who treat Foundation Models as a &lt;em&gt;progressive enhancement&lt;/em&gt; — something that makes the experience better for those who have it, without breaking anything for those who don't.&lt;/p&gt;

&lt;p&gt;Build the baseline first. Then layer the intelligence on top.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: iOS 26+ · Xcode 26+&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Apple Intelligence devices&lt;/strong&gt;: iPhone 15 Pro/Max, all iPhone 16 and 17 models  &lt;/p&gt;

</description>
      <category>ios</category>
      <category>mobile</category>
      <category>ai</category>
      <category>swift</category>
    </item>
    <item>
      <title>Apple's Foundation Models Framework: Run AI On-Device With Just a Few Lines of Swift</title>
      <dc:creator>ArshTechPro</dc:creator>
      <pubDate>Wed, 18 Feb 2026 15:43:44 +0000</pubDate>
      <link>https://forem.com/arshtechpro/apples-foundation-models-framework-run-ai-on-device-with-just-a-few-lines-of-swift-lbp</link>
      <guid>https://forem.com/arshtechpro/apples-foundation-models-framework-run-ai-on-device-with-just-a-few-lines-of-swift-lbp</guid>
      <description>&lt;p&gt;Apple has quietly shipped one of the most significant frameworks for iOS developers in years. With iOS 26 the &lt;strong&gt;Foundation Models framework&lt;/strong&gt; gives you direct access to Apple's on-device ~3B parameter large language model — the same one powering Apple Intelligence — right from your Swift code.&lt;/p&gt;

&lt;p&gt;No API keys. No cloud costs. No internet required. And it's &lt;strong&gt;completely free&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let's break down what this means and how to start building with it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is the Foundation Models Framework?
&lt;/h2&gt;

&lt;p&gt;The Foundation Models framework exposes Apple's on-device LLM to third-party developers. Unlike cloud-based models like ChatGPT or Claude that run on remote servers, Apple's model runs &lt;strong&gt;entirely on the user's device&lt;/strong&gt; using Apple silicon (CPU, GPU, and Neural Engine).&lt;/p&gt;

&lt;p&gt;This gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privacy by default&lt;/strong&gt; — all data stays on-device&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero latency from network calls&lt;/strong&gt; — inference happens locally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline support&lt;/strong&gt; — works without internet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free inference&lt;/strong&gt; — no per-token costs, no API billing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep Swift integration&lt;/strong&gt; — the API feels native, not bolted on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model specializes in language understanding, structured output generation, and tool calling. It's not designed as a general-knowledge chatbot, but rather as an engine for building intelligent features tailored to your app.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started: Your First On-Device AI Feature
&lt;/h2&gt;

&lt;p&gt;Here's how simple it is to generate a response from Apple's on-device LLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;FoundationModels&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"What's a good name for a travel app?"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Three lines of code and you're running AI inference on-device.&lt;/p&gt;

&lt;h3&gt;
  
  
  Streaming Responses
&lt;/h3&gt;

&lt;p&gt;For a ChatGPT-like experience where text appears token by token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;streamResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Suggest 5 creative app names for a fitness tracker"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;partial&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;partial&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;terminator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a much smoother UX instead of waiting for the entire response to complete.&lt;/p&gt;




&lt;h2&gt;
  
  
  Guided Generation: The Killer Feature
&lt;/h2&gt;

&lt;p&gt;Here's where Foundation Models really shines compared to typical LLM APIs. &lt;strong&gt;Guided Generation&lt;/strong&gt; lets you get structured, type-safe outputs directly as Swift types.&lt;/p&gt;

&lt;p&gt;Instead of parsing messy JSON strings, you define your output structure using the &lt;code&gt;@Generable&lt;/code&gt; macro:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;FoundationModels&lt;/span&gt;

&lt;span class="kd"&gt;@Generable&lt;/span&gt;
&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;MovieRecommendation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;
    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"A brief one-sentence summary"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;
    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;anyOf&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"PG"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"PG-13"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"R"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"G"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then generate structured output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;movie&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;MovieRecommendation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Recommend an action movie from the 2020s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;generating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;MovieRecommendation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;// e.g., "Top Gun: Maverick"&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// "PG-13" — guaranteed to be one of the allowed values&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;@Guide&lt;/code&gt; macro lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add natural language descriptions to guide the model&lt;/li&gt;
&lt;li&gt;Constrain values to a specific set with &lt;code&gt;.anyOf()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Control array lengths with &lt;code&gt;count()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Enforce string patterns with regex&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is &lt;strong&gt;constrained decoding&lt;/strong&gt; built directly into the framework — the model is literally forced to produce valid output matching your Swift types at the token level. No more hoping the model returns valid JSON.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tool Calling: Give the Model Superpowers
&lt;/h2&gt;

&lt;p&gt;The on-device model has a ~3B parameter size, so it doesn't know everything. But with &lt;strong&gt;Tool Calling&lt;/strong&gt;, you can extend its capabilities by giving it access to your app's data and APIs.&lt;/p&gt;

&lt;p&gt;Here's an example — a health coach that reads HealthKit data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;FoundationModels&lt;/span&gt;

&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;BloodPressureTool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Tool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"getBloodPressure"&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Fetches the user's latest blood pressure reading"&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;EmptyInput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;throws&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Fetch from HealthKit&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;systolic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;diastolic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;"Systolic: &lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;systolic&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt; mmHg, Diastolic: &lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;diastolic&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt; mmHg"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;BloodPressureTool&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"""
    You're a health coach. Help users manage their health 
    based on their blood pressure data.
    """&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"How's my blood pressure looking?"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model will automatically decide when to call your tool, fetch the data, and incorporate it into a natural language response. This is incredibly powerful for building context-aware features.&lt;/p&gt;




&lt;h2&gt;
  
  
  Specialized Adapters: Content Tagging Out of the Box
&lt;/h2&gt;

&lt;p&gt;Beyond the general-purpose model, Apple provides specialized adapters for specific tasks. The &lt;strong&gt;content tagging adapter&lt;/strong&gt; is built-in and optimized for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Topic tag generation&lt;/li&gt;
&lt;li&gt;Entity extraction&lt;/li&gt;
&lt;li&gt;Topic detection
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;taggingModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SystemLanguageModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;useCase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contentTagging&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;taggingModel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kd"&gt;@Generable&lt;/span&gt;
&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;Tags&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Apple announced new MacBook Pro with M5 chip at their spring event"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;generating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Tags&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// ["Apple", "MacBook Pro", "M5", "Product Launch"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Custom Adapter Training: Teach the Model New Tricks
&lt;/h2&gt;

&lt;p&gt;For advanced use cases, Apple provides a &lt;strong&gt;Python-based adapter training toolkit&lt;/strong&gt; that lets you fine-tune the on-device model with your own data using LoRA (Low-Rank Adaptation).&lt;/p&gt;

&lt;p&gt;When should you consider training a custom adapter?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model needs to become a subject-matter expert for your domain&lt;/li&gt;
&lt;li&gt;You need a specific output style, format, or policy&lt;/li&gt;
&lt;li&gt;Prompt engineering isn't achieving the required accuracy&lt;/li&gt;
&lt;li&gt;You want lower latency by reducing prompt length&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key things to know about adapters:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Each adapter is ~160MB in storage&lt;/li&gt;
&lt;li&gt;Adapters are compatible with a &lt;strong&gt;single specific model version&lt;/strong&gt; — you must retrain when Apple updates the base model&lt;/li&gt;
&lt;li&gt;Deploy via the Background Assets framework (don't bundle in your app)&lt;/li&gt;
&lt;li&gt;Requires Mac with Apple silicon and 32GB+ RAM, or Linux GPU machines&lt;/li&gt;
&lt;li&gt;You need the Foundation Models Framework Adapter Entitlement for production deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Apple recommends exhausting prompt engineering and tool calling before jumping to adapter training. It's powerful but comes with ongoing maintenance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Availability and Requirements
&lt;/h2&gt;

&lt;p&gt;Before creating a session, always check availability:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;availability&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SystemLanguageModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;availability&lt;/span&gt;

&lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;// Ready to use&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deviceNotEligible&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;// Device doesn't support Apple Intelligence&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appleIntelligenceNotEnabled&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;// User needs to enable Apple Intelligence in Settings&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;modelNotReady&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;// Model is still downloading&lt;/span&gt;
&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;break&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;iOS 26, iPadOS 26, macOS 26, or visionOS 26&lt;/li&gt;
&lt;li&gt;Apple Intelligence-compatible device (iPhone 15 Pro or later, M-series Macs/iPads)&lt;/li&gt;
&lt;li&gt;Apple Intelligence must be enabled in Settings&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Practical Use Cases
&lt;/h2&gt;

&lt;p&gt;Here are some real-world features you can build today:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smart Journaling&lt;/strong&gt;: Auto-generate mood tags and summaries from journal entries using guided generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recipe Parsing&lt;/strong&gt;: Point the model at unstructured recipe text and extract structured ingredient lists and steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Customer Support&lt;/strong&gt;: Build an in-app assistant that uses tool calling to access order history and FAQs without any cloud dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content Moderation&lt;/strong&gt;: Use the content tagging adapter to automatically classify user-generated content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personalized Learning&lt;/strong&gt;: Generate quiz questions based on study material, all processed locally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workout Insights&lt;/strong&gt;: Combine tool calling with HealthKit to generate natural language summaries of fitness data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations to Keep in Mind
&lt;/h2&gt;

&lt;p&gt;The Foundation Models framework is powerful, but it's not a silver bullet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context window is limited&lt;/strong&gt; — the ~3B model can't handle massive prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not for general world knowledge&lt;/strong&gt; — it's optimized for tasks, not trivia&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apple explicitly warns against using it for&lt;/strong&gt;: code generation, math calculations, or factual Q&amp;amp;A&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Device-dependent&lt;/strong&gt; — older devices can't run it, so always have a fallback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adapter retraining&lt;/strong&gt; — every OS update with a new model version means retraining your adapters&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What This Means for iOS Development
&lt;/h2&gt;

&lt;p&gt;The Foundation Models framework represents a fundamental shift. For the first time, iOS developers have access to a production-quality LLM with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero marginal cost&lt;/strong&gt; per inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native Swift API&lt;/strong&gt; that feels like any other Apple framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type-safe outputs&lt;/strong&gt; through guided generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in privacy&lt;/strong&gt; without any extra work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't just another AI API wrapper. It's a deeply integrated, first-party framework that makes on-device intelligence a realistic feature for apps of all sizes — from indie side projects to enterprise applications.&lt;/p&gt;

&lt;p&gt;If you haven't started experimenting with it yet, now's the time. The barrier to adding AI features to your iOS app has never been lower.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: Xcode 26 + iOS 26 SDK + Apple Intelligence-enabled device&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: &lt;a href="https://developer.apple.com/documentation/FoundationModels" rel="noopener noreferrer"&gt;Foundation Models | Apple Developer Documentation&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>mobile</category>
      <category>ai</category>
      <category>swift</category>
    </item>
  </channel>
</rss>
