<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Rost</title>
    <description>The latest articles on Forem by Rost (@rosgluk).</description>
    <link>https://forem.com/rosgluk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3544400%2F04dd81bf-749e-4055-971f-316c0134e76c.jpg</url>
      <title>Forem: Rost</title>
      <link>https://forem.com/rosgluk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rosgluk"/>
    <language>en</language>
    <item>
      <title>LLM Wiki - Compiled Knowledge That RAG Cannot Replace</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 18 May 2026 09:22:56 +0000</pubDate>
      <link>https://forem.com/rosgluk/llm-wiki-compiled-knowledge-that-rag-cannot-replace-8op</link>
      <guid>https://forem.com/rosgluk/llm-wiki-compiled-knowledge-that-rag-cannot-replace-8op</guid>
      <description>&lt;p&gt;The premise is simple: compiled knowledge is more reusable than retrieved fragments.&lt;br&gt;
RAG became the default answer to a straightforward question - how do I give an LLM access to external knowledge?&lt;/p&gt;



&lt;p&gt;And the usual architecture is by now familiar.&lt;br&gt;
Take documents, split them into chunks, embed the chunks, store them in a vector database, retrieve relevant pieces at query time, and pass them into the model. That pattern is useful, but it is also overused. RAG is very good at access and not automatically good at structure. It can find relevant fragments but does not create a stable understanding of a domain, it can retrieve context but does not decide what the canonical explanation is, and it can answer from documents but does not maintain a living knowledge base.&lt;/p&gt;

&lt;p&gt;LLM Wiki is not just another retrieval pattern but a different way to think about knowledge architecture entirely. Instead of asking the model to synthesize from raw chunks every time a question is asked, an LLM Wiki uses the model earlier in the pipeline, performing synthesis at ingest time and storing the result as structured, readable, linked knowledge.&lt;/p&gt;

&lt;p&gt;A good shorthand is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RAG retrieves knowledge at query time.&lt;/li&gt;
&lt;li&gt;LLM Wiki compiles knowledge at ingest time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction changes cost, latency, quality, maintenance, governance, and failure modes - and it is the central reason LLM Wiki deserves its own architecture category.&lt;/p&gt;
&lt;h2&gt;
  
  
  RAG optimizes retrieval, not representation
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;RAG&lt;/a&gt; is powerful because it lets a language model use information outside its training data, making it useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;company documentation&lt;/li&gt;
&lt;li&gt;product manuals&lt;/li&gt;
&lt;li&gt;technical support&lt;/li&gt;
&lt;li&gt;internal search&lt;/li&gt;
&lt;li&gt;research assistants&lt;/li&gt;
&lt;li&gt;policy lookup&lt;/li&gt;
&lt;li&gt;code documentation&lt;/li&gt;
&lt;li&gt;knowledge base chatbots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But RAG has a structural weakness: it often treats knowledge as a pile of retrievable fragments rather than a structured model of a domain.&lt;/p&gt;

&lt;p&gt;A typical RAG system works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Collect documents.&lt;/li&gt;
&lt;li&gt;Split them into chunks.&lt;/li&gt;
&lt;li&gt;Create embeddings.&lt;/li&gt;
&lt;li&gt;Store the chunks in a vector database.&lt;/li&gt;
&lt;li&gt;Retrieve similar chunks for each query.&lt;/li&gt;
&lt;li&gt;Ask the LLM to answer using those chunks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works well for many questions, but it also creates repeated interpretation work for complex ones. Every time a user asks something conceptually rich, the system has to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retrieve fragments&lt;/li&gt;
&lt;li&gt;decide which fragments matter&lt;/li&gt;
&lt;li&gt;infer relationships&lt;/li&gt;
&lt;li&gt;resolve contradictions&lt;/li&gt;
&lt;li&gt;build a temporary explanation&lt;/li&gt;
&lt;li&gt;produce an answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then that synthesis disappears and the next query starts from scratch. This is fine when questions are simple, but it becomes wasteful when the same concepts are repeatedly reconstructed from raw fragments.&lt;/p&gt;

&lt;p&gt;The most common RAG mistake is assuming that better retrieval equals better knowledge. Sometimes that is true, but often it is not, because &lt;a href="https://www.glukhov.org/knowledge-management/foundations/retrieval-vs-representation/" rel="noopener noreferrer"&gt;retrieval and representation solve different problems&lt;/a&gt;. Retrieval answers which pieces of text are relevant; representation answers how knowledge should be structured in the first place. A RAG system can retrieve five accurate chunks about a topic and still fail because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the chunks are outdated&lt;/li&gt;
&lt;li&gt;the documents contradict each other&lt;/li&gt;
&lt;li&gt;the important concept is spread across pages&lt;/li&gt;
&lt;li&gt;the source uses inconsistent terminology&lt;/li&gt;
&lt;li&gt;the answer requires synthesis, not lookup&lt;/li&gt;
&lt;li&gt;there is no canonical page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAG is an access layer, not a knowledge model by itself, and an LLM Wiki exists precisely because some knowledge should be represented before it is retrieved.&lt;/p&gt;
&lt;h2&gt;
  
  
  What is an LLM Wiki?
&lt;/h2&gt;

&lt;p&gt;An LLM Wiki is a knowledge system where a language model helps transform source material into structured wiki-like knowledge. Instead of storing only raw documents and retrieving chunks later, the system creates derived knowledge artifacts such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;topic pages&lt;/li&gt;
&lt;li&gt;summaries&lt;/li&gt;
&lt;li&gt;glossaries&lt;/li&gt;
&lt;li&gt;concept pages&lt;/li&gt;
&lt;li&gt;entity pages&lt;/li&gt;
&lt;li&gt;cross-links&lt;/li&gt;
&lt;li&gt;comparisons&lt;/li&gt;
&lt;li&gt;contradiction notes&lt;/li&gt;
&lt;li&gt;source references&lt;/li&gt;
&lt;li&gt;decision records&lt;/li&gt;
&lt;li&gt;explanations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output is usually human-readable and, in many implementations, stored as plain Markdown, which matters because Markdown makes the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inspectable&lt;/li&gt;
&lt;li&gt;portable&lt;/li&gt;
&lt;li&gt;editable&lt;/li&gt;
&lt;li&gt;versionable&lt;/li&gt;
&lt;li&gt;easy to diff&lt;/li&gt;
&lt;li&gt;compatible with static sites and PKM tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea is not that the LLM magically knows everything but that the LLM helps maintain a structured layer over the source material, acting as a structuring assistant rather than the final authority.&lt;/p&gt;
&lt;h2&gt;
  
  
  The core idea
&lt;/h2&gt;

&lt;p&gt;The core idea of LLM Wiki is ingest-time knowledge synthesis. In a RAG system, synthesis usually happens when a user asks a question; in an LLM Wiki, synthesis happens earlier, during ingestion, before any question has been asked.&lt;/p&gt;

&lt;p&gt;A simplified pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sources
  -&amp;gt; ingest
  -&amp;gt; summarize
  -&amp;gt; structure
  -&amp;gt; link
  -&amp;gt; maintain
  -&amp;gt; query or browse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system does not wait until query time to figure out what the knowledge means - it creates a reusable structure in advance, which makes LLM Wiki closer to a compiled knowledge base than a search pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical example
&lt;/h2&gt;

&lt;p&gt;Imagine you have 60 articles about local LLM hosting. A RAG system might split them into chunks and retrieve relevant sections when you ask about the differences between Ollama, vLLM, llama.cpp, and SGLang, then let the LLM assemble an answer from those retrieved fragments.&lt;/p&gt;

&lt;p&gt;An LLM Wiki system does something different. At ingest time, it creates structured pages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ollama.md&lt;/li&gt;
&lt;li&gt;vllm.md&lt;/li&gt;
&lt;li&gt;llama-cpp.md&lt;/li&gt;
&lt;li&gt;sglang.md&lt;/li&gt;
&lt;li&gt;local-llm-hosting-overview.md&lt;/li&gt;
&lt;li&gt;inference-backends-comparison.md&lt;/li&gt;
&lt;li&gt;gpu-memory-and-context-length.md&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then it links them. When you later ask a question, the system is not starting from raw fragments but from a structured knowledge layer that was already assembled before the question arrived - and for conceptual and comparative questions, that difference in quality is significant.&lt;/p&gt;

&lt;h2&gt;
  
  
  How LLM Wiki works
&lt;/h2&gt;

&lt;p&gt;There is no single official implementation, but most LLM Wiki systems follow the same conceptual stages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Source collection
&lt;/h3&gt;

&lt;p&gt;The system starts with source material - blog posts, PDFs, Markdown notes, technical documentation, transcripts, papers, meeting notes, bookmarks, code comments, and README files - which should be preserved as a separate layer, distinct from the generated wiki. This matters because generated wiki pages are derived knowledge, not original truth, and a serious LLM Wiki should always maintain links back to sources so that every generated page can answer the basic question: where did this claim come from?&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingestion and extraction
&lt;/h3&gt;

&lt;p&gt;During ingestion, the system reads source material and extracts useful knowledge. It may identify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;main topics&lt;/li&gt;
&lt;li&gt;entities and tools&lt;/li&gt;
&lt;li&gt;definitions&lt;/li&gt;
&lt;li&gt;claims&lt;/li&gt;
&lt;li&gt;decisions&lt;/li&gt;
&lt;li&gt;examples&lt;/li&gt;
&lt;li&gt;contradictions between sources&lt;/li&gt;
&lt;li&gt;open questions&lt;/li&gt;
&lt;li&gt;recurring concepts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This stage is where LLM Wiki starts to differ from ordinary RAG: while RAG usually chunks documents for retrieval, LLM Wiki tries to understand and reshape the material conceptually rather than just making it searchable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summarization
&lt;/h3&gt;

&lt;p&gt;The system creates summaries, but useful summaries are not just shorter versions of text - they should preserve the structure of the argument. A weak summary says "this document discusses local LLM hosting tools." A useful summary says "this document compares local LLM hosting tools by deployment complexity, GPU usage, API compatibility, and production readiness, positioning Ollama as easy for local use, vLLM as stronger for server workloads, and llama.cpp as flexible for quantized models."&lt;/p&gt;

&lt;p&gt;For technical knowledge, a summary should capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what problem it solves&lt;/li&gt;
&lt;li&gt;what assumptions it makes&lt;/li&gt;
&lt;li&gt;what tradeoffs it contains&lt;/li&gt;
&lt;li&gt;what dependencies it has&lt;/li&gt;
&lt;li&gt;what is still uncertain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where LLMs are genuinely useful, because they are good at compressing messy prose into structured explanations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structuring
&lt;/h3&gt;

&lt;p&gt;Summaries alone are not enough - the system must also decide where knowledge belongs, which is the representation layer. Common structures include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;topic pages&lt;/li&gt;
&lt;li&gt;concept pages&lt;/li&gt;
&lt;li&gt;index pages&lt;/li&gt;
&lt;li&gt;comparison pages&lt;/li&gt;
&lt;li&gt;glossary entries&lt;/li&gt;
&lt;li&gt;how-to pages&lt;/li&gt;
&lt;li&gt;architecture notes&lt;/li&gt;
&lt;li&gt;decision records&lt;/li&gt;
&lt;li&gt;maps of related pages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A pile of summaries is not a wiki; a wiki needs page boundaries, links, and recurring structure, and a good LLM Wiki is not measured by page count but by whether pages become genuinely reusable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linking
&lt;/h3&gt;

&lt;p&gt;Links define the shape of the knowledge system. In a normal document archive, relationships are often implicit; in an LLM Wiki, they should become explicit. Useful link types include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;concept to concept&lt;/li&gt;
&lt;li&gt;article to summary&lt;/li&gt;
&lt;li&gt;tool to comparison&lt;/li&gt;
&lt;li&gt;problem to solution&lt;/li&gt;
&lt;li&gt;architecture to implementation&lt;/li&gt;
&lt;li&gt;source to derived page&lt;/li&gt;
&lt;li&gt;glossary term to detailed page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is one of the most important differences between LLM Wiki and basic summarization: summaries reduce text, but links build a knowledge graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  Review and correction
&lt;/h3&gt;

&lt;p&gt;This stage is optional only in toy systems; in serious systems, human review is essential. The review process should check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;whether summaries are faithful&lt;/li&gt;
&lt;li&gt;whether links are useful&lt;/li&gt;
&lt;li&gt;whether claims are sourced&lt;/li&gt;
&lt;li&gt;whether pages are duplicated&lt;/li&gt;
&lt;li&gt;whether concepts are misplaced&lt;/li&gt;
&lt;li&gt;whether outdated information is marked&lt;/li&gt;
&lt;li&gt;whether generated pages overstate certainty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM Wiki can reduce human effort, but it should never remove human responsibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM Wiki vs RAG
&lt;/h2&gt;

&lt;p&gt;The cleanest distinction between LLM Wiki and RAG is timing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query-time synthesis
&lt;/h3&gt;

&lt;p&gt;In RAG, the system retrieves information when a user asks a question.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;query
  -&amp;gt; retrieve chunks
  -&amp;gt; assemble context
  -&amp;gt; generate answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is flexible and works well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the corpus is large&lt;/li&gt;
&lt;li&gt;information changes often&lt;/li&gt;
&lt;li&gt;questions are unpredictable&lt;/li&gt;
&lt;li&gt;you need broad coverage&lt;/li&gt;
&lt;li&gt;you cannot curate everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it may be less coherent for conceptual questions, because the model has to synthesize from fragments each time, which can produce inconsistent answers across similar queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingest-time synthesis
&lt;/h3&gt;

&lt;p&gt;In LLM Wiki, the system performs synthesis before the question arrives.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sources
  -&amp;gt; summarize
  -&amp;gt; structure
  -&amp;gt; link
  -&amp;gt; query or browse later
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is less flexible but more coherent, and it works well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the corpus is manageable&lt;/li&gt;
&lt;li&gt;the domain is stable&lt;/li&gt;
&lt;li&gt;concepts repeat&lt;/li&gt;
&lt;li&gt;human readability matters&lt;/li&gt;
&lt;li&gt;you want reusable synthesis&lt;/li&gt;
&lt;li&gt;you want a maintained knowledge layer&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The main differences
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;RAG&lt;/th&gt;
&lt;th&gt;LLM Wiki&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Main timing&lt;/td&gt;
&lt;td&gt;Query time&lt;/td&gt;
&lt;td&gt;Ingest time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main operation&lt;/td&gt;
&lt;td&gt;Retrieve chunks&lt;/td&gt;
&lt;td&gt;Compile knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best corpus&lt;/td&gt;
&lt;td&gt;Large and changing&lt;/td&gt;
&lt;td&gt;Curated and stable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;Generated answer&lt;/td&gt;
&lt;td&gt;Structured knowledge pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Search index or vector DB&lt;/td&gt;
&lt;td&gt;Markdown or wiki structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strength&lt;/td&gt;
&lt;td&gt;Flexible access&lt;/td&gt;
&lt;td&gt;Reusable synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weakness&lt;/td&gt;
&lt;td&gt;Fragmented context&lt;/td&gt;
&lt;td&gt;Maintenance drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human readability&lt;/td&gt;
&lt;td&gt;Often indirect&lt;/td&gt;
&lt;td&gt;Usually direct&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Complementary, not mutually exclusive
&lt;/h3&gt;

&lt;p&gt;The debate should not be framed as "LLM Wiki or RAG" - that is the wrong question. LLM Wiki does not replace RAG in most production systems; both have distinct and complementary roles. A well-designed system may look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;raw documents
  -&amp;gt; source store
  -&amp;gt; LLM Wiki synthesis
  -&amp;gt; reviewed knowledge pages
  -&amp;gt; search index
  -&amp;gt; RAG over source and synthesis
  -&amp;gt; answer with citations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In that architecture, LLM Wiki improves the representation layer and RAG improves the access layer. Use RAG for retrieval over large and changing corpora, use LLM Wiki for compiled synthesis over stable and curated knowledge, and use both together when you need scale and coherence at the same time.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM Wiki vs adjacent systems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LLM Wiki vs summarization
&lt;/h3&gt;

&lt;p&gt;A weak LLM Wiki is just a folder of generated summaries, and that is not enough. Summarization compresses content; LLM Wiki structures it. A real LLM Wiki needs stable pages, links, concepts, indexes, source tracking, revision history, maintenance workflows, and conflict detection - the wiki part matters as much as the LLM part.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM Wiki vs knowledge graph
&lt;/h3&gt;

&lt;p&gt;A knowledge graph represents entities and relationships explicitly, while an LLM Wiki creates a softer, document-oriented graph through Markdown pages and links. A mature system can use both: the wiki provides human-readable explanations and the knowledge graph provides precisely structured, machine-queryable relationships.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM Wiki vs agent memory
&lt;/h3&gt;

&lt;p&gt;LLM Wiki is also different from &lt;a href="https://www.glukhov.org/ai-systems/memory/agent-memory-providers/" rel="noopener noreferrer"&gt;AI memory&lt;/a&gt;. Memory stores context that affects future behavior, while an LLM Wiki stores structured knowledge that can be read, searched, reviewed, and linked by both humans and systems.&lt;/p&gt;

&lt;p&gt;Memory might remember:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the user prefers Go examples&lt;/li&gt;
&lt;li&gt;the project avoids ORMs&lt;/li&gt;
&lt;li&gt;the agent tried a command yesterday&lt;/li&gt;
&lt;li&gt;a bug investigation failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An LLM Wiki might store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what Go database access patterns exist&lt;/li&gt;
&lt;li&gt;how sqlc compares with GORM&lt;/li&gt;
&lt;li&gt;why outbox patterns matter&lt;/li&gt;
&lt;li&gt;how RAG differs from memory systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Memory is behavioral context; LLM Wiki is represented knowledge - and mixing the two leads to systems that are hard to inspect, audit, or maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  When LLM Wiki works well
&lt;/h2&gt;

&lt;p&gt;LLM Wiki works best for stable domains, personal research, curated corpora, technical documentation, and situations where repeated synthesis over the same material is wasteful.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stable domains
&lt;/h3&gt;

&lt;p&gt;LLM Wiki works best when the domain does not change every hour. Good examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;technical concepts&lt;/li&gt;
&lt;li&gt;research notes&lt;/li&gt;
&lt;li&gt;learning material&lt;/li&gt;
&lt;li&gt;architecture patterns&lt;/li&gt;
&lt;li&gt;book notes&lt;/li&gt;
&lt;li&gt;model comparison notes&lt;/li&gt;
&lt;li&gt;internal engineering principles&lt;/li&gt;
&lt;li&gt;curated documentation&lt;/li&gt;
&lt;li&gt;personal knowledge bases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If knowledge is stable enough to summarize without becoming stale within days, LLM Wiki can deliver lasting value that compounds as the wiki grows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Research synthesis
&lt;/h3&gt;

&lt;p&gt;Research synthesis is one of the strongest use cases, because researchers often read many sources and repeatedly ask the same meta-questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What are the main ideas?&lt;/li&gt;
&lt;li&gt;Which sources agree?&lt;/li&gt;
&lt;li&gt;Which sources conflict?&lt;/li&gt;
&lt;li&gt;What concepts repeat?&lt;/li&gt;
&lt;li&gt;What is the current state of the topic?&lt;/li&gt;
&lt;li&gt;What should I read next?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM Wiki helps turn that research material into reusable structure - topic pages, comparison pages, contradiction notes, and related links - so the researcher does not have to rebuild the same mental map every time they return to a domain. It is especially useful when working with papers, technical articles, transcripts, documentation, notes, and experiment logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Personal knowledge systems
&lt;/h3&gt;

&lt;p&gt;LLM Wiki fits naturally with &lt;a href="https://www.glukhov.org/knowledge-management/foundations/pkm-vs-rag-vs-wiki-vs-memory-systems/" rel="noopener noreferrer"&gt;PKM and the broader knowledge systems spectrum&lt;/a&gt; and &lt;a href="https://www.glukhov.org/knowledge-management/foundations/second-brain/" rel="noopener noreferrer"&gt;second brain&lt;/a&gt; workflows because a personal knowledge system already contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;notes&lt;/li&gt;
&lt;li&gt;links&lt;/li&gt;
&lt;li&gt;unfinished ideas&lt;/li&gt;
&lt;li&gt;summaries&lt;/li&gt;
&lt;li&gt;references&lt;/li&gt;
&lt;li&gt;topic maps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An LLM can help maintain the structure by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summarizing long notes&lt;/li&gt;
&lt;li&gt;proposing links&lt;/li&gt;
&lt;li&gt;creating topic pages&lt;/li&gt;
&lt;li&gt;detecting duplicate concepts&lt;/li&gt;
&lt;li&gt;extracting glossary terms&lt;/li&gt;
&lt;li&gt;generating index pages&lt;/li&gt;
&lt;li&gt;identifying gaps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The human remains the editor, which is the right relationship between human judgment and machine assistance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical blogging
&lt;/h3&gt;

&lt;p&gt;A technical blog can use LLM Wiki ideas internally even without building a full automated system. A well-structured site can include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pillar pages&lt;/li&gt;
&lt;li&gt;cluster index pages&lt;/li&gt;
&lt;li&gt;topic summaries&lt;/li&gt;
&lt;li&gt;related article maps&lt;/li&gt;
&lt;li&gt;glossary pages&lt;/li&gt;
&lt;li&gt;comparison pages&lt;/li&gt;
&lt;li&gt;canonical explainers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not only SEO but knowledge representation: a well-structured technical blog becomes more valuable when articles are connected into a durable knowledge structure that both humans and AI systems can navigate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Small team knowledge bases
&lt;/h3&gt;

&lt;p&gt;LLM Wiki can work well for small teams with curated knowledge, including engineering decisions, product architecture, onboarding notes, support playbooks, internal standards, postmortems, and runbooks. The key condition is governance: someone must review and maintain the generated structure, because without clear ownership the wiki decays into noise regardless of how well it was initially generated.&lt;/p&gt;

&lt;h2&gt;
  
  
  When LLM Wiki is a poor fit
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Highly dynamic data
&lt;/h3&gt;

&lt;p&gt;LLM Wiki is weaker when information changes constantly. Live inventory, pricing feeds, incident status, financial market data, rapidly changing support tickets, and real-time logs are all better served by retrieval or direct API access. Compiling fast-moving data into static summaries is counterproductive unless you have a strong refresh process that keeps the compiled layer in sync with reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Large unmanaged corpora
&lt;/h3&gt;

&lt;p&gt;LLM Wiki does not automatically scale to millions of documents. At large scale, the difficult problems extend well beyond generation and include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;access control&lt;/li&gt;
&lt;li&gt;data lineage&lt;/li&gt;
&lt;li&gt;ownership&lt;/li&gt;
&lt;li&gt;deduplication&lt;/li&gt;
&lt;li&gt;indexing&lt;/li&gt;
&lt;li&gt;freshness tracking&lt;/li&gt;
&lt;li&gt;evaluation&lt;/li&gt;
&lt;li&gt;governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple Markdown wiki is not equipped to address those needs, and at enterprise scale, LLM Wiki may become one layer inside a larger knowledge architecture rather than the whole system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Low-quality sources
&lt;/h3&gt;

&lt;p&gt;LLM Wiki cannot reliably fix bad sources. If the source material is contradictory, outdated, low quality, duplicated, incomplete, or badly scoped, generated pages may look polished but be wrong. This is dangerous precisely because a clean generated page creates false confidence - the formatting signals quality even when the underlying content does not justify it.&lt;/p&gt;

&lt;h3&gt;
  
  
  No review process
&lt;/h3&gt;

&lt;p&gt;LLM Wiki without review is risky because generated structure creates authority. A bad answer in RAG may affect one query, but a bad generated wiki page may affect many future queries, readers, and agents that retrieve from it. The model may overgeneralize, miss exceptions, invent structure, merge incompatible ideas, hide uncertainty, create misleading links, or summarize outdated material as though it were current - so for any knowledge that actually matters, human review is not optional.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations and failure modes
&lt;/h2&gt;

&lt;p&gt;The main risks of building an LLM Wiki are stale summaries, hallucinated synthesis baked into the knowledge base, weak source tracking, maintenance cost, and false confidence in generated structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Maintenance drift
&lt;/h3&gt;

&lt;p&gt;Knowledge drift happens when generated pages stop matching the underlying sources. This can happen because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sources changed&lt;/li&gt;
&lt;li&gt;new sources were added&lt;/li&gt;
&lt;li&gt;old pages were not refreshed&lt;/li&gt;
&lt;li&gt;summaries were edited manually&lt;/li&gt;
&lt;li&gt;links became outdated&lt;/li&gt;
&lt;li&gt;model output changed over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drift is the central operational risk of LLM Wiki, and a good system needs explicit refresh and validation workflows to catch it before it propagates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hallucinated synthesis
&lt;/h3&gt;

&lt;p&gt;RAG can hallucinate at answer time, but LLM Wiki can hallucinate at ingest time, which is more subtle and more dangerous. If a generated wiki page contains a wrong synthesis, future users may treat that page as ground truth, and future AI systems may retrieve it and amplify the mistake further. Generated structure needs provenance, and every important claim should link back to its original sources so the hallucination can be caught during review rather than silently embedded in the knowledge base.&lt;/p&gt;

&lt;h3&gt;
  
  
  Over-structuring
&lt;/h3&gt;

&lt;p&gt;Once you have an LLM that can create pages cheaply, it is tempting to create too many of them. You can end up with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;empty taxonomy&lt;/li&gt;
&lt;li&gt;duplicate concepts&lt;/li&gt;
&lt;li&gt;shallow pages&lt;/li&gt;
&lt;li&gt;meaningless links&lt;/li&gt;
&lt;li&gt;generated clutter&lt;/li&gt;
&lt;li&gt;fake completeness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A useful wiki is not measured by page count but by whether pages are actually reused, linked, and updated over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unclear ownership
&lt;/h3&gt;

&lt;p&gt;The model cannot own the page. A serious system needs clear ownership rules covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;who reviews pages&lt;/li&gt;
&lt;li&gt;who approves updates&lt;/li&gt;
&lt;li&gt;who deletes stale pages&lt;/li&gt;
&lt;li&gt;who resolves contradictions&lt;/li&gt;
&lt;li&gt;who decides canonical structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without that clarity, LLM Wiki becomes another abandoned knowledge base - well-intentioned, well-generated, and quietly ignored.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1. Personal LLM Wiki
&lt;/h3&gt;

&lt;p&gt;The personal pattern is the simplest and most practical version, best suited for individuals.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;notes and sources
  -&amp;gt; LLM assisted summaries
  -&amp;gt; Markdown pages
  -&amp;gt; manual review
  -&amp;gt; [Obsidian](https://www.glukhov.org/knowledge-management/tools/obsidian-for-personal-knowledge-management/ "Using Obsidian for Personal Knowledge Management") or static site
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works well for researchers, writers, engineers, technical bloggers, students, and consultants, where the value comes from reducing repeated synthesis and making personal knowledge easier to navigate without requiring any team coordination or governance infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2. Team LLM Wiki
&lt;/h3&gt;

&lt;p&gt;The team pattern is best for small groups and needs more governance than the personal version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;team docs
  -&amp;gt; ingest workflow
  -&amp;gt; generated draft pages
  -&amp;gt; review queue
  -&amp;gt; published wiki
  -&amp;gt; search or RAG layer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The review queue is critical here, because generated knowledge should never be published directly into a team source of truth without a human checkpoint - even a lightweight review process catches the most dangerous hallucinations before they become institutional knowledge.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3. LLM Wiki plus RAG
&lt;/h3&gt;

&lt;p&gt;This is often the most balanced architecture, giving you both raw source access and compiled synthesis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;raw sources
  -&amp;gt; LLM Wiki pages
  -&amp;gt; reviewed knowledge base
  -&amp;gt; search index
  -&amp;gt; RAG over raw and compiled knowledge
  -&amp;gt; cited answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The RAG system can retrieve from original documents, generated summaries, topic pages, comparison pages, and glossary entries, which makes retrieval quality significantly stronger than operating over raw documents alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 4. LLM Wiki as site architecture
&lt;/h3&gt;

&lt;p&gt;For a technical website, LLM Wiki ideas can guide content structure even without automation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;articles
  -&amp;gt; pillar pages
  -&amp;gt; topic maps
  -&amp;gt; comparisons
  -&amp;gt; internal links
  -&amp;gt; search and AI access
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This turns a blog into a knowledge system where articles are not just posts but nodes in a structured map - a significant difference for both reader experience and machine-readable discoverability.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM Wiki design principles
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Keep raw sources separate
&lt;/h3&gt;

&lt;p&gt;Never lose the original source. Generated pages should not replace source documents but sit above them - the source layer provides evidence, the wiki layer provides interpretation, and losing the original means losing the ability to verify, challenge, or update the interpretation derived from it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Markdown where possible
&lt;/h3&gt;

&lt;p&gt;Markdown is boring and excellent. It is portable, readable, diffable, versionable, easy to edit, friendly to static sites, and friendly to PKM tools. Boring formats survive longer than clever platforms, which means a Markdown-based LLM Wiki built today will still be usable long after whatever proprietary database you might have chosen has gone through multiple breaking migrations. For syntax reference, see the &lt;a href="https://www.glukhov.org/documentation-tools/markdown/markdown-cheatsheet/" rel="noopener noreferrer"&gt;Markdown Cheatsheet&lt;/a&gt; and the guide to &lt;a href="https://www.glukhov.org/documentation-tools/markdown/markdown-codeblocks/" rel="noopener noreferrer"&gt;Markdown Code Blocks&lt;/a&gt;, which are especially relevant when structuring wiki pages that include technical content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Track provenance
&lt;/h3&gt;

&lt;p&gt;Every generated page should answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What sources created this?&lt;/li&gt;
&lt;li&gt;When was it generated?&lt;/li&gt;
&lt;li&gt;When was it reviewed?&lt;/li&gt;
&lt;li&gt;What changed?&lt;/li&gt;
&lt;li&gt;Who approved it?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without provenance, trust collapses over time as pages drift further from their origins. A practical page schema might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;title
summary
status
sources
last_reviewed
related_pages
concepts
open_questions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For technical content, add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;applies_to
version
examples
tradeoffs
failure_modes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For research content, add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;claims
evidence
contradictions
confidence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Prefer fewer better pages
&lt;/h3&gt;

&lt;p&gt;Do not generate a page for every minor idea. Prefer strong concept pages, useful comparison pages, topic indexes, canonical summaries, and glossary entries that earn their place. A small useful wiki with twenty well-maintained pages beats a large generated mess with two hundred pages nobody reads or updates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Make links meaningful
&lt;/h3&gt;

&lt;p&gt;Links should explain relationships rather than just connect pages at random. Useful link types include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;related concept&lt;/li&gt;
&lt;li&gt;depends on&lt;/li&gt;
&lt;li&gt;contrasts with&lt;/li&gt;
&lt;li&gt;example of&lt;/li&gt;
&lt;li&gt;source for&lt;/li&gt;
&lt;li&gt;expands on&lt;/li&gt;
&lt;li&gt;implementation of&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Random links create noise and erode reader trust in the structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mark uncertainty
&lt;/h3&gt;

&lt;p&gt;LLM Wiki pages should not pretend all knowledge is equally certain. Useful status markers include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;confirmed&lt;/li&gt;
&lt;li&gt;likely&lt;/li&gt;
&lt;li&gt;disputed&lt;/li&gt;
&lt;li&gt;outdated&lt;/li&gt;
&lt;li&gt;needs review&lt;/li&gt;
&lt;li&gt;source conflict&lt;/li&gt;
&lt;li&gt;generated summary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These markers protect readers from false confidence and give maintainers a clear signal about which pages need attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to evaluate an LLM Wiki
&lt;/h2&gt;

&lt;p&gt;Do not only ask whether the generated pages look impressive - ask whether they improve knowledge work. Useful evaluation questions include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can users find concepts faster?&lt;/li&gt;
&lt;li&gt;Are repeated questions answered better?&lt;/li&gt;
&lt;li&gt;Are source links preserved?&lt;/li&gt;
&lt;li&gt;Are contradictions easier to see?&lt;/li&gt;
&lt;li&gt;Are pages reused?&lt;/li&gt;
&lt;li&gt;Are summaries accurate?&lt;/li&gt;
&lt;li&gt;Is stale content detected?&lt;/li&gt;
&lt;li&gt;Does the wiki reduce repeated synthesis?&lt;/li&gt;
&lt;li&gt;Does it help humans write or decide?&lt;/li&gt;
&lt;li&gt;Does it improve RAG answer quality?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer is no to most of these, the wiki is decoration regardless of how many pages it contains.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM Wiki and knowledge management
&lt;/h2&gt;

&lt;p&gt;LLM Wiki belongs in &lt;a href="https://www.glukhov.org/knowledge-management/" rel="noopener noreferrer"&gt;knowledge management&lt;/a&gt; because it is fundamentally about representation, not primarily about model hosting, vector search, or agent execution. It answers a different question: how should knowledge be structured so that humans and AI systems can reuse it? That places it in the knowledge systems architecture layer, connecting naturally to PKM, wikis, RAG, agent memory, knowledge graphs, technical publishing, and research synthesis.&lt;/p&gt;

&lt;p&gt;A clean layer model looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Human thinking - PKM, explore and develop ideas&lt;/li&gt;
&lt;li&gt;Shared knowledge - Wiki, maintain canonical pages&lt;/li&gt;
&lt;li&gt;Compiled knowledge - LLM Wiki, generate structured synthesis&lt;/li&gt;
&lt;li&gt;Machine access - RAG, retrieve context at query time&lt;/li&gt;
&lt;li&gt;Agent continuity - Memory, persist behavior and preferences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM Wiki occupies the compiled knowledge layer, and that position is what makes it useful - it is the layer that turns a pile of documents into something both humans and machines can navigate and reason over.&lt;/p&gt;

&lt;h2&gt;
  
  
  My opinionated take
&lt;/h2&gt;

&lt;p&gt;LLM Wiki is important, but the hype is slightly wrong - it is not a RAG killer, but a reminder that knowledge representation matters. The industry spent years optimizing retrieval pipelines, and that work was necessary, but many systems still retrieve from badly structured knowledge. Better embeddings and better rerankers help, but they cannot fully compensate for a weak knowledge layer.&lt;/p&gt;

&lt;p&gt;LLM Wiki pushes the conversation back toward structure by asking better questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What are the core concepts?&lt;/li&gt;
&lt;li&gt;What is canonical?&lt;/li&gt;
&lt;li&gt;How do ideas connect?&lt;/li&gt;
&lt;li&gt;What should be summarized once?&lt;/li&gt;
&lt;li&gt;What should be retrieved fresh?&lt;/li&gt;
&lt;li&gt;What should be reviewed by humans?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the right conversation, and the future is not just better vector search but layered knowledge systems where representation, retrieval, and memory each play a distinct and well-understood role.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;LLM Wiki is an architecture pattern for compiled knowledge that uses language models to help transform source material into structured, linked, reusable knowledge before questions are asked. Its core workflow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;summarize
  -&amp;gt; structure
  -&amp;gt; link
  -&amp;gt; review
  -&amp;gt; reuse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compared with RAG, the main difference is timing: RAG performs synthesis at query time, while LLM Wiki performs synthesis at ingest time, which makes it valuable for stable domains, research synthesis, personal knowledge bases, technical blogs, and curated team knowledge.&lt;/p&gt;

&lt;p&gt;But it has real limitations. It can drift when sources change, hallucinate when model output is wrong, create false confidence when review is absent, and collapse into noise when ownership is unclear. Used badly, it becomes another abandoned wiki. Used well, it becomes the representation layer between raw documents and AI systems - not a replacement for RAG, but the missing layer that makes retrieval worth using.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources and further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/what-is/retrieval-augmented-generation/" rel="noopener noreferrer"&gt;AWS - What Is Retrieval Augmented Generation?&lt;/a&gt; - AWS foundational overview of how RAG pipelines are constructed and when they are appropriate.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.ibm.com/think/topics/retrieval-augmented-generation" rel="noopener noreferrer"&gt;IBM - Retrieval Augmented Generation&lt;/a&gt; - IBM overview of RAG architecture, covering grounding, hallucination reduction, and enterprise use cases.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/use-cases/retrieval-augmented-generation" rel="noopener noreferrer"&gt;Google Cloud - Retrieval Augmented Generation&lt;/a&gt; - Google Cloud perspective on RAG use cases, system design, and integration with vector search.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://atlan.com/know/llm-wiki-vs-rag-knowledge-base/" rel="noopener noreferrer"&gt;Atlan - LLM Wiki vs RAG Knowledge Base&lt;/a&gt; - Practical comparison of LLM Wiki and RAG approaches from a data catalog perspective.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ranjankumar.in/llm-wiki-synthesis-time-decision-rag-agentic-memory" rel="noopener noreferrer"&gt;Ranjan Kumar - LLM Wiki, Synthesis Time, RAG, and Agentic Memory&lt;/a&gt; - In-depth discussion of the timing distinction between synthesis approaches and how they fit into agentic architectures.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/vishalmysore/rag-vs-agent-memory-vs-llm-wiki-a-practical-comparison-1oo6"&gt;Dev.to - RAG vs Agent Memory vs LLM Wiki&lt;/a&gt; - Practical comparison of all three knowledge patterns with implementation notes.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.starmorph.com/blog/karpathy-llm-wiki-knowledge-base-guide" rel="noopener noreferrer"&gt;Starmorph - Karpathy LLM Wiki Knowledge Base Guide&lt;/a&gt; - Guide inspired by Andrej Karpathy's framing of LLM Wiki as a compiled knowledge system.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.mindstudio.ai/blog/llm-wiki-vs-rag-knowledge-base" rel="noopener noreferrer"&gt;MindStudio - LLM Wiki vs RAG Knowledge Base&lt;/a&gt; - MindStudio perspective on choosing between LLM Wiki and RAG for AI assistant knowledge.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>wiki</category>
      <category>knowledgemanagement</category>
      <category>rag</category>
      <category>aisystems</category>
    </item>
    <item>
      <title>Retrieval vs Representation in Knowledge Systems</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 18 May 2026 09:22:43 +0000</pubDate>
      <link>https://forem.com/rosgluk/retrieval-vs-representation-in-knowledge-systems-5e49</link>
      <guid>https://forem.com/rosgluk/retrieval-vs-representation-in-knowledge-systems-5e49</guid>
      <description>&lt;p&gt;Most modern knowledge systems optimize retrieval, and that is understandable.&lt;br&gt;
Search is visible, easy to demo, and feels magical when it works. Type a question, get an answer.&lt;/p&gt;



&lt;p&gt;But retrieval is only one half of the problem. The deeper question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What shape does the knowledge have before anything tries to retrieve it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is representation — the structure behind the knowledge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;notes&lt;/li&gt;
&lt;li&gt;pages&lt;/li&gt;
&lt;li&gt;schemas&lt;/li&gt;
&lt;li&gt;graphs&lt;/li&gt;
&lt;li&gt;entities&lt;/li&gt;
&lt;li&gt;relationships&lt;/li&gt;
&lt;li&gt;summaries&lt;/li&gt;
&lt;li&gt;taxonomies&lt;/li&gt;
&lt;li&gt;source boundaries&lt;/li&gt;
&lt;li&gt;canonical versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Retrieval asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can I find something relevant?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Representation asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Is the knowledge organized in a way that makes sense?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These are not the same problem. A RAG system with poor representation becomes a fast interface to a messy archive. It can retrieve fragments, but it cannot fix broken structure. It can quote documents, but it cannot decide which one is canonical. It can assemble context, but it cannot guarantee that the underlying knowledge is coherent.&lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://www.glukhov.org/knowledge-management/knowledge-systems-architectures/compiled-knowledge/what-is-llm-wiki/" rel="noopener noreferrer"&gt;LLM Wiki&lt;/a&gt; style systems are interesting: they shift effort from query time to ingest time. Instead of only retrieving chunks when a user asks a question, they attempt to pre-structure knowledge into pages, concepts, summaries, and links. That does not make RAG obsolete — it means retrieval and representation are different layers, and good knowledge systems need both.&lt;/p&gt;
&lt;h2&gt;
  
  
  The core distinction
&lt;/h2&gt;

&lt;p&gt;Retrieval is about access; representation is about meaning.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;td&gt;How do I find the right information?&lt;/td&gt;
&lt;td&gt;search, embeddings, BM25, reranking, vector stores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Representation&lt;/td&gt;
&lt;td&gt;How is knowledge structured?&lt;/td&gt;
&lt;td&gt;notes, wikis, graphs, schemas, ontologies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;How do I use the knowledge?&lt;/td&gt;
&lt;td&gt;synthesis, comparison, inference, decision making&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A weak system often jumps straight to retrieval; a strong system first asks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What are the core concepts?&lt;/li&gt;
&lt;li&gt;What is the canonical source?&lt;/li&gt;
&lt;li&gt;What relationships matter?&lt;/li&gt;
&lt;li&gt;What changes over time?&lt;/li&gt;
&lt;li&gt;What should be retrieved?&lt;/li&gt;
&lt;li&gt;What should already be represented?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the difference between search over documents and an actual knowledge system.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why retrieval became dominant
&lt;/h2&gt;

&lt;p&gt;Retrieval became dominant because it maps well to the modern AI stack. A typical &lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;RAG&lt;/a&gt; pipeline looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load documents&lt;/li&gt;
&lt;li&gt;Split them into chunks&lt;/li&gt;
&lt;li&gt;Generate embeddings&lt;/li&gt;
&lt;li&gt;Store vectors&lt;/li&gt;
&lt;li&gt;Retrieve relevant chunks&lt;/li&gt;
&lt;li&gt;Optionally rerank them&lt;/li&gt;
&lt;li&gt;Put them into an LLM prompt&lt;/li&gt;
&lt;li&gt;Generate an answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This pipeline is practical: it is relatively easy to build, works with messy documents, scales to large corpora, avoids retraining models, and gives LLMs access to current information. That is why RAG became the default pattern for "AI over documents."&lt;/p&gt;

&lt;p&gt;But there is a trap:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;RAG improves access to knowledge. It does not automatically improve the knowledge.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If your content is duplicated, outdated, contradictory, badly chunked, or poorly named, retrieval will surface those problems — often with confidence.&lt;/p&gt;
&lt;h2&gt;
  
  
  What representation means
&lt;/h2&gt;

&lt;p&gt;Representation is the way knowledge is shaped before retrieval happens. It answers questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this knowledge stored as documents, notes, entities, or facts?&lt;/li&gt;
&lt;li&gt;Are relationships explicit or implicit?&lt;/li&gt;
&lt;li&gt;Are there canonical pages?&lt;/li&gt;
&lt;li&gt;Are there summaries?&lt;/li&gt;
&lt;li&gt;Are concepts linked?&lt;/li&gt;
&lt;li&gt;Is the system organized by topic, workflow, time, or ownership?&lt;/li&gt;
&lt;li&gt;Can a human maintain it?&lt;/li&gt;
&lt;li&gt;Can a machine reason over it?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Representation is not decoration — it determines what kind of operations are possible.&lt;/p&gt;
&lt;h2&gt;
  
  
  Forms of representation
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Documents
&lt;/h3&gt;

&lt;p&gt;Documents are the most common representation. Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;articles&lt;/li&gt;
&lt;li&gt;PDFs&lt;/li&gt;
&lt;li&gt;manuals&lt;/li&gt;
&lt;li&gt;reports&lt;/li&gt;
&lt;li&gt;README files&lt;/li&gt;
&lt;li&gt;support pages&lt;/li&gt;
&lt;li&gt;blog posts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Documents are easy for humans to write, but they are often hard for machines to use because they mix facts, narrative, context, examples, opinions, outdated sections, and repeated explanations into the same container. Documents are good containers, but they are not always good knowledge structures.&lt;/p&gt;
&lt;h3&gt;
  
  
  Notes
&lt;/h3&gt;

&lt;p&gt;Notes are more flexible than documents. They can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;atomic&lt;/li&gt;
&lt;li&gt;linked&lt;/li&gt;
&lt;li&gt;private&lt;/li&gt;
&lt;li&gt;unfinished&lt;/li&gt;
&lt;li&gt;concept focused&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A note system, such as a &lt;a href="https://www.glukhov.org/knowledge-management/foundations/pkm-vs-rag-vs-wiki-vs-memory-systems/" rel="noopener noreferrer"&gt;PKM&lt;/a&gt; or &lt;a href="https://www.glukhov.org/knowledge-management/foundations/second-brain/" rel="noopener noreferrer"&gt;second brain&lt;/a&gt;, can represent evolving knowledge better than a polished document repository. Good notes capture thinking in progress; bad notes become an unsearchable junk drawer.&lt;/p&gt;
&lt;h3&gt;
  
  
  Wikis
&lt;/h3&gt;

&lt;p&gt;Wikis represent knowledge as maintained pages. A good wiki has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stable pages&lt;/li&gt;
&lt;li&gt;clear topics&lt;/li&gt;
&lt;li&gt;internal links&lt;/li&gt;
&lt;li&gt;ownership&lt;/li&gt;
&lt;li&gt;canonical answers&lt;/li&gt;
&lt;li&gt;update patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A wiki is stronger than a loose document dump because it gives knowledge a home. "Deployment checklist" lives in one place. "Incident response" lives in one place. "RAG architecture" lives in one place. That matters because retrieval works better when knowledge has a stable structure.&lt;/p&gt;
&lt;h3&gt;
  
  
  Knowledge graphs
&lt;/h3&gt;

&lt;p&gt;Knowledge graphs represent knowledge as entities and relationships. Instead of storing only text, they model things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Person works on Project&lt;/li&gt;
&lt;li&gt;Model supports ContextLength&lt;/li&gt;
&lt;li&gt;Page depends on Concept&lt;/li&gt;
&lt;li&gt;Service connects to Database&lt;/li&gt;
&lt;li&gt;Tool implements Protocol&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Graphs are powerful because relationships become explicit, which helps with traversal, dependency analysis, entity resolution, lineage, reasoning, and recommendations. But graphs are expensive to maintain and they are not magic — a bad graph is just structured confusion.&lt;/p&gt;
&lt;h3&gt;
  
  
  Schemas and ontologies
&lt;/h3&gt;

&lt;p&gt;Schemas define expected structure; ontologies go further and define types, relations, and constraints. They answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What kinds of things exist?&lt;/li&gt;
&lt;li&gt;What properties do they have?&lt;/li&gt;
&lt;li&gt;How can they relate?&lt;/li&gt;
&lt;li&gt;What rules apply?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is useful when correctness matters, such as in medical knowledge, legal knowledge, enterprise data catalogs, product taxonomies, and compliance systems. The tradeoff is rigidity: the more formal the representation, the more expensive it is to evolve.&lt;/p&gt;
&lt;h3&gt;
  
  
  LLM-generated representations
&lt;/h3&gt;

&lt;p&gt;Modern systems increasingly use LLMs to create representations. Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summaries&lt;/li&gt;
&lt;li&gt;extracted entities&lt;/li&gt;
&lt;li&gt;topic pages&lt;/li&gt;
&lt;li&gt;concept maps&lt;/li&gt;
&lt;li&gt;synthetic FAQs&lt;/li&gt;
&lt;li&gt;document outlines&lt;/li&gt;
&lt;li&gt;cross-links&lt;/li&gt;
&lt;li&gt;glossary entries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where LLM Wiki style systems sit. They use the model not only to answer queries but to pre-process and structure knowledge before the query happens. RAG says "retrieve relevant chunks at query time"; LLM Wiki says "compile useful knowledge structures at ingest time." Both patterns can coexist in the same architecture.&lt;/p&gt;
&lt;h2&gt;
  
  
  What retrieval means
&lt;/h2&gt;

&lt;p&gt;Retrieval is the process of finding relevant information. Common retrieval methods include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keyword search&lt;/li&gt;
&lt;li&gt;full text search&lt;/li&gt;
&lt;li&gt;vector search&lt;/li&gt;
&lt;li&gt;hybrid search&lt;/li&gt;
&lt;li&gt;metadata filtering&lt;/li&gt;
&lt;li&gt;graph traversal&lt;/li&gt;
&lt;li&gt;reranking&lt;/li&gt;
&lt;li&gt;query rewriting&lt;/li&gt;
&lt;li&gt;agentic search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Retrieval is not one thing — it is a layered stack of complementary methods.&lt;/p&gt;
&lt;h3&gt;
  
  
  Keyword search
&lt;/h3&gt;

&lt;p&gt;Keyword search matches terms and is still useful because it is predictable, debuggable, fast, and good for exact terms, IDs, error messages, names, and code. Its weakness is semantic mismatch: if the user searches "how to stop repeated answers" but the document says "presence penalty", keyword search may miss the best result.&lt;/p&gt;
&lt;h3&gt;
  
  
  Vector search
&lt;/h3&gt;

&lt;p&gt;Vector search retrieves by semantic similarity. It is useful when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;wording differs&lt;/li&gt;
&lt;li&gt;concepts are fuzzy&lt;/li&gt;
&lt;li&gt;users ask natural language questions&lt;/li&gt;
&lt;li&gt;documents use inconsistent terminology&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Its weakness is precision — vector search can retrieve things that feel related but are not actually correct, which is especially risky in technical systems.&lt;/p&gt;
&lt;h3&gt;
  
  
  Hybrid search
&lt;/h3&gt;

&lt;p&gt;Hybrid search combines keyword and vector retrieval, which is often better than either alone. Keyword search catches exact matches; vector search catches conceptual matches. For technical knowledge bases, hybrid retrieval is usually a strong default.&lt;/p&gt;
&lt;h3&gt;
  
  
  Reranking
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/rag/reranking/reranking-with-embedding-models/" rel="noopener noreferrer"&gt;Reranking&lt;/a&gt; takes an initial set of retrieved results and reorders them using a stronger model. This improves quality because the first retrieval step is often broad. A typical pattern retrieves 50 chunks, reranks to the top 5 or 10, then passes only the best context to the LLM. Reranking is one of the most practical ways to improve RAG quality.&lt;/p&gt;
&lt;h3&gt;
  
  
  Agentic retrieval
&lt;/h3&gt;

&lt;p&gt;Agentic retrieval turns search into a process. Instead of one query, an agent may:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ask an initial question&lt;/li&gt;
&lt;li&gt;Search&lt;/li&gt;
&lt;li&gt;Inspect results&lt;/li&gt;
&lt;li&gt;Reformulate the query&lt;/li&gt;
&lt;li&gt;Search again&lt;/li&gt;
&lt;li&gt;Compare sources&lt;/li&gt;
&lt;li&gt;Synthesize an answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is closer to research than search. It is useful for complex questions, but it is slower and harder to control.&lt;/p&gt;
&lt;h2&gt;
  
  
  Retrieval without representation is fragile
&lt;/h2&gt;

&lt;p&gt;A retrieval system can only retrieve what exists. It cannot reliably fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;unclear concepts&lt;/li&gt;
&lt;li&gt;duplicate pages&lt;/li&gt;
&lt;li&gt;inconsistent terminology&lt;/li&gt;
&lt;li&gt;stale documentation&lt;/li&gt;
&lt;li&gt;missing source ownership&lt;/li&gt;
&lt;li&gt;contradictory statements&lt;/li&gt;
&lt;li&gt;weak internal linking&lt;/li&gt;
&lt;li&gt;bad document boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the most common mistake in RAG projects: teams build a vector database and expect it to become a knowledge system. A vector database is not a knowledge architecture — it is an access layer.&lt;/p&gt;
&lt;h2&gt;
  
  
  Representation without retrieval is isolated
&lt;/h2&gt;

&lt;p&gt;The opposite failure also exists. You can have a beautifully structured knowledge base that nobody can find. This happens with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;over-designed wikis&lt;/li&gt;
&lt;li&gt;deep folder trees&lt;/li&gt;
&lt;li&gt;rigid taxonomies&lt;/li&gt;
&lt;li&gt;poorly indexed documentation&lt;/li&gt;
&lt;li&gt;private note systems with no discovery&lt;/li&gt;
&lt;li&gt;graphs without usable interfaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Representation gives knowledge structure; retrieval gives knowledge reach. You need both.&lt;/p&gt;
&lt;h2&gt;
  
  
  The tradeoff map
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Speed vs coherence
&lt;/h3&gt;

&lt;p&gt;Retrieval is fast to build and representation takes longer. If you need a prototype, retrieval wins; if you need long-term trust, representation matters more.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Better starting point&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fast Q&amp;amp;A over many docs&lt;/td&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stable technical knowledge&lt;/td&gt;
&lt;td&gt;Representation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exploratory research&lt;/td&gt;
&lt;td&gt;PKM plus retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise assistant&lt;/td&gt;
&lt;td&gt;Structured corpus plus RAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent memory&lt;/td&gt;
&lt;td&gt;Representation plus selective retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A pure RAG prototype can be built quickly, but a reliable knowledge system takes curation.&lt;/p&gt;
&lt;h3&gt;
  
  
  Flexibility vs consistency
&lt;/h3&gt;

&lt;p&gt;Loose documents are flexible; structured knowledge is consistent. Flexibility helps when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the domain changes quickly&lt;/li&gt;
&lt;li&gt;knowledge is incomplete&lt;/li&gt;
&lt;li&gt;users are exploring&lt;/li&gt;
&lt;li&gt;the system is personal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Consistency helps when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multiple people rely on it&lt;/li&gt;
&lt;li&gt;answers must be trusted&lt;/li&gt;
&lt;li&gt;workflows depend on it&lt;/li&gt;
&lt;li&gt;AI systems consume it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The more people or agents depend on knowledge, the more representation matters.&lt;/p&gt;
&lt;h3&gt;
  
  
  Recall vs precision
&lt;/h3&gt;

&lt;p&gt;Retrieval systems often optimize recall first, which means finding anything that might be relevant. But good answers need precision, which means finding the best evidence rather than merely related evidence. Representation improves precision by making concepts and boundaries clearer — a well-structured page is easier to retrieve accurately than a random paragraph buried inside a long document.&lt;/p&gt;
&lt;h3&gt;
  
  
  Ingest-time cost vs query-time cost
&lt;/h3&gt;

&lt;p&gt;RAG usually pushes work to query time. At query time, the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rewrites the query&lt;/li&gt;
&lt;li&gt;retrieves chunks&lt;/li&gt;
&lt;li&gt;reranks results&lt;/li&gt;
&lt;li&gt;assembles context&lt;/li&gt;
&lt;li&gt;asks the model to reason over fragments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM Wiki style systems push more work to ingest time. At ingest time, the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reads sources&lt;/li&gt;
&lt;li&gt;extracts concepts&lt;/li&gt;
&lt;li&gt;writes summaries&lt;/li&gt;
&lt;li&gt;creates pages&lt;/li&gt;
&lt;li&gt;links related ideas&lt;/li&gt;
&lt;li&gt;maintains structure&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Expensive step&lt;/th&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Query time&lt;/td&gt;
&lt;td&gt;Flexible retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM Wiki&lt;/td&gt;
&lt;td&gt;Ingest time&lt;/td&gt;
&lt;td&gt;Pre-compiled structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge graph&lt;/td&gt;
&lt;td&gt;Modeling time&lt;/td&gt;
&lt;td&gt;Explicit relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wiki&lt;/td&gt;
&lt;td&gt;Maintenance time&lt;/td&gt;
&lt;td&gt;Canonical knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of these is universally better — they optimize different costs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why LLM Wiki exists
&lt;/h2&gt;

&lt;p&gt;LLM Wiki exists because retrieval alone often repeats work. In a normal RAG system, every query may force the model to interpret raw fragments again:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Retrieve chunks about a topic&lt;/li&gt;
&lt;li&gt;Ask the LLM to infer the concept&lt;/li&gt;
&lt;li&gt;Generate an answer&lt;/li&gt;
&lt;li&gt;Forget the synthesis&lt;/li&gt;
&lt;li&gt;Repeat next time&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;LLM Wiki says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Stop re-deriving the same synthesis. Compile it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of only storing raw documents, it creates structured pages that summarize and connect knowledge, which can improve coherence, reuse, token efficiency, human readability, and long-term maintenance. But it has a cost: the system must maintain the wiki, and if the wiki is wrong, stale, or hallucinated, the structure becomes dangerous.&lt;/p&gt;
&lt;h2&gt;
  
  
  RAG hallucination vs bad representation
&lt;/h2&gt;

&lt;p&gt;People often blame the LLM when a RAG system gives a bad answer, and sometimes that is correct. But many failures are actually retrieval or representation failures.&lt;/p&gt;
&lt;h3&gt;
  
  
  Failure mode 1. Correct document, wrong chunk
&lt;/h3&gt;

&lt;p&gt;The answer exists, but &lt;a href="https://www.glukhov.org/rag/retrieval/chunking-strategies-in-rag/" rel="noopener noreferrer"&gt;chunking&lt;/a&gt; splits it badly. The model receives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;half of a paragraph&lt;/li&gt;
&lt;li&gt;missing context&lt;/li&gt;
&lt;li&gt;a table without explanation&lt;/li&gt;
&lt;li&gt;a definition without constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM fills those gaps, which looks like hallucination, but the deeper problem is broken representation.&lt;/p&gt;
&lt;h3&gt;
  
  
  Failure mode 2. Related chunk, wrong answer
&lt;/h3&gt;

&lt;p&gt;Vector search retrieves something semantically similar but operationally wrong. The query asks about production deployment; the retrieved chunk discusses local development. The terms overlap but the meaning differs, so the model answers with local setup instructions for a production problem. This is retrieval imprecision.&lt;/p&gt;
&lt;h3&gt;
  
  
  Failure mode 3. Conflicting sources
&lt;/h3&gt;

&lt;p&gt;Two documents disagree — one old, one new. The retrieval system returns both, and the LLM merges them into a confident but invalid answer. This is not just a retrieval problem but a representation problem, because the knowledge base lacks canonical state.&lt;/p&gt;
&lt;h3&gt;
  
  
  Failure mode 4. No concept model
&lt;/h3&gt;

&lt;p&gt;The system has many documents but no model of the domain. It does not know that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"agent memory" differs from "RAG"&lt;/li&gt;
&lt;li&gt;"wiki" differs from "PKM"&lt;/li&gt;
&lt;li&gt;"embedding search" differs from "full text search"&lt;/li&gt;
&lt;li&gt;"deployment" differs from "hosting"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without conceptual representation, retrieval becomes fuzzy matching.&lt;/p&gt;
&lt;h3&gt;
  
  
  Failure mode 5. Generated structure becomes fake authority
&lt;/h3&gt;

&lt;p&gt;LLM Wiki systems have their own failure mode. If an LLM generates a clean page from bad sources, the result can look more authoritative than the original material. This is dangerous: a polished hallucination is worse than a messy source document. Any generated representation needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source links&lt;/li&gt;
&lt;li&gt;review&lt;/li&gt;
&lt;li&gt;update rules&lt;/li&gt;
&lt;li&gt;confidence markers&lt;/li&gt;
&lt;li&gt;ownership&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Design implications
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Optimize retrieval when the corpus is large and dynamic
&lt;/h3&gt;

&lt;p&gt;Retrieval should be the priority when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the corpus is huge&lt;/li&gt;
&lt;li&gt;documents change frequently&lt;/li&gt;
&lt;li&gt;users ask many unpredictable questions&lt;/li&gt;
&lt;li&gt;you need broad coverage&lt;/li&gt;
&lt;li&gt;perfect structure is unrealistic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples: support knowledge bases, enterprise document search, research assistants, internal chat over many files, legal discovery, and customer service bots. In these cases, invest in strong retrieval:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hybrid search&lt;/li&gt;
&lt;li&gt;metadata filters&lt;/li&gt;
&lt;li&gt;reranking&lt;/li&gt;
&lt;li&gt;query rewriting&lt;/li&gt;
&lt;li&gt;source citation&lt;/li&gt;
&lt;li&gt;evaluation sets&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Optimize representation when coherence matters
&lt;/h3&gt;

&lt;p&gt;Representation should be the priority when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;knowledge must be trusted&lt;/li&gt;
&lt;li&gt;answers must be consistent&lt;/li&gt;
&lt;li&gt;concepts are reused often&lt;/li&gt;
&lt;li&gt;the domain has clear structure&lt;/li&gt;
&lt;li&gt;multiple systems depend on it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples: architecture knowledge, product documentation, compliance rules, API references, operational runbooks, curated research collections, and technical blog clusters. In these cases, invest in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;canonical pages&lt;/li&gt;
&lt;li&gt;glossary terms&lt;/li&gt;
&lt;li&gt;diagrams&lt;/li&gt;
&lt;li&gt;internal links&lt;/li&gt;
&lt;li&gt;ownership&lt;/li&gt;
&lt;li&gt;versioning&lt;/li&gt;
&lt;li&gt;review cadence&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Optimize both when AI systems depend on knowledge
&lt;/h3&gt;

&lt;p&gt;If an AI agent depends on the knowledge, retrieval alone is usually not enough. Agents need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stable context&lt;/li&gt;
&lt;li&gt;clear task rules&lt;/li&gt;
&lt;li&gt;durable memory&lt;/li&gt;
&lt;li&gt;structured references&lt;/li&gt;
&lt;li&gt;source boundaries&lt;/li&gt;
&lt;li&gt;update behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For agentic systems, representation becomes part of system design. A coding agent does not only need to retrieve "some docs" — it needs to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;project conventions&lt;/li&gt;
&lt;li&gt;architecture decisions&lt;/li&gt;
&lt;li&gt;command patterns&lt;/li&gt;
&lt;li&gt;forbidden dependencies&lt;/li&gt;
&lt;li&gt;testing workflow&lt;/li&gt;
&lt;li&gt;deployment rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of that belongs in RAG, some belongs in memory, and some belongs in structured project documentation.&lt;/p&gt;
&lt;h2&gt;
  
  
  Practical decision framework
&lt;/h2&gt;
&lt;h3&gt;
  
  
  If the problem is finding information
&lt;/h3&gt;

&lt;p&gt;Optimize retrieval. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Find relevant pages."&lt;/li&gt;
&lt;li&gt;"Answer questions over documents."&lt;/li&gt;
&lt;li&gt;"Search across many PDFs."&lt;/li&gt;
&lt;li&gt;"Locate similar support tickets."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;full text search&lt;/li&gt;
&lt;li&gt;vector search&lt;/li&gt;
&lt;li&gt;hybrid retrieval&lt;/li&gt;
&lt;li&gt;reranking&lt;/li&gt;
&lt;li&gt;metadata filtering&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  If the problem is making knowledge coherent
&lt;/h3&gt;

&lt;p&gt;Optimize representation. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Create a canonical explanation."&lt;/li&gt;
&lt;li&gt;"Resolve duplicate pages."&lt;/li&gt;
&lt;li&gt;"Define the domain model."&lt;/li&gt;
&lt;li&gt;"Build a stable knowledge base."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;wiki pages&lt;/li&gt;
&lt;li&gt;concept maps&lt;/li&gt;
&lt;li&gt;taxonomies&lt;/li&gt;
&lt;li&gt;knowledge graphs&lt;/li&gt;
&lt;li&gt;summaries&lt;/li&gt;
&lt;li&gt;schemas&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  If the problem is repeated synthesis
&lt;/h3&gt;

&lt;p&gt;Use compiled representation. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"We answer the same conceptual questions repeatedly."&lt;/li&gt;
&lt;li&gt;"The system keeps re-summarizing the same sources."&lt;/li&gt;
&lt;li&gt;"We need a stable synthesis layer."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM Wiki&lt;/li&gt;
&lt;li&gt;curated summaries&lt;/li&gt;
&lt;li&gt;topic pages&lt;/li&gt;
&lt;li&gt;human-reviewed generated pages&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  If the problem is adaptive continuity
&lt;/h3&gt;

&lt;p&gt;Use memory. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"The agent should remember user preferences."&lt;/li&gt;
&lt;li&gt;"The coding agent should remember project conventions."&lt;/li&gt;
&lt;li&gt;"The assistant should continue work across sessions."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;agent memory&lt;/li&gt;
&lt;li&gt;preference stores&lt;/li&gt;
&lt;li&gt;episodic memory&lt;/li&gt;
&lt;li&gt;semantic memory&lt;/li&gt;
&lt;li&gt;project memory&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How this applies to a technical blog
&lt;/h2&gt;

&lt;p&gt;A technical blog can be more than a sequence of posts — it can become a represented knowledge system. Articles are documents, categories are weak taxonomy, internal links are graph edges, pillar pages are canonical summaries, series pages are curated pathways, and search is retrieval. If you only publish isolated posts, retrieval has to work harder. If you build strong representation, retrieval becomes easier.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clear cluster boundaries&lt;/li&gt;
&lt;li&gt;stable slugs&lt;/li&gt;
&lt;li&gt;canonical pages&lt;/li&gt;
&lt;li&gt;comparison pages&lt;/li&gt;
&lt;li&gt;glossary-style explainers&lt;/li&gt;
&lt;li&gt;internal links&lt;/li&gt;
&lt;li&gt;structured metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why site architecture matters — not just for SEO, but because it is knowledge representation. The &lt;a href="https://www.glukhov.org/knowledge-management/" rel="noopener noreferrer"&gt;Knowledge Management&lt;/a&gt; cluster on this site is itself an example of representation-first publishing.&lt;/p&gt;
&lt;h2&gt;
  
  
  How this applies to RAG
&lt;/h2&gt;

&lt;p&gt;RAG quality depends heavily on representation. A well-structured source corpus improves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chunk quality&lt;/li&gt;
&lt;li&gt;retrieval accuracy&lt;/li&gt;
&lt;li&gt;citation quality&lt;/li&gt;
&lt;li&gt;answer consistency&lt;/li&gt;
&lt;li&gt;evaluation clarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before building a complex RAG pipeline, ask:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Are the source documents current?&lt;/li&gt;
&lt;li&gt;Are duplicates removed?&lt;/li&gt;
&lt;li&gt;Are important concepts clearly named?&lt;/li&gt;
&lt;li&gt;Are pages scoped correctly?&lt;/li&gt;
&lt;li&gt;Are tables and code blocks retrievable?&lt;/li&gt;
&lt;li&gt;Are canonical answers obvious?&lt;/li&gt;
&lt;li&gt;Are document boundaries meaningful?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answer is no, better embeddings will only help so much.&lt;/p&gt;
&lt;h2&gt;
  
  
  How this applies to LLM Wiki
&lt;/h2&gt;

&lt;p&gt;LLM Wiki is a representation-first pattern. It is useful when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the corpus is small or medium sized&lt;/li&gt;
&lt;li&gt;knowledge is stable enough to summarize&lt;/li&gt;
&lt;li&gt;repeated synthesis is expensive&lt;/li&gt;
&lt;li&gt;humans benefit from readable pages&lt;/li&gt;
&lt;li&gt;you want structure before retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is less useful when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the corpus is massive&lt;/li&gt;
&lt;li&gt;content changes constantly&lt;/li&gt;
&lt;li&gt;freshness is more important than coherence&lt;/li&gt;
&lt;li&gt;governance is weak&lt;/li&gt;
&lt;li&gt;generated summaries cannot be reviewed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM Wiki is not a replacement for RAG but a different layer, and a strong system can use both:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LLM Wiki creates structured summaries.&lt;/li&gt;
&lt;li&gt;RAG retrieves from raw sources and wiki pages.&lt;/li&gt;
&lt;li&gt;Human review keeps the representation trustworthy.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Suggested architecture patterns
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Pattern 1. Retrieval first
&lt;/h3&gt;

&lt;p&gt;Use when speed matters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;documents
  -&amp;gt; chunks
  -&amp;gt; embeddings
  -&amp;gt; retrieval
  -&amp;gt; LLM answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prototypes&lt;/li&gt;
&lt;li&gt;broad search&lt;/li&gt;
&lt;li&gt;large corpora&lt;/li&gt;
&lt;li&gt;early experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weakness: coherence depends on source quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2. Representation first
&lt;/h3&gt;

&lt;p&gt;Use when trust matters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sources
  -&amp;gt; curated pages
  -&amp;gt; internal links
  -&amp;gt; maintained knowledge base
  -&amp;gt; search or RAG
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;documentation&lt;/li&gt;
&lt;li&gt;technical knowledge&lt;/li&gt;
&lt;li&gt;long-term content&lt;/li&gt;
&lt;li&gt;team knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weakness: requires maintenance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3. Compiled knowledge
&lt;/h3&gt;

&lt;p&gt;Use when repeated synthesis matters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;raw sources
  -&amp;gt; LLM extraction
  -&amp;gt; generated summaries
  -&amp;gt; topic pages
  -&amp;gt; reviewed knowledge base
  -&amp;gt; retrieval
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM Wiki systems&lt;/li&gt;
&lt;li&gt;research collections&lt;/li&gt;
&lt;li&gt;personal knowledge bases&lt;/li&gt;
&lt;li&gt;stable domains&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weakness: generated structure must be audited.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 4. Hybrid knowledge architecture
&lt;/h3&gt;

&lt;p&gt;Use when building serious systems.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;raw documents
  -&amp;gt; structured knowledge layer
  -&amp;gt; search index
  -&amp;gt; retrieval and reranking
  -&amp;gt; AI answer
  -&amp;gt; feedback and maintenance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;production RAG&lt;/li&gt;
&lt;li&gt;internal knowledge systems&lt;/li&gt;
&lt;li&gt;AI assistants&lt;/li&gt;
&lt;li&gt;technical publishing systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weakness: more moving parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation questions
&lt;/h2&gt;

&lt;p&gt;To evaluate retrieval, ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did the system find the right source?&lt;/li&gt;
&lt;li&gt;Did it rank the right source highly?&lt;/li&gt;
&lt;li&gt;Did it retrieve enough context?&lt;/li&gt;
&lt;li&gt;Did it avoid irrelevant context?&lt;/li&gt;
&lt;li&gt;Did the answer cite the correct source?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To evaluate representation, ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the knowledge structured clearly?&lt;/li&gt;
&lt;li&gt;Is there a canonical page?&lt;/li&gt;
&lt;li&gt;Are concepts named consistently?&lt;/li&gt;
&lt;li&gt;Are relationships explicit?&lt;/li&gt;
&lt;li&gt;Is the content maintained?&lt;/li&gt;
&lt;li&gt;Can humans and machines both use it?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Do not evaluate a knowledge system only by answer quality — a good answer can hide a bad structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The opinionated rule
&lt;/h2&gt;

&lt;p&gt;If your system fails occasionally, improve retrieval. If it fails repeatedly in the same conceptual area, improve representation.&lt;/p&gt;

&lt;p&gt;Bad retrieval misses the right information. Bad representation means the right information does not really exist in a usable shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Retrieval and representation solve different problems: retrieval gives access, representation gives structure. RAG is powerful because it makes external knowledge available to LLMs at query time, but RAG does not automatically make knowledge coherent, canonical, or maintained. That is why wikis, &lt;a href="https://www.glukhov.org/knowledge-management/foundations/pkm-vs-rag-vs-wiki-vs-memory-systems/" rel="noopener noreferrer"&gt;PKM systems&lt;/a&gt;, knowledge graphs, and LLM Wiki style systems still matter.&lt;/p&gt;

&lt;p&gt;The future is not retrieval vs representation but layered knowledge systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;representation for structure&lt;/li&gt;
&lt;li&gt;retrieval for access&lt;/li&gt;
&lt;li&gt;memory for continuity&lt;/li&gt;
&lt;li&gt;reasoning for synthesis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are building a serious knowledge system, do not start with the vector database. Start with the shape of the knowledge, then decide how it should be retrieved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources and further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/use-cases/retrieval-augmented-generation" rel="noopener noreferrer"&gt;Google Cloud — Retrieval Augmented Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/what-is/retrieval-augmented-generation/" rel="noopener noreferrer"&gt;AWS — What Is Retrieval Augmented Generation?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ibm.com/think/topics/retrieval-augmented-generation" rel="noopener noreferrer"&gt;IBM — Retrieval Augmented Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ibm.com/think/topics/knowledge-management" rel="noopener noreferrer"&gt;IBM — Knowledge Management&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://atlan.com/know/llm-wiki-vs-rag-knowledge-base/" rel="noopener noreferrer"&gt;Atlan — LLM Wiki vs RAG Knowledge Base&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vishalmysore/rag-vs-agent-memory-vs-llm-wiki-a-practical-comparison-1oo6"&gt;Dev.to — RAG vs Agent Memory vs LLM Wiki&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.starmorph.com/blog/karpathy-llm-wiki-knowledge-base-guide" rel="noopener noreferrer"&gt;Starmorph — Karpathy LLM Wiki Knowledge Base Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>knowledgemanagement</category>
      <category>wiki</category>
      <category>architecture</category>
      <category>rag</category>
    </item>
    <item>
      <title>PKM vs RAG vs Wiki vs Memory Systems Explained Clearly</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sun, 17 May 2026 08:50:06 +0000</pubDate>
      <link>https://forem.com/rosgluk/pkm-vs-rag-vs-wiki-vs-memory-systems-explained-clearly-4b1c</link>
      <guid>https://forem.com/rosgluk/pkm-vs-rag-vs-wiki-vs-memory-systems-explained-clearly-4b1c</guid>
      <description>&lt;p&gt;PKM, RAG, wikis, and AI memory systems are often discussed as if they solve the same problem.&lt;br&gt;
They do not.&lt;br&gt;
They all deal with knowledge, but they operate at different layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PKM helps humans think.&lt;/li&gt;
&lt;li&gt;Wikis help groups preserve shared knowledge.&lt;/li&gt;
&lt;li&gt;RAG helps machines retrieve external knowledge.&lt;/li&gt;
&lt;li&gt;Memory systems help AI agents persist context over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Confusing these systems leads to bad architecture.&lt;/p&gt;

&lt;p&gt;You get wikis full of personal scratch notes, RAG systems without a source of truth, memory layers pretending to be databases, and PKM tools overloaded with automation they were never designed to handle.&lt;/p&gt;

&lt;p&gt;A better model is to see them as different parts of a knowledge systems spectrum.&lt;/p&gt;

&lt;p&gt;This article compares PKM, RAG, wikis, and AI memory systems by structure, retrieval, ownership, evolution, and real-world use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Primary user&lt;/th&gt;
&lt;th&gt;Main purpose&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PKM&lt;/td&gt;
&lt;td&gt;Individual&lt;/td&gt;
&lt;td&gt;Develop personal knowledge&lt;/td&gt;
&lt;td&gt;Thinking, learning, synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wiki&lt;/td&gt;
&lt;td&gt;Team or public group&lt;/td&gt;
&lt;td&gt;Maintain shared knowledge&lt;/td&gt;
&lt;td&gt;Documentation, policies, reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Machine system&lt;/td&gt;
&lt;td&gt;Retrieve context for generation&lt;/td&gt;
&lt;td&gt;AI answers over external data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI memory&lt;/td&gt;
&lt;td&gt;AI agent&lt;/td&gt;
&lt;td&gt;Persist context over time&lt;/td&gt;
&lt;td&gt;Long-running agents and personalization&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most important distinction is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PKM and wikis structure knowledge. RAG retrieves knowledge. Memory systems evolve agent context.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the core mental model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why these systems are confused
&lt;/h2&gt;

&lt;p&gt;They overlap in visible behavior.&lt;/p&gt;

&lt;p&gt;All of them can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;store notes&lt;/li&gt;
&lt;li&gt;retrieve information&lt;/li&gt;
&lt;li&gt;answer questions&lt;/li&gt;
&lt;li&gt;organize references&lt;/li&gt;
&lt;li&gt;connect ideas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But they differ in intent.&lt;/p&gt;

&lt;p&gt;A PKM system is not just a private wiki.&lt;br&gt;
A wiki is not just a RAG database.&lt;br&gt;
A RAG pipeline is not an AI memory.&lt;br&gt;
An AI memory system is not a replacement for structured documentation.&lt;/p&gt;

&lt;p&gt;The confusion comes from treating "knowledge" as one thing.&lt;/p&gt;

&lt;p&gt;In practice, knowledge has multiple layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Capture&lt;/li&gt;
&lt;li&gt;Structure&lt;/li&gt;
&lt;li&gt;Retrieval&lt;/li&gt;
&lt;li&gt;Interpretation&lt;/li&gt;
&lt;li&gt;Reuse&lt;/li&gt;
&lt;li&gt;Evolution&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Different systems optimize different stages.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four paradigms
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. PKM
&lt;/h2&gt;

&lt;p&gt;PKM stands for &lt;a href="https://www.glukhov.org/knowledge-management/foundations/personal-knowledge-management/" rel="noopener noreferrer"&gt;personal knowledge management&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It is the practice of capturing, organizing, connecting, and using knowledge for personal work.&lt;/p&gt;

&lt;p&gt;Typical PKM systems include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/knowledge-management/tools/obsidian-for-personal-knowledge-management/" rel="noopener noreferrer"&gt;Obsidian&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Logseq&lt;/li&gt;
&lt;li&gt;Notion&lt;/li&gt;
&lt;li&gt;plain Markdown folders&lt;/li&gt;
&lt;li&gt;Zettelkasten systems&lt;/li&gt;
&lt;li&gt;second brain systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PKM is human driven.&lt;/p&gt;

&lt;p&gt;The goal is not just storage. The goal is better thinking.&lt;/p&gt;

&lt;h3&gt;
  
  
  What PKM is good at
&lt;/h3&gt;

&lt;p&gt;PKM works well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;learning a new domain&lt;/li&gt;
&lt;li&gt;developing original ideas&lt;/li&gt;
&lt;li&gt;connecting notes over time&lt;/li&gt;
&lt;li&gt;writing articles or books&lt;/li&gt;
&lt;li&gt;tracking personal research&lt;/li&gt;
&lt;li&gt;building a second brain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A good PKM system is messy in a useful way. It supports unfinished thoughts, partial ideas, private context, and evolving concepts.&lt;/p&gt;

&lt;p&gt;This is why PKM is not the same as documentation.&lt;/p&gt;

&lt;p&gt;Documentation wants clarity.&lt;br&gt;
PKM tolerates ambiguity.&lt;/p&gt;

&lt;h3&gt;
  
  
  PKM failure modes
&lt;/h3&gt;

&lt;p&gt;PKM often fails when it becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a dumping ground&lt;/li&gt;
&lt;li&gt;a folder taxonomy project&lt;/li&gt;
&lt;li&gt;a productivity aesthetic&lt;/li&gt;
&lt;li&gt;a tool optimization hobby&lt;/li&gt;
&lt;li&gt;a private archive nobody uses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main risk is collection without synthesis.&lt;/p&gt;

&lt;p&gt;If you only save information, you do not have a knowledge system. You have a personal landfill.&lt;/p&gt;

&lt;h3&gt;
  
  
  Opinionated take
&lt;/h3&gt;

&lt;p&gt;PKM should optimize for reuse, not capture.&lt;/p&gt;

&lt;p&gt;Capturing everything feels productive, but it creates debt. The real value appears when notes become connected, rewritten, compressed, and used in output.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Wiki
&lt;/h2&gt;

&lt;p&gt;A wiki is a structured knowledge base designed for shared reference.&lt;/p&gt;

&lt;p&gt;Typical wiki systems include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/knowledge-management/self-hosted-knowledge/dokuwiki-selfhosted-wiki-alternatives/" rel="noopener noreferrer"&gt;DokuWiki&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;MediaWiki&lt;/li&gt;
&lt;li&gt;Confluence&lt;/li&gt;
&lt;li&gt;BookStack&lt;/li&gt;
&lt;li&gt;Git based documentation sites&lt;/li&gt;
&lt;li&gt;internal company knowledge bases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A wiki is usually more formal than PKM.&lt;/p&gt;

&lt;p&gt;It should answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What do we know, and where is the current version?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What wikis are good at
&lt;/h3&gt;

&lt;p&gt;Wikis work well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;team documentation&lt;/li&gt;
&lt;li&gt;operational runbooks&lt;/li&gt;
&lt;li&gt;product knowledge&lt;/li&gt;
&lt;li&gt;policy documents&lt;/li&gt;
&lt;li&gt;technical reference&lt;/li&gt;
&lt;li&gt;onboarding material&lt;/li&gt;
&lt;li&gt;stable domain knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A wiki is a social contract.&lt;/p&gt;

&lt;p&gt;It says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This page is the place where this knowledge lives.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That makes ownership and maintenance critical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wiki failure modes
&lt;/h3&gt;

&lt;p&gt;Wikis often fail because they become stale.&lt;/p&gt;

&lt;p&gt;Common problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no page owners&lt;/li&gt;
&lt;li&gt;outdated screenshots&lt;/li&gt;
&lt;li&gt;duplicate pages&lt;/li&gt;
&lt;li&gt;unclear canonical versions&lt;/li&gt;
&lt;li&gt;too much hierarchy&lt;/li&gt;
&lt;li&gt;no maintenance rhythm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A wiki with old information is worse than no wiki, because it creates false confidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Opinionated take
&lt;/h3&gt;

&lt;p&gt;A wiki should be boring.&lt;/p&gt;

&lt;p&gt;That is a compliment.&lt;/p&gt;

&lt;p&gt;A good wiki is not where ideas are born. It is where stable knowledge is preserved after it becomes useful to others.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. RAG
&lt;/h2&gt;

&lt;p&gt;RAG stands for &lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;retrieval augmented generation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It is an AI architecture where a system retrieves relevant external information before asking a language model to generate an answer.&lt;/p&gt;

&lt;p&gt;A basic RAG pipeline usually has:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Documents&lt;/li&gt;
&lt;li&gt;Chunking&lt;/li&gt;
&lt;li&gt;Embeddings or search index&lt;/li&gt;
&lt;li&gt;Retrieval&lt;/li&gt;
&lt;li&gt;Optional reranking&lt;/li&gt;
&lt;li&gt;Prompt assembly&lt;/li&gt;
&lt;li&gt;LLM generation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;RAG is machine driven.&lt;/p&gt;

&lt;p&gt;The goal is not to create knowledge. The goal is to give a model relevant context at query time.&lt;/p&gt;

&lt;h3&gt;
  
  
  What RAG is good at
&lt;/h3&gt;

&lt;p&gt;RAG works well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;question answering over documents&lt;/li&gt;
&lt;li&gt;internal search assistants&lt;/li&gt;
&lt;li&gt;support bots&lt;/li&gt;
&lt;li&gt;technical documentation assistants&lt;/li&gt;
&lt;li&gt;compliance lookup&lt;/li&gt;
&lt;li&gt;research over large corpora&lt;/li&gt;
&lt;li&gt;connecting LLMs to updated information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAG is especially useful when the model cannot or should not memorize the information.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG failure modes
&lt;/h3&gt;

&lt;p&gt;RAG often fails when teams treat it as magic search.&lt;/p&gt;

&lt;p&gt;Common problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bad chunking&lt;/li&gt;
&lt;li&gt;weak retrieval&lt;/li&gt;
&lt;li&gt;noisy context&lt;/li&gt;
&lt;li&gt;missing metadata&lt;/li&gt;
&lt;li&gt;no source of truth&lt;/li&gt;
&lt;li&gt;stale documents&lt;/li&gt;
&lt;li&gt;weak evaluation&lt;/li&gt;
&lt;li&gt;no human feedback loop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAG does not fix bad knowledge management.&lt;/p&gt;

&lt;p&gt;If the underlying content is fragmented, outdated, or contradictory, the RAG system will surface that mess with confidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Opinionated take
&lt;/h3&gt;

&lt;p&gt;RAG is not a knowledge strategy.&lt;/p&gt;

&lt;p&gt;RAG is an access strategy.&lt;/p&gt;

&lt;p&gt;It helps machines access knowledge, but it does not decide what knowledge is valid, maintained, canonical, or useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. AI memory systems
&lt;/h2&gt;

&lt;p&gt;AI memory systems give agents persistent context beyond a single prompt or conversation.&lt;/p&gt;

&lt;p&gt;They may store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user preferences&lt;/li&gt;
&lt;li&gt;past decisions&lt;/li&gt;
&lt;li&gt;long-term facts&lt;/li&gt;
&lt;li&gt;task history&lt;/li&gt;
&lt;li&gt;summaries&lt;/li&gt;
&lt;li&gt;reflections&lt;/li&gt;
&lt;li&gt;extracted entities&lt;/li&gt;
&lt;li&gt;episodic memories&lt;/li&gt;
&lt;li&gt;semantic memories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples and related ideas include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MemGPT style memory tiers&lt;/li&gt;
&lt;li&gt;long-term agent memory&lt;/li&gt;
&lt;li&gt;episodic memory&lt;/li&gt;
&lt;li&gt;semantic memory&lt;/li&gt;
&lt;li&gt;vector memory&lt;/li&gt;
&lt;li&gt;profile memory&lt;/li&gt;
&lt;li&gt;tool state memory&lt;/li&gt;
&lt;li&gt;reflective agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI memory is agent driven.&lt;/p&gt;

&lt;p&gt;The goal is continuity.&lt;/p&gt;

&lt;h3&gt;
  
  
  What AI memory is good at
&lt;/h3&gt;

&lt;p&gt;AI memory systems work well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;personal assistants&lt;/li&gt;
&lt;li&gt;long-running coding agents&lt;/li&gt;
&lt;li&gt;research agents&lt;/li&gt;
&lt;li&gt;customer support agents&lt;/li&gt;
&lt;li&gt;tutoring systems&lt;/li&gt;
&lt;li&gt;workflow automation&lt;/li&gt;
&lt;li&gt;persistent companions&lt;/li&gt;
&lt;li&gt;multi-session task execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Memory matters when the system must behave as if it remembers.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI memory failure modes
&lt;/h3&gt;

&lt;p&gt;Memory systems are dangerous when unmanaged.&lt;/p&gt;

&lt;p&gt;Common problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;remembering wrong facts&lt;/li&gt;
&lt;li&gt;storing too much&lt;/li&gt;
&lt;li&gt;privacy risk&lt;/li&gt;
&lt;li&gt;stale preferences&lt;/li&gt;
&lt;li&gt;poor memory ranking&lt;/li&gt;
&lt;li&gt;memory poisoning&lt;/li&gt;
&lt;li&gt;no forgetting mechanism&lt;/li&gt;
&lt;li&gt;confusing memory with truth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A memory system needs governance.&lt;/p&gt;

&lt;p&gt;It should answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should be remembered?&lt;/li&gt;
&lt;li&gt;Who approved it?&lt;/li&gt;
&lt;li&gt;How long should it live?&lt;/li&gt;
&lt;li&gt;When should it be forgotten?&lt;/li&gt;
&lt;li&gt;How is it corrected?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Opinionated take
&lt;/h3&gt;

&lt;p&gt;AI memory is not just long context.&lt;/p&gt;

&lt;p&gt;Long context lets a model see more at once.&lt;br&gt;
Memory decides what survives across time.&lt;/p&gt;

&lt;p&gt;Those are different problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core differences table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;PKM&lt;/th&gt;
&lt;th&gt;Wiki&lt;/th&gt;
&lt;th&gt;RAG&lt;/th&gt;
&lt;th&gt;AI memory&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary user&lt;/td&gt;
&lt;td&gt;Individual&lt;/td&gt;
&lt;td&gt;Team or public group&lt;/td&gt;
&lt;td&gt;AI system&lt;/td&gt;
&lt;td&gt;AI agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main function&lt;/td&gt;
&lt;td&gt;Thinking&lt;/td&gt;
&lt;td&gt;Shared reference&lt;/td&gt;
&lt;td&gt;Query time retrieval&lt;/td&gt;
&lt;td&gt;Persistent context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge state&lt;/td&gt;
&lt;td&gt;Evolving&lt;/td&gt;
&lt;td&gt;Stabilized&lt;/td&gt;
&lt;td&gt;Retrieved&lt;/td&gt;
&lt;td&gt;Adaptive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structure&lt;/td&gt;
&lt;td&gt;Flexible&lt;/td&gt;
&lt;td&gt;Explicit&lt;/td&gt;
&lt;td&gt;Index based&lt;/td&gt;
&lt;td&gt;Learned or extracted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval style&lt;/td&gt;
&lt;td&gt;Human search and linking&lt;/td&gt;
&lt;td&gt;Navigation and search&lt;/td&gt;
&lt;td&gt;Semantic or hybrid retrieval&lt;/td&gt;
&lt;td&gt;Relevance plus salience&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ownership&lt;/td&gt;
&lt;td&gt;Personal&lt;/td&gt;
&lt;td&gt;Page or team owners&lt;/td&gt;
&lt;td&gt;System maintainers&lt;/td&gt;
&lt;td&gt;Agent or user controlled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time horizon&lt;/td&gt;
&lt;td&gt;Long term personal&lt;/td&gt;
&lt;td&gt;Long term shared&lt;/td&gt;
&lt;td&gt;Query time&lt;/td&gt;
&lt;td&gt;Multi-session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best output&lt;/td&gt;
&lt;td&gt;Insight&lt;/td&gt;
&lt;td&gt;Reliable reference&lt;/td&gt;
&lt;td&gt;Grounded answer&lt;/td&gt;
&lt;td&gt;Continuity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main risk&lt;/td&gt;
&lt;td&gt;Hoarding&lt;/td&gt;
&lt;td&gt;Staleness&lt;/td&gt;
&lt;td&gt;Bad retrieval&lt;/td&gt;
&lt;td&gt;Bad memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Good metric&lt;/td&gt;
&lt;td&gt;Reuse in thinking&lt;/td&gt;
&lt;td&gt;Trust and freshness&lt;/td&gt;
&lt;td&gt;Answer quality&lt;/td&gt;
&lt;td&gt;Helpful continuity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Structure vs retrieval vs evolution
&lt;/h2&gt;

&lt;p&gt;The simplest way to understand these systems is to compare what they optimize. The architectural implications of that distinction are explored in depth in &lt;a href="https://www.glukhov.org/knowledge-management/foundations/retrieval-vs-representation/" rel="noopener noreferrer"&gt;Retrieval vs Representation in Knowledge Systems&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  PKM optimizes personal evolution
&lt;/h3&gt;

&lt;p&gt;PKM is about how your understanding changes.&lt;/p&gt;

&lt;p&gt;You collect material, rewrite it, connect it, and turn it into something useful.&lt;/p&gt;

&lt;p&gt;The output is often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a better mental model&lt;/li&gt;
&lt;li&gt;a written article&lt;/li&gt;
&lt;li&gt;a decision&lt;/li&gt;
&lt;li&gt;a research direction&lt;/li&gt;
&lt;li&gt;a reusable insight&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PKM is not primarily about fast lookup. It is about long-term sensemaking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wikis optimize shared structure
&lt;/h3&gt;

&lt;p&gt;Wikis are about stable knowledge.&lt;/p&gt;

&lt;p&gt;They ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the current answer?&lt;/li&gt;
&lt;li&gt;Who owns it?&lt;/li&gt;
&lt;li&gt;Where should people go?&lt;/li&gt;
&lt;li&gt;What should be updated?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A wiki works when people trust it.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG optimizes machine retrieval
&lt;/h3&gt;

&lt;p&gt;RAG is about retrieving the right context at the right time.&lt;/p&gt;

&lt;p&gt;It asks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What documents are relevant?&lt;/li&gt;
&lt;li&gt;Which chunks should be used?&lt;/li&gt;
&lt;li&gt;How much context fits?&lt;/li&gt;
&lt;li&gt;What should the model cite?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAG works when retrieval quality is high and the source corpus is trustworthy.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI memory optimizes continuity
&lt;/h3&gt;

&lt;p&gt;Memory systems are about persistence across sessions.&lt;/p&gt;

&lt;p&gt;They ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should the agent remember?&lt;/li&gt;
&lt;li&gt;What should be forgotten?&lt;/li&gt;
&lt;li&gt;Which memory matters now?&lt;/li&gt;
&lt;li&gt;How should memory change behavior?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Memory works when it improves future behavior without polluting the agent with stale or incorrect context.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use PKM
&lt;/h2&gt;

&lt;p&gt;Use PKM when the knowledge is personal, unfinished, or exploratory.&lt;/p&gt;

&lt;p&gt;Good scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;learning distributed systems&lt;/li&gt;
&lt;li&gt;planning articles&lt;/li&gt;
&lt;li&gt;researching LLM architecture&lt;/li&gt;
&lt;li&gt;collecting book notes&lt;/li&gt;
&lt;li&gt;building a &lt;a href="https://www.glukhov.org/knowledge-management/foundations/second-brain/" rel="noopener noreferrer"&gt;second brain&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;tracking personal experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use PKM when you are still thinking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;You are learning about RAG evaluation.&lt;/p&gt;

&lt;p&gt;You collect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;articles&lt;/li&gt;
&lt;li&gt;benchmark notes&lt;/li&gt;
&lt;li&gt;diagrams&lt;/li&gt;
&lt;li&gt;implementation ideas&lt;/li&gt;
&lt;li&gt;failures from your own experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This belongs in PKM first.&lt;/p&gt;

&lt;p&gt;Later, once the knowledge stabilizes, you may publish an article or turn it into documentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use a wiki
&lt;/h2&gt;

&lt;p&gt;Use a wiki when knowledge must be shared and maintained.&lt;/p&gt;

&lt;p&gt;Good scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;team onboarding&lt;/li&gt;
&lt;li&gt;API documentation&lt;/li&gt;
&lt;li&gt;operational runbooks&lt;/li&gt;
&lt;li&gt;architecture decision records&lt;/li&gt;
&lt;li&gt;product knowledge&lt;/li&gt;
&lt;li&gt;deployment instructions&lt;/li&gt;
&lt;li&gt;support procedures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use a wiki when others need a reliable answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Your team has one correct way to deploy a Hugo site to S3 and CloudFront.&lt;/p&gt;

&lt;p&gt;That does not belong only in someone's private notes.&lt;/p&gt;

&lt;p&gt;It belongs in a wiki or documentation system with clear ownership.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use RAG
&lt;/h2&gt;

&lt;p&gt;Use RAG when an AI system needs access to external knowledge at query time.&lt;/p&gt;

&lt;p&gt;Good scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chatbot over documentation&lt;/li&gt;
&lt;li&gt;search assistant over internal docs&lt;/li&gt;
&lt;li&gt;support assistant over help articles&lt;/li&gt;
&lt;li&gt;legal or compliance assistant&lt;/li&gt;
&lt;li&gt;research over large document sets&lt;/li&gt;
&lt;li&gt;developer assistant over code docs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use RAG when the problem is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The model needs information that lives outside its weights.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;You have hundreds of technical articles and want an assistant to answer questions using them.&lt;/p&gt;

&lt;p&gt;RAG is a good fit.&lt;/p&gt;

&lt;p&gt;But only if the documents are clean enough to retrieve from.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use AI memory
&lt;/h2&gt;

&lt;p&gt;Use AI memory when an agent needs continuity.&lt;/p&gt;

&lt;p&gt;Good scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;coding agents that remember project conventions&lt;/li&gt;
&lt;li&gt;personal assistants that remember preferences&lt;/li&gt;
&lt;li&gt;research agents that continue long investigations&lt;/li&gt;
&lt;li&gt;tutoring agents that remember student progress&lt;/li&gt;
&lt;li&gt;support agents that remember prior interactions&lt;/li&gt;
&lt;li&gt;autonomous agents that track goals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use memory when the system must improve across time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;A coding agent should remember:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the project uses Go&lt;/li&gt;
&lt;li&gt;tests run with a specific command&lt;/li&gt;
&lt;li&gt;the user prefers minimal dependencies&lt;/li&gt;
&lt;li&gt;database migrations follow a convention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not just retrieval. It is persistent operating context.&lt;/p&gt;

&lt;h2&gt;
  
  
  How these systems combine
&lt;/h2&gt;

&lt;p&gt;The most useful systems are hybrids.&lt;/p&gt;

&lt;p&gt;A mature knowledge architecture might look like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;PKM for personal exploration&lt;/li&gt;
&lt;li&gt;Wiki for stable shared knowledge&lt;/li&gt;
&lt;li&gt;RAG for machine access&lt;/li&gt;
&lt;li&gt;AI memory for long-running agent continuity&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each layer has a job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1. PKM to wiki
&lt;/h2&gt;

&lt;p&gt;This is the human knowledge pipeline.&lt;/p&gt;

&lt;p&gt;Flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Capture notes privately&lt;/li&gt;
&lt;li&gt;Connect ideas&lt;/li&gt;
&lt;li&gt;Distill insights&lt;/li&gt;
&lt;li&gt;Publish stable knowledge&lt;/li&gt;
&lt;li&gt;Maintain as shared reference&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is how personal research becomes organizational knowledge.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;You research self-hosted knowledge tools in Obsidian.&lt;/p&gt;

&lt;p&gt;After testing DokuWiki, Nextcloud, and static Markdown systems, you write a stable guide in your site or team wiki.&lt;/p&gt;

&lt;p&gt;PKM created the insight.&lt;br&gt;
The wiki preserves the result.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 2. Wiki to RAG
&lt;/h2&gt;

&lt;p&gt;This is the machine access pipeline.&lt;/p&gt;

&lt;p&gt;Flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Maintain canonical wiki pages&lt;/li&gt;
&lt;li&gt;Index them&lt;/li&gt;
&lt;li&gt;Retrieve relevant sections&lt;/li&gt;
&lt;li&gt;Generate grounded answers&lt;/li&gt;
&lt;li&gt;Link back to sources&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is one of the cleanest RAG patterns.&lt;/p&gt;

&lt;p&gt;The wiki remains the source of truth.&lt;br&gt;
RAG becomes the access layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;A support bot answers questions using a product wiki.&lt;/p&gt;

&lt;p&gt;The bot should not replace the wiki. It should cite and route users back to the canonical pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 3. RAG plus memory
&lt;/h2&gt;

&lt;p&gt;This is the agent continuity pipeline.&lt;/p&gt;

&lt;p&gt;Flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;RAG retrieves external facts&lt;/li&gt;
&lt;li&gt;Memory stores user or task context&lt;/li&gt;
&lt;li&gt;The agent combines both&lt;/li&gt;
&lt;li&gt;Future behavior improves&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;RAG answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the knowledge base say?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Memory answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What matters about this user, project, or task?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;A coding agent uses RAG to retrieve framework docs.&lt;/p&gt;

&lt;p&gt;It uses memory to remember that your project avoids ORMs, prefers sqlc, and uses structured logging.&lt;/p&gt;

&lt;p&gt;Those are different knowledge types.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 4. PKM plus AI assistant
&lt;/h2&gt;

&lt;p&gt;This is the hybrid thinking pipeline.&lt;/p&gt;

&lt;p&gt;Flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Human captures notes&lt;/li&gt;
&lt;li&gt;AI summarizes and suggests links&lt;/li&gt;
&lt;li&gt;Human edits and validates&lt;/li&gt;
&lt;li&gt;Knowledge becomes more structured&lt;/li&gt;
&lt;li&gt;Some pages graduate to wiki or publication&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The AI augments the PKM system, but it should not own the truth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;An AI assistant can suggest connections between notes about RAG, memory systems, and LLM Wiki.&lt;/p&gt;

&lt;p&gt;But the human decides which connections are meaningful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common architecture mistakes
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Mistake 1. Treating RAG as a wiki
&lt;/h2&gt;

&lt;p&gt;RAG is not a knowledge base.&lt;/p&gt;

&lt;p&gt;It does not automatically create a canonical structure. It retrieves from whatever exists.&lt;/p&gt;

&lt;p&gt;If the source documents are bad, RAG becomes a confident interface to bad knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 2. Treating memory as a database
&lt;/h2&gt;

&lt;p&gt;AI memory is selective context, not general storage.&lt;/p&gt;

&lt;p&gt;A database stores records.&lt;br&gt;
Memory changes behavior.&lt;/p&gt;

&lt;p&gt;If you need exact facts, use a database or knowledge base.&lt;br&gt;
If you need continuity, use memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 3. Treating PKM as documentation
&lt;/h2&gt;

&lt;p&gt;PKM can be messy.&lt;/p&gt;

&lt;p&gt;Documentation should not be.&lt;/p&gt;

&lt;p&gt;Private notes can contain half-formed ideas. Shared documentation should contain stable, maintained knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 4. Treating a wiki as a thinking tool
&lt;/h2&gt;

&lt;p&gt;A wiki can support thinking, but it is not ideal for early exploration.&lt;/p&gt;

&lt;p&gt;If every early thought must become a polished page, people stop writing.&lt;/p&gt;

&lt;p&gt;Use PKM for rough thinking. Use wikis for durable knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 5. Treating long context as memory
&lt;/h2&gt;

&lt;p&gt;Long context is not memory.&lt;/p&gt;

&lt;p&gt;It only helps while the context is present.&lt;/p&gt;

&lt;p&gt;Memory persists, selects, updates, and sometimes forgets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision guide
&lt;/h2&gt;

&lt;p&gt;Use this simple decision model.&lt;/p&gt;

&lt;h3&gt;
  
  
  If the knowledge is private and evolving
&lt;/h3&gt;

&lt;p&gt;Use PKM.&lt;/p&gt;

&lt;h3&gt;
  
  
  If the knowledge is shared and stable
&lt;/h3&gt;

&lt;p&gt;Use a wiki.&lt;/p&gt;

&lt;h3&gt;
  
  
  If an AI needs to answer from external documents
&lt;/h3&gt;

&lt;p&gt;Use RAG.&lt;/p&gt;

&lt;h3&gt;
  
  
  If an agent needs continuity over time
&lt;/h3&gt;

&lt;p&gt;Use memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  If you need all four
&lt;/h3&gt;

&lt;p&gt;Build a layered system.&lt;/p&gt;

&lt;p&gt;Do not force one tool to do every job.&lt;/p&gt;

&lt;h2&gt;
  
  
  The knowledge systems spectrum
&lt;/h2&gt;

&lt;p&gt;These systems form a spectrum from human thinking to AI continuity.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Human thought&lt;/td&gt;
&lt;td&gt;PKM&lt;/td&gt;
&lt;td&gt;Explore and synthesize&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shared structure&lt;/td&gt;
&lt;td&gt;Wiki&lt;/td&gt;
&lt;td&gt;Preserve and maintain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Machine access&lt;/td&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Retrieve and generate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent continuity&lt;/td&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Persist and adapt&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The direction matters.&lt;/p&gt;

&lt;p&gt;Knowledge often starts as personal thought, becomes shared structure, is indexed for machine retrieval, and then becomes part of persistent agent behavior.&lt;/p&gt;

&lt;p&gt;That is the modern knowledge stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where LLM Wiki fits
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/knowledge-management/knowledge-systems-architectures/compiled-knowledge/what-is-llm-wiki/" rel="noopener noreferrer"&gt;LLM Wiki&lt;/a&gt; style systems sit between wiki and AI architecture.&lt;/p&gt;

&lt;p&gt;They are not classic RAG.&lt;/p&gt;

&lt;p&gt;Instead of retrieving chunks only at query time, they attempt to pre-structure knowledge into pages, summaries, entities, and links.&lt;/p&gt;

&lt;p&gt;That makes them closer to compiled knowledge systems.&lt;/p&gt;

&lt;p&gt;A useful placement:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Position&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Wiki&lt;/td&gt;
&lt;td&gt;Human maintained structured knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Query time machine retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM Wiki&lt;/td&gt;
&lt;td&gt;Ingest time machine structured knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Agent persistent context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is why LLM Wiki belongs near knowledge systems architecture, not inside ordinary RAG.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical examples
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Example 1. Personal technical blog
&lt;/h2&gt;

&lt;p&gt;A technical blogger might use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PKM for research notes&lt;/li&gt;
&lt;li&gt;Hugo site as published knowledge&lt;/li&gt;
&lt;li&gt;internal linking as wiki-like structure&lt;/li&gt;
&lt;li&gt;RAG later for site search&lt;/li&gt;
&lt;li&gt;AI memory for writing assistant preferences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a strong architecture.&lt;/p&gt;

&lt;p&gt;It keeps human judgment at the center while still allowing AI support.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example 2. Engineering team
&lt;/h2&gt;

&lt;p&gt;An engineering team might use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PKM for individual learning&lt;/li&gt;
&lt;li&gt;wiki for standards and runbooks&lt;/li&gt;
&lt;li&gt;RAG assistant for internal docs&lt;/li&gt;
&lt;li&gt;memory for coding agents working inside repositories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The wiki should remain canonical.&lt;/p&gt;

&lt;p&gt;The RAG assistant should not invent process.&lt;br&gt;
The memory layer should remember project preferences, not replace architecture decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example 3. AI research workflow
&lt;/h2&gt;

&lt;p&gt;A researcher might use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PKM for paper notes&lt;/li&gt;
&lt;li&gt;wiki for stable summaries&lt;/li&gt;
&lt;li&gt;RAG for literature search&lt;/li&gt;
&lt;li&gt;memory for long-running research agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This works because each layer handles a different time scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security and governance
&lt;/h2&gt;

&lt;p&gt;Knowledge systems become risky when they store sensitive or stale information.&lt;/p&gt;

&lt;h3&gt;
  
  
  PKM governance
&lt;/h3&gt;

&lt;p&gt;Questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should stay private?&lt;/li&gt;
&lt;li&gt;What should be published?&lt;/li&gt;
&lt;li&gt;What should be deleted?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Wiki governance
&lt;/h3&gt;

&lt;p&gt;Questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who owns each page?&lt;/li&gt;
&lt;li&gt;When was it last reviewed?&lt;/li&gt;
&lt;li&gt;What is canonical?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  RAG governance
&lt;/h3&gt;

&lt;p&gt;Questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which sources are indexed?&lt;/li&gt;
&lt;li&gt;Are answers cited?&lt;/li&gt;
&lt;li&gt;How is retrieval evaluated?&lt;/li&gt;
&lt;li&gt;What content is excluded?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Memory governance
&lt;/h3&gt;

&lt;p&gt;Questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is remembered?&lt;/li&gt;
&lt;li&gt;Can users inspect memory?&lt;/li&gt;
&lt;li&gt;Can users delete memory?&lt;/li&gt;
&lt;li&gt;How are wrong memories corrected?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Memory needs the strictest governance because it can silently influence future behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  SEO and content strategy note
&lt;/h2&gt;

&lt;p&gt;If you run a technical site, this distinction is not only architectural. It is also editorial.&lt;/p&gt;

&lt;p&gt;You can map content like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PKM pages explain human knowledge practices.&lt;/li&gt;
&lt;li&gt;Wiki pages explain structured knowledge systems.&lt;/li&gt;
&lt;li&gt;RAG pages explain retrieval engineering.&lt;/li&gt;
&lt;li&gt;Memory pages explain persistent AI behavior.&lt;/li&gt;
&lt;li&gt;Architecture pages compare and connect the paradigms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives your site a clean authority mesh instead of a pile of loosely related AI articles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final conclusion
&lt;/h2&gt;

&lt;p&gt;PKM, RAG, wikis, and AI memory systems are not competitors.&lt;/p&gt;

&lt;p&gt;They are different answers to different questions.&lt;/p&gt;

&lt;p&gt;PKM asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do I think better over time?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A wiki asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What do we know, and where is the trusted version?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;RAG asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What external context should the model use right now?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AI memory asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What should this agent remember for the future?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once you separate those questions, the architecture becomes obvious.&lt;/p&gt;

&lt;p&gt;Use PKM for thinking.&lt;br&gt;
Use wikis for shared truth.&lt;br&gt;
Use RAG for retrieval.&lt;br&gt;
Use memory for continuity.&lt;/p&gt;

&lt;p&gt;The future is not one knowledge system that replaces all others.&lt;/p&gt;

&lt;p&gt;The future is layered knowledge architecture. For tools, methods, and self-hosted platforms across the full &lt;a href="https://www.glukhov.org/knowledge-management/" rel="noopener noreferrer"&gt;knowledge management&lt;/a&gt; spectrum, the cluster pillar maps the territory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources and further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/use-cases/retrieval-augmented-generation" rel="noopener noreferrer"&gt;https://cloud.google.com/use-cases/retrieval-augmented-generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/what-is/retrieval-augmented-generation/" rel="noopener noreferrer"&gt;https://aws.amazon.com/what-is/retrieval-augmented-generation/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ibm.com/think/topics/retrieval-augmented-generation" rel="noopener noreferrer"&gt;https://www.ibm.com/think/topics/retrieval-augmented-generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ibm.com/think/topics/knowledge-management" rel="noopener noreferrer"&gt;https://www.ibm.com/think/topics/knowledge-management&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2310.08560" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2310.08560&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://research.memgpt.ai/" rel="noopener noreferrer"&gt;https://research.memgpt.ai/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zettelkasten.de/posts/building-a-second-brain-and-zettelkasten/" rel="noopener noreferrer"&gt;https://zettelkasten.de/posts/building-a-second-brain-and-zettelkasten/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>wiki</category>
      <category>ai</category>
      <category>knowledgemanagement</category>
    </item>
    <item>
      <title>Agentic LLM Inference Parameters Reference for Qwen and Gemma</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sun, 17 May 2026 02:27:20 +0000</pubDate>
      <link>https://forem.com/rosgluk/agentic-llm-inference-parameters-reference-for-qwen-and-gemma-4nkh</link>
      <guid>https://forem.com/rosgluk/agentic-llm-inference-parameters-reference-for-qwen-and-gemma-4nkh</guid>
      <description>&lt;p&gt;This page is a &lt;strong&gt;practical reference for agentic LLM inference tuning&lt;/strong&gt; (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).&lt;/p&gt;

&lt;p&gt;It sits alongside the broader &lt;a href="https://www.glukhov.org/llm-performance/" rel="noopener noreferrer"&gt;LLM performance engineering hub&lt;/a&gt; and matches best with a clear &lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM hosting and serving story&lt;/a&gt;—throughput and scheduling still dominate when the model is starved, but unstable sampling burns retries and output tokens before the GPU does.&lt;/p&gt;

&lt;p&gt;This page consolidates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vendor recommended parameters&lt;/li&gt;
&lt;li&gt;embedded defaults from GGUF and APIs&lt;/li&gt;
&lt;li&gt;real-world community findings&lt;/li&gt;
&lt;li&gt;agentic workflow optimizations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Right now it is focused on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen 3.6&lt;/strong&gt; (dense and MoE)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 4&lt;/strong&gt; (dense and MoE)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run terminal agents such as OpenCode, pair this reference with &lt;a href="https://www.glukhov.org/ai-devtools/opencode/llms-comparison/" rel="noopener noreferrer"&gt;local LLM behavior in OpenCode&lt;/a&gt; so workload-level results and sampler defaults stay aligned.&lt;/p&gt;

&lt;p&gt;The goal is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Provide a single place to configure models for &lt;strong&gt;agent loops, coding, and multi-step reasoning&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  TLDR Reference Table - All models (agentic defaults)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;temp&lt;/th&gt;
&lt;th&gt;top_p&lt;/th&gt;
&lt;th&gt;top_k&lt;/th&gt;
&lt;th&gt;presence_penalty&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5 27B&lt;/td&gt;
&lt;td&gt;thinking general&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5 27B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;coding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5 35B MoE&lt;/td&gt;
&lt;td&gt;thinking&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;1.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5 35B MoE&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;coding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B&lt;/td&gt;
&lt;td&gt;general&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;coding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.2&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B MoE&lt;/td&gt;
&lt;td&gt;general&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B MoE&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;coding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.2&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What "Agentic Inference" Actually Means
&lt;/h2&gt;

&lt;p&gt;Most parameter guides assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chat&lt;/li&gt;
&lt;li&gt;single-shot completion&lt;/li&gt;
&lt;li&gt;human interaction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agentic systems are different.&lt;/p&gt;

&lt;p&gt;They require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multi-step reasoning&lt;/li&gt;
&lt;li&gt;tool calling&lt;/li&gt;
&lt;li&gt;consistent outputs&lt;/li&gt;
&lt;li&gt;low error propagation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This changes tuning priorities.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core shift
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chat&lt;/td&gt;
&lt;td&gt;natural language quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative&lt;/td&gt;
&lt;td&gt;diversity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agentic&lt;/td&gt;
&lt;td&gt;consistency + reasoning stability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Qwen 3.6 Tuning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dense vs MoE matters
&lt;/h3&gt;

&lt;p&gt;Qwen is one of the few families where:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;MoE requires different penalties&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Dense (27B)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;stable&lt;/li&gt;
&lt;li&gt;predictable&lt;/li&gt;
&lt;li&gt;no routing complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recommended:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;presence_penalty = 0.0&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  MoE (35B-A3B)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;expert routing per token&lt;/li&gt;
&lt;li&gt;risk of repetition loops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recommended:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;presence_penalty = 1.5 (general)&lt;/li&gt;
&lt;li&gt;0.0 for coding&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why this matters
&lt;/h3&gt;

&lt;p&gt;MoE models can get stuck reusing the same experts.&lt;/p&gt;

&lt;p&gt;Presence penalty helps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;diversify token paths&lt;/li&gt;
&lt;li&gt;improve reasoning exploration&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Qwen Agentic Coding Setup
&lt;/h2&gt;

&lt;p&gt;This is where most people get it wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Correct setup
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;temperature = 0.6&lt;/li&gt;
&lt;li&gt;top_p = 0.95&lt;/li&gt;
&lt;li&gt;top_k = 20&lt;/li&gt;
&lt;li&gt;presence_penalty = 0.0&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why low temperature works
&lt;/h3&gt;

&lt;p&gt;Coding agents need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deterministic outputs&lt;/li&gt;
&lt;li&gt;repeatable tool calls&lt;/li&gt;
&lt;li&gt;stable formatting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Higher temperature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;breaks JSON&lt;/li&gt;
&lt;li&gt;introduces hallucinated APIs&lt;/li&gt;
&lt;li&gt;increases retries&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Gemma 4 Tuning
&lt;/h2&gt;

&lt;p&gt;Gemma behaves differently.&lt;/p&gt;

&lt;h3&gt;
  
  
  No official defaults
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;model cards are empty&lt;/li&gt;
&lt;li&gt;configs are implicit&lt;/li&gt;
&lt;li&gt;real tuning comes from:

&lt;ul&gt;
&lt;li&gt;Google AI Studio&lt;/li&gt;
&lt;li&gt;GGUF defaults&lt;/li&gt;
&lt;li&gt;community benchmarks&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Counter-Intuitive Finding
&lt;/h2&gt;

&lt;p&gt;Gemma 4 performs better with &lt;strong&gt;higher temperature&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observed behavior
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Temp&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;poor reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;stable baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.2 to 1.5&lt;/td&gt;
&lt;td&gt;best coding performance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This contradicts standard advice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why high temperature works here
&lt;/h2&gt;

&lt;p&gt;Hypothesis:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;training distribution favors exploration&lt;/li&gt;
&lt;li&gt;reasoning mode depends on diversity&lt;/li&gt;
&lt;li&gt;model compensates for lack of explicit chain-of-thought control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;higher temperature improves solution search space&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Gemma Agentic Coding Setup
&lt;/h2&gt;

&lt;p&gt;Recommended:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;temperature = 1.2&lt;/li&gt;
&lt;li&gt;top_p = 0.95&lt;/li&gt;
&lt;li&gt;top_k = 65&lt;/li&gt;
&lt;li&gt;penalties = 0.0&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Important
&lt;/h3&gt;

&lt;p&gt;Do not apply traditional "low temp for code" rule blindly.&lt;/p&gt;

&lt;p&gt;Gemma is an exception.&lt;/p&gt;




&lt;h2&gt;
  
  
  Thinking Mode and Agent Systems
&lt;/h2&gt;

&lt;p&gt;Both Qwen and Gemma support reasoning modes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why it matters
&lt;/h3&gt;

&lt;p&gt;Agent loops require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;intermediate reasoning&lt;/li&gt;
&lt;li&gt;error recovery&lt;/li&gt;
&lt;li&gt;multi-step planning&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Practical rule
&lt;/h3&gt;

&lt;p&gt;Always enable thinking mode for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;coding agents&lt;/li&gt;
&lt;li&gt;tool use&lt;/li&gt;
&lt;li&gt;multi-step tasks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Parameter Strategy by Use Case
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Coding agents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;prioritize determinism&lt;/li&gt;
&lt;li&gt;minimize penalties&lt;/li&gt;
&lt;li&gt;stable sampling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reasoning agents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;moderate temperature&lt;/li&gt;
&lt;li&gt;allow exploration&lt;/li&gt;
&lt;li&gt;preserve structure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tool calling
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;strict formatting&lt;/li&gt;
&lt;li&gt;low randomness&lt;/li&gt;
&lt;li&gt;consistent token patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Schema and JSON tooling are orthogonal to logits; combine these sampling rules with &lt;a href="https://www.glukhov.org/llm-performance/ollama/llm-structured-output-with-ollama-in-python-and-go/" rel="noopener noreferrer"&gt;structured output patterns for Ollama and Qwen3&lt;/a&gt; so validators see fewer retries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vendor Defaults vs Reality
&lt;/h2&gt;

&lt;p&gt;Vendor defaults are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;safe&lt;/li&gt;
&lt;li&gt;generic&lt;/li&gt;
&lt;li&gt;not optimized&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Community findings often show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;better performance&lt;/li&gt;
&lt;li&gt;task-specific tuning&lt;/li&gt;
&lt;li&gt;architecture-aware adjustments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Gemma:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;official: no guidance&lt;/li&gt;
&lt;li&gt;community: high temperature improves coding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Qwen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;official: inconsistent sections&lt;/li&gt;
&lt;li&gt;community: standardized values converge&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Practical Deployment Notes
&lt;/h2&gt;

&lt;p&gt;Under concurrency, queueing and memory splits interact with retries as much as sampling does—read &lt;a href="https://www.glukhov.org/llm-performance/ollama/how-ollama-handles-parallel-requests/" rel="noopener noreferrer"&gt;how Ollama handles parallel requests&lt;/a&gt; alongside the presets above.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ollama
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;works well for both families&lt;/li&gt;
&lt;li&gt;verify GPU compatibility&lt;/li&gt;
&lt;li&gt;defaults may differ from reference&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  vLLM
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;supports advanced sampling&lt;/li&gt;
&lt;li&gt;stable for production&lt;/li&gt;
&lt;li&gt;use explicit parameters&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  llama.cpp
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;requires sampler ordering&lt;/li&gt;
&lt;li&gt;always enable jinja for modern models&lt;/li&gt;
&lt;li&gt;incorrect sampler chain reduces output quality&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;there is no universal parameter set&lt;/li&gt;
&lt;li&gt;architecture matters more than model size&lt;/li&gt;
&lt;li&gt;agentic systems require different tuning than chat&lt;/li&gt;
&lt;li&gt;community benchmarks are often ahead of vendors&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Opinion
&lt;/h2&gt;

&lt;p&gt;Most parameter guides are outdated.&lt;/p&gt;

&lt;p&gt;They assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chat use&lt;/li&gt;
&lt;li&gt;low temperature for code&lt;/li&gt;
&lt;li&gt;static configurations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern models break those assumptions.&lt;/p&gt;

&lt;p&gt;If you are building agentic systems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;treat inference tuning as a first-class system design problem&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not a config file.&lt;/p&gt;




&lt;h2&gt;
  
  
  Future Direction
&lt;/h2&gt;

&lt;p&gt;This reference will evolve into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;per-model deep dives&lt;/li&gt;
&lt;li&gt;agent-specific configs&lt;/li&gt;
&lt;li&gt;benchmarking-backed tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;inference is where model capability becomes system performance&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>hermes</category>
      <category>openclaw</category>
      <category>opencode</category>
      <category>cheatsheet</category>
    </item>
    <item>
      <title>LLM Structured Output Validation in Python That Holds Up</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 15 May 2026 01:26:29 +0000</pubDate>
      <link>https://forem.com/rosgluk/llm-structured-output-validation-in-python-that-holds-up-3inc</link>
      <guid>https://forem.com/rosgluk/llm-structured-output-validation-in-python-that-holds-up-3inc</guid>
      <description>&lt;p&gt;Most LLM "structured output" tutorials are unserious.&lt;br&gt;
They teach you to ask for JSON politely and then hope the model behaves.&lt;br&gt;
That is not validation.&lt;br&gt;
That is optimism with braces.&lt;/p&gt;



&lt;p&gt;OpenAI's own docs make the distinction explicit. JSON mode gives you valid JSON, while Structured Outputs enforces schema adherence, and OpenAI recommends using Structured Outputs instead of JSON mode when possible.&lt;/p&gt;

&lt;p&gt;That still does not make the payload trustworthy. JSON Schema defines structure and allowed values, Pydantic gives you typed validation in Python, and OpenAI explicitly notes that a schema-valid response can still contain incorrect values. On top of that, refusals and incomplete outputs can bypass the shape you expected. In production, structured output validation is a pipeline, not a toggle. The same boundary also has to live inside the wider story of throughput, retries, and scheduler limits on the &lt;a href="https://www.glukhov.org/llm-performance/" rel="noopener noreferrer"&gt;LLM performance engineering hub&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Structured output validation is a contract
&lt;/h2&gt;

&lt;p&gt;Structured output validation for LLMs means you define the shape of the answer up front, constrain the model to produce that shape when possible, and then validate the result again before your application trusts it. In practical terms, that means checking required fields, types, enums, closed object shapes, and domain rules before the payload touches your database, UI, queue, or downstream service. JSON Schema exists for exactly this kind of structural validation, Pydantic is built to validate untrusted data against Python type hints, and Python's &lt;code&gt;jsonschema&lt;/code&gt; library gives you a direct way to validate an instance against a schema.&lt;/p&gt;

&lt;p&gt;There is also a clean split between two common use cases. If the model is supposed to answer the user in a structured format, use a structured response format. If the model is supposed to call your application's tools or functions, use function calling. OpenAI's docs spell out that distinction, and for function calling they recommend enabling &lt;code&gt;strict: true&lt;/code&gt; so the arguments reliably adhere to the function schema.&lt;/p&gt;

&lt;p&gt;My strong opinion is simple. Treat every structured LLM response as an API boundary. Once you start thinking in terms of contracts instead of prompts, the architecture gets cleaner, the bugs get cheaper, and the whole "why did the model invent a new field in production" problem mostly disappears. That is the real answer to "what is structured output validation for LLMs" and it is a much better answer than "ask the model nicely for JSON."&lt;/p&gt;
&lt;h2&gt;
  
  
  JSON mode is not validation
&lt;/h2&gt;

&lt;p&gt;If you remember only one thing from this article, make it this. JSON mode is not schema validation. OpenAI's Help Center says JSON mode will not guarantee the output matches any specific schema, only that it is valid JSON and parses without errors. The Structured Outputs guide says the same thing in a cleaner way. Both JSON mode and Structured Outputs can produce valid JSON, but only Structured Outputs enforces schema adherence.&lt;/p&gt;

&lt;p&gt;That difference matters more than people admit. In its Structured Outputs launch post, OpenAI reported that &lt;code&gt;gpt-4o-2024-08-06&lt;/code&gt; with Structured Outputs scored 100 percent on its complex JSON schema evals, while &lt;code&gt;gpt-4-0613&lt;/code&gt; scored under 40 percent. You do not need to treat those numbers as universal truth to see the broader point. Schema enforcement changes the failure surface from "anything could happen" to "the contract is much tighter."&lt;/p&gt;

&lt;p&gt;There are still edge cases, and pretending otherwise is how toy demos become pager duty. OpenAI documents that the model can refuse an unsafe request, and those refusals are surfaced outside your normal schema path. It also documents incomplete responses, including cases such as hitting &lt;code&gt;max_output_tokens&lt;/code&gt; or a content filter interruption. So the FAQ "is JSON mode enough for reliable LLM output" has a short answer and a longer one. The short answer is no. The longer answer is that even strict structured output still needs explicit failure handling.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where structured output still breaks
&lt;/h2&gt;

&lt;p&gt;Schema enforcement shrinks the problem. It does not delete it. In real traffic you still see broken or surprising payloads for reasons that have little to do with your prompt wording.&lt;/p&gt;
&lt;h3&gt;
  
  
  Failure shapes worth designing for
&lt;/h3&gt;

&lt;p&gt;Models and clients disagree about details. You can get extra prose before or after the JSON, Markdown fenced blocks around the payload, or a tool call whose name is valid but whose arguments are JSON that does not match your Pydantic model. Streaming makes it worse because you might validate a half-finished buffer. Defensive code should assume "string in, maybe JSON inside" rather than "bytes on the wire already match my model."&lt;/p&gt;
&lt;h3&gt;
  
  
  Provider and API differences
&lt;/h3&gt;

&lt;p&gt;Not every host exposes the same structured-output surface. One stack might give you a first-class schema-bound completion, another might only guarantee JSON syntax, and local runtimes might lag behind hosted APIs. That is one reason the FAQ "how do you validate LLM JSON in Python" starts with provider enforcement when it exists and still ends with Python-side validation. For a wider view of how vendors compare, see the &lt;a href="https://www.glukhov.org/llm-performance/benchmarks/structured-output-comparison-popular-llm-providers/" rel="noopener noreferrer"&gt;structured output comparison across popular LLM providers&lt;/a&gt;. If you run models locally, the same validation pipeline applies after you normalize the wire format, for example after extraction with Ollama as in &lt;a href="https://www.glukhov.org/llm-performance/ollama/llm-structured-output-with-ollama-in-python-and-go/" rel="noopener noreferrer"&gt;structured LLM output with Ollama in Python and Go&lt;/a&gt;. When a runtime still wraps JSON with odd prefixes or reasoning traces, expect the same class of parser failures described in &lt;a href="https://www.glukhov.org/llm-performance/ollama/ollama-gpt-oss-structured-output-issues/" rel="noopener noreferrer"&gt;Ollama GPT-OSS structured output issues&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Python stack that actually works
&lt;/h2&gt;

&lt;p&gt;My recommendation is boring on purpose. First, let the model provider enforce the structural contract when it can. Second, validate the returned payload in Python with Pydantic. Third, use explicit business-rule validation for facts that a schema alone cannot prove. Fourth, test the contract with fixtures and adversarial examples instead of waving at a playground screenshot and calling it done. OpenAI's Structured Outputs docs, Pydantic's validator model, Python's &lt;code&gt;jsonschema&lt;/code&gt; tooling, and OpenAI's own structured-output eval examples all point in that direction.&lt;/p&gt;

&lt;p&gt;Pydantic is the right center of gravity for Python. It lets you model the output as normal Python types, generate JSON Schema with &lt;code&gt;model_json_schema()&lt;/code&gt;, and validate raw JSON with &lt;code&gt;model_validate_json()&lt;/code&gt;. Pydantic's docs also note that &lt;code&gt;model_validate_json()&lt;/code&gt; is generally the better path than doing &lt;code&gt;json.loads(...)&lt;/code&gt; first and then validating, because that two-step route adds extra parsing work in Python.&lt;/p&gt;

&lt;p&gt;If you keep standalone schema files in your repo, or you want CI to validate fixture payloads independently of model code, Python's &lt;code&gt;jsonschema&lt;/code&gt; package gives you the simplest possible contract check with &lt;code&gt;jsonschema.validate(...)&lt;/code&gt;. If you want that in pre-commit, &lt;code&gt;check-jsonschema&lt;/code&gt; exists specifically as a CLI and pre-commit hook built on &lt;code&gt;jsonschema&lt;/code&gt;. That is a very good fit for teams that want schema changes reviewed like code changes.&lt;/p&gt;

&lt;p&gt;Frameworks can reduce plumbing, but they do not remove the need for actual validation. LangChain now auto-selects provider-native structured output when the provider supports it and falls back to a tool strategy otherwise. Instructor layers Pydantic response models, validation, retries, and multi-provider support on top of model calls. Guardrails focuses on validators and input-output guard layers. Useful tools, all of them. But the schema and the business rules still belong to you. If you are choosing between higher-level libraries, the &lt;a href="https://www.glukhov.org/llm-performance/benchmarks/baml-vs-instruct-for-structured-output-llm-in-python/" rel="noopener noreferrer"&gt;BAML vs Instructor comparison for Python&lt;/a&gt; is a useful companion to this article.&lt;/p&gt;
&lt;h2&gt;
  
  
  A minimal OpenAI and Pydantic example
&lt;/h2&gt;

&lt;p&gt;The smallest production-worthy example has a few non-negotiables. Use a closed set of enum-like values where possible. Forbid extra keys. Add field descriptions so the schema is understandable to humans and more legible to the model. Keep the root object explicit and boring. OpenAI recommends clear names plus titles and descriptions for important keys, JSON Schema uses &lt;code&gt;enum&lt;/code&gt; to restrict values, and Pydantic can close the object shape with &lt;code&gt;extra="forbid"&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ConfigDict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConfigDict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;forbid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;how_to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abuse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Support ticket category.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Operational urgency.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;needs_human&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Whether a human should review the case.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A one sentence summary of the issue.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-2024-08-06&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Classify support tickets. Return only the structured result.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer reports duplicate charges after refreshing checkout.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;text_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_parsed&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two details in that example are easy to miss and absolutely worth caring about. &lt;code&gt;extra="forbid"&lt;/code&gt; on the Pydantic side mirrors the JSON Schema idea of &lt;code&gt;additionalProperties: false&lt;/code&gt;, which is also a requirement for strict tool schemas in OpenAI's function-calling docs. And enums are not cosmetic. They are one of the simplest ways to stop the model from inventing a value your code does not understand.&lt;/p&gt;

&lt;p&gt;The OpenAI Python SDK supports &lt;code&gt;client.responses.parse(...)&lt;/code&gt; with a Pydantic model supplied as &lt;code&gt;text_format&lt;/code&gt;, and the parsed object is returned on &lt;code&gt;response.output_parsed&lt;/code&gt;. The same SDK also supports &lt;code&gt;client.chat.completions.parse(...)&lt;/code&gt;, where the parsed object lives on &lt;code&gt;message.parsed&lt;/code&gt;. If you want direct structured data extraction with minimal glue, those helpers are the cleanest starting point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parse, normalize, then validate
&lt;/h2&gt;

&lt;p&gt;Structured Outputs and &lt;code&gt;model_validate_json&lt;/code&gt; remove a lot of parsing pain when the stack is aligned end to end. The moment you support a provider that returns plain chat text, a model that wraps JSON in fences, or a logging path that stores the raw completion string, you want one choke point that turns text into a dict before Pydantic runs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_json_from_llm_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rsplit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Common "Sure, here is the JSON:" prefix before the object.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ticket_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_json_from_llm_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_completion_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ticket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket_dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That helper is intentionally boring. It handles fenced "&lt;br&gt;
&lt;br&gt;
&lt;code&gt;json ...&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
" blocks and a leading natural-language preamble when the payload is still a single top-level object. It is not a full JSON extractor. If the model nests braces inside string values, naive slicing can break, and the right fix is usually stricter prompting, schema-bound completions, or a dedicated parser library.&lt;/p&gt;
&lt;h3&gt;
  
  
  Streaming completions
&lt;/h3&gt;

&lt;p&gt;If you stream chat tokens, do not run &lt;code&gt;json.loads&lt;/code&gt; or &lt;code&gt;model_validate_json&lt;/code&gt; on every delta. Buffer until the API reports a finished message (check your client for the stream termination or &lt;code&gt;finish_reason&lt;/code&gt;), concatenate the text, then parse once. The same rule applies when tool-call arguments arrive in chunks. You only validate after the arguments string is complete.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;completion_stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;raw_completion_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ticket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_completion_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can still pass &lt;code&gt;raw_completion_text&lt;/code&gt; through &lt;code&gt;parse_json_from_llm_text&lt;/code&gt; first when you expect fences or chatter around the JSON.&lt;/p&gt;

&lt;p&gt;Once you own plain-string parsing, the next constraint is often not Python but the provider's JSON Schema dialect and what the remote API actually accepts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Provider schema limits (before you get clever in Python)
&lt;/h2&gt;

&lt;p&gt;Do not blindly dump any schema generator output into an API and assume every JSON Schema feature is supported. OpenAI supports a subset of JSON Schema, requires all fields to be required for Structured Outputs, requires the root to be an object rather than a top-level &lt;code&gt;anyOf&lt;/code&gt;, and documents limits on nesting depth and total property count. Keep the provider-facing schema simple. That is not a compromise. That is good engineering.&lt;/p&gt;

&lt;p&gt;If you need a provider-agnostic validation path, or you want to validate stored fixtures and mocks, Pydantic plus &lt;code&gt;jsonschema&lt;/code&gt; is still a great combination.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;jsonschema&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;validate_json&lt;/span&gt;

&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_json_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;priority&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;needs_human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checkout duplicates charges after refresh.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;validate_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ticket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That pattern is especially handy in tests, contract fixtures, and integrations where the model provider does not offer native structured output enforcement. Just remember that a locally generated schema may be broader than a given provider's supported subset, so "valid locally" does not automatically mean "accepted by every LLM API." Also note that some providers preprocess and cache schema artifacts, so the first request for a new schema can be slower than warm requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool calls are a second contract
&lt;/h2&gt;

&lt;p&gt;Function or tool calling is the other major structured-output shape. The model chooses a name and passes arguments that should match a JSON Schema you control. OpenAI recommends &lt;code&gt;strict: true&lt;/code&gt; on tool definitions so arguments stay aligned with that schema. In agent-heavy stacks, bad sampling turns into invalid tool JSON fast; keep sampler settings aligned with multi-step work using the &lt;a href="https://www.glukhov.org/llm-performance/benchmarks/agentic-inference-parameters-reference/" rel="noopener noreferrer"&gt;agentic inference parameters reference for Qwen and Gemma&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The snippets below assume you already mapped the provider's tool-call object into a &lt;code&gt;name&lt;/code&gt; string and an &lt;code&gt;arguments&lt;/code&gt; dict, for example by parsing &lt;code&gt;tool_calls[].function&lt;/code&gt; on chat completions (JSON string arguments become &lt;code&gt;json.loads&lt;/code&gt; first). &lt;code&gt;dispatch_tool&lt;/code&gt; is the step after that normalization.&lt;/p&gt;

&lt;p&gt;Two practical rules help in Python. First, validate the tool name against an explicit allowlist before you route execution. Second, validate the arguments dict with the same Pydantic model you use in tests, not with ad hoc key access. The failure mode you are avoiding is "valid JSON arguments, wrong shape for the tool that fired," which slips past string checks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="n"&gt;ToolHandler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dispatch_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;handlers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ToolHandler&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;handlers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unsupported tool &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;model_cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;handlers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;validated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_cls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;handlers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ToolHandler&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify_ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued as &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That pattern keeps routing and validation in one place. Your real handlers will be richer, but the split should stay the same: allowed names, typed arguments, then side effects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schema validation still needs business rules
&lt;/h2&gt;

&lt;p&gt;A valid object is not the same thing as a correct object. OpenAI says this directly. Structured Outputs does not prevent mistakes inside the values of the JSON object. That is why the FAQ "why do schema validation and business-rule validation both matter" has a blunt answer. Because a response can match the schema perfectly and still be wrong in a way that hurts the business.&lt;/p&gt;

&lt;p&gt;Here is a realistic example. The structure can be valid, but the pricing logic can still be nonsense.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing_extensions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Self&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ConfigDict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_validator&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Offer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConfigDict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;forbid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EUR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GBP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;original_amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;discounted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;

    &lt;span class="nd"&gt;@model_validator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;after&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_discount_logic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Self&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discounted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;original_amount&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;original_amount is required when discounted is true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;original_amount&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;original_amount must be greater than amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That validator does something schemas alone often do poorly in real systems. It checks cross-field semantics after the whole model has been parsed. Pydantic's &lt;code&gt;model_validator&lt;/code&gt; exists exactly for this kind of whole-object validation. Notice the &lt;code&gt;Decimal | None&lt;/code&gt; field without a default. That keeps the field present while still allowing &lt;code&gt;null&lt;/code&gt;, which matches OpenAI's documented pattern for optional-like values under strict Structured Outputs.&lt;/p&gt;

&lt;p&gt;If you want validation failures to feed back into the model automatically, Instructor is a practical layer on top of Pydantic. Its docs describe a retry loop where validation errors are captured, formatted as feedback, and used to ask the model to try again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;instructor&lt;/span&gt;

&lt;span class="n"&gt;retrying_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;instructor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;offer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retrying_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Offer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extract the offer from this text. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Was 49.00 USD, now 19.00 USD.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is one of the few conveniences I will happily recommend. Automatic retries tied to real validation errors are useful. Silent coercion is not. Instructor's model layer, retry docs, and validation docs all lean into that same idea, and they are right to do so.&lt;/p&gt;

&lt;p&gt;You can implement the same idea without a framework. The loop is small. Ask the model, validate with Pydantic, and if validation fails, send the error details back in a follow-up user message and ask for corrected JSON only. Cap attempts, log the final failure, and surface a controlled error to callers. When you already rely on &lt;code&gt;responses.parse&lt;/code&gt; or other schema-bound helpers, you may rarely exercise this path. It still matters for JSON mode, older chat endpoints, or any gateway that hands you a raw string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Return only JSON that matches the ticket schema.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer reports duplicate charges after refreshing checkout.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TicketClassification&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-2024-08-06&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;raw_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ticket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Validation failed with &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Return corrected JSON only.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exhausted structured output retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;ticket&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In real services you would attach tracing IDs, redact customer text in logs, and distinguish recoverable validation errors from refusals or incomplete responses. The important part is that the retry is driven by real validator output, not by a generic "try again" message.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test, retry, and fail closed
&lt;/h2&gt;

&lt;p&gt;What should happen when LLM validation fails? Not a shrug. Reject the payload, log the failure, retry with bounded attempts if the task is worth retrying, and fail closed instead of normalizing garbage into something that only looks acceptable. This is also where many teams forget to handle refusals and incomplete outputs explicitly, even though the provider docs tell them those paths exist.&lt;/p&gt;

&lt;p&gt;For OpenAI's Responses API, failure handling should be first-class code, not an afterthought. The variable is &lt;code&gt;response&lt;/code&gt; from &lt;code&gt;client.responses.create&lt;/code&gt; or &lt;code&gt;parse&lt;/code&gt;, not &lt;code&gt;completion&lt;/code&gt; from chat streaming elsewhere in this article.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incomplete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;incomplete_details&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refusal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;refusal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is not defensive over-engineering. It is directly aligned with the documented failure modes. If the model refuses, you are not holding a schema-valid payload. If the response is incomplete, you are not holding a schema-valid payload. Treat both as explicit branches in your control flow.&lt;/p&gt;

&lt;p&gt;You should also test the contract outside the model call itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;jsonschema&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validate&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;validate_json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_ticket_fixture_matches_schema&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;priority&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;needs_human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checkout duplicates charges after refresh.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nf"&gt;validate_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_json_schema&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_discount_logic_rejects_broken_offer&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ValidationError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;Offer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;19.00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;original_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;discounted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_ticket_rejects_unknown_category_string&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ValidationError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;priority&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;needs_human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer wants a refund.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_ticket_rejects_extra_keys&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ValidationError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;TicketClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;priority&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;needs_human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Broken flow.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the right shape of test strategy for LLM output validation in Python. Validate golden fixtures with &lt;code&gt;jsonschema&lt;/code&gt; so every field in the contract is exercised. Validate semantics with Pydantic, then add adversarial cases such as illegal enum strings, forbidden extra keys, and cross-field contradictions you care about. If you snapshot real model outputs, scrub PII and treat them as regression fixtures.&lt;/p&gt;

&lt;p&gt;If your team lives in the OpenAI stack, the Evals API also includes structured-output evaluation recipes specifically for testing and iterating on tasks that depend on machine-readable formats. And if you keep raw schema files in the repo, wire &lt;code&gt;check-jsonschema&lt;/code&gt; into CI or pre-commit. Ship contracts, not vibes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production checks that save you later
&lt;/h2&gt;

&lt;p&gt;When validation fails, the FAQ answer is blunt. Reject the payload, log why, retry with targeted feedback when the task is worth another attempt, and fail closed instead of coercing bad data into a queue.&lt;/p&gt;

&lt;p&gt;A short operations checklist helps teams avoid repeat incidents.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log schema version or a hash of the JSON Schema you sent to the provider so you can replay failures accurately.&lt;/li&gt;
&lt;li&gt;Redact model inputs and outputs in logs. Structured logs are useless if they leak customer text.&lt;/li&gt;
&lt;li&gt;Emit counters or metrics for refusal rate, incomplete response rate, validation failure rate, and repair success rate. Spikes there beat guessing when a model or prompt change shipped.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Broader &lt;a href="https://www.glukhov.org/observability/observability-for-llm-systems/" rel="noopener noreferrer"&gt;observability for LLM systems&lt;/a&gt; guidance helps wire those signals into dashboards, traces, and SLO reviews once the counters exist.&lt;/p&gt;

&lt;p&gt;The best practice is not complicated. Use provider-side Structured Outputs or strict tool schemas when you can. Normalize raw text when you must. Mirror the contract in Python with Pydantic. Add business-rule validation for what the schema cannot prove. Handle refusals and incomplete responses as normal branches. Test the contract until it stops being a demo and starts being software. Anything less is just prompt engineering cosplay.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>llm</category>
      <category>ai</category>
      <category>aicoding</category>
    </item>
    <item>
      <title>Second Brain Explained for Engineers and Knowledge Workers</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Thu, 14 May 2026 08:51:13 +0000</pubDate>
      <link>https://forem.com/rosgluk/second-brain-explained-for-engineers-and-knowledge-workers-bpg</link>
      <guid>https://forem.com/rosgluk/second-brain-explained-for-engineers-and-knowledge-workers-bpg</guid>
      <description>&lt;p&gt;Information overload is less about sheer volume than about unresolved inputs. Modern knowledge work leaves a trail of tabs, chat threads, docs, highlights, snippets, transcripts, screenshots, and half-written notes.&lt;/p&gt;

&lt;p&gt;Most of that material is only potentially useful, because almost none of it surfaces at the moment it would actually help. That gap between capture and reuse is where the idea of a second brain becomes interesting.&lt;/p&gt;

&lt;p&gt;In contemporary personal knowledge management, Tiago Forte popularized the term &lt;em&gt;second brain&lt;/em&gt; for an external digital repository of ideas, insights, and resources. The phrase can sound inflated, yet the useful core is practical. A second brain externalizes thinking so your biological brain spends less energy on storage and more on interpretation, connection, and output.&lt;/p&gt;

&lt;p&gt;The site’s &lt;a href="https://www.glukhov.org/knowledge-management/" rel="noopener noreferrer"&gt;Knowledge Management in 2026&lt;/a&gt; hub gathers adjacent guides—tools, self-hosted wikis, and PKM methods—when you want surrounding context beyond this article.&lt;/p&gt;

&lt;p&gt;Philosophically, the idea is less exotic than the branding implies. External media have always extended cognition—a notebook, a diagram, a link map, or a markdown vault can sit inside the thinking loop. A second brain is that familiar pattern updated for search, backlinks, linked notes, and AI-assisted retrieval.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is a Second Brain
&lt;/h2&gt;

&lt;p&gt;A second brain is an external knowledge system, but that label alone is too weak. Plenty of systems store information; a genuine second brain also helps you retrieve, compare, compress, and reuse ideas.&lt;/p&gt;

&lt;p&gt;That is why a second brain is not merely a note-taking app. Apps hold text; a second brain sustains a loop between capture and expression. When someone asks what a second brain &lt;em&gt;is&lt;/em&gt;, the shortest honest answer is that it is a personal system for turning scattered inputs into reusable thinking.&lt;/p&gt;

&lt;p&gt;The contrast between notes and a knowledge system matters because notes are inert artifacts. A knowledge system gives those artifacts retrieval paths, relationships, and context. A folder full of markdown files is no more a second brain than a pile of source files is a finished product—structure and flow are the missing layers.&lt;/p&gt;

&lt;p&gt;The strongest setups therefore resist obsession with storage. Storage is cheap, retrieval is expensive, and synthesis is where value compounds. If the system cannot help turn yesterday’s reading into tomorrow’s writing, design, research, or decision-making, it behaves less like a brain and more like a basement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Principles of a Second Brain
&lt;/h2&gt;

&lt;p&gt;The most useful modern framing is CODE—Capture, Organize, Distill, Express. The acronym sounds simple because it &lt;em&gt;is&lt;/em&gt; simple, which is part of its power.&lt;/p&gt;

&lt;h3&gt;
  
  
  Capture
&lt;/h3&gt;

&lt;p&gt;Capture does not mean saving everything; that path leads quickly to digital hoarding. Good capture means saving ideas with future energy. Useful notes tend to be surprising, reusable, unresolved, emotional, or clearly tied to active work.&lt;/p&gt;

&lt;p&gt;Accordingly, the capture question is rarely “Should this be saved forever?” The sharper question is “Will this be useful again in a different context?” A second brain improves when it collects sparks rather than exhaust.&lt;/p&gt;

&lt;h3&gt;
  
  
  Organize
&lt;/h3&gt;

&lt;p&gt;Organization is not about perfect taxonomy. It is about retrieval with low friction—making information easier to find while work is already in motion.&lt;/p&gt;

&lt;p&gt;Here PARA often enters the conversation. Projects, Areas, Resources, and Archives offer a lightweight way to organize by actionability rather than abstract topic. Strict category trees often decay into maintenance work, whereas action-oriented buckets keep the system tethered to reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Distill
&lt;/h3&gt;

&lt;p&gt;Distillation is where raw notes stop cluttering the vault and start becoming knowledge. A long highlight dump is not yet useful; a distilled note surfaces what is worth keeping, which claims deserve testing, and which ideas can be reused.&lt;/p&gt;

&lt;p&gt;Many people skip this step, yet it is what makes the whole method work. Distillation turns large volumes of text into a smaller set of ideas you can recognize later without rereading everything from scratch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Express
&lt;/h3&gt;

&lt;p&gt;Expression is the phase most note-taking systems quietly avoid, but without output the loop never closes. A second brain earns its keep when notes become articles, designs, code comments, decision memos, architecture docs, or working theories.&lt;/p&gt;

&lt;p&gt;Without output there is no pressure test, and without a pressure test there is no learning loop—so a second brain that never expresses anything is only a well-organized backlog.&lt;/p&gt;

&lt;h2&gt;
  
  
  Second Brain vs PKM
&lt;/h2&gt;

&lt;p&gt;Personal knowledge management (PKM) names the wider field—the habits, skills, and systems people use to gather, evaluate, organize, retrieve, and apply what they learn. In academic literature PKM stretches beyond note-taking and software into cognitive, informational, social, and learning competencies. For a fuller tour of that field than this narrower framing allows, see &lt;a href="https://www.glukhov.org/knowledge-management/foundations/personal-knowledge-management/" rel="noopener noreferrer"&gt;Personal Knowledge Management — goals, methods, and tools&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A second brain sits inside that umbrella as one philosophy of PKM, especially the digital workflow built around capture, organization, distillation, and expression. In Tiago Forte’s framing, &lt;em&gt;Building a Second Brain&lt;/em&gt; describes the larger creative process, while PARA is one implementation layer within it.&lt;/p&gt;

&lt;p&gt;The terms are related but not interchangeable. PKM is the category; a second brain is an opinionated implementation—and many online debates about second-brain systems are really debates about the broader PKM problem wearing a narrower label.&lt;/p&gt;

&lt;h2&gt;
  
  
  Second Brain vs Wiki vs RAG
&lt;/h2&gt;

&lt;p&gt;Technical readers usually arrive next at a pair of questions—how a second brain differs from a wiki, and how it differs from RAG—and the answer begins with intent.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Primary job&lt;/th&gt;
&lt;th&gt;Best at&lt;/th&gt;
&lt;th&gt;Weak point&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Second brain&lt;/td&gt;
&lt;td&gt;Personal evolving context&lt;/td&gt;
&lt;td&gt;Idea development and synthesis&lt;/td&gt;
&lt;td&gt;Can become messy and highly personal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wiki&lt;/td&gt;
&lt;td&gt;Shared structured knowledge&lt;/td&gt;
&lt;td&gt;Documentation and stable reference&lt;/td&gt;
&lt;td&gt;Weaker for unfinished thinking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Query time retrieval for AI&lt;/td&gt;
&lt;td&gt;Grounded responses over external sources&lt;/td&gt;
&lt;td&gt;Does not preserve human interpretation by itself&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Wikis stabilize knowledge. They favor explicit structure, shared naming, and pages that converge toward a source of truth, which makes them excellent for documentation yet awkward for half-formed concepts, private context, and exploratory thinking. Self-hosted setups such as &lt;a href="https://www.glukhov.org/knowledge-management/self-hosted-knowledge/dokuwiki-selfhosted-wiki-alternatives/" rel="noopener noreferrer"&gt;DokuWiki and its alternatives&lt;/a&gt; illustrate how teams turn that impulse into durable reference sites.&lt;/p&gt;

&lt;p&gt;A second brain usually begins from the opposite posture—it is personal, evolving, and tolerant of ambiguity, existing before consensus settles. In that sense a wiki is where knowledge goes when it stops changing quickly, whereas a second brain is where it still changes shape.&lt;/p&gt;

&lt;p&gt;RAG addresses yet another problem. Retrieval-augmented generation connects an AI model to external knowledge so responses can draw on fresher or more domain-specific context at query time. That capability is valuable, yet it is not the same as building a personal knowledge system—RAG retrieves at inference time, while a second brain remembers what mattered, why it mattered, and how your interpretation shifted.&lt;/p&gt;

&lt;p&gt;The interesting technical point is complementarity. A second brain can feed a wiki; a wiki can supply a clean source for RAG; RAG can make a second brain easier to search. None of those roles makes the abstractions interchangeable. The production-oriented &lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;RAG tutorial&lt;/a&gt; spells out the machine-side retrieval stack; read alongside a personal vault, it clarifies what human-curated notes preserve that query-time retrieval alone does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools for a Second Brain
&lt;/h2&gt;

&lt;p&gt;People gravitate to tool wars because tools are visible and structure is not, yet the tool is usually the least informative part of the system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Obsidian
&lt;/h3&gt;

&lt;p&gt;Obsidian appeals because it pairs local markdown files with internal links, backlinks, properties, and graph-style navigation—it feels like a knowledge base first and a text editor second. For technical users who care about file ownership and link-driven structure, that combination is hard to ignore. Vault-oriented setup detail lives in &lt;a href="https://www.glukhov.org/knowledge-management/tools/obsidian-for-personal-knowledge-management/" rel="noopener noreferrer"&gt;Using Obsidian for personal knowledge management&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logseq
&lt;/h3&gt;

&lt;p&gt;Logseq speaks to a different instinct. It is local-first, privacy-oriented, and built around an outline model where daily journals, bullets, references, and nonlinear linking make the tool feel less like drafting documents and more like accumulating thought fragments that later connect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Notion
&lt;/h3&gt;

&lt;p&gt;Notion sits closer to docs, lightweight databases, and team wiki workflows, while still supporting links, backlinks, and increasingly AI-driven search and summarization across connected workspaces. For anyone who wants one surface for docs, projects, and knowledge hubs, the appeal is obvious.&lt;/p&gt;

&lt;p&gt;Underneath those differences, all three can support a second brain—and all three can fail at it. Tool choice shifts ergonomics more than philosophy; a weak workflow inside a powerful tool stays weak, while a clear workflow inside a simpler tool still compounds. When Obsidian and Logseq are both on the table, &lt;a href="https://www.glukhov.org/knowledge-management/tools/obsidian-vs-logseq-comparison/" rel="noopener noreferrer"&gt;Obsidian vs Logseq&lt;/a&gt; is the feature-level split readers usually want next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Second Brain Mistakes
&lt;/h2&gt;

&lt;p&gt;The first trap is collecting too much. Capture feels productive because it is frictionless, yet when everything seems worth saving, nothing stays salient. The usual outcome is a bloated archive with thin signal density.&lt;/p&gt;

&lt;p&gt;The second trap is over-structuring, often driven by anxiety. Extra folders, tags, naming rules, and dashboards feel safer, but systems that demand constant grooming stop serving thinking and begin consuming it.&lt;/p&gt;

&lt;p&gt;The third trap—both the most common and the most costly—is failing to express. Notes that never become output do not compound; they only accumulate. The promise of a second brain hinges on turning private fragments into public or practical artifacts.&lt;/p&gt;

&lt;h2&gt;
  
  
  How a Second Brain Evolves
&lt;/h2&gt;

&lt;p&gt;Early on the system can look underwhelming—a handful of notes, a few saved links, perhaps a project page and some book highlights—and then the connections start.&lt;/p&gt;

&lt;p&gt;A meeting note links to a design decision; a blog draft links to a half-finished idea from six months earlier; a research note links to a bug report, which links to a product discussion, which loops back to a concept that once seemed unrelated. That is when static notes begin behaving like a dynamic system.&lt;/p&gt;

&lt;p&gt;Over time a second brain starts acting like a personal knowledge graph, which does not require a literal graph view. Value shifts from individual notes to relationships among them—the archive stops feeling like a cabinet of documents and starts feeling like a map of evolving context.&lt;/p&gt;

&lt;p&gt;That shift drives compounding. Notes become connections, connections become reusable patterns, and reusable patterns cultivate judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI and the Second Brain
&lt;/h2&gt;

&lt;p&gt;AI is the newest animating layer in this conversation, though not for the reason hype suggests. The payoff is not that AI replaces your second brain; it is that AI can make a human-centered second brain more capable. Readers routing notes toward assistants will find adjacent infrastructure context in &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI systems&lt;/a&gt;—orchestration, retrieval, and memory beyond a single chat prompt.&lt;/p&gt;

&lt;p&gt;In practice AI can fill three roles—summarizing large notes, transcripts, and documents; surfacing related ideas across a workspace faster than manual search; and augmenting expression through outlines, alternative framings, rough rewrites, or extracted action items.&lt;/p&gt;

&lt;p&gt;Those abilities edge toward magic until they don’t. AI does not decide what deserves to matter inside your system; it predicts relevance from patterns. Meaning still flows from human priorities, context, and taste—which is why “Can AI improve a second brain without replacing human judgment?” lands on a clear yes only because the judgment layer stays human.&lt;/p&gt;

&lt;p&gt;The strongest systems will probably braid both strands—human-curated notes supplying durable context, AI supplying acceleration through summarization, search, and transformation—so the model operates quickly over the archive without owning it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Take Away
&lt;/h2&gt;

&lt;p&gt;“Second brain” is slightly misleading branding. The aim is not to manufacture another brain; it is to stop treating your first one like cold storage.&lt;/p&gt;

&lt;p&gt;A second brain is neither a single tool nor “just notes” nor a prettier folder tree. It is a system for capturing ideas, organizing them for retrieval, distilling them into reusable insight, and expressing them as work.&lt;/p&gt;

&lt;p&gt;That is why the concept survives tool churn. Apps change, interfaces change, and AI changes faster than both, yet the underlying failure mode persists—knowledge work breaks when useful ideas vanish between the moment of capture and the moment of need. A second brain is one of the few frameworks that treats that gap as a design problem rather than a character flaw.&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful links
&lt;/h2&gt;

&lt;p&gt;To deepen your grasp of CODE and PARA, the philosophical idea of extended cognition, and the gap between human-centered notes and retrieval-first RAG, these readings are a practical next step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://fortelabs.com/blog/basboverview/" rel="noopener noreferrer"&gt;Building a Second Brain overview&lt;/a&gt; — Tiago Forte’s canonical introduction—the naming of the idea, the CODE workflow (Capture, Organize, Distill, Express), and the case for externalized cognition beyond sheer storage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://fortelabs.com/blog/para/" rel="noopener noreferrer"&gt;PARA method&lt;/a&gt; — Practical organization by actionability rather than textbook taxonomy; especially helpful for thinking about retrieval friction versus folder perfectionism.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://consc.net/papers/extended.html" rel="noopener noreferrer"&gt;The extended mind&lt;/a&gt; — Andy Clark and David Chalmers’ paper on cognitive extension—why notebooks, diagrams, and digital notes can count as part of the thinking process, not just accessories to it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://arxiv.org/abs/2005.11401" rel="noopener noreferrer"&gt;Retrieval-augmented generation for knowledge-intensive NLP tasks&lt;/a&gt; — Lewis et al.’s foundational RAG paper; useful background for why RAG is built around query-time retrieval and differs in purpose from a curated personal vault.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.ibm.com/think/topics/retrieval-augmented-generation" rel="noopener noreferrer"&gt;What is retrieval-augmented generation?&lt;/a&gt; — A clear, implementation-minded explanation of RAG architecture and limits—good companion reading for the wiki versus second brain versus RAG comparison.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Bonus.&lt;/strong&gt; &lt;a href="https://fortelabs.com/blog/supersizing-the-mind-the-science-of-cognitive-extension/" rel="noopener noreferrer"&gt;Supersizing the mind — the science of cognitive extension&lt;/a&gt; — Forte connects extended-mind ideas to everyday knowledge work; a strong bridge between theory and practice.&lt;/p&gt;

</description>
      <category>obsidian</category>
      <category>wiki</category>
      <category>rag</category>
    </item>
    <item>
      <title>Idempotency in Distributed Systems That Actually Works</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 11 May 2026 11:37:09 +0000</pubDate>
      <link>https://forem.com/rosgluk/idempotency-in-distributed-systems-that-actually-works-5dl6</link>
      <guid>https://forem.com/rosgluk/idempotency-in-distributed-systems-that-actually-works-5dl6</guid>
      <description>&lt;p&gt;Idempotency in distributed systems is the property that saves you after the network lies, the queue retries, the client panics, and the operator hits replay. In production systems, duplicate delivery is normal. Duplicate side effects are the bug.&lt;/p&gt;

&lt;p&gt;HTTP defines an idempotent method as one where multiple identical requests have the same intended effect on the server as one request. That is why PUT, DELETE, and safe methods are idempotent in protocol semantics and can be retried automatically after a communication failure.&lt;/p&gt;

&lt;p&gt;That definition is useful, but it is not enough. In real architectures, idempotency is not an HTTP trivia answer. It is a business guarantee. If a customer hits "pay" once, you do not get to charge twice because a timeout happened between commit and response. If a worker updates inventory and crashes before acking the message, you do not get to decrement stock twice because the broker redelivered. That is the bar.&lt;/p&gt;

&lt;p&gt;The mistake I see over and over is treating idempotency as a transport feature instead of a system property. Queue deduplication, HTTP verbs, and client retries help, but none of them rescue a design that lets the same business intent create a second side effect. If you want the broader framing for how these integration decisions fit service boundaries and persistence trade-offs, start with &lt;a href="https://www.glukhov.org/app-architecture/" rel="noopener noreferrer"&gt;App Architecture in Production: Integration Patterns, Code Design, and Data Access&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where duplicates come from in production
&lt;/h2&gt;

&lt;p&gt;Duplicates do not appear because teams are careless. They appear because distributed systems retry, reorder, and replay.&lt;/p&gt;

&lt;p&gt;A client can send a create request, the server can commit it, and the response can still disappear on the wire. That is exactly why HTTP distinguishes idempotent methods and why payment APIs such as Stripe and PayPal expose explicit idempotency mechanisms for unsafe methods like POST.&lt;/p&gt;

&lt;p&gt;Message brokers make the problem even more obvious. At-least-once delivery means a consumer can be invoked repeatedly for the same message, and a handler can update the database successfully but fail before acknowledgment, causing the broker to deliver the same message again.&lt;/p&gt;

&lt;p&gt;Webhooks are no different. GitHub says webhook deliveries can arrive out of order, failed deliveries are not automatically redelivered, and each delivery carries a unique &lt;code&gt;X-GitHub-Delivery&lt;/code&gt; GUID that you should use when protecting against replay. For a practical architecture view of chat endpoints as interaction boundaries, see &lt;a href="https://www.glukhov.org/app-architecture/integration-patterns/chat-platforms-as-system-interfaces/" rel="noopener noreferrer"&gt;Chat Platforms as System Interfaces in Modern Systems&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Even systems that advertise stronger guarantees still leave you work to do. Kafka can prevent duplicate entries in Kafka logs with idempotent producers and can provide exactly-once delivery for read-process-write flows that stay inside Kafka with transactions and &lt;code&gt;read_committed&lt;/code&gt; consumers. But Kafka's own design docs are clear that external systems still require coordination with offsets and outputs. Google Cloud Pub/Sub exactly-once delivery is limited to pull subscriptions, within a cloud region, and still requires clients to track processing progress until acknowledgment succeeds.&lt;/p&gt;

&lt;p&gt;My opinionated summary is simple. Assume the transport will retry. Assume operators will replay. Assume webhooks will arrive late. Design the write path so a repeated intent cannot create a second business effect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The API contract I actually trust
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do idempotency keys prevent duplicate API requests
&lt;/h3&gt;

&lt;p&gt;The only API contract I trust for mutating operations is caller-supplied intent plus server-side persistence.&lt;/p&gt;

&lt;p&gt;AWS recommends a caller-provided request identifier and warns that the service must atomically record the idempotency token together with the mutating work. Stripe stores the first status code and response body for a key, compares later parameters with the original request, and returns the same result for retries. PayPal uses &lt;code&gt;PayPal-Request-Id&lt;/code&gt; on supported POST APIs and returns the latest status for the previous request with that same header.&lt;/p&gt;

&lt;p&gt;That leads to a practical contract:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The client generates an idempotency key for a business operation.&lt;/li&gt;
&lt;li&gt;The server scopes that key by tenant and operation name.&lt;/li&gt;
&lt;li&gt;The server stores a request hash so the same key cannot be reused for a different payload.&lt;/li&gt;
&lt;li&gt;The server records state such as &lt;code&gt;pending&lt;/code&gt;, &lt;code&gt;completed&lt;/code&gt;, or &lt;code&gt;failed&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Retries with the same key either return the stored outcome or a stable pointer to it.&lt;/li&gt;
&lt;li&gt;Retries with the same key and a different payload fail loudly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There is an IETF &lt;code&gt;Idempotency-Key&lt;/code&gt; header draft, but as of 2026-05-09 it is still listed in the IETF Datatracker as an expired Internet-Draft rather than a published RFC. In practice, the header name is still widely useful as a de facto convention, but you should document the contract in your own API instead of pretending the standard is finished.&lt;/p&gt;

&lt;p&gt;What should the key represent? Intent. Not an HTTP attempt. Not a TCP connection. Not a retry counter. If the user means "create order 123 once", every retry for that same command must reuse the same key. If the user means "place a second order", that must use a different key.&lt;/p&gt;

&lt;p&gt;A request ID is for tracing. An idempotency key is for correctness. If you mix those up, your dashboards look tidy while your money moves twice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why PUT is not enough
&lt;/h3&gt;

&lt;p&gt;No, HTTP PUT is not enough to make an operation idempotent.&lt;/p&gt;

&lt;p&gt;Yes, RFC 9110 gives PUT idempotent semantics. But if your PUT handler emits a new downstream event, sends an email on every retry, or charges an external provider again, then your implementation has violated the business contract even if your route name looks respectable.&lt;/p&gt;

&lt;p&gt;Verb choice helps clients understand intent. It does not implement intent for you.&lt;/p&gt;

&lt;p&gt;Use PUT when the resource model genuinely fits a full replacement or upsert style operation. Use POST when you are creating commands or actions. But for any mutation that might be retried across network boundaries, document an explicit idempotency contract. If your mutating actions are triggered from chat workflows, the same contract applies in &lt;a href="https://www.glukhov.org/app-architecture/integration-patterns/slack/" rel="noopener noreferrer"&gt;Slack Integration Patterns for Alerts and Workflows&lt;/a&gt; and &lt;a href="https://www.glukhov.org/app-architecture/integration-patterns/discord/" rel="noopener noreferrer"&gt;Discord Integration Pattern for Alerts and Control Loops&lt;/a&gt;. Hidden side effects are where architecture goes to die.&lt;/p&gt;

&lt;h3&gt;
  
  
  How long should an idempotency key be stored
&lt;/h3&gt;

&lt;p&gt;Longer than your transport team wants.&lt;/p&gt;

&lt;p&gt;Stripe says keys can be pruned after at least 24 hours. PayPal says retention is API specific and gives examples that can last up to 45 days. Amazon SQS FIFO deduplicates only within a 5-minute window. GitHub keeps recent deliveries for 3 days for manual redelivery. Those numbers are wildly different because the right retention period is a business decision, not a protocol default.&lt;/p&gt;

&lt;p&gt;If you only keep keys for five minutes because your queue does, you are not designing idempotency. You are copying a transport limitation into your business layer.&lt;/p&gt;

&lt;p&gt;Keep idempotency records for at least the maximum of these windows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;client retry horizon&lt;/li&gt;
&lt;li&gt;queue redrive horizon&lt;/li&gt;
&lt;li&gt;webhook replay horizon&lt;/li&gt;
&lt;li&gt;operator replay horizon&lt;/li&gt;
&lt;li&gt;settlement or compensation horizon for money-moving operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For payments, bookings, and provisioning, that often means hours or days, not minutes.&lt;/p&gt;

&lt;p&gt;AWS also calls out two anti-patterns I fully agree with. Do not use timestamps as the key, because clock skew and collisions make them unreliable. Do not blindly store entire request payloads as the dedup record for every request, because that harms performance and scalability. Store a normalized request hash plus the minimum response state you need to replay safely. If you must reproduce the first response byte for byte, store the canonical response body the way Stripe does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The database patterns that make idempotency real
&lt;/h2&gt;

&lt;p&gt;Idempotency becomes real when the persistence layer can win a race exactly once.&lt;/p&gt;

&lt;p&gt;PostgreSQL gives you two critical primitives here. Unique constraints enforce uniqueness on one or more columns, and &lt;code&gt;INSERT ... ON CONFLICT&lt;/code&gt; lets you define an alternative action instead of failing on a uniqueness violation. PostgreSQL also documents that &lt;code&gt;ON CONFLICT DO UPDATE&lt;/code&gt; guarantees an atomic insert-or-update outcome under concurrency.&lt;/p&gt;

&lt;p&gt;That means your idempotency layer should usually start with a table like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;api_idempotency&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;operation&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idempotency_key&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;request_hash&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_body&lt;/span&gt; &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;resource_type&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;resource_id&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;expires_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;primary&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the handling flow should look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;begin transaction

try insert (tenant_id, operation, idempotency_key, request_hash, state='pending')
on conflict do nothing

load row for (tenant_id, operation, idempotency_key) for update

if row.request_hash != incoming_request_hash
    fail with conflict or validation error

if row.state = 'completed'
    return stored response

if row.state = 'pending' and row was created by another live request
    either wait briefly, or fail fast with a retryable response

perform local business mutation

store stable result in idempotency row
set state = 'completed'

commit
return result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is not the syntax. The important part is the atomicity. Recording the key and performing the mutation must succeed or fail together. AWS says this explicitly for API idempotency, and the same rule applies in SQL-backed services.&lt;/p&gt;

&lt;p&gt;Do not do a naive check-then-act sequence like "select key; if missing then insert order". Under concurrency, two requests can pass the check and both create the side effect. A unique constraint is not optional. It is the mechanism that turns your architecture from optimistic folklore into something you can prove under load.&lt;/p&gt;

&lt;p&gt;Here is the rule I use in reviews. If the dedup decision is not protected by the same transactional boundary as the mutation, you do not have idempotency. You have hope.&lt;/p&gt;

&lt;h2&gt;
  
  
  Messages, events, and webhooks need their own boundary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do consumers handle duplicate events and messages
&lt;/h3&gt;

&lt;p&gt;For message consumers, the classic pattern is still the right one. Record processed message IDs in the same database transaction as the business update. Chris Richardson describes the &lt;code&gt;PROCESSED_MESSAGES&lt;/code&gt; table approach directly, using a primary key on subscriber and message ID so duplicates fail cleanly and can be ignored.&lt;/p&gt;

&lt;p&gt;Many teams call that explicit &lt;code&gt;processed_messages&lt;/code&gt; store an inbox table. The label matters less than the rule. The receiver must persist proof that it already handled the message before a retry can safely do nothing.&lt;/p&gt;

&lt;p&gt;A minimal form looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;processed_messages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;subscriber_id&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;message_id&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;processed_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="k"&gt;primary&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscriber_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the consumer flow is just as strict as the HTTP flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;begin transaction

insert into processed_messages (subscriber_id, message_id)
values (?, ?)
on conflict do nothing

if no row inserted
    rollback
    ack and ignore duplicate

apply business mutation

commit
ack message
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That pattern is boring. Good. Idempotency should be boring.&lt;/p&gt;

&lt;p&gt;It is also usually better than trying to lean on broker marketing terms. Kafka's exactly-once support is excellent when you stay inside Kafka's own transactional model, but Kafka's docs still warn that external destinations need cooperation. SQS FIFO reduces duplicate sends only within its 5-minute dedup window. Pub/Sub exactly-once still expects the subscriber to track progress and avoid duplicate work when acknowledgments fail.&lt;/p&gt;

&lt;p&gt;Exactly-once is usually a local optimization. Idempotent side effects are the system guarantee.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pair dedup with the outbox pattern
&lt;/h3&gt;

&lt;p&gt;If your service updates local state and also publishes an event, idempotent consumption alone is not enough. You also need a safe way to get the event out after the local transaction commits.&lt;/p&gt;

&lt;p&gt;That is why the transactional outbox pattern matters. Chris Richardson describes the basic idea as writing the event to an outbox table in the same transaction as the business update, and then publishing it asynchronously. Debezium says the outbox pattern avoids inconsistencies between a service's internal state and the events consumed by other services. NServiceBus goes further and shows how outbox processing deduplicates incoming messages and avoids zombie records and ghost messages.&lt;/p&gt;

&lt;p&gt;This is the architecture I recommend for services that own data and publish integration events:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Validate and persist the command under an idempotency key.&lt;/li&gt;
&lt;li&gt;Write business state and outbox event in one local transaction.&lt;/li&gt;
&lt;li&gt;Let CDC or an outbox dispatcher publish the event.&lt;/li&gt;
&lt;li&gt;Make downstream consumers idempotent too.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Outbox does not remove the need for idempotent consumers. It removes the need to pretend that a database commit and a broker publish can be one magical distributed transaction when they usually cannot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Webhooks are just messages with better branding
&lt;/h3&gt;

&lt;p&gt;Treat inbound webhooks exactly like messages from an untrusted network edge.&lt;/p&gt;

&lt;p&gt;GitHub documents that deliveries can arrive out of order, recommends using &lt;code&gt;X-Hub-Signature-256&lt;/code&gt; to verify authenticity, and provides &lt;code&gt;X-GitHub-Delivery&lt;/code&gt; as the unique delivery identifier. It also notes that redeliveries reuse the same delivery ID.&lt;/p&gt;

&lt;p&gt;So the architecture is straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;verify the signature first&lt;/li&gt;
&lt;li&gt;use the delivery GUID as the dedup key&lt;/li&gt;
&lt;li&gt;persist receipt before side effects&lt;/li&gt;
&lt;li&gt;make handlers order-aware rather than assuming arrival order&lt;/li&gt;
&lt;li&gt;enqueue the heavy work and return fast&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your webhook handler writes directly to business tables before it records receipt, it is not production-ready. It is just faster at making duplicate mistakes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sagas and workflow engines still need idempotency
&lt;/h2&gt;

&lt;p&gt;Sagas and durable workflow engines do not delete the problem. They make it visible.&lt;/p&gt;

&lt;p&gt;Temporal recommends writing Activities to be idempotent because Activities can be retried after failures or timeouts. Its docs even call out the edge case where a worker completes an external side effect successfully but crashes before reporting completion, which causes the Activity to run again. Temporal also suggests using a combination of Workflow Run ID and Activity ID as a stable idempotency key when calling downstream services. If you are applying this in service orchestration, &lt;a href="https://www.glukhov.org/app-architecture/integration-patterns/go-microservices-for-ai-ml-orchestration-patterns/" rel="noopener noreferrer"&gt;Go Microservices for AI/ML Orchestration&lt;/a&gt; covers the broader workflow trade-offs.&lt;/p&gt;

&lt;p&gt;That is exactly the right mental model. A workflow engine can preserve execution history and coordinate retries. It cannot retroactively uncharge a card or unsend an email unless your application gives it idempotent steps and idempotent compensations.&lt;/p&gt;

&lt;p&gt;The same applies to sagas. Temporal's own saga guidance describes compensating actions that run when a step fails. Those compensations must be idempotent too. If "refund payment" runs twice, you may have solved the original bug by creating a new one.&lt;/p&gt;

&lt;p&gt;My rule here is brutal and simple. Every Activity, every command handler, and every compensation that touches the outside world should either be naturally idempotent or carry a real idempotency key to the downstream system.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to test idempotency before production
&lt;/h2&gt;

&lt;p&gt;Most teams test happy paths and then act surprised when retries happen. That is not enough.&lt;/p&gt;

&lt;p&gt;You should have automated tests for at least these cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the server commits the mutation but the response never reaches the client&lt;/li&gt;
&lt;li&gt;two identical requests race with the same idempotency key&lt;/li&gt;
&lt;li&gt;the same key is reused with a different payload&lt;/li&gt;
&lt;li&gt;a consumer commits its database work and crashes before ack&lt;/li&gt;
&lt;li&gt;a webhook is replayed with the same delivery ID&lt;/li&gt;
&lt;li&gt;an outbox dispatcher publishes the same event more than once&lt;/li&gt;
&lt;li&gt;a workflow Activity completes the external call and crashes before completion is reported&lt;/li&gt;
&lt;li&gt;an idempotency record expires and a genuine late retry arrives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS explicitly recommends comprehensive test suites that include successful requests, failed requests, and duplicate requests. That advice is pedestrian and absolutely correct.&lt;/p&gt;

&lt;p&gt;I would add one more failure drill. Verify that the replayed response is semantically equivalent to the first result. AWS discusses late-arriving retries and argues for responses that preserve the original meaning even after underlying state has changed. That is the difference between "no extra side effect happened" and "the caller still has a consistent contract."&lt;/p&gt;

&lt;h2&gt;
  
  
  Opinionated rules that save real systems
&lt;/h2&gt;

&lt;p&gt;Here are the rules I would enforce in an architecture review.&lt;/p&gt;

&lt;p&gt;First, idempotency keys belong to business intent, not transport attempts.&lt;/p&gt;

&lt;p&gt;Second, scope every key by tenant and operation. Global key spaces are how unrelated requests collide.&lt;/p&gt;

&lt;p&gt;Third, persist the dedup decision atomically with the mutation. If that is not true, the design is wrong.&lt;/p&gt;

&lt;p&gt;Fourth, reject same-key different-payload retries. Stripe and AWS both do this for good reason.&lt;/p&gt;

&lt;p&gt;Fifth, keep keys for the full replay horizon of the business process, not for the shortest queue window.&lt;/p&gt;

&lt;p&gt;Sixth, pair producers with an outbox and consumers with message ID tracking. One side without the other is half a design.&lt;/p&gt;

&lt;p&gt;Seventh, propagate the same operation identity downstream when the business action is the same. AWS explicitly recommends passing the idempotency token along the processing chain.&lt;/p&gt;

&lt;p&gt;Eighth, never assume exactly-once marketing removes the need for idempotent side effects.&lt;/p&gt;

&lt;p&gt;If that sounds strict, good. Idempotency is where optimistic architecture meets production reality. You do not need complexity everywhere. But wherever duplicate side effects would hurt money, state, or trust, idempotency should be a first-class part of the contract.&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.rfc-editor.org/rfc/rfc9110.html" rel="noopener noreferrer"&gt;RFC 9110, HTTP Semantics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.cloud.google.com/pubsub/docs/exactly-once-delivery" rel="noopener noreferrer"&gt;Google Cloud Pub/Sub, Exactly-once delivery&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/webhooks/testing-and-troubleshooting-webhooks/redelivering-webhooks" rel="noopener noreferrer"&gt;GitHub Docs, Redelivering webhooks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.temporal.io/evaluate/use-cases-design-patterns" rel="noopener noreferrer"&gt;Temporal Documentation, Use cases and design patterns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://datatracker.ietf.org/doc/draft-ietf-httpapi-idempotency-key-header/" rel="noopener noreferrer"&gt;IETF Datatracker, The Idempotency-Key HTTP Header Field&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dev</category>
      <category>microservices</category>
      <category>api</category>
    </item>
    <item>
      <title>Hermes Voice Control from Your Phone</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sun, 10 May 2026 11:12:56 +0000</pubDate>
      <link>https://forem.com/rosgluk/hermes-voice-control-from-your-phone-3fm6</link>
      <guid>https://forem.com/rosgluk/hermes-voice-control-from-your-phone-3fm6</guid>
      <description>&lt;p&gt;You already chat to Hermes Agent from your phone with text.&lt;br&gt;
Now you want to talk to it directly and get spoken replies back.&lt;br&gt;
That is usually the right move, especially if you already use &lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes as a persistent self-hosted assistant&lt;/a&gt;.&lt;br&gt;
Typing long prompts on a small screen is slow and error-prone&lt;/p&gt;



&lt;p&gt;Voice mode makes Hermes practical in the moments where it matters most, while walking, commuting, or doing admin work away from your desk.&lt;/p&gt;

&lt;p&gt;The good news is that voice mode can run with zero paid APIs. A local faster-whisper model handles transcription, and Edge TTS handles spoken output for free. This guide covers setup, provider choices, platform differences, practical command patterns, and the failure modes that usually block first-time users.&lt;/p&gt;
&lt;h2&gt;
  
  
  How the Pipeline Works
&lt;/h2&gt;

&lt;p&gt;Three stages, no magic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Transcription STT&lt;/strong&gt; — Your voice message becomes text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt; — Hermes processes that text exactly like a typed request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesis TTS&lt;/strong&gt; — The response text is converted back to audio.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The important distinction from consumer assistants is execution depth. Hermes is not just answering trivia. It can call tools, inspect files, run code paths, and continue multi-step work from memory. In practice, that means voice can trigger real workflows such as incident triage, draft generation, and targeted debugging. If you want the broader architecture context, the &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems pillar&lt;/a&gt; explains how this voice layer fits into local agent infrastructure.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Voice Control Is Great For
&lt;/h2&gt;

&lt;p&gt;Use voice mode when keyboard precision is not required yet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operational checks&lt;/strong&gt; while away from your laptop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idea capture&lt;/strong&gt; for drafts, outlines, and rough specs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast triage&lt;/strong&gt; of alerts and errors before deeper desktop follow-up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hands-busy workflows&lt;/strong&gt; where speaking is the only realistic input channel.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Voice Input: Pick an STT Provider
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;API Key&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local faster-whisper&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;On-device, ~150 MB model, 90+ languages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Groq Whisper&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GROQ_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fast cloud inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI Whisper&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;&lt;code&gt;VOICE_TOOLS_OPENAI_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Highest accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Voxtral&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MISTRAL_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Alternative cloud option&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Configuration in &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;stt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local&lt;/span&gt;
  &lt;span class="na"&gt;local&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;base&lt;/span&gt;  &lt;span class="c1"&gt;# tiny, base, small, medium, large-v3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start with &lt;code&gt;local&lt;/code&gt;. It works immediately, handles multilingual speech, and adds no recurring cost. Move to Groq or OpenAI only if your local setup cannot meet your latency or accuracy requirements. For command-level setup and diagnostics while testing providers, keep the &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-cli-cheatsheet/" rel="noopener noreferrer"&gt;Hermes CLI cheat sheet&lt;/a&gt; nearby.&lt;/p&gt;

&lt;h3&gt;
  
  
  Faster Whisper Model Selection
&lt;/h3&gt;

&lt;p&gt;Use a simple progression:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;tiny&lt;/strong&gt; for very low-power devices where speed matters most.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;base&lt;/strong&gt; as the default balance for laptops and small servers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;small&lt;/strong&gt; when accents, noisy environments, or domain terms reduce accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;medium or large-v3&lt;/strong&gt; when quality is critical and hardware budget is higher.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your transcripts are consistently wrong, increase model size first before adding more prompt complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Voice Output: TTS Providers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Edge TTS (default)&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Quick start, 322 voices, 74 languages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;Premium quality, voice cloning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI TTS&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;Natural voices, 6 options&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax TTS&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;Fine-grained speed/volume/pitch control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NeuTTS&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Free (local)&lt;/td&gt;
&lt;td&gt;Fully offline, voice cloning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;edge"&lt;/span&gt;
  &lt;span class="na"&gt;speed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;

  &lt;span class="na"&gt;edge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;voice&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-US-AriaNeural"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One critical detail is output format. Telegram voice bubbles are most reliable when audio is encoded as OGG with Opus. Hermes relies on ffmpeg for these conversions in common setups. If ffmpeg is missing, replies often show up as file attachments instead of inline voice bubbles.&lt;/p&gt;

&lt;p&gt;Install ffmpeg early:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;ffmpeg  &lt;span class="c"&gt;# Ubuntu/Debian&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;ffmpeg       &lt;span class="c"&gt;# macOS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Platform Workflows and Practical Differences
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Telegram
&lt;/h3&gt;

&lt;p&gt;Telegram is the easiest place to start. Voice messages are first-class on mobile, and the interaction loop is simple hold, speak, release, receive.&lt;/p&gt;

&lt;p&gt;Setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Create a bot via @BotFather, get your token&lt;/span&gt;
&lt;span class="c"&gt;# 2. Add to ~/.hermes/.env:&lt;/span&gt;
&lt;span class="nv"&gt;TELEGRAM_BOT_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;***&lt;/span&gt;
&lt;span class="nv"&gt;TELEGRAM_ALLOWED_USERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_user_id

&lt;span class="c"&gt;# 3. Start the gateway&lt;/span&gt;
hermes gateway start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open the Hermes chat, tap the microphone, and speak. If STT and TTS are enabled, Hermes transcribes your request, executes it, and sends a voice reply.&lt;/p&gt;

&lt;h3&gt;
  
  
  Discord
&lt;/h3&gt;

&lt;p&gt;Discord supports two useful modes. Voice messages in DMs or channels are close to Telegram behavior.&lt;/p&gt;

&lt;p&gt;The more advanced option is live voice channels. In that flow, Hermes can participate continuously, transcribing speech and responding without explicit message bubbles.&lt;/p&gt;

&lt;p&gt;Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message Content Intent enabled in your bot settings&lt;/li&gt;
&lt;li&gt;Server Members Intent enabled&lt;/li&gt;
&lt;li&gt;Bot permissions: Connect and Speak&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Signal
&lt;/h3&gt;

&lt;p&gt;Signal works through the &lt;code&gt;signal-cli&lt;/code&gt; daemon. Voice messages still use the same Hermes STT and TTS pipeline.&lt;/p&gt;

&lt;p&gt;A useful pattern is running &lt;code&gt;signal-cli&lt;/code&gt; as a linked device and using Signal Note to Self. You can leave yourself a voice note and get Hermes output in the same thread.&lt;/p&gt;

&lt;h3&gt;
  
  
  WhatsApp
&lt;/h3&gt;

&lt;p&gt;WhatsApp follows the same gateway model. Audio messages transcribe automatically once the connector is configured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mobile App Permissions
&lt;/h2&gt;

&lt;p&gt;Both iOS and Android need microphone access for the messaging app you're using.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;iOS:&lt;/strong&gt; Settings → Telegram (or Discord) → Permissions → Microphone → Allow. Enable Background App Refresh for instant responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Android:&lt;/strong&gt; Settings → Apps → Telegram → Permissions → Microphone → Allow. For Discord voice channels, enable overlay permission.&lt;/p&gt;

&lt;p&gt;Pinning the Hermes bot chat to your home screen helps — one tap to start speaking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speaking Patterns That Work Reliably
&lt;/h2&gt;

&lt;p&gt;Voice interaction has different ergonomics than typing. You cannot easily paste logs or quote long stack traces, so structure matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Be explicit.&lt;/strong&gt; Say the action, scope, and output format in one sentence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep one objective per message.&lt;/strong&gt; Split multi-step jobs into short follow-ups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constrain output.&lt;/strong&gt; Ask for numbered actions or a 3-point summary when mobile readability matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stay short.&lt;/strong&gt; Around 10 to 30 seconds per message usually transcribes better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use iterative turns.&lt;/strong&gt; Correct and refine in the next voice message instead of overloading the first.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example Prompts You Can Speak
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;"Check deployment logs for the last one hour and report only critical errors."&lt;/li&gt;
&lt;li&gt;"Create a draft outline for a post about OpenTelemetry migration with five sections."&lt;/li&gt;
&lt;li&gt;"Summarize this bug in three bullets and propose the most likely root cause."&lt;/li&gt;
&lt;li&gt;"Review the config and tell me what to change for lower transcription latency."&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Use Cases with Concrete Outcomes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operations&lt;/strong&gt; — "Check production health and list failed services."
Outcome is a focused status update you can act on immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writing&lt;/strong&gt; — "Turn these rough points into a publishable intro paragraph."
Outcome is polished text from spoken notes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debug triage&lt;/strong&gt; — "Investigate this TypeError and suggest the first fix to test."
Outcome is a concrete next step before opening the IDE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research&lt;/strong&gt; — "Find three recent sources on topic X and summarize differences."
Outcome is a compressed briefing for later deep work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt; — "Run the home routine and confirm device states."
Outcome is direct action plus confirmation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Voice messages not transcribing:&lt;/strong&gt; Confirm &lt;code&gt;stt.enabled: true&lt;/code&gt; in &lt;code&gt;config.yaml&lt;/code&gt;. Verify local dependencies are installed. Then restart with &lt;code&gt;hermes gateway restart&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTS not responding:&lt;/strong&gt; Confirm &lt;code&gt;tts.provider&lt;/code&gt; is set. If using a paid provider, verify the API key in &lt;code&gt;.env&lt;/code&gt;. Validate current voice settings from the Hermes CLI status commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Poor transcription quality:&lt;/strong&gt; Increase &lt;code&gt;stt.local.model&lt;/code&gt; from &lt;code&gt;base&lt;/code&gt; to &lt;code&gt;small&lt;/code&gt; or &lt;code&gt;medium&lt;/code&gt;. Reduce noise and speak in shorter segments. If needed, switch to cloud STT for better accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Voice bubbles showing as files on Telegram:&lt;/strong&gt; Install ffmpeg and restart the gateway. This is the most common issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Free Stack
&lt;/h2&gt;

&lt;p&gt;For cost-conscious setups, this baseline is strong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;STT:&lt;/strong&gt; Local faster-whisper with no API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTS:&lt;/strong&gt; Edge TTS with wide language coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total cost:&lt;/strong&gt; $0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a meaningful advantage over many closed assistants where voice quality and automation quickly become paid-only features.&lt;/p&gt;

&lt;p&gt;If quality requirements increase, upgrade one layer at a time. Usually STT upgrades produce the biggest immediate gain, then TTS quality can be improved later if needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ Topics in Practice
&lt;/h2&gt;

&lt;p&gt;The four most common user questions are predictable. They also overlap with memory and profile design concerns covered in &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System&lt;/a&gt; and &lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes production setup patterns&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether voice commands get the same tool access as text.&lt;/li&gt;
&lt;li&gt;Whether a free stack is viable for daily use.&lt;/li&gt;
&lt;li&gt;Why Telegram sometimes shows attachments instead of voice bubbles.&lt;/li&gt;
&lt;li&gt;Which local Whisper model should be used first.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guide addresses each of these directly in setup, tuning, and troubleshooting sections so you can move from first run to stable daily usage quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start Recap
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install voice extras&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"hermes-agent[all]"&lt;/span&gt;

&lt;span class="c"&gt;# 2. Set up Telegram gateway&lt;/span&gt;
hermes gateway setup

&lt;span class="c"&gt;# 3. Install ffmpeg (required for Telegram voice bubbles)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;ffmpeg

&lt;span class="c"&gt;# 4. Send a voice message from your phone&lt;/span&gt;
&lt;span class="c"&gt;# Hermes transcribes, processes, and responds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From there, iterate based on your real bottleneck. If latency is the issue, tune model size or cloud STT. If audio quality is the issue, tune TTS provider and voice preset. Start free, measure, then upgrade only where it actually improves your workflow.&lt;/p&gt;

</description>
      <category>hermes</category>
      <category>selfhosting</category>
      <category>llm</category>
    </item>
    <item>
      <title>Kanban in Hermes Agent for Self Hosted LLM Workflows</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 08 May 2026 09:48:56 +0000</pubDate>
      <link>https://forem.com/rosgluk/kanban-in-hermes-agent-for-self-hosted-llm-workflows-1ekf</link>
      <guid>https://forem.com/rosgluk/kanban-in-hermes-agent-for-self-hosted-llm-workflows-1ekf</guid>
      <description>&lt;p&gt;Hermes Agent ships with a Kanban-style board and the Hermes Gateway that can saturate your self-hosted LLM if too many tasks are dispatched at once.&lt;/p&gt;

&lt;p&gt;I can say you can easily ddos your own LLM this way.&lt;/p&gt;

&lt;p&gt;Hermes Kanban is a durable multi-profile board backed by &lt;code&gt;~/.hermes/kanban.db&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
Each lane represents a phase of work, and each card is a task that can be claimed by a specific Hermes profile.&lt;br&gt;&lt;br&gt;
Out of the box, the dispatcher can promote many &lt;code&gt;ready&lt;/code&gt; tasks in one pass. That is fine for elastic cloud APIs, but it can overload a small self-hosted GPU cluster.&lt;/p&gt;

&lt;p&gt;If you are new to this stack, start with the broader &lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes setup and operations guide&lt;/a&gt; and the &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems pillar&lt;/a&gt; for surrounding architecture.&lt;/p&gt;

&lt;p&gt;This post shows how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understand&lt;/strong&gt; how Hermes Kanban dispatch interacts with your LLM gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control&lt;/strong&gt; parallelism safely for heavy tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch&lt;/strong&gt; promotions with cron so background jobs do not collide with interactive use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor&lt;/strong&gt; and tune the system so GPUs stay busy without overload.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How Hermes Kanban and the dispatcher work
&lt;/h2&gt;

&lt;p&gt;At a high level, the system has three layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Board&lt;/strong&gt; - durable SQLite state for tasks, columns, relations, and history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workers&lt;/strong&gt; - Hermes profiles started in isolated workspaces to process a task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dispatcher&lt;/strong&gt; - a long-lived process that scans for dispatchable cards and starts runs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Tasks created from CLI or dashboard usually start in &lt;code&gt;backlog&lt;/code&gt; or &lt;code&gt;ready&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
The dispatcher scans for eligible cards, claims one atomically, and starts the assigned profile with its tools and memory.&lt;br&gt;&lt;br&gt;
Each worker then calls your LLM gateway or local runtime (for example, OpenAI-compatible endpoints backed by Ollama, vLLM, or llama.cpp). For deployment choices across these runtimes, use the &lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM Hosting in 2026 Local Self-Hosted and Cloud Infrastructure Compared&lt;/a&gt;. If you are tuning request fan-out on Ollama itself, this pairs well with &lt;a href="https://www.glukhov.org/llm-performance/ollama/how-ollama-handles-parallel-requests/" rel="noopener noreferrer"&gt;How Ollama Handles Parallel Requests&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you add many heavy tasks and do not cap promotions, your gateway can get flooded with concurrent requests.&lt;br&gt;&lt;br&gt;
On a single-GPU or CPU-bound host, that often means queueing, thrashing, and timeouts instead of better throughput.&lt;/p&gt;
&lt;h2&gt;
  
  
  The practical limitation today
&lt;/h2&gt;

&lt;p&gt;In current Hermes builds many teams run, dispatcher config exposes only two Kanban dispatch keys and does not apply a global active-task cap from config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;kanban&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dispatch_in_gateway&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="na"&gt;dispatch_interval_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For active-task control, rely on explicit dispatch cadence (&lt;code&gt;hermes kanban dispatch --max ...&lt;/code&gt;) plus dependency modeling.&lt;/p&gt;

&lt;p&gt;Known gotchas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do not run gateway-embedded dispatch and &lt;code&gt;hermes kanban daemon --force&lt;/code&gt; against the same board, or you can get claim races.&lt;/li&gt;
&lt;li&gt;If the gateway is down, &lt;code&gt;ready&lt;/code&gt; tasks do not dispatch and can burst later when service returns.&lt;/li&gt;
&lt;li&gt;Longer dispatch intervals feel uneven because claiming happens in ticks.&lt;/li&gt;
&lt;li&gt;Behavior can vary across versions because run-state and reclaim edge cases were patched over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Quick verification when behavior looks wrong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1) confirm exactly one dispatcher path is active&lt;/span&gt;
pgrep &lt;span class="nt"&gt;-af&lt;/span&gt; &lt;span class="s2"&gt;"hermes gateway start|hermes kanban daemon"&lt;/span&gt;

&lt;span class="c"&gt;# 2) check the wired Kanban dispatcher keys&lt;/span&gt;
rg &lt;span class="s2"&gt;"dispatch_in_gateway|dispatch_interval_seconds"&lt;/span&gt; ~/.hermes/config.yaml

&lt;span class="c"&gt;# 3) inspect queue shape&lt;/span&gt;
hermes kanban list &lt;span class="nt"&gt;--status&lt;/span&gt; ready
hermes kanban list &lt;span class="nt"&gt;--status&lt;/span&gt; running
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key ideas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dispatcher config wires &lt;code&gt;dispatch_in_gateway&lt;/code&gt; and &lt;code&gt;dispatch_interval_seconds&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dispatch --max&lt;/code&gt; limits new spawns in that pass, not total running tasks.&lt;/li&gt;
&lt;li&gt;For small self-hosted clusters, start conservative and increase only after latency stays stable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When first deploying Hermes near your LLM gateway:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep only supported Kanban dispatcher keys in config.&lt;/li&gt;
&lt;li&gt;Observe GPU and CPU utilization under real queue pressure.&lt;/li&gt;
&lt;li&gt;Use Strategy 1 or Strategy 2 for deterministic pacing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Investigation findings and root cause
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;hermes kanban dispatch&lt;/code&gt; does not read &lt;code&gt;config.yaml&lt;/code&gt; for &lt;code&gt;max_active_tasks&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;hermes_cli/kanban.py&lt;/code&gt;, the dispatch command exposes &lt;code&gt;--max&lt;/code&gt; as a CLI cap (default &lt;code&gt;None&lt;/code&gt;) and passes only &lt;code&gt;args.max&lt;/code&gt; into &lt;code&gt;kb.dispatch_once(...)&lt;/code&gt;. There is no &lt;code&gt;max_active_tasks&lt;/code&gt; config lookup in this path. See &lt;a href="https://github.com/NousResearch/hermes-agent/raw/refs/heads/main/hermes_cli/kanban.py" rel="noopener noreferrer"&gt;hermes_cli/kanban.py raw&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Then in &lt;code&gt;kanban_db.dispatch_once&lt;/code&gt;, the only cap is &lt;code&gt;max_spawn&lt;/code&gt;, with logic equivalent to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_spawn&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;spawned&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;max_spawn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no check of already running tasks and no &lt;code&gt;max_active_tasks&lt;/code&gt; reference in that dispatch path. See &lt;a href="https://github.com/NousResearch/hermes-agent/raw/refs/heads/main/hermes_cli/kanban_db.py" rel="noopener noreferrer"&gt;hermes_cli/kanban_db.py raw&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Effective behavior:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes kanban dispatch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;unbounded for that pass (limited by ready queue size).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes kanban dispatch &lt;span class="nt"&gt;--max&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;caps only new spawns in that pass, not total running tasks.&lt;/p&gt;

&lt;p&gt;The wired config knobs around gateway dispatch are &lt;code&gt;kanban.dispatch_in_gateway&lt;/code&gt; and &lt;code&gt;kanban.dispatch_interval_seconds&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
So &lt;code&gt;max_active_tasks&lt;/code&gt; is ignored in this dispatch path because it is not implemented there.&lt;/p&gt;
&lt;h2&gt;
  
  
  Strategy 1 - Encode dependencies for strictly sequential flows
&lt;/h2&gt;

&lt;p&gt;Some workflows should run strictly one after another — for example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multi step data pipelines with shared intermediate artefacts&lt;/li&gt;
&lt;li&gt;migrations or infrastructure changes&lt;/li&gt;
&lt;li&gt;batch jobs that write to the same object store or database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hermes Kanban supports parent child dependencies between tasks so that a child card becomes dispatchable only when its parent is done.&lt;/p&gt;

&lt;p&gt;You can model this with a small helper script around the Hermes CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;

&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;parent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;hermes kanban add &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--title&lt;/span&gt; &lt;span class="s1"&gt;'Ingest customer logs for April'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="s1"&gt;'etl-worker'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--column&lt;/span&gt; backlog&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

hermes kanban add &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--title&lt;/span&gt; &lt;span class="s1"&gt;'Generate April anomaly report'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="s1"&gt;'analytics-worker'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--column&lt;/span&gt; backlog &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parent&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;parent_id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

hermes kanban add &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--title&lt;/span&gt; &lt;span class="s1"&gt;'Publish April summary to dashboard'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="s1"&gt;'reporting-worker'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--column&lt;/span&gt; backlog &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parent&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;parent_id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With an appropriate board policy and low dispatcher limits only the parent task runs first.&lt;br&gt;&lt;br&gt;
Once it finishes the child tasks gradually become ready, and the dispatcher pulls them one by one without ever exceeding your concurrency caps.&lt;/p&gt;
&lt;h2&gt;
  
  
  Strategy 2 - Use Linux cron with a running-aware dispatch cap
&lt;/h2&gt;

&lt;p&gt;If you want deterministic pacing, use host cron plus a small wrapper script.&lt;br&gt;&lt;br&gt;
Instead of always calling &lt;code&gt;dispatch --max 2&lt;/code&gt;, first count currently running tasks, then dispatch only the remaining slots.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;hermes-kanban-dispatch-capped.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;MAX_PARALLEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MAX_PARALLEL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;2&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;BOARD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;BOARD&lt;/span&gt;&lt;span class="k"&gt;:-}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nv"&gt;board_args&lt;/span&gt;&lt;span class="o"&gt;=()&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$BOARD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nv"&gt;board_args&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="nt"&gt;--board&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$BOARD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# or where your hermes is installed&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/home/abc/.local/bin:&lt;/span&gt;&lt;span class="nv"&gt;$PATH&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nv"&gt;running_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;hermes kanban &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;board_args&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; list &lt;span class="nt"&gt;--status&lt;/span&gt; running&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$running_out&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="s2"&gt;"(no matching tasks)"&lt;/span&gt;&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nv"&gt;running_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="k"&gt;else
  &lt;/span&gt;&lt;span class="nv"&gt;running_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s1"&gt;'%s\n'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$running_out&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi

&lt;/span&gt;&lt;span class="nv"&gt;slots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt; MAX_PARALLEL &lt;span class="o"&gt;-&lt;/span&gt; running_count &lt;span class="k"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt; slots &amp;lt;&lt;span class="o"&gt;=&lt;/span&gt; 0 &lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Already at limit running=&lt;/span&gt;&lt;span class="nv"&gt;$running_count&lt;/span&gt;&lt;span class="s2"&gt; max=&lt;/span&gt;&lt;span class="nv"&gt;$MAX_PARALLEL&lt;/span&gt;&lt;span class="s2"&gt; dispatch skipped"&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;0
&lt;span class="k"&gt;fi

&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"running=&lt;/span&gt;&lt;span class="nv"&gt;$running_count&lt;/span&gt;&lt;span class="s2"&gt; max=&lt;/span&gt;&lt;span class="nv"&gt;$MAX_PARALLEL&lt;/span&gt;&lt;span class="s2"&gt; slots=&lt;/span&gt;&lt;span class="nv"&gt;$slots&lt;/span&gt;&lt;span class="s2"&gt; dispatching up to &lt;/span&gt;&lt;span class="nv"&gt;$slots&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

hermes kanban &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;board_args&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; dispatch &lt;span class="nt"&gt;--max&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$slots&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make it executable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x ./hermes-kanban-dispatch-capped.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MAX_PARALLEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 ./hermes-kanban-dispatch-capped.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a specific board:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;BOARD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-board &lt;span class="nv"&gt;MAX_PARALLEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 ./hermes-kanban-dispatch-capped.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Schedule it once per minute with cron:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /opt/hermes/scripts/hermes-kanban-dispatch-capped.sh &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /var/log/hermes/kanban-cron.log 2&amp;gt;&amp;amp;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Operational notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cron often has a minimal &lt;code&gt;PATH&lt;/code&gt;, so if &lt;code&gt;hermes&lt;/code&gt; is not found, use its full path inside the script (for example &lt;code&gt;/usr/local/bin/hermes&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;If you log to &lt;code&gt;/var/log/hermes/...&lt;/code&gt;, create that directory first and ensure the cron user has write access.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /var/log/hermes
&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;:&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; /var/log/hermes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create or edit cron entries with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;crontab &lt;span class="nt"&gt;-e&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then verify with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;crontab &lt;span class="nt"&gt;-l&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-minute cadence with one cron entry
&lt;/h3&gt;

&lt;p&gt;Cron ticks once per minute, but you can still dispatch more frequently by running a short loop inside the script.&lt;/p&gt;

&lt;p&gt;Example &lt;code&gt;hermes-kanban-dispatch-subminute.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;LOCK_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/tmp/hermes-kanban-dispatch.lock"&lt;/span&gt;
&lt;span class="nv"&gt;RUNS_PER_MINUTE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;RUNS_PER_MINUTE&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;4&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;    &lt;span class="c"&gt;# 4 runs =&amp;gt; every 15 seconds&lt;/span&gt;
&lt;span class="nv"&gt;CAP_SCRIPT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CAP_SCRIPT&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="p"&gt;/opt/hermes/scripts/hermes-kanban-dispatch-capped.sh&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nb"&gt;exec &lt;/span&gt;9&amp;gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCK_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
flock &lt;span class="nt"&gt;-n&lt;/span&gt; 9 &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;0

&lt;span class="nv"&gt;sleep_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; RUNS_PER_MINUTE &lt;span class="k"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nv"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; i&amp;lt;&lt;span class="o"&gt;=&lt;/span&gt;RUNS_PER_MINUTE&lt;span class="p"&gt;;&lt;/span&gt; i++&lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CAP_SCRIPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt; i &amp;lt; RUNS_PER_MINUTE &lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;sleep&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$sleep_seconds&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="k"&gt;fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make it executable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x ./hermes-kanban-dispatch-subminute.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Schedule it once per minute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /opt/hermes/scripts/hermes-kanban-dispatch-subminute.sh &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /var/log/hermes/kanban-subminute.log 2&amp;gt;&amp;amp;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives an effective sub-minute cadence while &lt;code&gt;flock&lt;/code&gt; prevents overlapping runs.&lt;/p&gt;

&lt;p&gt;Why this works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;list --status running&lt;/code&gt; gives current running load.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dispatch --max N&lt;/code&gt; caps only new spawns for that pass.&lt;/li&gt;
&lt;li&gt;Computing &lt;code&gt;N&lt;/code&gt; as remaining slots keeps total running tasks near your target limit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Important caveat: this cap works only for dispatches made through this script.&lt;br&gt;&lt;br&gt;
Disable gateway embedded dispatch, otherwise it can still promote tasks independently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;kanban&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dispatch_in_gateway&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The official docs describe both command capabilities and note gateway dispatch defaults in the Kanban feature guide: &lt;a href="https://github.com/NousResearch/hermes-agent/blob/main/website/docs/user-guide/features/kanban.md" rel="noopener noreferrer"&gt;Hermes Kanban docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Internal Hermes Cron
&lt;/h2&gt;

&lt;p&gt;Do not use it.&lt;br&gt;
Do you really want your llm to process regular prompts like &lt;code&gt;Execute in terminal the command /path/hermes-kanban-dispatch-capped.sh&lt;/code&gt;, especially when it's busy doing some useful work?&lt;/p&gt;

&lt;h2&gt;
  
  
  Hermes Kanban Monitoring and Tuning
&lt;/h2&gt;

&lt;p&gt;Whichever strategy you choose you should monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM gateway metrics — request rate, latency, error rate, token throughput.&lt;/li&gt;
&lt;li&gt;Node health — GPU utilisation, VRAM usage, CPU load and RAM.&lt;/li&gt;
&lt;li&gt;Hermes metrics — how many tasks are in backlog, ready, active and done.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For production metric baselines and dashboards, see &lt;a href="https://www.glukhov.org/observability/monitoring-llm-inference-prometheus-grafana/" rel="noopener noreferrer"&gt;Monitor LLM Inference in Production with Prometheus and Grafana&lt;/a&gt; and the broader &lt;a href="https://www.glukhov.org/llm-performance/" rel="noopener noreferrer"&gt;LLM Performance hub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Start with low concurrency, then gradually raise limits while watching for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rising latency at constant throughput&lt;/li&gt;
&lt;li&gt;increasing timeout or rate limit errors&lt;/li&gt;
&lt;li&gt;long tails where some tasks stay active for a very long time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As soon as you see these symptoms roll back to the previous stable configuration and keep that as your default.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Kanban is the right tool
&lt;/h2&gt;

&lt;p&gt;Hermes Kanban shines when you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;long lived research or engineering backlogs&lt;/li&gt;
&lt;li&gt;multi agent collaboration with named profiles&lt;/li&gt;
&lt;li&gt;workflows that must survive restarts and host reboots&lt;/li&gt;
&lt;li&gt;humans who want a dashboard to triage work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you only need a single run to create a few temporary helpers, the built in delegate task tools are usually simpler.&lt;br&gt;&lt;br&gt;
Once you need history, dashboards and strict control over how your agents hit self hosted LLMs the Kanban board plus dispatcher is the right foundation.&lt;/p&gt;

&lt;p&gt;With a few configuration tweaks and optional cron based batching you can keep Hermes Kanban responsive while protecting your gateway and hardware.&lt;/p&gt;

</description>
      <category>hermes</category>
      <category>selfhosting</category>
      <category>llm</category>
      <category>ai</category>
    </item>
    <item>
      <title>Hermes Agent Skill Authoring — SKILL.md Structure and Best Practices</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Wed, 06 May 2026 08:10:04 +0000</pubDate>
      <link>https://forem.com/rosgluk/hermes-agent-skill-authoring-skillmd-structure-and-best-practices-44n9</link>
      <guid>https://forem.com/rosgluk/hermes-agent-skill-authoring-skillmd-structure-and-best-practices-44n9</guid>
      <description>&lt;p&gt;Hermes Agent treats &lt;strong&gt;skills&lt;/strong&gt; as the default way to teach repeatable workflows. Official documentation describes them as on-demand knowledge documents aligned with the open &lt;a href="https://agentskills.io/specification" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt; shape, loaded through &lt;strong&gt;progressive disclosure&lt;/strong&gt; so the model sees a small index first and only pulls full instructions when a task actually needs them.&lt;/p&gt;

&lt;p&gt;Authoring is less about clever wording than about &lt;strong&gt;packaging&lt;/strong&gt;—you are telling the runtime when to load a procedure, what order of steps counts as “done,” and how to tell success from a silent failure. This article stays focused on &lt;code&gt;SKILL.md&lt;/code&gt; structure, supporting folders, visibility rules, and the split between secrets and non-secret settings—the details that decide whether a skill shows up in &lt;code&gt;/slash&lt;/code&gt; commands, survives a hub install, or quietly disappears on CI.&lt;/p&gt;

&lt;p&gt;Hermes sits inside the broader &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems: Self-Hosted Assistants, RAG, and Local Infrastructure&lt;/a&gt;&lt;/strong&gt; cluster, where assistants are treated as systems built from inference, retrieval, memory, and tooling rather than as a single chat surface. Install paths, provider wiring, gateway behavior, and the layout of &lt;code&gt;~/.hermes&lt;/code&gt; are all spelled out in the &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes AI Assistant - Install, Setup, Workflow, and Troubleshooting&lt;/a&gt;&lt;/strong&gt; guide; day-to-day shell ergonomics—&lt;code&gt;hermes skills&lt;/code&gt;, profiles, gateway, memory—are easier to scan in the &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-cli-cheatsheet/" rel="noopener noreferrer"&gt;Hermes Agent CLI cheat sheet — commands, flags, and slash shortcuts&lt;/a&gt;&lt;/strong&gt;. In real deployments, skills inherit isolation from &lt;strong&gt;profiles&lt;/strong&gt; (separate config, secrets, memories, and skill trees). &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes AI Assistant Skills for Real Production Setups&lt;/a&gt;&lt;/strong&gt; argues for treating those profiles—not individual markdown files—as the unit of ownership; keep that in mind when you name skills and decide what belongs in shared &lt;code&gt;external_dirs&lt;/code&gt; versus a single profile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skill or tool?
&lt;/h2&gt;

&lt;p&gt;Official guidance is blunt. &lt;strong&gt;Use a skill&lt;/strong&gt; when the capability is mostly prose instructions plus shell commands and tools Hermes already exposes—wrapping a CLI, driving &lt;code&gt;git&lt;/code&gt;, calling &lt;code&gt;curl&lt;/code&gt;, or using &lt;code&gt;web_extract&lt;/code&gt; for structured fetches. &lt;strong&gt;Use a tool&lt;/strong&gt; when you need tight integration for API keys and auth flows, deterministic binary handling, streaming, or Python that must execute the same way every time.&lt;/p&gt;

&lt;p&gt;That boundary matters in practice because skills ship without changing agent code, while tools carry review and release overhead. Most teams benefit from starting with a skill, then promoting only the brittle core to a tool once the failure modes are obvious (auth refresh loops, binary parsers, strict idempotency).&lt;/p&gt;

&lt;h3&gt;
  
  
  Procedures versus curated memory
&lt;/h3&gt;

&lt;p&gt;Skills answer &lt;strong&gt;how&lt;/strong&gt; to run a workflow; Hermes’ bounded core memory answers &lt;strong&gt;what has already been agreed&lt;/strong&gt; about the user and the project. A skill loads when the task matches its description; MEMORY.md and USER.md stay in the prompt as a small, curated fact layer. The two mechanisms stack rather than compete, and the full picture of snapshots, limits, and external providers is laid out in &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System: How Persistent AI Memory Actually Works&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anatomy of a skill directory
&lt;/h2&gt;

&lt;p&gt;On disk, every skill is a folder under &lt;code&gt;~/.hermes/skills/&lt;/code&gt;, often nested under a category such as &lt;code&gt;devops/&lt;/code&gt; or &lt;code&gt;research/&lt;/code&gt;. Hermes expects &lt;strong&gt;&lt;code&gt;SKILL.md&lt;/code&gt; at the leaf&lt;/strong&gt;; everything else is optional structure you add when the instructions would otherwise sprawl. The usual pattern is &lt;code&gt;references/&lt;/code&gt; for long tables or vendor docs, &lt;code&gt;templates/&lt;/code&gt; for output skeletons, &lt;code&gt;scripts/&lt;/code&gt; for deterministic helpers, and &lt;code&gt;assets/&lt;/code&gt; for static files the agent should not re-fetch.&lt;/p&gt;

&lt;p&gt;That layout mirrors how progressive disclosure works in practice: the agent can stay at the main file until it truly needs a deep appendix. Keeping “happy path” prose in &lt;code&gt;SKILL.md&lt;/code&gt; and pushing rarely used detail into &lt;code&gt;references/&lt;/code&gt; is one of the cheapest ways to protect token budgets.&lt;/p&gt;

&lt;p&gt;Hermes can also merge in &lt;strong&gt;external skill directories&lt;/strong&gt; via &lt;code&gt;skills.external_dirs&lt;/code&gt; in &lt;code&gt;config.yaml&lt;/code&gt;. Those paths are scanned for discovery, but the agent still writes through &lt;code&gt;skill_manage&lt;/code&gt; into the primary &lt;code&gt;~/.hermes/skills/&lt;/code&gt; tree. &lt;strong&gt;Local names shadow external ones&lt;/strong&gt;, so if you “fix” a shared skill in your home directory, teammates pulling the same external repo will not see your edit until they remove or rename the local copy—a common source of “it works on my machine” confusion.&lt;/p&gt;

&lt;h2&gt;
  
  
  SKILL.md frontmatter that survives review
&lt;/h2&gt;

&lt;p&gt;The body of &lt;code&gt;SKILL.md&lt;/code&gt; is Markdown; the opening block must be valid YAML between &lt;code&gt;---&lt;/code&gt; delimiters. Real skills accumulate long fenced examples, so the small habits from &lt;strong&gt;&lt;a href="https://www.glukhov.org/documentation-tools/markdown/markdown-codeblocks/" rel="noopener noreferrer"&gt;Markdown Code Blocks: Complete Guide with Syntax, Languages &amp;amp; Examples&lt;/a&gt;&lt;/strong&gt;—consistent language tags, readable excerpts, tight fences—keep large files maintainable for humans and slightly easier for the model to scan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Required fields&lt;/strong&gt; are &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt;. The &lt;code&gt;name&lt;/code&gt; becomes the slash route and index key; it stays lowercase with hyphens and must respect the documented length cap. The &lt;code&gt;description&lt;/code&gt; is the only prose many sessions ever pay for at &lt;strong&gt;level zero&lt;/strong&gt;, so it should read like a search result or router string (“when backups look stale, verify latest archive and checksum”), not the first paragraph of a blog post.&lt;/p&gt;

&lt;p&gt;Optional top-level keys such as &lt;code&gt;version&lt;/code&gt;, &lt;code&gt;author&lt;/code&gt;, and &lt;code&gt;license&lt;/code&gt; help hub packaging and audits. The &lt;code&gt;platforms&lt;/code&gt; list (&lt;code&gt;macos&lt;/code&gt;, &lt;code&gt;linux&lt;/code&gt;, &lt;code&gt;windows&lt;/code&gt;) is sharper than it looks—when set, Hermes omits the skill entirely on non-matching hosts, which is why a skill that “works on my Mac” can vanish in Linux CI with no error message beyond a shorter skill list.&lt;/p&gt;

&lt;p&gt;Hermes-specific knobs live under &lt;code&gt;metadata.hermes&lt;/code&gt;: &lt;code&gt;tags&lt;/code&gt;, &lt;code&gt;related_skills&lt;/code&gt;, and the conditional visibility fields in the next section. &lt;strong&gt;&lt;code&gt;required_environment_variables&lt;/code&gt;&lt;/strong&gt; declares secrets that should land in &lt;code&gt;.env&lt;/code&gt; and pass into sandboxes; &lt;strong&gt;&lt;code&gt;required_credential_files&lt;/code&gt;&lt;/strong&gt; covers OAuth token files and other on-disk credentials that must mount into Docker or Modal; &lt;strong&gt;&lt;code&gt;metadata.hermes.config&lt;/code&gt;&lt;/strong&gt; declares non-secret preferences stored under &lt;code&gt;skills.config&lt;/code&gt; in &lt;code&gt;config.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Official docs stress &lt;strong&gt;size discipline&lt;/strong&gt; for a reason. Trim the &lt;code&gt;description&lt;/code&gt; to its budget, front-load the procedure, and push historical notes or giant option matrices into &lt;code&gt;references/&lt;/code&gt; so a partial &lt;code&gt;skill_view&lt;/code&gt; still gives the agent something actionable.&lt;/p&gt;

&lt;p&gt;Below is a &lt;strong&gt;minimal&lt;/strong&gt; &lt;code&gt;SKILL.md&lt;/code&gt; you can drop into &lt;code&gt;~/.hermes/skills/devops/backup-check/SKILL.md&lt;/code&gt; (or any category folder) and iterate from there.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backup-check&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify nightly backup archives exist, are non-empty, and pass a quick checksum spot-check on the latest file.&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1.0.0&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hermes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;devops&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;backups&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;shell&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;requires_toolsets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;terminal&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backup_check.archive_dir&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Absolute path to the directory that holds backup archives&lt;/span&gt;
        &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/var/backups"&lt;/span&gt;
        &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Backup archive directory (absolute path)&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# Backup archive spot-check&lt;/span&gt;

&lt;span class="gu"&gt;## When to use&lt;/span&gt;

Use when the user asks to confirm backups ran, to audit the latest archive on disk, or to catch empty or stale backup files before a restore drill.

&lt;span class="gu"&gt;## Quick reference&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Latest archive directory is configured under &lt;span class="sb"&gt;`skills.config.backup_check.archive_dir`&lt;/span&gt; (set via &lt;span class="sb"&gt;`hermes config migrate`&lt;/span&gt; if declared in metadata).
&lt;span class="p"&gt;-&lt;/span&gt; Default check uses &lt;span class="sb"&gt;`ls`&lt;/span&gt; by mtime and &lt;span class="sb"&gt;`test -s`&lt;/span&gt; for non-empty files.

&lt;span class="gu"&gt;## Procedure&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Resolve the archive directory from skill config or ask the user once if unset.
&lt;span class="p"&gt;2.&lt;/span&gt; List the most recently modified file matching the expected pattern (for example &lt;span class="sb"&gt;`*.tar.zst`&lt;/span&gt;).
&lt;span class="p"&gt;3.&lt;/span&gt; Confirm the file exists, is non-empty, and record its path and size for the reply.
&lt;span class="p"&gt;4.&lt;/span&gt; If a checksum file exists beside the archive, verify it with the documented tool (for example &lt;span class="sb"&gt;`sha256sum -c`&lt;/span&gt;).

&lt;span class="gu"&gt;## Pitfalls&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Empty files can still have a recent mtime if a failed job touched the path; always check size.
&lt;span class="p"&gt;-&lt;/span&gt; Relative paths break when the terminal cwd is not the backup host; use absolute paths in config.

&lt;span class="gu"&gt;## Verification&lt;/span&gt;

The user should see the latest archive path, byte size, and either a checksum OK line or an explicit note that no &lt;span class="sb"&gt;`.sha256`&lt;/span&gt; sidecar was found.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Progressive disclosure in practice
&lt;/h2&gt;

&lt;p&gt;Progressive disclosure is the difference between a skill library that feels snappy and one that burns thousands of tokens before the first user message. Hermes walks three conceptual steps: a compact catalog (names and short descriptions), the full &lt;code&gt;SKILL.md&lt;/code&gt; when the task matches, and—only if needed—a slice of a reference file via &lt;code&gt;skill_view&lt;/code&gt; paths. &lt;strong&gt;Assume level zero is all the model will read&lt;/strong&gt; until it explicitly commits; every sentence in the &lt;code&gt;description&lt;/code&gt; and the first screen of body text should help routing, not storytelling.&lt;/p&gt;

&lt;p&gt;A practical outline that survives partial loads is &lt;strong&gt;When to use&lt;/strong&gt; (triggers in plain language), &lt;strong&gt;Quick reference&lt;/strong&gt; (commands, env vars, file paths), &lt;strong&gt;Procedure&lt;/strong&gt; (ordered steps the agent should not improvise away), &lt;strong&gt;Pitfalls&lt;/strong&gt; (known failure modes), and &lt;strong&gt;Verification&lt;/strong&gt; (what “green” looks like). Narrative history, vendor changelog dumps, and twenty-row option tables belong in &lt;code&gt;references/&lt;/code&gt; with stable headings so the agent can pull a single section.&lt;/p&gt;

&lt;p&gt;When a skill activates, Hermes can rewrite &lt;strong&gt;&lt;code&gt;${HERMES_SKILL_DIR}&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;${HERMES_SESSION_ID}&lt;/code&gt;&lt;/strong&gt; in the body so shell lines point at the installed folder without hand-built paths. Optional &lt;strong&gt;inline shell&lt;/strong&gt; snippets (&lt;code&gt;!&lt;/code&gt;cmd`&lt;code&gt;) can inject fresh context (current branch, disk free space), but they execute on the host and stay disabled unless&lt;/code&gt;skills.inline_shell` is on—treat that flag as a trust boundary for the whole skill source, not a convenience toggle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conditional activation and prompt hygiene
&lt;/h2&gt;

&lt;p&gt;Skills can &lt;strong&gt;show or hide&lt;/strong&gt; based on which toolsets or tools exist in the current session. &lt;code&gt;requires_toolsets&lt;/code&gt; / &lt;code&gt;requires_tools&lt;/code&gt; gate a skill behind capabilities that must be present; &lt;code&gt;fallback_for_toolsets&lt;/code&gt; / &lt;code&gt;fallback_for_tools&lt;/code&gt; surface a cheaper or local path when a premium integration is absent—the DuckDuckGo fallback when a paid web search API is not configured is the canonical example.&lt;/p&gt;

&lt;p&gt;These predicates directly shape &lt;strong&gt;prompt noise&lt;/strong&gt;. An overly strict &lt;code&gt;requires_*&lt;/code&gt; rule hides a skill from newcomers who have not finished &lt;code&gt;hermes tools&lt;/code&gt; setup yet; an overly loose &lt;code&gt;fallback_for_*&lt;/code&gt; rule duplicates half your library whenever someone omits an API key. The useful middle ground is to name real prerequisites, test with &lt;code&gt;hermes chat --toolsets skills&lt;/code&gt;, and toggle keys or toolsets on purpose while watching whether the skill list breathes the way you expect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secrets, config, and credential files
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Secrets&lt;/strong&gt; should be declared in &lt;code&gt;required_environment_variables&lt;/code&gt;. Hermes can prompt when a skill loads in the local CLI, persist values in &lt;code&gt;.env&lt;/code&gt;, and pass them into &lt;code&gt;terminal&lt;/code&gt; and &lt;code&gt;execute_code&lt;/code&gt; sandboxes &lt;strong&gt;without&lt;/strong&gt; streaming the raw secret back into the model transcript. Remote chat surfaces refuse to collect keys inline and instead point people at &lt;code&gt;hermes setup&lt;/code&gt; or manual &lt;code&gt;.env&lt;/code&gt; edits—author your skill text so it matches that behavior (tell users &lt;em&gt;that&lt;/em&gt; a key is required, not *to paste it into Telegram).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-secret preferences&lt;/strong&gt;—default paths, org names, feature toggles—belong in &lt;code&gt;metadata.hermes.config&lt;/code&gt;. Values resolve into &lt;code&gt;skills.config&lt;/code&gt; inside &lt;code&gt;config.yaml&lt;/code&gt;, show up in &lt;code&gt;hermes config show&lt;/code&gt;, and arrive in the skill message as resolved facts so the model does not need to open your config file mid-task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File-shaped credentials&lt;/strong&gt; (OAuth token JSON, service account keys) map to &lt;code&gt;required_credential_files&lt;/code&gt;. When those files exist, Hermes can bind-mount them into Docker or sync them into Modal jobs; declaring them upfront avoids the classic “script works locally, dies in sandbox” gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supporting scripts and dependencies
&lt;/h2&gt;

&lt;p&gt;The upstream guide pushes authors toward &lt;strong&gt;boring dependencies&lt;/strong&gt;: stdlib Python, &lt;code&gt;curl&lt;/code&gt;, and Hermes’ own tools (&lt;code&gt;web_extract&lt;/code&gt;, &lt;code&gt;read_file&lt;/code&gt;, &lt;code&gt;terminal&lt;/code&gt;). That is less about purity than about reproducibility—every extra &lt;code&gt;pip install&lt;/code&gt; is another silent failure when the agent runs in a clean container.&lt;/p&gt;

&lt;p&gt;When JSON or XML parsing is fiddly, a short script under &lt;code&gt;scripts/&lt;/code&gt; plus a &lt;code&gt;${HERMES_SKILL_DIR}&lt;/code&gt; path beats asking the model to re-derive parsers each run. If you truly need a package, state the install command in &lt;strong&gt;Procedure&lt;/strong&gt;, repeat the failure symptom in &lt;strong&gt;Pitfalls&lt;/strong&gt;, and give a &lt;strong&gt;Verification&lt;/strong&gt; command that fails loudly when the dependency is missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Publishing, hub installs, and trust
&lt;/h2&gt;

&lt;p&gt;Community skills move through the Skills Hub and the other discovery paths the user guide lists—official optional skills, GitHub slugs, &lt;code&gt;skills.sh&lt;/code&gt; entries, &lt;code&gt;.well-known&lt;/code&gt; indexes, and raw &lt;code&gt;SKILL.md&lt;/code&gt; URLs. Installs are scanned for obvious exfiltration, injection, and destructive patterns; trust tiers run from &lt;strong&gt;builtin&lt;/strong&gt; through &lt;strong&gt;community&lt;/strong&gt;, and some findings only clear with &lt;code&gt;--force&lt;/code&gt; while the worst cases stay blocked entirely.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;SKILL.md&lt;/code&gt; file shape is not Hermes-specific&lt;/strong&gt;; IDE-centric assistants use the same progressive-loading idea with different discovery and triggers. &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-devtools/claude-code/claude-skills-for-developers/" rel="noopener noreferrer"&gt;Claude Skills and SKILL.md for Developers: VS Code, JetBrains, Cursor&lt;/a&gt;&lt;/strong&gt; is a useful contrast read—frontmatter discipline and “load only when relevant” carry over, even when the installer and slash-command wiring differ.&lt;/p&gt;

&lt;p&gt;Org-wide rollouts usually pair a &lt;strong&gt;private tap or shared Git repo&lt;/strong&gt; with &lt;code&gt;external_dirs&lt;/code&gt; for read-only sharing, while keeping the agent-writable copy under each profile when &lt;code&gt;skill_manage&lt;/code&gt; is allowed to mutate skills in place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting and optimization
&lt;/h2&gt;

&lt;p&gt;When a skill misbehaves, walk this checklist before rewriting prose.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Visibility&lt;/strong&gt; — Confirm &lt;code&gt;platforms&lt;/code&gt;, &lt;code&gt;requires_*&lt;/code&gt;, and &lt;code&gt;fallback_for_*&lt;/code&gt; predicates. A skill that “works on my Mac” but not in Linux CI is often a platform guard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Name collisions&lt;/strong&gt; — Duplicate names across local and external directories follow &lt;strong&gt;local precedence&lt;/strong&gt;. Rename or namespace aggressively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discovery layout&lt;/strong&gt; — A misplaced &lt;code&gt;SKILL.md&lt;/code&gt; or wrong category folder can drop the skill from indexing entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token load&lt;/strong&gt; — If sessions feel slow, shorten level-zero descriptions, move depth into &lt;code&gt;references/&lt;/code&gt;, and deduplicate giant tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent edits&lt;/strong&gt; — Hermes can create, patch, or delete skills via &lt;code&gt;skill_manage&lt;/code&gt;. Treat valuable skills like code: review diffs, export snapshots, and reset bundled skills deliberately when upgrades drift.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A tight regression loop beats rereading the whole file: &lt;code&gt;hermes chat --toolsets skills -q "Use the &amp;lt;skill&amp;gt; workflow to &amp;lt;concrete task&amp;gt;"&lt;/code&gt; should show the agent pulling the right disclosure level before it freestyles. If it never invokes &lt;code&gt;skill_view&lt;/code&gt;, your &lt;strong&gt;When to use&lt;/strong&gt; text or &lt;code&gt;description&lt;/code&gt; probably does not match how people phrase requests.&lt;/p&gt;

&lt;p&gt;Official references stay authoritative for behavior changes—the &lt;strong&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/user-guide/features/skills/" rel="noopener noreferrer"&gt;Skills System&lt;/a&gt;&lt;/strong&gt; user guide for runtime semantics, &lt;strong&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/developer-guide/creating-skills" rel="noopener noreferrer"&gt;Creating Skills&lt;/a&gt;&lt;/strong&gt; for author-facing rules, the &lt;strong&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/reference/skills-catalog" rel="noopener noreferrer"&gt;Bundled Skills Catalog&lt;/a&gt;&lt;/strong&gt; for copy-paste examples, and the &lt;strong&gt;&lt;a href="https://agentskills.io/specification" rel="noopener noreferrer"&gt;agentskills.io specification&lt;/a&gt;&lt;/strong&gt; for the shared file format Hermes aligns with.&lt;/p&gt;

</description>
      <category>selfhosting</category>
      <category>hermes</category>
      <category>aiagents</category>
      <category>llm</category>
    </item>
    <item>
      <title>Hermes Agent CLI cheat sheet — commands, flags, and slash shortcuts</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 04 May 2026 10:57:09 +0000</pubDate>
      <link>https://forem.com/rosgluk/hermes-agent-cli-cheat-sheet-commands-flags-and-slash-shortcuts-3pcb</link>
      <guid>https://forem.com/rosgluk/hermes-agent-cli-cheat-sheet-commands-flags-and-slash-shortcuts-3pcb</guid>
      <description>&lt;p&gt;Hermes Agent from Nous Research is a model-agnostic, tool-using assistant you run locally or on a VPS.&lt;/p&gt;

&lt;p&gt;Hermes does not lock you into one surface. You can use&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the classic &lt;strong&gt;&lt;code&gt;hermes&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;hermes chat&lt;/code&gt;&lt;/strong&gt; CLI, &lt;/li&gt;
&lt;li&gt;the full-screen &lt;strong&gt;&lt;code&gt;hermes --tui&lt;/code&gt;&lt;/strong&gt; session, &lt;/li&gt;
&lt;li&gt;a long-running &lt;strong&gt;&lt;code&gt;hermes gateway&lt;/code&gt;&lt;/strong&gt; for Telegram, Discord, Slack, and other messaging platforms,&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;&lt;code&gt;hermes dashboard&lt;/code&gt;&lt;/strong&gt; for a local browser UI when the web extra is installed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those paths share the same config and data under &lt;strong&gt;&lt;code&gt;~/.hermes&lt;/code&gt;&lt;/strong&gt;; this page lists &lt;strong&gt;shell commands&lt;/strong&gt; that matter across those modes.&lt;/p&gt;

&lt;p&gt;Below is a &lt;strong&gt;dense command reference&lt;/strong&gt; grouped by task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install Hermes Agent and first-run CLI commands
&lt;/h2&gt;

&lt;p&gt;For install and troubleshooting, start with &lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes AI Assistant — Install, Setup, Workflow, and Troubleshooting&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;The installer pulls the repo, sets up a Python environment, and wires the &lt;code&gt;hermes&lt;/code&gt; executable. After &lt;code&gt;source ~/.bashrc&lt;/code&gt; or &lt;code&gt;~/.zshrc&lt;/code&gt;, your &lt;strong&gt;default entry point&lt;/strong&gt; for interactive chat is simply &lt;strong&gt;&lt;code&gt;hermes&lt;/code&gt;&lt;/strong&gt; (same family as &lt;strong&gt;&lt;code&gt;hermes chat&lt;/code&gt;&lt;/strong&gt;).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;`curl -fsSL &lt;a href="https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh&lt;/a&gt; \&lt;/td&gt;
&lt;td&gt;bash`&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes&lt;/code&gt; / &lt;code&gt;hermes chat&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Start interactive chat after install (default daily entry).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes --version&lt;/code&gt; / &lt;code&gt;hermes version&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Print version information.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes completion bash&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;zsh&lt;/code&gt; \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes update [--check] [--backup] [--restart-gateway]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pull latest code, reinstall deps, optional pre-update home snapshot or gateway restart.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes uninstall [--full] [--yes]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Remove Hermes; optional full data deletion.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Native Windows is not supported; use &lt;strong&gt;WSL2&lt;/strong&gt;. Android installs via Termux follow a dedicated path in the upstream docs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Global flags for every &lt;code&gt;hermes&lt;/code&gt; invocation
&lt;/h2&gt;

&lt;p&gt;These flags apply before subcommands and change &lt;strong&gt;which profile&lt;/strong&gt;, &lt;strong&gt;which session&lt;/strong&gt;, or &lt;strong&gt;how much personal config&lt;/strong&gt; loads.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--profile&lt;/code&gt;, &lt;code&gt;-p&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Select Hermes profile for this run (overrides sticky default from &lt;code&gt;hermes profile use&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--resume&lt;/code&gt;, &lt;code&gt;-r&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Resume a session by ID or title.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--continue [name]&lt;/code&gt;, &lt;code&gt;-c&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Continue the latest session, or latest matching a title.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--worktree&lt;/code&gt;, &lt;code&gt;-w&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Start in an isolated Git worktree for parallel agents.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--yolo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Bypass dangerous-command approval prompts (use with care).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--pass-session-id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include session ID in the system prompt.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--ignore-user-config&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skip &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt; (defaults only); &lt;code&gt;.env&lt;/code&gt; still loads.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--ignore-rules&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skip auto-injection of AGENTS.md, SOUL.md, &lt;code&gt;.cursorrules&lt;/code&gt;, memory, preloaded skills.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--tui&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Launch the TUI (&lt;code&gt;HERMES_TUI=1&lt;/code&gt; equivalent).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--dev&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;With &lt;code&gt;--tui&lt;/code&gt;, run TS sources via &lt;code&gt;tsx&lt;/code&gt; for TUI development.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Isolated automation often pairs &lt;strong&gt;&lt;code&gt;hermes chat --ignore-user-config --ignore-rules&lt;/code&gt;&lt;/strong&gt; with &lt;strong&gt;&lt;code&gt;hermes -z&lt;/code&gt;&lt;/strong&gt; for reproducible one-shots.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;hermes chat&lt;/code&gt;, one-shot prompts, and &lt;code&gt;hermes -z&lt;/code&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command / pattern&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes chat&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive or scripted chat; main surface for &lt;code&gt;-q&lt;/code&gt;, &lt;code&gt;-m&lt;/code&gt;, &lt;code&gt;--provider&lt;/code&gt;, toolsets, resume, worktree, checkpoints.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes chat -q "..."&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;One-shot prompt (non-interactive); keeps richer output than &lt;code&gt;-z&lt;/code&gt; when tools run.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes -z "..."&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Scripted one-shot&lt;/strong&gt; — final answer only on stdout, no banner or session noise. Same agent and tools; best for pipes and scripts.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes chat --quiet&lt;/code&gt;, &lt;code&gt;-Q&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Quieter programmatic mode (banner and tool previews suppressed).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;-m&lt;/code&gt; / &lt;code&gt;--model&lt;/code&gt;, &lt;code&gt;--provider&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Per-run model and provider overrides; env &lt;code&gt;HERMES_INFERENCE_MODEL&lt;/code&gt; / &lt;code&gt;HERMES_INFERENCE_PROVIDER&lt;/code&gt; mirror behavior.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;-t&lt;/code&gt; / &lt;code&gt;--toolsets&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Enable comma-separated toolsets for the run.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;-s&lt;/code&gt; / &lt;code&gt;--skills&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Preload skills (repeat or comma-separated).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--image path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Attach a local image to a single query.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--checkpoints&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Enable filesystem checkpoints before destructive edits.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--max-turns N&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cap tool-calling iterations per turn (default from config).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--source&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Session source tag (&lt;code&gt;cli&lt;/code&gt; vs &lt;code&gt;tool&lt;/code&gt; for integrations).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Hermes model outside the session vs &lt;code&gt;/model&lt;/code&gt; inside it&lt;/strong&gt; — Running &lt;strong&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/strong&gt; from the shell is where you &lt;strong&gt;add providers&lt;/strong&gt;, keys, and OAuth. Slash &lt;strong&gt;&lt;code&gt;/model&lt;/code&gt;&lt;/strong&gt; only switches among &lt;strong&gt;already configured&lt;/strong&gt; providers. If you only see OpenRouter in &lt;code&gt;/model&lt;/code&gt;, exit the session and complete &lt;strong&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model picker, credential pools, and fallback providers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive provider and model picker; keys, OAuth, custom endpoints.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes auth&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Credential pools — &lt;code&gt;add&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;remove&lt;/code&gt;, &lt;code&gt;reset&lt;/code&gt; for rotation-friendly keys and OAuth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`hermes fallback [list \&lt;/td&gt;
&lt;td&gt;add \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;{% raw %}`hermes setup [model \&lt;/td&gt;
&lt;td&gt;tts \&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Deprecated &lt;strong&gt;{% raw %}&lt;code&gt;hermes login&lt;/code&gt; / &lt;code&gt;hermes logout&lt;/code&gt;&lt;/strong&gt; — use &lt;strong&gt;&lt;code&gt;hermes auth&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/strong&gt; instead.&lt;/p&gt;

&lt;p&gt;Picking local OpenAI-compatible endpoints versus hosted APIs for &lt;strong&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/strong&gt; sits on the same trade-offs as general &lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM hosting&lt;/a&gt; (latency, cost, ops).&lt;/p&gt;

&lt;h2&gt;
  
  
  Config files and &lt;code&gt;hermes config&lt;/code&gt; commands
&lt;/h2&gt;

&lt;p&gt;Configuration resolves as &lt;strong&gt;CLI overrides → &lt;code&gt;config.yaml&lt;/code&gt; → &lt;code&gt;.env&lt;/code&gt; → defaults&lt;/strong&gt;. API keys belong in &lt;strong&gt;&lt;code&gt;.env&lt;/code&gt;&lt;/strong&gt;; structured settings in &lt;strong&gt;&lt;code&gt;config.yaml&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config show&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Display effective configuration.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config edit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Open &lt;code&gt;config.yaml&lt;/code&gt; in &lt;code&gt;$EDITOR&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config set key value&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Set values (secrets routed to &lt;code&gt;.env&lt;/code&gt;, non-secrets to YAML).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes config path&lt;/code&gt; / &lt;code&gt;hermes config env-path&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Print paths to config and env files.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config check&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Detect missing or stale settings.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config migrate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Apply newly introduced options interactively.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Where files live&lt;/strong&gt; — Everything sits under &lt;strong&gt;&lt;code&gt;HERMES_HOME&lt;/code&gt;&lt;/strong&gt; (default &lt;strong&gt;&lt;code&gt;~/.hermes&lt;/code&gt;&lt;/strong&gt;) for config, secrets, memories, skills, sessions, gateway state, and logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Session management and &lt;code&gt;hermes profile&lt;/code&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes sessions list&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List recent sessions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes sessions browse&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive picker with search and resume.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes sessions export&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Export sessions (e.g. JSONL).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes sessions delete&lt;/code&gt;, &lt;code&gt;prune&lt;/code&gt;, &lt;code&gt;rename&lt;/code&gt;, &lt;code&gt;stats&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Delete one session, prune old ones, rename titles, show store stats.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes profile list&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;use&lt;/code&gt; \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes profile export&lt;/code&gt; / &lt;code&gt;import&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Archive or restore a profile tarball.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes profile alias&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Short wrapper scripts for fast profile switching.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Use &lt;strong&gt;&lt;code&gt;hermes -p work chat -q "..."&lt;/code&gt;&lt;/strong&gt; for ad hoc runs without changing the sticky default profile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills hub, toolsets, shell hooks, and plugins
&lt;/h2&gt;

&lt;p&gt;For profile-first configuration and skills tuned to real production workflows by role, see &lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes AI Assistant Skills for Real Production Setups&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes tools&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive per-platform tool enablement; &lt;code&gt;--summary&lt;/code&gt; prints current choices.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes skills browse&lt;/code&gt;, &lt;code&gt;search&lt;/code&gt;, &lt;code&gt;inspect&lt;/code&gt;, &lt;code&gt;install&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;check&lt;/code&gt;, &lt;code&gt;update&lt;/code&gt;, &lt;code&gt;audit&lt;/code&gt;, &lt;code&gt;uninstall&lt;/code&gt;, &lt;code&gt;publish&lt;/code&gt;, &lt;code&gt;snapshot&lt;/code&gt;, &lt;code&gt;tap&lt;/code&gt;, &lt;code&gt;config&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Skills hub workflows including registries and URL installs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes curator status&lt;/code&gt;, &lt;code&gt;run&lt;/code&gt;, &lt;code&gt;pause&lt;/code&gt;, &lt;code&gt;pin&lt;/code&gt;, &lt;code&gt;rollback&lt;/code&gt;, …&lt;/td&gt;
&lt;td&gt;Background skill maintenance and safe rollback.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes hooks list&lt;/code&gt;, &lt;code&gt;test&lt;/code&gt;, &lt;code&gt;revoke&lt;/code&gt;, &lt;code&gt;doctor&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Declared shell hooks and allowlists in config.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes plugins&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Composite UI or subcommands to install, enable, disable, remove plugins.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Built-in memory and &lt;code&gt;hermes memory&lt;/code&gt; providers
&lt;/h2&gt;

&lt;p&gt;Built-in &lt;strong&gt;MEMORY.md&lt;/strong&gt; / &lt;strong&gt;USER.md&lt;/strong&gt; stay active; external providers add optional recall layers. For how that architecture behaves in practice, read &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System — How Persistent AI Memory Actually Works&lt;/a&gt;. To compare external backends and activation trade-offs, see &lt;a href="https://www.glukhov.org/ai-systems/memory/agent-memory-providers/" rel="noopener noreferrer"&gt;Agent Memory Providers Compared — Honcho, Mem0, Hindsight, and Five More&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes memory setup&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive external memory provider configuration.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes memory status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show active provider settings.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes memory off&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Disable external provider; built-in files remain.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When a provider is active it may register extra provider-specific top-level subcommands — run &lt;strong&gt;&lt;code&gt;hermes --help&lt;/code&gt;&lt;/strong&gt; to see what is wired today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Messaging gateway, DM pairing, and platforms
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes gateway setup&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive messaging platform setup.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes gateway run&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Foreground gateway (recommended on &lt;strong&gt;WSL&lt;/strong&gt;, Docker, Termux).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes gateway start&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;stop&lt;/code&gt; \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes gateway install&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uninstall&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes pairing list&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;approve&lt;/code&gt; \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes whatsapp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;WhatsApp bridge pairing flow.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes slack manifest&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generate Slack app manifest with gateway slash parity.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On &lt;strong&gt;WSL&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;hermes gateway run&lt;/code&gt;&lt;/strong&gt; inside &lt;strong&gt;tmux&lt;/strong&gt; is the resilient pattern when &lt;strong&gt;&lt;code&gt;gateway start&lt;/code&gt;&lt;/strong&gt; misbehaves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cron scheduler, webhooks, and Kanban
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes cron …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Create, edit, pause, resume, run, remove scheduled prompts (&lt;code&gt;tick&lt;/code&gt; for manual scheduler pass).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes webhook subscribe&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;remove&lt;/code&gt;, &lt;code&gt;test&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Dynamic webhook routes for event-driven runs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes kanban …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multi-profile task board backed by SQLite; &lt;code&gt;dispatch&lt;/code&gt; drives workers.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;hermes doctor&lt;/code&gt;, logs, backup, and usage insights
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes doctor [--fix]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive diagnostics and optional auto-repair.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes status [--all] [--deep]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concise status; deeper checks when needed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes dump [--show-keys]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Paste-friendly setup summary for Discord or GitHub issues.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes debug share&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Upload redacted debug bundle to a paste service (or &lt;code&gt;--local&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`hermes logs [agent \&lt;/td&gt;
&lt;td&gt;errors \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;{% raw %}&lt;code&gt;hermes backup&lt;/code&gt;, &lt;code&gt;hermes import&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Zip snapshots of home data and restore paths.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes insights [--days N] [--source …]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Token, cost, and activity analytics.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When something breaks after an upgrade, &lt;strong&gt;&lt;code&gt;hermes doctor&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;hermes status&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;hermes logs errors -f&lt;/code&gt;&lt;/strong&gt; form the fastest triage loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP, ACP, web dashboard, and OpenClaw migration
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes mcp serve&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Run Hermes as an MCP server.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes mcp add&lt;/code&gt;, &lt;code&gt;remove&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;test&lt;/code&gt;, &lt;code&gt;configure&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Manage MCP client connections from Hermes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes acp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Agent Client Protocol stdio server for editors (extra install may apply).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes dashboard [--port …] [--host …]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Local web dashboard (&lt;code&gt;pip install hermes-agent[web]&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes claw migrate …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Migrate OpenClaw-style configs into Hermes (&lt;code&gt;--dry-run&lt;/code&gt;, presets, optional secrets).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw migration&lt;/strong&gt; — &lt;code&gt;hermes claw migrate&lt;/code&gt; reads legacy OpenClaw home directories; for what that stack looked like before moving, see the &lt;a href="https://www.glukhov.org/ai-systems/openclaw/" rel="noopener noreferrer"&gt;OpenClaw case study&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Slash commands in the Hermes CLI session
&lt;/h2&gt;

&lt;p&gt;Type &lt;strong&gt;&lt;code&gt;/&lt;/code&gt;&lt;/strong&gt; for autocomplete. Commands are &lt;strong&gt;case-insensitive&lt;/strong&gt;; skills register extra &lt;strong&gt;&lt;code&gt;/skill-name&lt;/code&gt;&lt;/strong&gt; routes. The tables below are a curated subset; for the full registry see &lt;strong&gt;Official Hermes Agent documentation&lt;/strong&gt; at the end of this article.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session flow, background tasks, and goals
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/new&lt;/code&gt;, &lt;code&gt;/reset&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;New session ID and history.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/resume [name]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Resume a named session.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/compress [focus]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Manual context compression with optional focus topic.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/retry&lt;/code&gt;, &lt;code&gt;/undo&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Retry last turn or drop last exchange.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/title …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Name the session for later &lt;code&gt;/resume&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/background …&lt;/code&gt;, &lt;code&gt;/queue …&lt;/code&gt;, &lt;code&gt;/steer …&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Parallel background run, queued next prompt, mid-loop nudge after next tool.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/goal …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Persistent multi-turn objective with judge loop (&lt;code&gt;status&lt;/code&gt;, &lt;code&gt;pause&lt;/code&gt;, &lt;code&gt;resume&lt;/code&gt;, &lt;code&gt;clear&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/branch&lt;/code&gt;, &lt;code&gt;/fork&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Branch the conversation for alternate exploration.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Models, tool toggles, skills, and reload
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/model … [--global]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switch models among configured providers; &lt;code&gt;--global&lt;/code&gt; persists default.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/tools …&lt;/code&gt;, &lt;code&gt;/toolsets&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Session tool toggles and toolset listing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/skills …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search, install, and manage skills from chat.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/cron …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scheduled tasks UI from the CLI session.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/reload-mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reload MCP servers from config.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/reload&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reload &lt;code&gt;.env&lt;/code&gt; into the running session without restart.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Usage, help, and quitting
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/usage&lt;/code&gt;, &lt;code&gt;/insights&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Token and cost visibility; analytics snapshot.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/help&lt;/code&gt;, &lt;code&gt;/quit&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Help or exit the CLI.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Messaging apps (Telegram, Discord, Slack, and others) expose an overlapping slash set plus &lt;strong&gt;&lt;code&gt;/approve&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;/restart&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;/commands&lt;/code&gt;&lt;/strong&gt;, and related gateway-only helpers — platform differences are documented in the slash command reference under &lt;strong&gt;Official Hermes Agent documentation&lt;/strong&gt; below.&lt;/p&gt;

&lt;h2&gt;
  
  
  More useful reading
&lt;/h2&gt;

&lt;p&gt;Related pages on this site (broader context for Hermes and terminal agents):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems — Self-Hosted Assistants, RAG, and Local Infrastructure&lt;/a&gt; — cluster overview and how assistants fit the stack&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-systems/memory/" rel="noopener noreferrer"&gt;AI Systems Memory&lt;/a&gt; — memory hub and adjacent guides&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-devtools/" rel="noopener noreferrer"&gt;AI Developer Tools&lt;/a&gt; — terminal and IDE tooling landscape&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-devtools/opencode/" rel="noopener noreferrer"&gt;OpenCode Quickstart&lt;/a&gt; — another terminal-first agent for ergonomic comparison&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Official Hermes Agent documentation
&lt;/h2&gt;

&lt;p&gt;Upstream documentation on &lt;em&gt;hermes-agent.nousresearch.com&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://hermes-agent.nousresearch.com/" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/reference/cli-commands" rel="noopener noreferrer"&gt;CLI commands reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/reference/slash-commands" rel="noopener noreferrer"&gt;Slash commands reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Tip.&lt;/strong&gt; Keep &lt;strong&gt;&lt;code&gt;hermes dump&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;hermes doctor --fix&lt;/code&gt;&lt;/strong&gt; in muscle memory — they turn vague "something broke" reports into actionable diffs against a known-good setup.&lt;/p&gt;

</description>
      <category>cheatsheet</category>
      <category>hermes</category>
      <category>selfhosting</category>
      <category>llm</category>
    </item>
    <item>
      <title>MinIO CE in 2026: Retired Upstream, Source-Only, and What to Use</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 04 May 2026 10:56:52 +0000</pubDate>
      <link>https://forem.com/rosgluk/minio-ce-in-2026-retired-upstream-source-only-and-what-to-use-1k02</link>
      <guid>https://forem.com/rosgluk/minio-ce-in-2026-retired-upstream-source-only-and-what-to-use-1k02</guid>
      <description>&lt;p&gt;MinIO Community Edition is no longer a safe default for new production systems.  &lt;/p&gt;

&lt;p&gt;As of 2026, the public project status and distribution model changed enough that many teams now treat MinIO CE as end of life for serious workloads.&lt;/p&gt;

&lt;p&gt;If you are deciding whether to keep MinIO CE, fork it, or migrate, this guide gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a factual timeline of what changed&lt;/li&gt;
&lt;li&gt;the practical risk for operators&lt;/li&gt;
&lt;li&gt;a technical comparison of SeaweedFS, Garage, RustFS, and Ceph RGW&lt;/li&gt;
&lt;li&gt;a migration plan you can execute in phases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For broader context around storage, databases, and search in production AI stacks, see the &lt;a href="https://www.glukhov.org/data-infrastructure/" rel="noopener noreferrer"&gt;Data Infrastructure for AI Systems&lt;/a&gt; pillar.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened to MinIO CE
&lt;/h2&gt;

&lt;p&gt;The community concern is not one single event. It is the sequence.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;May 2025&lt;/td&gt;
&lt;td&gt;Key management features moved out of CE path&lt;/td&gt;
&lt;td&gt;Reduced CE parity for auth and admin workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oct 2025&lt;/td&gt;
&lt;td&gt;Community Docker images and public binaries stopped&lt;/td&gt;
&lt;td&gt;Operators must build and verify from source&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dec 2025&lt;/td&gt;
&lt;td&gt;Public maintenance mode messaging became explicit&lt;/td&gt;
&lt;td&gt;Fewer expectations for active OSS iteration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feb 2026&lt;/td&gt;
&lt;td&gt;Repository archived for the first time&lt;/td&gt;
&lt;td&gt;Read only state blocks normal OSS collaboration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;Repository archived again and stayed locked&lt;/td&gt;
&lt;td&gt;Confirms long term frozen upstream posture&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The core operational impact is simple&lt;br&gt;&lt;br&gt;
you inherit more supply chain, patching, and maintenance responsibility than most teams expect from a mainstream S3 compatible store.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is MinIO still open source in 2026
&lt;/h2&gt;

&lt;p&gt;A common question is whether MinIO is still open source at all.&lt;/p&gt;

&lt;p&gt;The server code in the public repository is still under AGPLv3.&lt;br&gt;&lt;br&gt;
However, the practical community path changed from normal binary-first consumption to source-first self build.&lt;br&gt;&lt;br&gt;
For many teams, that feels less like a living OSS ecosystem and more like unsupported source availability.&lt;/p&gt;

&lt;p&gt;So the accurate answer is nuanced&lt;br&gt;&lt;br&gt;
license status remains open source, but operationally the community experience is no longer what most platform teams need for low risk production adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is MinIO CE safe for new production deployments
&lt;/h2&gt;

&lt;p&gt;For greenfield deployments, usually no, especially when compared with documented options in this &lt;a href="https://www.glukhov.org/data-infrastructure/object-storage/garage-vs-minio-vs-s3/" rel="noopener noreferrer"&gt;MinIO vs Garage vs AWS S3 comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the risk profile changed
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Patch cadence risk&lt;/strong&gt;
no stable, trusted community binary channel means every CVE cycle becomes your build and release cycle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification burden&lt;/strong&gt;
your team must own provenance, repeatability, and rollback strategy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ecosystem drift risk&lt;/strong&gt;
tooling that assumed public images may lag or break&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;People risk&lt;/strong&gt;
senior SRE and security time is consumed by platform plumbing instead of product work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you already run MinIO CE internally, this does not mean panic shutdown.&lt;br&gt;&lt;br&gt;
It means treat the platform as controlled technical debt and put a migration runway on your roadmap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community verdict and market response
&lt;/h2&gt;

&lt;p&gt;Across operator communities in 2025 to 2026, the pattern is consistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fewer teams choose MinIO CE for net new deployments&lt;/li&gt;
&lt;li&gt;more teams evaluate Garage and SeaweedFS first&lt;/li&gt;
&lt;li&gt;enterprise teams with strict S3 semantics often move toward Ceph RGW&lt;/li&gt;
&lt;li&gt;RustFS gets attention as a direct successor style option, but with alpha caution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This trend matters because platform safety is partly social&lt;br&gt;&lt;br&gt;
healthy ecosystems reduce integration risk, improve troubleshooting velocity, and widen hiring pools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best alternatives to MinIO CE
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SeaweedFS
&lt;/h2&gt;

&lt;p&gt;SeaweedFS is a strong option when you care about huge object counts, small file behavior, and practical efficiency in commodity environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose SeaweedFS when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;you need high small-object density&lt;/li&gt;
&lt;li&gt;you prefer Apache 2.0 governance and licensing clarity&lt;/li&gt;
&lt;li&gt;you want production readiness without the heavy footprint of Ceph&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Garage
&lt;/h2&gt;

&lt;p&gt;Garage is attractive for lightweight self-hosted clusters, edge nodes, and geo distributed deployments on modest hardware.&lt;/p&gt;

&lt;p&gt;If you want a concrete setup path, use this &lt;a href="https://www.glukhov.org/data-infrastructure/object-storage/garage-quickstart/" rel="noopener noreferrer"&gt;Garage S3 quickstart&lt;/a&gt; to validate replication and operations before migration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose Garage when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;resource efficiency matters more than full S3 feature parity&lt;/li&gt;
&lt;li&gt;you run mixed ARM or small node environments&lt;/li&gt;
&lt;li&gt;you want simple operations over maximal feature surface&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  RustFS
&lt;/h2&gt;

&lt;p&gt;RustFS is frequently discussed as the closest successor narrative to MinIO style deployment and UX.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose RustFS when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;you accept alpha-stage software risk&lt;/li&gt;
&lt;li&gt;you can test deeply before production&lt;/li&gt;
&lt;li&gt;you want to track a fast moving project with potential upside&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For regulated or high uptime systems, keep RustFS in pilot until maturity is proven in your own reliability tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ceph RGW
&lt;/h2&gt;

&lt;p&gt;Ceph RGW remains the enterprise heavyweight with broad capability and scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose Ceph RGW when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;you need mature enterprise S3 behavior&lt;/li&gt;
&lt;li&gt;your team already has Ceph operational expertise&lt;/li&gt;
&lt;li&gt;you can support higher infrastructure and on-call complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Which object store is best for your use case
&lt;/h2&gt;

&lt;p&gt;Use this pragmatic filter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small team and low ops budget&lt;/strong&gt;
start with Garage or SeaweedFS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large enterprise and strict compatibility needs&lt;/strong&gt;
prefer Ceph RGW&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experimental migration from MinIO style workflows&lt;/strong&gt;
pilot RustFS, but keep rollback options&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No option is universally best.&lt;br&gt;&lt;br&gt;
The correct target depends on required S3 features, RPO and RTO goals, team maturity, and how much platform ownership you want.&lt;/p&gt;

&lt;p&gt;If your team still needs legacy MinIO background before deciding, this &lt;a href="https://www.glukhov.org/data-infrastructure/object-storage/minio-vs-aws-s3/" rel="noopener noreferrer"&gt;MinIO vs AWS S3 overview&lt;/a&gt; and this &lt;a href="https://www.glukhov.org/data-infrastructure/object-storage/minio-cheatsheet/" rel="noopener noreferrer"&gt;MinIO command cheatsheet&lt;/a&gt; help with current-state audits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration plan from MinIO CE
&lt;/h2&gt;

&lt;p&gt;If you are currently on MinIO CE, this phased approach avoids risky big-bang moves.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1 inventory and risk scoring
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;list buckets, object counts, and growth rates&lt;/li&gt;
&lt;li&gt;classify workloads by criticality and recovery objectives&lt;/li&gt;
&lt;li&gt;identify hard S3 dependencies such as versioning, object lock, or policy behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 2 proof of compatibility
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;stand up one or two candidate platforms&lt;/li&gt;
&lt;li&gt;replay representative read and write workloads&lt;/li&gt;
&lt;li&gt;verify auth, lifecycle rules, retention behavior, and SDK edge cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plan to instrument your pilot from day one with metrics and alerts from the &lt;a href="https://www.glukhov.org/observability/" rel="noopener noreferrer"&gt;Observability pillar&lt;/a&gt; so migration regressions are measurable rather than anecdotal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3 pilot cutover
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;migrate one low blast radius workload first&lt;/li&gt;
&lt;li&gt;run dual read validation where possible&lt;/li&gt;
&lt;li&gt;measure latency, error rates, and operational overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 4 production migration
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;migrate high priority internet facing workloads first&lt;/li&gt;
&lt;li&gt;keep rollback artifacts and retention windows&lt;/li&gt;
&lt;li&gt;document final runbooks before decommissioning MinIO CE paths&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;MinIO CE may still run, but it is no longer the low-friction default for new production object storage.&lt;br&gt;&lt;br&gt;
Treat current clusters as transition infrastructure, not a long horizon foundation.&lt;/p&gt;

&lt;p&gt;For most teams in 2026, safer direction is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SeaweedFS or Garage for pragmatic self hosted deployments&lt;/li&gt;
&lt;li&gt;Ceph RGW for enterprise scale and mature S3 requirements&lt;/li&gt;
&lt;li&gt;RustFS for monitored pilot environments only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Make the migration decision early while you can still choose your timeline instead of reacting to the next forced change.&lt;/p&gt;

</description>
      <category>minio</category>
      <category>garage</category>
      <category>s3</category>
      <category>selfhosting</category>
    </item>
  </channel>
</rss>
