Forem: Rost

LLM Wiki - Compiled Knowledge That RAG Cannot Replace

Rost — Mon, 18 May 2026 09:22:56 +0000

The premise is simple: compiled knowledge is more reusable than retrieved fragments.
RAG became the default answer to a straightforward question - how do I give an LLM access to external knowledge?

And the usual architecture is by now familiar.
Take documents, split them into chunks, embed the chunks, store them in a vector database, retrieve relevant pieces at query time, and pass them into the model. That pattern is useful, but it is also overused. RAG is very good at access and not automatically good at structure. It can find relevant fragments but does not create a stable understanding of a domain, it can retrieve context but does not decide what the canonical explanation is, and it can answer from documents but does not maintain a living knowledge base.

LLM Wiki is not just another retrieval pattern but a different way to think about knowledge architecture entirely. Instead of asking the model to synthesize from raw chunks every time a question is asked, an LLM Wiki uses the model earlier in the pipeline, performing synthesis at ingest time and storing the result as structured, readable, linked knowledge.

A good shorthand is this:

RAG retrieves knowledge at query time.
LLM Wiki compiles knowledge at ingest time.

That distinction changes cost, latency, quality, maintenance, governance, and failure modes - and it is the central reason LLM Wiki deserves its own architecture category.

RAG optimizes retrieval, not representation

RAG is powerful because it lets a language model use information outside its training data, making it useful for:

company documentation
product manuals
technical support
internal search
research assistants
policy lookup
code documentation
knowledge base chatbots

But RAG has a structural weakness: it often treats knowledge as a pile of retrievable fragments rather than a structured model of a domain.

A typical RAG system works like this:

Collect documents.
Split them into chunks.
Create embeddings.
Store the chunks in a vector database.
Retrieve similar chunks for each query.
Ask the LLM to answer using those chunks.

This works well for many questions, but it also creates repeated interpretation work for complex ones. Every time a user asks something conceptually rich, the system has to:

retrieve fragments
decide which fragments matter
infer relationships
resolve contradictions
build a temporary explanation
produce an answer

Then that synthesis disappears and the next query starts from scratch. This is fine when questions are simple, but it becomes wasteful when the same concepts are repeatedly reconstructed from raw fragments.

The most common RAG mistake is assuming that better retrieval equals better knowledge. Sometimes that is true, but often it is not, because retrieval and representation solve different problems. Retrieval answers which pieces of text are relevant; representation answers how knowledge should be structured in the first place. A RAG system can retrieve five accurate chunks about a topic and still fail because:

the chunks are outdated
the documents contradict each other
the important concept is spread across pages
the source uses inconsistent terminology
the answer requires synthesis, not lookup
there is no canonical page

RAG is an access layer, not a knowledge model by itself, and an LLM Wiki exists precisely because some knowledge should be represented before it is retrieved.

What is an LLM Wiki?

An LLM Wiki is a knowledge system where a language model helps transform source material into structured wiki-like knowledge. Instead of storing only raw documents and retrieving chunks later, the system creates derived knowledge artifacts such as:

topic pages
summaries
glossaries
concept pages
entity pages
cross-links
comparisons
contradiction notes
source references
decision records
explanations

The output is usually human-readable and, in many implementations, stored as plain Markdown, which matters because Markdown makes the system:

inspectable
portable
editable
versionable
easy to diff
compatible with static sites and PKM tools

The idea is not that the LLM magically knows everything but that the LLM helps maintain a structured layer over the source material, acting as a structuring assistant rather than the final authority.

The core idea

The core idea of LLM Wiki is ingest-time knowledge synthesis. In a RAG system, synthesis usually happens when a user asks a question; in an LLM Wiki, synthesis happens earlier, during ingestion, before any question has been asked.

A simplified pipeline looks like this:

sources
  -> ingest
  -> summarize
  -> structure
  -> link
  -> maintain
  -> query or browse

The system does not wait until query time to figure out what the knowledge means - it creates a reusable structure in advance, which makes LLM Wiki closer to a compiled knowledge base than a search pipeline.

A practical example

Imagine you have 60 articles about local LLM hosting. A RAG system might split them into chunks and retrieve relevant sections when you ask about the differences between Ollama, vLLM, llama.cpp, and SGLang, then let the LLM assemble an answer from those retrieved fragments.

An LLM Wiki system does something different. At ingest time, it creates structured pages:

ollama.md
vllm.md
llama-cpp.md
sglang.md
local-llm-hosting-overview.md
inference-backends-comparison.md
gpu-memory-and-context-length.md

Then it links them. When you later ask a question, the system is not starting from raw fragments but from a structured knowledge layer that was already assembled before the question arrived - and for conceptual and comparative questions, that difference in quality is significant.

How LLM Wiki works

There is no single official implementation, but most LLM Wiki systems follow the same conceptual stages.

Source collection

The system starts with source material - blog posts, PDFs, Markdown notes, technical documentation, transcripts, papers, meeting notes, bookmarks, code comments, and README files - which should be preserved as a separate layer, distinct from the generated wiki. This matters because generated wiki pages are derived knowledge, not original truth, and a serious LLM Wiki should always maintain links back to sources so that every generated page can answer the basic question: where did this claim come from?

Ingestion and extraction

During ingestion, the system reads source material and extracts useful knowledge. It may identify:

main topics
entities and tools
definitions
claims
decisions
examples
contradictions between sources
open questions
recurring concepts

This stage is where LLM Wiki starts to differ from ordinary RAG: while RAG usually chunks documents for retrieval, LLM Wiki tries to understand and reshape the material conceptually rather than just making it searchable.

Summarization

The system creates summaries, but useful summaries are not just shorter versions of text - they should preserve the structure of the argument. A weak summary says "this document discusses local LLM hosting tools." A useful summary says "this document compares local LLM hosting tools by deployment complexity, GPU usage, API compatibility, and production readiness, positioning Ollama as easy for local use, vLLM as stronger for server workloads, and llama.cpp as flexible for quantized models."

For technical knowledge, a summary should capture:

what problem it solves
what assumptions it makes
what tradeoffs it contains
what dependencies it has
what is still uncertain

This is where LLMs are genuinely useful, because they are good at compressing messy prose into structured explanations.

Structuring

Summaries alone are not enough - the system must also decide where knowledge belongs, which is the representation layer. Common structures include:

topic pages
concept pages
index pages
comparison pages
glossary entries
how-to pages
architecture notes
decision records
maps of related pages

A pile of summaries is not a wiki; a wiki needs page boundaries, links, and recurring structure, and a good LLM Wiki is not measured by page count but by whether pages become genuinely reusable.

Linking

Links define the shape of the knowledge system. In a normal document archive, relationships are often implicit; in an LLM Wiki, they should become explicit. Useful link types include:

concept to concept
article to summary
tool to comparison
problem to solution
architecture to implementation
source to derived page
glossary term to detailed page

This is one of the most important differences between LLM Wiki and basic summarization: summaries reduce text, but links build a knowledge graph.

Review and correction

This stage is optional only in toy systems; in serious systems, human review is essential. The review process should check:

whether summaries are faithful
whether links are useful
whether claims are sourced
whether pages are duplicated
whether concepts are misplaced
whether outdated information is marked
whether generated pages overstate certainty

LLM Wiki can reduce human effort, but it should never remove human responsibility.

LLM Wiki vs RAG

The cleanest distinction between LLM Wiki and RAG is timing.

Query-time synthesis

In RAG, the system retrieves information when a user asks a question.

query
  -> retrieve chunks
  -> assemble context
  -> generate answer

This is flexible and works well when:

the corpus is large
information changes often
questions are unpredictable
you need broad coverage
you cannot curate everything

But it may be less coherent for conceptual questions, because the model has to synthesize from fragments each time, which can produce inconsistent answers across similar queries.

Ingest-time synthesis

In LLM Wiki, the system performs synthesis before the question arrives.

sources
  -> summarize
  -> structure
  -> link
  -> query or browse later

This is less flexible but more coherent, and it works well when:

the corpus is manageable
the domain is stable
concepts repeat
human readability matters
you want reusable synthesis
you want a maintained knowledge layer

The main differences

Dimension	RAG	LLM Wiki
Main timing	Query time	Ingest time
Main operation	Retrieve chunks	Compile knowledge
Best corpus	Large and changing	Curated and stable
Output	Generated answer	Structured knowledge pages
Infrastructure	Search index or vector DB	Markdown or wiki structure
Strength	Flexible access	Reusable synthesis
Weakness	Fragmented context	Maintenance drift
Human readability	Often indirect	Usually direct

Complementary, not mutually exclusive

The debate should not be framed as "LLM Wiki or RAG" - that is the wrong question. LLM Wiki does not replace RAG in most production systems; both have distinct and complementary roles. A well-designed system may look like this:

raw documents
  -> source store
  -> LLM Wiki synthesis
  -> reviewed knowledge pages
  -> search index
  -> RAG over source and synthesis
  -> answer with citations

In that architecture, LLM Wiki improves the representation layer and RAG improves the access layer. Use RAG for retrieval over large and changing corpora, use LLM Wiki for compiled synthesis over stable and curated knowledge, and use both together when you need scale and coherence at the same time.

LLM Wiki vs adjacent systems

LLM Wiki vs summarization

A weak LLM Wiki is just a folder of generated summaries, and that is not enough. Summarization compresses content; LLM Wiki structures it. A real LLM Wiki needs stable pages, links, concepts, indexes, source tracking, revision history, maintenance workflows, and conflict detection - the wiki part matters as much as the LLM part.

LLM Wiki vs knowledge graph

A knowledge graph represents entities and relationships explicitly, while an LLM Wiki creates a softer, document-oriented graph through Markdown pages and links. A mature system can use both: the wiki provides human-readable explanations and the knowledge graph provides precisely structured, machine-queryable relationships.

LLM Wiki vs agent memory

LLM Wiki is also different from AI memory. Memory stores context that affects future behavior, while an LLM Wiki stores structured knowledge that can be read, searched, reviewed, and linked by both humans and systems.

Memory might remember:

the user prefers Go examples
the project avoids ORMs
the agent tried a command yesterday
a bug investigation failed

An LLM Wiki might store:

what Go database access patterns exist
how sqlc compares with GORM
why outbox patterns matter
how RAG differs from memory systems

Memory is behavioral context; LLM Wiki is represented knowledge - and mixing the two leads to systems that are hard to inspect, audit, or maintain.

When LLM Wiki works well

LLM Wiki works best for stable domains, personal research, curated corpora, technical documentation, and situations where repeated synthesis over the same material is wasteful.

Stable domains

LLM Wiki works best when the domain does not change every hour. Good examples include:

technical concepts
research notes
learning material
architecture patterns
book notes
model comparison notes
internal engineering principles
curated documentation
personal knowledge bases

If knowledge is stable enough to summarize without becoming stale within days, LLM Wiki can deliver lasting value that compounds as the wiki grows.

Research synthesis

Research synthesis is one of the strongest use cases, because researchers often read many sources and repeatedly ask the same meta-questions:

What are the main ideas?
Which sources agree?
Which sources conflict?
What concepts repeat?
What is the current state of the topic?
What should I read next?

LLM Wiki helps turn that research material into reusable structure - topic pages, comparison pages, contradiction notes, and related links - so the researcher does not have to rebuild the same mental map every time they return to a domain. It is especially useful when working with papers, technical articles, transcripts, documentation, notes, and experiment logs.

Personal knowledge systems

LLM Wiki fits naturally with PKM and the broader knowledge systems spectrum and second brain workflows because a personal knowledge system already contains:

notes
links
unfinished ideas
summaries
references
topic maps

An LLM can help maintain the structure by:

summarizing long notes
proposing links
creating topic pages
detecting duplicate concepts
extracting glossary terms
generating index pages
identifying gaps

The human remains the editor, which is the right relationship between human judgment and machine assistance.

Technical blogging

A technical blog can use LLM Wiki ideas internally even without building a full automated system. A well-structured site can include:

pillar pages
cluster index pages
topic summaries
related article maps
glossary pages
comparison pages
canonical explainers

This is not only SEO but knowledge representation: a well-structured technical blog becomes more valuable when articles are connected into a durable knowledge structure that both humans and AI systems can navigate.

Small team knowledge bases

LLM Wiki can work well for small teams with curated knowledge, including engineering decisions, product architecture, onboarding notes, support playbooks, internal standards, postmortems, and runbooks. The key condition is governance: someone must review and maintain the generated structure, because without clear ownership the wiki decays into noise regardless of how well it was initially generated.

When LLM Wiki is a poor fit

Highly dynamic data

LLM Wiki is weaker when information changes constantly. Live inventory, pricing feeds, incident status, financial market data, rapidly changing support tickets, and real-time logs are all better served by retrieval or direct API access. Compiling fast-moving data into static summaries is counterproductive unless you have a strong refresh process that keeps the compiled layer in sync with reality.

Large unmanaged corpora

LLM Wiki does not automatically scale to millions of documents. At large scale, the difficult problems extend well beyond generation and include:

access control
data lineage
ownership
deduplication
indexing
freshness tracking
evaluation
governance

A simple Markdown wiki is not equipped to address those needs, and at enterprise scale, LLM Wiki may become one layer inside a larger knowledge architecture rather than the whole system.

Low-quality sources

LLM Wiki cannot reliably fix bad sources. If the source material is contradictory, outdated, low quality, duplicated, incomplete, or badly scoped, generated pages may look polished but be wrong. This is dangerous precisely because a clean generated page creates false confidence - the formatting signals quality even when the underlying content does not justify it.

No review process

LLM Wiki without review is risky because generated structure creates authority. A bad answer in RAG may affect one query, but a bad generated wiki page may affect many future queries, readers, and agents that retrieve from it. The model may overgeneralize, miss exceptions, invent structure, merge incompatible ideas, hide uncertainty, create misleading links, or summarize outdated material as though it were current - so for any knowledge that actually matters, human review is not optional.

Limitations and failure modes

The main risks of building an LLM Wiki are stale summaries, hallucinated synthesis baked into the knowledge base, weak source tracking, maintenance cost, and false confidence in generated structure.

Maintenance drift

Knowledge drift happens when generated pages stop matching the underlying sources. This can happen because:

sources changed
new sources were added
old pages were not refreshed
summaries were edited manually
links became outdated
model output changed over time

Drift is the central operational risk of LLM Wiki, and a good system needs explicit refresh and validation workflows to catch it before it propagates.

Hallucinated synthesis

RAG can hallucinate at answer time, but LLM Wiki can hallucinate at ingest time, which is more subtle and more dangerous. If a generated wiki page contains a wrong synthesis, future users may treat that page as ground truth, and future AI systems may retrieve it and amplify the mistake further. Generated structure needs provenance, and every important claim should link back to its original sources so the hallucination can be caught during review rather than silently embedded in the knowledge base.

Over-structuring

Once you have an LLM that can create pages cheaply, it is tempting to create too many of them. You can end up with:

empty taxonomy
duplicate concepts
shallow pages
meaningless links
generated clutter
fake completeness

A useful wiki is not measured by page count but by whether pages are actually reused, linked, and updated over time.

Unclear ownership

The model cannot own the page. A serious system needs clear ownership rules covering:

who reviews pages
who approves updates
who deletes stale pages
who resolves contradictions
who decides canonical structure

Without that clarity, LLM Wiki becomes another abandoned knowledge base - well-intentioned, well-generated, and quietly ignored.

Architecture patterns

Pattern 1. Personal LLM Wiki

The personal pattern is the simplest and most practical version, best suited for individuals.

notes and sources
  -> LLM assisted summaries
  -> Markdown pages
  -> manual review
  -> [Obsidian](https://www.glukhov.org/knowledge-management/tools/obsidian-for-personal-knowledge-management/ "Using Obsidian for Personal Knowledge Management") or static site

It works well for researchers, writers, engineers, technical bloggers, students, and consultants, where the value comes from reducing repeated synthesis and making personal knowledge easier to navigate without requiring any team coordination or governance infrastructure.

Pattern 2. Team LLM Wiki

The team pattern is best for small groups and needs more governance than the personal version.

team docs
  -> ingest workflow
  -> generated draft pages
  -> review queue
  -> published wiki
  -> search or RAG layer

The review queue is critical here, because generated knowledge should never be published directly into a team source of truth without a human checkpoint - even a lightweight review process catches the most dangerous hallucinations before they become institutional knowledge.

Pattern 3. LLM Wiki plus RAG

This is often the most balanced architecture, giving you both raw source access and compiled synthesis.

raw sources
  -> LLM Wiki pages
  -> reviewed knowledge base
  -> search index
  -> RAG over raw and compiled knowledge
  -> cited answer

The RAG system can retrieve from original documents, generated summaries, topic pages, comparison pages, and glossary entries, which makes retrieval quality significantly stronger than operating over raw documents alone.

Pattern 4. LLM Wiki as site architecture

For a technical website, LLM Wiki ideas can guide content structure even without automation.

articles
  -> pillar pages
  -> topic maps
  -> comparisons
  -> internal links
  -> search and AI access

This turns a blog into a knowledge system where articles are not just posts but nodes in a structured map - a significant difference for both reader experience and machine-readable discoverability.

LLM Wiki design principles

Keep raw sources separate

Never lose the original source. Generated pages should not replace source documents but sit above them - the source layer provides evidence, the wiki layer provides interpretation, and losing the original means losing the ability to verify, challenge, or update the interpretation derived from it.

Use Markdown where possible

Markdown is boring and excellent. It is portable, readable, diffable, versionable, easy to edit, friendly to static sites, and friendly to PKM tools. Boring formats survive longer than clever platforms, which means a Markdown-based LLM Wiki built today will still be usable long after whatever proprietary database you might have chosen has gone through multiple breaking migrations. For syntax reference, see the Markdown Cheatsheet and the guide to Markdown Code Blocks, which are especially relevant when structuring wiki pages that include technical content.

Track provenance

Every generated page should answer:

What sources created this?
When was it generated?
When was it reviewed?
What changed?
Who approved it?

Without provenance, trust collapses over time as pages drift further from their origins. A practical page schema might look like this:

title
summary
status
sources
last_reviewed
related_pages
concepts
open_questions

For technical content, add:

applies_to
version
examples
tradeoffs
failure_modes

For research content, add:

claims
evidence
contradictions
confidence

Prefer fewer better pages

Do not generate a page for every minor idea. Prefer strong concept pages, useful comparison pages, topic indexes, canonical summaries, and glossary entries that earn their place. A small useful wiki with twenty well-maintained pages beats a large generated mess with two hundred pages nobody reads or updates.

Make links meaningful

Links should explain relationships rather than just connect pages at random. Useful link types include:

related concept
depends on
contrasts with
example of
source for
expands on
implementation of

Random links create noise and erode reader trust in the structure.

Mark uncertainty

LLM Wiki pages should not pretend all knowledge is equally certain. Useful status markers include:

confirmed
likely
disputed
outdated
needs review
source conflict
generated summary

These markers protect readers from false confidence and give maintainers a clear signal about which pages need attention.

How to evaluate an LLM Wiki

Do not only ask whether the generated pages look impressive - ask whether they improve knowledge work. Useful evaluation questions include:

Can users find concepts faster?
Are repeated questions answered better?
Are source links preserved?
Are contradictions easier to see?
Are pages reused?
Are summaries accurate?
Is stale content detected?
Does the wiki reduce repeated synthesis?
Does it help humans write or decide?
Does it improve RAG answer quality?

If the answer is no to most of these, the wiki is decoration regardless of how many pages it contains.

LLM Wiki and knowledge management

LLM Wiki belongs in knowledge management because it is fundamentally about representation, not primarily about model hosting, vector search, or agent execution. It answers a different question: how should knowledge be structured so that humans and AI systems can reuse it? That places it in the knowledge systems architecture layer, connecting naturally to PKM, wikis, RAG, agent memory, knowledge graphs, technical publishing, and research synthesis.

A clean layer model looks like this:

Human thinking - PKM, explore and develop ideas
Shared knowledge - Wiki, maintain canonical pages
Compiled knowledge - LLM Wiki, generate structured synthesis
Machine access - RAG, retrieve context at query time
Agent continuity - Memory, persist behavior and preferences

LLM Wiki occupies the compiled knowledge layer, and that position is what makes it useful - it is the layer that turns a pile of documents into something both humans and machines can navigate and reason over.

My opinionated take

LLM Wiki is important, but the hype is slightly wrong - it is not a RAG killer, but a reminder that knowledge representation matters. The industry spent years optimizing retrieval pipelines, and that work was necessary, but many systems still retrieve from badly structured knowledge. Better embeddings and better rerankers help, but they cannot fully compensate for a weak knowledge layer.

LLM Wiki pushes the conversation back toward structure by asking better questions:

What are the core concepts?
What is canonical?
How do ideas connect?
What should be summarized once?
What should be retrieved fresh?
What should be reviewed by humans?

That is the right conversation, and the future is not just better vector search but layered knowledge systems where representation, retrieval, and memory each play a distinct and well-understood role.

Conclusion

LLM Wiki is an architecture pattern for compiled knowledge that uses language models to help transform source material into structured, linked, reusable knowledge before questions are asked. Its core workflow is:

summarize
  -> structure
  -> link
  -> review
  -> reuse

Compared with RAG, the main difference is timing: RAG performs synthesis at query time, while LLM Wiki performs synthesis at ingest time, which makes it valuable for stable domains, research synthesis, personal knowledge bases, technical blogs, and curated team knowledge.

But it has real limitations. It can drift when sources change, hallucinate when model output is wrong, create false confidence when review is absent, and collapse into noise when ownership is unclear. Used badly, it becomes another abandoned wiki. Used well, it becomes the representation layer between raw documents and AI systems - not a replacement for RAG, but the missing layer that makes retrieval worth using.

Sources and further reading

AWS - What Is Retrieval Augmented Generation? - AWS foundational overview of how RAG pipelines are constructed and when they are appropriate.
IBM - Retrieval Augmented Generation - IBM overview of RAG architecture, covering grounding, hallucination reduction, and enterprise use cases.
Google Cloud - Retrieval Augmented Generation - Google Cloud perspective on RAG use cases, system design, and integration with vector search.
Atlan - LLM Wiki vs RAG Knowledge Base - Practical comparison of LLM Wiki and RAG approaches from a data catalog perspective.
Ranjan Kumar - LLM Wiki, Synthesis Time, RAG, and Agentic Memory - In-depth discussion of the timing distinction between synthesis approaches and how they fit into agentic architectures.
Dev.to - RAG vs Agent Memory vs LLM Wiki - Practical comparison of all three knowledge patterns with implementation notes.
Starmorph - Karpathy LLM Wiki Knowledge Base Guide - Guide inspired by Andrej Karpathy's framing of LLM Wiki as a compiled knowledge system.
MindStudio - LLM Wiki vs RAG Knowledge Base - MindStudio perspective on choosing between LLM Wiki and RAG for AI assistant knowledge.

Retrieval vs Representation in Knowledge Systems

Rost — Mon, 18 May 2026 09:22:43 +0000

Most modern knowledge systems optimize retrieval, and that is understandable.
Search is visible, easy to demo, and feels magical when it works. Type a question, get an answer.

But retrieval is only one half of the problem. The deeper question is:

What shape does the knowledge have before anything tries to retrieve it?

That is representation — the structure behind the knowledge:

notes
pages
schemas
graphs
entities
relationships
summaries
taxonomies
source boundaries
canonical versions

Retrieval asks:

Can I find something relevant?

Representation asks:

Is the knowledge organized in a way that makes sense?

These are not the same problem. A RAG system with poor representation becomes a fast interface to a messy archive. It can retrieve fragments, but it cannot fix broken structure. It can quote documents, but it cannot decide which one is canonical. It can assemble context, but it cannot guarantee that the underlying knowledge is coherent.

This is why LLM Wiki style systems are interesting: they shift effort from query time to ingest time. Instead of only retrieving chunks when a user asks a question, they attempt to pre-structure knowledge into pages, concepts, summaries, and links. That does not make RAG obsolete — it means retrieval and representation are different layers, and good knowledge systems need both.

The core distinction

Retrieval is about access; representation is about meaning.

Layer	Question	Examples
Retrieval	How do I find the right information?	search, embeddings, BM25, reranking, vector stores
Representation	How is knowledge structured?	notes, wikis, graphs, schemas, ontologies
Reasoning	How do I use the knowledge?	synthesis, comparison, inference, decision making

A weak system often jumps straight to retrieval; a strong system first asks:

What are the core concepts?
What is the canonical source?
What relationships matter?
What changes over time?
What should be retrieved?
What should already be represented?

This is the difference between search over documents and an actual knowledge system.

Why retrieval became dominant

Retrieval became dominant because it maps well to the modern AI stack. A typical RAG pipeline looks like this:

Load documents
Split them into chunks
Generate embeddings
Store vectors
Retrieve relevant chunks
Optionally rerank them
Put them into an LLM prompt
Generate an answer

This pipeline is practical: it is relatively easy to build, works with messy documents, scales to large corpora, avoids retraining models, and gives LLMs access to current information. That is why RAG became the default pattern for "AI over documents."

But there is a trap:

RAG improves access to knowledge. It does not automatically improve the knowledge.

If your content is duplicated, outdated, contradictory, badly chunked, or poorly named, retrieval will surface those problems — often with confidence.

What representation means

Representation is the way knowledge is shaped before retrieval happens. It answers questions like:

Is this knowledge stored as documents, notes, entities, or facts?
Are relationships explicit or implicit?
Are there canonical pages?
Are there summaries?
Are concepts linked?
Is the system organized by topic, workflow, time, or ownership?
Can a human maintain it?
Can a machine reason over it?

Representation is not decoration — it determines what kind of operations are possible.

Forms of representation

Documents

Documents are the most common representation. Examples include:

articles
PDFs
manuals
reports
README files
support pages
blog posts

Documents are easy for humans to write, but they are often hard for machines to use because they mix facts, narrative, context, examples, opinions, outdated sections, and repeated explanations into the same container. Documents are good containers, but they are not always good knowledge structures.

Notes

Notes are more flexible than documents. They can be:

atomic
linked
private
unfinished
concept focused

A note system, such as a PKM or second brain, can represent evolving knowledge better than a polished document repository. Good notes capture thinking in progress; bad notes become an unsearchable junk drawer.

Wikis

Wikis represent knowledge as maintained pages. A good wiki has:

stable pages
clear topics
internal links
ownership
canonical answers
update patterns

A wiki is stronger than a loose document dump because it gives knowledge a home. "Deployment checklist" lives in one place. "Incident response" lives in one place. "RAG architecture" lives in one place. That matters because retrieval works better when knowledge has a stable structure.

Knowledge graphs

Knowledge graphs represent knowledge as entities and relationships. Instead of storing only text, they model things like:

Person works on Project
Model supports ContextLength
Page depends on Concept
Service connects to Database
Tool implements Protocol

Graphs are powerful because relationships become explicit, which helps with traversal, dependency analysis, entity resolution, lineage, reasoning, and recommendations. But graphs are expensive to maintain and they are not magic — a bad graph is just structured confusion.

Schemas and ontologies

Schemas define expected structure; ontologies go further and define types, relations, and constraints. They answer:

What kinds of things exist?
What properties do they have?
How can they relate?
What rules apply?

This is useful when correctness matters, such as in medical knowledge, legal knowledge, enterprise data catalogs, product taxonomies, and compliance systems. The tradeoff is rigidity: the more formal the representation, the more expensive it is to evolve.

LLM-generated representations

Modern systems increasingly use LLMs to create representations. Examples include:

summaries
extracted entities
topic pages
concept maps
synthetic FAQs
document outlines
cross-links
glossary entries

This is where LLM Wiki style systems sit. They use the model not only to answer queries but to pre-process and structure knowledge before the query happens. RAG says "retrieve relevant chunks at query time"; LLM Wiki says "compile useful knowledge structures at ingest time." Both patterns can coexist in the same architecture.

What retrieval means

Retrieval is the process of finding relevant information. Common retrieval methods include:

keyword search
full text search
vector search
hybrid search
metadata filtering
graph traversal
reranking
query rewriting
agentic search

Retrieval is not one thing — it is a layered stack of complementary methods.

Keyword search

Keyword search matches terms and is still useful because it is predictable, debuggable, fast, and good for exact terms, IDs, error messages, names, and code. Its weakness is semantic mismatch: if the user searches "how to stop repeated answers" but the document says "presence penalty", keyword search may miss the best result.

Vector search

Vector search retrieves by semantic similarity. It is useful when:

wording differs
concepts are fuzzy
users ask natural language questions
documents use inconsistent terminology

Its weakness is precision — vector search can retrieve things that feel related but are not actually correct, which is especially risky in technical systems.

Hybrid search

Hybrid search combines keyword and vector retrieval, which is often better than either alone. Keyword search catches exact matches; vector search catches conceptual matches. For technical knowledge bases, hybrid retrieval is usually a strong default.

Reranking

Reranking takes an initial set of retrieved results and reorders them using a stronger model. This improves quality because the first retrieval step is often broad. A typical pattern retrieves 50 chunks, reranks to the top 5 or 10, then passes only the best context to the LLM. Reranking is one of the most practical ways to improve RAG quality.

Agentic retrieval

Agentic retrieval turns search into a process. Instead of one query, an agent may:

Ask an initial question
Search
Inspect results
Reformulate the query
Search again
Compare sources
Synthesize an answer

This is closer to research than search. It is useful for complex questions, but it is slower and harder to control.

Retrieval without representation is fragile

A retrieval system can only retrieve what exists. It cannot reliably fix:

unclear concepts
duplicate pages
inconsistent terminology
stale documentation
missing source ownership
contradictory statements
weak internal linking
bad document boundaries

This is the most common mistake in RAG projects: teams build a vector database and expect it to become a knowledge system. A vector database is not a knowledge architecture — it is an access layer.

Representation without retrieval is isolated

The opposite failure also exists. You can have a beautifully structured knowledge base that nobody can find. This happens with:

over-designed wikis
deep folder trees
rigid taxonomies
poorly indexed documentation
private note systems with no discovery
graphs without usable interfaces

Representation gives knowledge structure; retrieval gives knowledge reach. You need both.

The tradeoff map

Speed vs coherence

Retrieval is fast to build and representation takes longer. If you need a prototype, retrieval wins; if you need long-term trust, representation matters more.

Priority	Better starting point
Fast Q&A over many docs	Retrieval
Stable technical knowledge	Representation
Exploratory research	PKM plus retrieval
Enterprise assistant	Structured corpus plus RAG
Agent memory	Representation plus selective retrieval

A pure RAG prototype can be built quickly, but a reliable knowledge system takes curation.

Flexibility vs consistency

Loose documents are flexible; structured knowledge is consistent. Flexibility helps when:

the domain changes quickly
knowledge is incomplete
users are exploring
the system is personal

Consistency helps when:

multiple people rely on it
answers must be trusted
workflows depend on it
AI systems consume it

The more people or agents depend on knowledge, the more representation matters.

Recall vs precision

Retrieval systems often optimize recall first, which means finding anything that might be relevant. But good answers need precision, which means finding the best evidence rather than merely related evidence. Representation improves precision by making concepts and boundaries clearer — a well-structured page is easier to retrieve accurately than a random paragraph buried inside a long document.

Ingest-time cost vs query-time cost

RAG usually pushes work to query time. At query time, the system:

rewrites the query
retrieves chunks
reranks results
assembles context
asks the model to reason over fragments

LLM Wiki style systems push more work to ingest time. At ingest time, the system:

reads sources
extracts concepts
writes summaries
creates pages
links related ideas
maintains structure

Architecture	Expensive step	Benefit
RAG	Query time	Flexible retrieval
LLM Wiki	Ingest time	Pre-compiled structure
Knowledge graph	Modeling time	Explicit relationships
Wiki	Maintenance time	Canonical knowledge

None of these is universally better — they optimize different costs.

Why LLM Wiki exists

LLM Wiki exists because retrieval alone often repeats work. In a normal RAG system, every query may force the model to interpret raw fragments again:

Retrieve chunks about a topic
Ask the LLM to infer the concept
Generate an answer
Forget the synthesis
Repeat next time

LLM Wiki says:

Stop re-deriving the same synthesis. Compile it.

Instead of only storing raw documents, it creates structured pages that summarize and connect knowledge, which can improve coherence, reuse, token efficiency, human readability, and long-term maintenance. But it has a cost: the system must maintain the wiki, and if the wiki is wrong, stale, or hallucinated, the structure becomes dangerous.

RAG hallucination vs bad representation

People often blame the LLM when a RAG system gives a bad answer, and sometimes that is correct. But many failures are actually retrieval or representation failures.

Failure mode 1. Correct document, wrong chunk

The answer exists, but chunking splits it badly. The model receives:

half of a paragraph
missing context
a table without explanation
a definition without constraints

The LLM fills those gaps, which looks like hallucination, but the deeper problem is broken representation.

Failure mode 2. Related chunk, wrong answer

Vector search retrieves something semantically similar but operationally wrong. The query asks about production deployment; the retrieved chunk discusses local development. The terms overlap but the meaning differs, so the model answers with local setup instructions for a production problem. This is retrieval imprecision.

Failure mode 3. Conflicting sources

Two documents disagree — one old, one new. The retrieval system returns both, and the LLM merges them into a confident but invalid answer. This is not just a retrieval problem but a representation problem, because the knowledge base lacks canonical state.

Failure mode 4. No concept model

The system has many documents but no model of the domain. It does not know that:

"agent memory" differs from "RAG"
"wiki" differs from "PKM"
"embedding search" differs from "full text search"
"deployment" differs from "hosting"

Without conceptual representation, retrieval becomes fuzzy matching.

Failure mode 5. Generated structure becomes fake authority

LLM Wiki systems have their own failure mode. If an LLM generates a clean page from bad sources, the result can look more authoritative than the original material. This is dangerous: a polished hallucination is worse than a messy source document. Any generated representation needs:

source links
review
update rules
confidence markers
ownership

Design implications

Optimize retrieval when the corpus is large and dynamic

Retrieval should be the priority when:

the corpus is huge
documents change frequently
users ask many unpredictable questions
you need broad coverage
perfect structure is unrealistic

Examples: support knowledge bases, enterprise document search, research assistants, internal chat over many files, legal discovery, and customer service bots. In these cases, invest in strong retrieval:

hybrid search
metadata filters
reranking
query rewriting
source citation
evaluation sets

Optimize representation when coherence matters

Representation should be the priority when:

knowledge must be trusted
answers must be consistent
concepts are reused often
the domain has clear structure
multiple systems depend on it

Examples: architecture knowledge, product documentation, compliance rules, API references, operational runbooks, curated research collections, and technical blog clusters. In these cases, invest in:

canonical pages
glossary terms
diagrams
internal links
ownership
versioning
review cadence

Optimize both when AI systems depend on knowledge

If an AI agent depends on the knowledge, retrieval alone is usually not enough. Agents need:

stable context
clear task rules
durable memory
structured references
source boundaries
update behavior

For agentic systems, representation becomes part of system design. A coding agent does not only need to retrieve "some docs" — it needs to know:

project conventions
architecture decisions
command patterns
forbidden dependencies
testing workflow
deployment rules

Some of that belongs in RAG, some belongs in memory, and some belongs in structured project documentation.

Practical decision framework

If the problem is finding information

Optimize retrieval. Examples:

"Find relevant pages."
"Answer questions over documents."
"Search across many PDFs."
"Locate similar support tickets."

Use:

full text search
vector search
hybrid retrieval
reranking
metadata filtering

If the problem is making knowledge coherent

Optimize representation. Examples:

"Create a canonical explanation."
"Resolve duplicate pages."
"Define the domain model."
"Build a stable knowledge base."

Use:

wiki pages
concept maps
taxonomies
knowledge graphs
summaries
schemas

If the problem is repeated synthesis

Use compiled representation. Examples:

"We answer the same conceptual questions repeatedly."
"The system keeps re-summarizing the same sources."
"We need a stable synthesis layer."

Use:

LLM Wiki
curated summaries
topic pages
human-reviewed generated pages

If the problem is adaptive continuity

Use memory. Examples:

"The agent should remember user preferences."
"The coding agent should remember project conventions."
"The assistant should continue work across sessions."

Use:

agent memory
preference stores
episodic memory
semantic memory
project memory

How this applies to a technical blog

A technical blog can be more than a sequence of posts — it can become a represented knowledge system. Articles are documents, categories are weak taxonomy, internal links are graph edges, pillar pages are canonical summaries, series pages are curated pathways, and search is retrieval. If you only publish isolated posts, retrieval has to work harder. If you build strong representation, retrieval becomes easier.

That means:

clear cluster boundaries
stable slugs
canonical pages
comparison pages
glossary-style explainers
internal links
structured metadata

This is why site architecture matters — not just for SEO, but because it is knowledge representation. The Knowledge Management cluster on this site is itself an example of representation-first publishing.

How this applies to RAG

RAG quality depends heavily on representation. A well-structured source corpus improves:

chunk quality
retrieval accuracy
citation quality
answer consistency
evaluation clarity

Before building a complex RAG pipeline, ask:

Are the source documents current?
Are duplicates removed?
Are important concepts clearly named?
Are pages scoped correctly?
Are tables and code blocks retrievable?
Are canonical answers obvious?
Are document boundaries meaningful?

If the answer is no, better embeddings will only help so much.

How this applies to LLM Wiki

LLM Wiki is a representation-first pattern. It is useful when:

the corpus is small or medium sized
knowledge is stable enough to summarize
repeated synthesis is expensive
humans benefit from readable pages
you want structure before retrieval

It is less useful when:

the corpus is massive
content changes constantly
freshness is more important than coherence
governance is weak
generated summaries cannot be reviewed

LLM Wiki is not a replacement for RAG but a different layer, and a strong system can use both:

LLM Wiki creates structured summaries.
RAG retrieves from raw sources and wiki pages.
Human review keeps the representation trustworthy.

Suggested architecture patterns

Pattern 1. Retrieval first

Use when speed matters.

documents
  -> chunks
  -> embeddings
  -> retrieval
  -> LLM answer

Good for:

prototypes
broad search
large corpora
early experiments

Weakness: coherence depends on source quality.

Pattern 2. Representation first

Use when trust matters.

sources
  -> curated pages
  -> internal links
  -> maintained knowledge base
  -> search or RAG

Good for:

documentation
technical knowledge
long-term content
team knowledge

Weakness: requires maintenance.

Pattern 3. Compiled knowledge

Use when repeated synthesis matters.

raw sources
  -> LLM extraction
  -> generated summaries
  -> topic pages
  -> reviewed knowledge base
  -> retrieval

Good for:

LLM Wiki systems
research collections
personal knowledge bases
stable domains

Weakness: generated structure must be audited.

Pattern 4. Hybrid knowledge architecture

Use when building serious systems.

raw documents
  -> structured knowledge layer
  -> search index
  -> retrieval and reranking
  -> AI answer
  -> feedback and maintenance

Good for:

production RAG
internal knowledge systems
AI assistants
technical publishing systems

Weakness: more moving parts.

Evaluation questions

To evaluate retrieval, ask:

Did the system find the right source?
Did it rank the right source highly?
Did it retrieve enough context?
Did it avoid irrelevant context?
Did the answer cite the correct source?

To evaluate representation, ask:

Is the knowledge structured clearly?
Is there a canonical page?
Are concepts named consistently?
Are relationships explicit?
Is the content maintained?
Can humans and machines both use it?

Do not evaluate a knowledge system only by answer quality — a good answer can hide a bad structure.

The opinionated rule

If your system fails occasionally, improve retrieval. If it fails repeatedly in the same conceptual area, improve representation.

Bad retrieval misses the right information. Bad representation means the right information does not really exist in a usable shape.

Conclusion

Retrieval and representation solve different problems: retrieval gives access, representation gives structure. RAG is powerful because it makes external knowledge available to LLMs at query time, but RAG does not automatically make knowledge coherent, canonical, or maintained. That is why wikis, PKM systems, knowledge graphs, and LLM Wiki style systems still matter.

The future is not retrieval vs representation but layered knowledge systems:

representation for structure
retrieval for access
memory for continuity
reasoning for synthesis

If you are building a serious knowledge system, do not start with the vector database. Start with the shape of the knowledge, then decide how it should be retrieved.

Sources and further reading

PKM vs RAG vs Wiki vs Memory Systems Explained Clearly

Rost — Sun, 17 May 2026 08:50:06 +0000

PKM, RAG, wikis, and AI memory systems are often discussed as if they solve the same problem.
They do not.
They all deal with knowledge, but they operate at different layers:

PKM helps humans think.
Wikis help groups preserve shared knowledge.
RAG helps machines retrieve external knowledge.
Memory systems help AI agents persist context over time.

Confusing these systems leads to bad architecture.

You get wikis full of personal scratch notes, RAG systems without a source of truth, memory layers pretending to be databases, and PKM tools overloaded with automation they were never designed to handle.

A better model is to see them as different parts of a knowledge systems spectrum.

This article compares PKM, RAG, wikis, and AI memory systems by structure, retrieval, ownership, evolution, and real-world use cases.

The short version

System	Primary user	Main purpose	Best for
PKM	Individual	Develop personal knowledge	Thinking, learning, synthesis
Wiki	Team or public group	Maintain shared knowledge	Documentation, policies, reference
RAG	Machine system	Retrieve context for generation	AI answers over external data
AI memory	AI agent	Persist context over time	Long-running agents and personalization

The most important distinction is this:

PKM and wikis structure knowledge. RAG retrieves knowledge. Memory systems evolve agent context.

That is the core mental model.

Why these systems are confused

They overlap in visible behavior.

All of them can:

store notes
retrieve information
answer questions
organize references
connect ideas

But they differ in intent.

A PKM system is not just a private wiki.
A wiki is not just a RAG database.
A RAG pipeline is not an AI memory.
An AI memory system is not a replacement for structured documentation.

The confusion comes from treating "knowledge" as one thing.

In practice, knowledge has multiple layers:

Capture
Structure
Retrieval
Interpretation
Reuse
Evolution

Different systems optimize different stages.

The four paradigms

1. PKM

PKM stands for personal knowledge management.

It is the practice of capturing, organizing, connecting, and using knowledge for personal work.

Typical PKM systems include:

Obsidian
Logseq
Notion
plain Markdown folders
Zettelkasten systems
second brain systems

PKM is human driven.

The goal is not just storage. The goal is better thinking.

What PKM is good at

PKM works well for:

learning a new domain
developing original ideas
connecting notes over time
writing articles or books
tracking personal research
building a second brain

A good PKM system is messy in a useful way. It supports unfinished thoughts, partial ideas, private context, and evolving concepts.

This is why PKM is not the same as documentation.

Documentation wants clarity.
PKM tolerates ambiguity.

PKM failure modes

PKM often fails when it becomes:

a dumping ground
a folder taxonomy project
a productivity aesthetic
a tool optimization hobby
a private archive nobody uses

The main risk is collection without synthesis.

If you only save information, you do not have a knowledge system. You have a personal landfill.

Opinionated take

PKM should optimize for reuse, not capture.

Capturing everything feels productive, but it creates debt. The real value appears when notes become connected, rewritten, compressed, and used in output.

2. Wiki

A wiki is a structured knowledge base designed for shared reference.

Typical wiki systems include:

DokuWiki
MediaWiki
Confluence
BookStack
Git based documentation sites
internal company knowledge bases

A wiki is usually more formal than PKM.

It should answer:

What do we know, and where is the current version?

What wikis are good at

Wikis work well for:

team documentation
operational runbooks
product knowledge
policy documents
technical reference
onboarding material
stable domain knowledge

A wiki is a social contract.

It says:

This page is the place where this knowledge lives.

That makes ownership and maintenance critical.

Wiki failure modes

Wikis often fail because they become stale.

Common problems:

no page owners
outdated screenshots
duplicate pages
unclear canonical versions
too much hierarchy
no maintenance rhythm

A wiki with old information is worse than no wiki, because it creates false confidence.

Opinionated take

A wiki should be boring.

That is a compliment.

A good wiki is not where ideas are born. It is where stable knowledge is preserved after it becomes useful to others.

3. RAG

RAG stands for retrieval augmented generation.

It is an AI architecture where a system retrieves relevant external information before asking a language model to generate an answer.

A basic RAG pipeline usually has:

Documents
Chunking
Embeddings or search index
Retrieval
Optional reranking
Prompt assembly
LLM generation

RAG is machine driven.

The goal is not to create knowledge. The goal is to give a model relevant context at query time.

What RAG is good at

RAG works well for:

question answering over documents
internal search assistants
support bots
technical documentation assistants
compliance lookup
research over large corpora
connecting LLMs to updated information

RAG is especially useful when the model cannot or should not memorize the information.

RAG failure modes

RAG often fails when teams treat it as magic search.

Common problems:

bad chunking
weak retrieval
noisy context
missing metadata
no source of truth
stale documents
weak evaluation
no human feedback loop

RAG does not fix bad knowledge management.

If the underlying content is fragmented, outdated, or contradictory, the RAG system will surface that mess with confidence.

Opinionated take

RAG is not a knowledge strategy.

RAG is an access strategy.

It helps machines access knowledge, but it does not decide what knowledge is valid, maintained, canonical, or useful.

4. AI memory systems

AI memory systems give agents persistent context beyond a single prompt or conversation.

They may store:

user preferences
past decisions
long-term facts
task history
summaries
reflections
extracted entities
episodic memories
semantic memories

Examples and related ideas include:

MemGPT style memory tiers
long-term agent memory
episodic memory
semantic memory
vector memory
profile memory
tool state memory
reflective agents

AI memory is agent driven.

The goal is continuity.

What AI memory is good at

AI memory systems work well for:

personal assistants
long-running coding agents
research agents
customer support agents
tutoring systems
workflow automation
persistent companions
multi-session task execution

Memory matters when the system must behave as if it remembers.

AI memory failure modes

Memory systems are dangerous when unmanaged.

Common problems:

remembering wrong facts
storing too much
privacy risk
stale preferences
poor memory ranking
memory poisoning
no forgetting mechanism
confusing memory with truth

A memory system needs governance.

It should answer:

What should be remembered?
Who approved it?
How long should it live?
When should it be forgotten?
How is it corrected?

Opinionated take

AI memory is not just long context.

Long context lets a model see more at once.
Memory decides what survives across time.

Those are different problems.

Core differences table

Dimension	PKM	Wiki	RAG	AI memory
Primary user	Individual	Team or public group	AI system	AI agent
Main function	Thinking	Shared reference	Query time retrieval	Persistent context
Knowledge state	Evolving	Stabilized	Retrieved	Adaptive
Structure	Flexible	Explicit	Index based	Learned or extracted
Retrieval style	Human search and linking	Navigation and search	Semantic or hybrid retrieval	Relevance plus salience
Ownership	Personal	Page or team owners	System maintainers	Agent or user controlled
Time horizon	Long term personal	Long term shared	Query time	Multi-session
Best output	Insight	Reliable reference	Grounded answer	Continuity
Main risk	Hoarding	Staleness	Bad retrieval	Bad memory
Good metric	Reuse in thinking	Trust and freshness	Answer quality	Helpful continuity

Structure vs retrieval vs evolution

The simplest way to understand these systems is to compare what they optimize. The architectural implications of that distinction are explored in depth in Retrieval vs Representation in Knowledge Systems.

PKM optimizes personal evolution

PKM is about how your understanding changes.

You collect material, rewrite it, connect it, and turn it into something useful.

The output is often:

a better mental model
a written article
a decision
a research direction
a reusable insight

PKM is not primarily about fast lookup. It is about long-term sensemaking.

Wikis optimize shared structure

Wikis are about stable knowledge.

They ask:

What is the current answer?
Who owns it?
Where should people go?
What should be updated?

A wiki works when people trust it.

RAG optimizes machine retrieval

RAG is about retrieving the right context at the right time.

It asks:

What documents are relevant?
Which chunks should be used?
How much context fits?
What should the model cite?

RAG works when retrieval quality is high and the source corpus is trustworthy.

AI memory optimizes continuity

Memory systems are about persistence across sessions.

They ask:

What should the agent remember?
What should be forgotten?
Which memory matters now?
How should memory change behavior?

Memory works when it improves future behavior without polluting the agent with stale or incorrect context.

When to use PKM

Use PKM when the knowledge is personal, unfinished, or exploratory.

Good scenarios:

learning distributed systems
planning articles
researching LLM architecture
collecting book notes
building a second brain
tracking personal experiments

Use PKM when you are still thinking.

Example

You are learning about RAG evaluation.

You collect:

articles
benchmark notes
diagrams
implementation ideas
failures from your own experiments

This belongs in PKM first.

Later, once the knowledge stabilizes, you may publish an article or turn it into documentation.

When to use a wiki

Use a wiki when knowledge must be shared and maintained.

Good scenarios:

team onboarding
API documentation
operational runbooks
architecture decision records
product knowledge
deployment instructions
support procedures

Use a wiki when others need a reliable answer.

Example

Your team has one correct way to deploy a Hugo site to S3 and CloudFront.

That does not belong only in someone's private notes.

It belongs in a wiki or documentation system with clear ownership.

When to use RAG

Use RAG when an AI system needs access to external knowledge at query time.

Good scenarios:

chatbot over documentation
search assistant over internal docs
support assistant over help articles
legal or compliance assistant
research over large document sets
developer assistant over code docs

Use RAG when the problem is:

The model needs information that lives outside its weights.

Example

You have hundreds of technical articles and want an assistant to answer questions using them.

RAG is a good fit.

But only if the documents are clean enough to retrieve from.

When to use AI memory

Use AI memory when an agent needs continuity.

Good scenarios:

coding agents that remember project conventions
personal assistants that remember preferences
research agents that continue long investigations
tutoring agents that remember student progress
support agents that remember prior interactions
autonomous agents that track goals

Use memory when the system must improve across time.

Example

A coding agent should remember:

the project uses Go
tests run with a specific command
the user prefers minimal dependencies
database migrations follow a convention

That is not just retrieval. It is persistent operating context.

How these systems combine

The most useful systems are hybrids.

A mature knowledge architecture might look like this:

PKM for personal exploration
Wiki for stable shared knowledge
RAG for machine access
AI memory for long-running agent continuity

Each layer has a job.

Pattern 1. PKM to wiki

This is the human knowledge pipeline.

Flow:

Capture notes privately
Connect ideas
Distill insights
Publish stable knowledge
Maintain as shared reference

This is how personal research becomes organizational knowledge.

Example

You research self-hosted knowledge tools in Obsidian.

After testing DokuWiki, Nextcloud, and static Markdown systems, you write a stable guide in your site or team wiki.

PKM created the insight.
The wiki preserves the result.

Pattern 2. Wiki to RAG

This is the machine access pipeline.

Flow:

Maintain canonical wiki pages
Index them
Retrieve relevant sections
Generate grounded answers
Link back to sources

This is one of the cleanest RAG patterns.

The wiki remains the source of truth.
RAG becomes the access layer.

Example

A support bot answers questions using a product wiki.

The bot should not replace the wiki. It should cite and route users back to the canonical pages.

Pattern 3. RAG plus memory

This is the agent continuity pipeline.

Flow:

RAG retrieves external facts
Memory stores user or task context
The agent combines both
Future behavior improves

RAG answers:

What does the knowledge base say?

Memory answers:

What matters about this user, project, or task?

Example

A coding agent uses RAG to retrieve framework docs.

It uses memory to remember that your project avoids ORMs, prefers sqlc, and uses structured logging.

Those are different knowledge types.

Pattern 4. PKM plus AI assistant

This is the hybrid thinking pipeline.

Flow:

Human captures notes
AI summarizes and suggests links
Human edits and validates
Knowledge becomes more structured
Some pages graduate to wiki or publication

The AI augments the PKM system, but it should not own the truth.

Example

An AI assistant can suggest connections between notes about RAG, memory systems, and LLM Wiki.

But the human decides which connections are meaningful.

Common architecture mistakes

Mistake 1. Treating RAG as a wiki

RAG is not a knowledge base.

It does not automatically create a canonical structure. It retrieves from whatever exists.

If the source documents are bad, RAG becomes a confident interface to bad knowledge.

Mistake 2. Treating memory as a database

AI memory is selective context, not general storage.

A database stores records.
Memory changes behavior.

If you need exact facts, use a database or knowledge base.
If you need continuity, use memory.

Mistake 3. Treating PKM as documentation

PKM can be messy.

Documentation should not be.

Private notes can contain half-formed ideas. Shared documentation should contain stable, maintained knowledge.

Mistake 4. Treating a wiki as a thinking tool

A wiki can support thinking, but it is not ideal for early exploration.

If every early thought must become a polished page, people stop writing.

Use PKM for rough thinking. Use wikis for durable knowledge.

Mistake 5. Treating long context as memory

Long context is not memory.

It only helps while the context is present.

Memory persists, selects, updates, and sometimes forgets.

Decision guide

Use this simple decision model.

If the knowledge is private and evolving

Use PKM.

If the knowledge is shared and stable

Use a wiki.

If an AI needs to answer from external documents

Use RAG.

If an agent needs continuity over time

Use memory.

If you need all four

Build a layered system.

Do not force one tool to do every job.

The knowledge systems spectrum

These systems form a spectrum from human thinking to AI continuity.

Layer	System	Role
Human thought	PKM	Explore and synthesize
Shared structure	Wiki	Preserve and maintain
Machine access	RAG	Retrieve and generate
Agent continuity	Memory	Persist and adapt

The direction matters.

Knowledge often starts as personal thought, becomes shared structure, is indexed for machine retrieval, and then becomes part of persistent agent behavior.

That is the modern knowledge stack.

Where LLM Wiki fits

LLM Wiki style systems sit between wiki and AI architecture.

They are not classic RAG.

Instead of retrieving chunks only at query time, they attempt to pre-structure knowledge into pages, summaries, entities, and links.

That makes them closer to compiled knowledge systems.

A useful placement:

System	Position
Wiki	Human maintained structured knowledge
RAG	Query time machine retrieval
LLM Wiki	Ingest time machine structured knowledge
Memory	Agent persistent context

This is why LLM Wiki belongs near knowledge systems architecture, not inside ordinary RAG.

Practical examples

Example 1. Personal technical blog

A technical blogger might use:

PKM for research notes
Hugo site as published knowledge
internal linking as wiki-like structure
RAG later for site search
AI memory for writing assistant preferences

This is a strong architecture.

It keeps human judgment at the center while still allowing AI support.

Example 2. Engineering team

An engineering team might use:

PKM for individual learning
wiki for standards and runbooks
RAG assistant for internal docs
memory for coding agents working inside repositories

The wiki should remain canonical.

The RAG assistant should not invent process.
The memory layer should remember project preferences, not replace architecture decisions.

Example 3. AI research workflow

A researcher might use:

PKM for paper notes
wiki for stable summaries
RAG for literature search
memory for long-running research agents

This works because each layer handles a different time scale.

Security and governance

Knowledge systems become risky when they store sensitive or stale information.

PKM governance

Questions:

What should stay private?
What should be published?
What should be deleted?

Wiki governance

Questions:

Who owns each page?
When was it last reviewed?
What is canonical?

RAG governance

Questions:

Which sources are indexed?
Are answers cited?
How is retrieval evaluated?
What content is excluded?

Memory governance

Questions:

What is remembered?
Can users inspect memory?
Can users delete memory?
How are wrong memories corrected?

Memory needs the strictest governance because it can silently influence future behavior.

SEO and content strategy note

If you run a technical site, this distinction is not only architectural. It is also editorial.

You can map content like this:

PKM pages explain human knowledge practices.
Wiki pages explain structured knowledge systems.
RAG pages explain retrieval engineering.
Memory pages explain persistent AI behavior.
Architecture pages compare and connect the paradigms.

This gives your site a clean authority mesh instead of a pile of loosely related AI articles.

Final conclusion

PKM, RAG, wikis, and AI memory systems are not competitors.

They are different answers to different questions.

PKM asks:

How do I think better over time?

A wiki asks:

What do we know, and where is the trusted version?

RAG asks:

What external context should the model use right now?

AI memory asks:

What should this agent remember for the future?

Once you separate those questions, the architecture becomes obvious.

Use PKM for thinking.
Use wikis for shared truth.
Use RAG for retrieval.
Use memory for continuity.

The future is not one knowledge system that replaces all others.

The future is layered knowledge architecture. For tools, methods, and self-hosted platforms across the full knowledge management spectrum, the cluster pillar maps the territory.

Sources and further reading

Agentic LLM Inference Parameters Reference for Qwen and Gemma

Rost — Sun, 17 May 2026 02:27:20 +0000

This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).

It sits alongside the broader LLM performance engineering hub and matches best with a clear LLM hosting and serving story—throughput and scheduling still dominate when the model is starved, but unstable sampling burns retries and output tokens before the GPU does.

This page consolidates:

vendor recommended parameters
embedded defaults from GGUF and APIs
real-world community findings
agentic workflow optimizations

Right now it is focused on:

Qwen 3.6 (dense and MoE)
Gemma 4 (dense and MoE)

If you run terminal agents such as OpenCode, pair this reference with local LLM behavior in OpenCode so workload-level results and sampler defaults stay aligned.

The goal is simple:

Provide a single place to configure models for agent loops, coding, and multi-step reasoning.

TLDR Reference Table - All models (agentic defaults)

Model	Mode	temp	top_p	top_k	presence_penalty
Qwen 3.5 27B	thinking general	1.0	0.95	20	0.0
Qwen 3.5 27B	coding	0.6	0.95	20	0.0
Qwen 3.5 35B MoE	thinking	1.0	0.95	20	1.5
Qwen 3.5 35B MoE	coding	0.6	0.95	20	0.0
Gemma 4 31B	general	1.0	0.95	64	0.0
Gemma 4 31B	coding	1.2	0.95	65	0.0
Gemma 4 26B MoE	general	1.0	0.95	64	0.0
Gemma 4 26B MoE	coding	1.2	0.95	65	0.0

What "Agentic Inference" Actually Means

Most parameter guides assume:

chat
single-shot completion
human interaction

Agentic systems are different.

They require:

multi-step reasoning
tool calling
consistent outputs
low error propagation

This changes tuning priorities.

Core shift

Use case	Priority
Chat	natural language quality
Creative	diversity
Agentic	consistency + reasoning stability

Qwen 3.6 Tuning

Dense vs MoE matters

Qwen is one of the few families where:

MoE requires different penalties

Dense (27B)

stable
predictable
no routing complexity

Recommended:

presence_penalty = 0.0

MoE (35B-A3B)

expert routing per token
risk of repetition loops

Recommended:

presence_penalty = 1.5 (general)
0.0 for coding

Why this matters

MoE models can get stuck reusing the same experts.

Presence penalty helps:

diversify token paths
improve reasoning exploration

Qwen Agentic Coding Setup

This is where most people get it wrong.

Correct setup

temperature = 0.6
top_p = 0.95
top_k = 20
presence_penalty = 0.0

Why low temperature works

Coding agents need:

deterministic outputs
repeatable tool calls
stable formatting

Higher temperature:

breaks JSON
introduces hallucinated APIs
increases retries

Gemma 4 Tuning

Gemma behaves differently.

No official defaults

model cards are empty
configs are implicit
real tuning comes from:
- Google AI Studio
- GGUF defaults
- community benchmarks

The Counter-Intuitive Finding

Gemma 4 performs better with higher temperature.

Observed behavior

Temp	Result
0.5	poor reasoning
1.0	stable baseline
1.2 to 1.5	best coding performance

This contradicts standard advice.

Why high temperature works here

Hypothesis:

training distribution favors exploration
reasoning mode depends on diversity
model compensates for lack of explicit chain-of-thought control

Result:

higher temperature improves solution search space

Gemma Agentic Coding Setup

Recommended:

temperature = 1.2
top_p = 0.95
top_k = 65
penalties = 0.0

Important

Do not apply traditional "low temp for code" rule blindly.

Gemma is an exception.

Thinking Mode and Agent Systems

Both Qwen and Gemma support reasoning modes.

Why it matters

Agent loops require:

intermediate reasoning
error recovery
multi-step planning

Practical rule

Always enable thinking mode for:

coding agents
tool use
multi-step tasks

Parameter Strategy by Use Case

Coding agents

prioritize determinism
minimize penalties
stable sampling

Reasoning agents

moderate temperature
allow exploration
preserve structure

Tool calling

strict formatting
low randomness
consistent token patterns

Schema and JSON tooling are orthogonal to logits; combine these sampling rules with structured output patterns for Ollama and Qwen3 so validators see fewer retries.

Vendor Defaults vs Reality

Vendor defaults are:

safe
generic
not optimized

Community findings often show:

better performance
task-specific tuning
architecture-aware adjustments

Example

Gemma:

official: no guidance
community: high temperature improves coding

Qwen:

official: inconsistent sections
community: standardized values converge

Practical Deployment Notes

Under concurrency, queueing and memory splits interact with retries as much as sampling does—read how Ollama handles parallel requests alongside the presets above.

Ollama

works well for both families
verify GPU compatibility
defaults may differ from reference

vLLM

supports advanced sampling
stable for production
use explicit parameters

llama.cpp

requires sampler ordering
always enable jinja for modern models
incorrect sampler chain reduces output quality

Key Takeaways

there is no universal parameter set
architecture matters more than model size
agentic systems require different tuning than chat
community benchmarks are often ahead of vendors

Final Opinion

Most parameter guides are outdated.

They assume:

chat use
low temperature for code
static configurations

Modern models break those assumptions.

If you are building agentic systems:

treat inference tuning as a first-class system design problem

Not a config file.

Future Direction

This reference will evolve into:

per-model deep dives
agent-specific configs
benchmarking-backed tuning

Because:

inference is where model capability becomes system performance

LLM Structured Output Validation in Python That Holds Up

Rost — Fri, 15 May 2026 01:26:29 +0000

Most LLM "structured output" tutorials are unserious.
They teach you to ask for JSON politely and then hope the model behaves.
That is not validation.
That is optimism with braces.

OpenAI's own docs make the distinction explicit. JSON mode gives you valid JSON, while Structured Outputs enforces schema adherence, and OpenAI recommends using Structured Outputs instead of JSON mode when possible.

That still does not make the payload trustworthy. JSON Schema defines structure and allowed values, Pydantic gives you typed validation in Python, and OpenAI explicitly notes that a schema-valid response can still contain incorrect values. On top of that, refusals and incomplete outputs can bypass the shape you expected. In production, structured output validation is a pipeline, not a toggle. The same boundary also has to live inside the wider story of throughput, retries, and scheduler limits on the LLM performance engineering hub.

Structured output validation is a contract

Structured output validation for LLMs means you define the shape of the answer up front, constrain the model to produce that shape when possible, and then validate the result again before your application trusts it. In practical terms, that means checking required fields, types, enums, closed object shapes, and domain rules before the payload touches your database, UI, queue, or downstream service. JSON Schema exists for exactly this kind of structural validation, Pydantic is built to validate untrusted data against Python type hints, and Python's jsonschema library gives you a direct way to validate an instance against a schema.

There is also a clean split between two common use cases. If the model is supposed to answer the user in a structured format, use a structured response format. If the model is supposed to call your application's tools or functions, use function calling. OpenAI's docs spell out that distinction, and for function calling they recommend enabling strict: true so the arguments reliably adhere to the function schema.

My strong opinion is simple. Treat every structured LLM response as an API boundary. Once you start thinking in terms of contracts instead of prompts, the architecture gets cleaner, the bugs get cheaper, and the whole "why did the model invent a new field in production" problem mostly disappears. That is the real answer to "what is structured output validation for LLMs" and it is a much better answer than "ask the model nicely for JSON."

JSON mode is not validation

If you remember only one thing from this article, make it this. JSON mode is not schema validation. OpenAI's Help Center says JSON mode will not guarantee the output matches any specific schema, only that it is valid JSON and parses without errors. The Structured Outputs guide says the same thing in a cleaner way. Both JSON mode and Structured Outputs can produce valid JSON, but only Structured Outputs enforces schema adherence.

That difference matters more than people admit. In its Structured Outputs launch post, OpenAI reported that gpt-4o-2024-08-06 with Structured Outputs scored 100 percent on its complex JSON schema evals, while gpt-4-0613 scored under 40 percent. You do not need to treat those numbers as universal truth to see the broader point. Schema enforcement changes the failure surface from "anything could happen" to "the contract is much tighter."

There are still edge cases, and pretending otherwise is how toy demos become pager duty. OpenAI documents that the model can refuse an unsafe request, and those refusals are surfaced outside your normal schema path. It also documents incomplete responses, including cases such as hitting max_output_tokens or a content filter interruption. So the FAQ "is JSON mode enough for reliable LLM output" has a short answer and a longer one. The short answer is no. The longer answer is that even strict structured output still needs explicit failure handling.

Where structured output still breaks

Schema enforcement shrinks the problem. It does not delete it. In real traffic you still see broken or surprising payloads for reasons that have little to do with your prompt wording.

Failure shapes worth designing for

Models and clients disagree about details. You can get extra prose before or after the JSON, Markdown fenced blocks around the payload, or a tool call whose name is valid but whose arguments are JSON that does not match your Pydantic model. Streaming makes it worse because you might validate a half-finished buffer. Defensive code should assume "string in, maybe JSON inside" rather than "bytes on the wire already match my model."

Provider and API differences

Not every host exposes the same structured-output surface. One stack might give you a first-class schema-bound completion, another might only guarantee JSON syntax, and local runtimes might lag behind hosted APIs. That is one reason the FAQ "how do you validate LLM JSON in Python" starts with provider enforcement when it exists and still ends with Python-side validation. For a wider view of how vendors compare, see the structured output comparison across popular LLM providers. If you run models locally, the same validation pipeline applies after you normalize the wire format, for example after extraction with Ollama as in structured LLM output with Ollama in Python and Go. When a runtime still wraps JSON with odd prefixes or reasoning traces, expect the same class of parser failures described in Ollama GPT-OSS structured output issues.

The Python stack that actually works

My recommendation is boring on purpose. First, let the model provider enforce the structural contract when it can. Second, validate the returned payload in Python with Pydantic. Third, use explicit business-rule validation for facts that a schema alone cannot prove. Fourth, test the contract with fixtures and adversarial examples instead of waving at a playground screenshot and calling it done. OpenAI's Structured Outputs docs, Pydantic's validator model, Python's jsonschema tooling, and OpenAI's own structured-output eval examples all point in that direction.

Pydantic is the right center of gravity for Python. It lets you model the output as normal Python types, generate JSON Schema with model_json_schema(), and validate raw JSON with model_validate_json(). Pydantic's docs also note that model_validate_json() is generally the better path than doing json.loads(...) first and then validating, because that two-step route adds extra parsing work in Python.

If you keep standalone schema files in your repo, or you want CI to validate fixture payloads independently of model code, Python's jsonschema package gives you the simplest possible contract check with jsonschema.validate(...). If you want that in pre-commit, check-jsonschema exists specifically as a CLI and pre-commit hook built on jsonschema. That is a very good fit for teams that want schema changes reviewed like code changes.

Frameworks can reduce plumbing, but they do not remove the need for actual validation. LangChain now auto-selects provider-native structured output when the provider supports it and falls back to a tool strategy otherwise. Instructor layers Pydantic response models, validation, retries, and multi-provider support on top of model calls. Guardrails focuses on validators and input-output guard layers. Useful tools, all of them. But the schema and the business rules still belong to you. If you are choosing between higher-level libraries, the BAML vs Instructor comparison for Python is a useful companion to this article.

A minimal OpenAI and Pydantic example

The smallest production-worthy example has a few non-negotiables. Use a closed set of enum-like values where possible. Forbid extra keys. Add field descriptions so the schema is understandable to humans and more legible to the model. Keep the root object explicit and boring. OpenAI recommends clear names plus titles and descriptions for important keys, JSON Schema uses enum to restrict values, and Pydantic can close the object shape with extra="forbid".

from typing import Literal

from openai import OpenAI
from pydantic import BaseModel, ConfigDict, Field

class TicketClassification(BaseModel):
    model_config = ConfigDict(extra="forbid")

    category: Literal["billing", "bug", "how_to", "abuse"] = Field(
        description="Support ticket category."
    )
    priority: Literal["low", "medium", "high"] = Field(
        description="Operational urgency."
    )
    needs_human: bool = Field(
        description="Whether a human should review the case."
    )
    summary: str = Field(
        description="A one sentence summary of the issue."
    )

client = OpenAI()

response = client.responses.parse(
    model="gpt-4o-2024-08-06",
    input=[
        {
            "role": "system",
            "content": "Classify support tickets. Return only the structured result.",
        },
        {
            "role": "user",
            "content": "Customer reports duplicate charges after refreshing checkout.",
        },
    ],
    text_format=TicketClassification,
)

result = response.output_parsed
print(result.model_dump())

Two details in that example are easy to miss and absolutely worth caring about. extra="forbid" on the Pydantic side mirrors the JSON Schema idea of additionalProperties: false, which is also a requirement for strict tool schemas in OpenAI's function-calling docs. And enums are not cosmetic. They are one of the simplest ways to stop the model from inventing a value your code does not understand.

The OpenAI Python SDK supports client.responses.parse(...) with a Pydantic model supplied as text_format, and the parsed object is returned on response.output_parsed. The same SDK also supports client.chat.completions.parse(...), where the parsed object lives on message.parsed. If you want direct structured data extraction with minimal glue, those helpers are the cleanest starting point.

Parse, normalize, then validate

Structured Outputs and model_validate_json remove a lot of parsing pain when the stack is aligned end to end. The moment you support a provider that returns plain chat text, a model that wraps JSON in fences, or a logging path that stores the raw completion string, you want one choke point that turns text into a dict before Pydantic runs.

import json

def parse_json_from_llm_text(text: str) -> dict:
    cleaned = text.strip()
    if cleaned.startswith("```

"):
        cleaned = cleaned.split("\n", 1)[1]
        cleaned = cleaned.rsplit("

```", 1)[0].strip()

    # Common "Sure, here is the JSON:" prefix before the object.
    if not cleaned.startswith("{") and "{" in cleaned and "}" in cleaned:
        start = cleaned.find("{")
        end = cleaned.rfind("}")
        if end > start:
            cleaned = cleaned[start : end + 1]

    return json.loads(cleaned)

ticket_dict = parse_json_from_llm_text(raw_completion_text)
ticket = TicketClassification.model_validate(ticket_dict)

That helper is intentionally boring. It handles fenced "

json ...

" blocks and a leading natural-language preamble when the payload is still a single top-level object. It is not a full JSON extractor. If the model nests braces inside string values, naive slicing can break, and the right fix is usually stricter prompting, schema-bound completions, or a dedicated parser library.

Streaming completions

If you stream chat tokens, do not run json.loads or model_validate_json on every delta. Buffer until the API reports a finished message (check your client for the stream termination or finish_reason), concatenate the text, then parse once. The same rule applies when tool-call arguments arrive in chunks. You only validate after the arguments string is complete.

chunks: list[str] = []
for chunk in completion_stream:
    delta = chunk.choices[0].delta.content or ""
    chunks.append(delta)
raw_completion_text = "".join(chunks)
ticket = TicketClassification.model_validate_json(raw_completion_text)

You can still pass raw_completion_text through parse_json_from_llm_text first when you expect fences or chatter around the JSON.

Once you own plain-string parsing, the next constraint is often not Python but the provider's JSON Schema dialect and what the remote API actually accepts.

Provider schema limits (before you get clever in Python)

Do not blindly dump any schema generator output into an API and assume every JSON Schema feature is supported. OpenAI supports a subset of JSON Schema, requires all fields to be required for Structured Outputs, requires the root to be an object rather than a top-level anyOf, and documents limits on nesting depth and total property count. Keep the provider-facing schema simple. That is not a compromise. That is good engineering.

If you need a provider-agnostic validation path, or you want to validate stored fixtures and mocks, Pydantic plus jsonschema is still a great combination.

from jsonschema import validate as validate_json

schema = TicketClassification.model_json_schema()

payload = {
    "category": "bug",
    "priority": "high",
    "needs_human": True,
    "summary": "Checkout duplicates charges after refresh.",
}

validate_json(instance=payload, schema=schema)
ticket = TicketClassification.model_validate(payload)
print(ticket)

That pattern is especially handy in tests, contract fixtures, and integrations where the model provider does not offer native structured output enforcement. Just remember that a locally generated schema may be broader than a given provider's supported subset, so "valid locally" does not automatically mean "accepted by every LLM API." Also note that some providers preprocess and cache schema artifacts, so the first request for a new schema can be slower than warm requests.

Tool calls are a second contract

Function or tool calling is the other major structured-output shape. The model chooses a name and passes arguments that should match a JSON Schema you control. OpenAI recommends strict: true on tool definitions so arguments stay aligned with that schema. In agent-heavy stacks, bad sampling turns into invalid tool JSON fast; keep sampler settings aligned with multi-step work using the agentic inference parameters reference for Qwen and Gemma.

The snippets below assume you already mapped the provider's tool-call object into a name string and an arguments dict, for example by parsing tool_calls[].function on chat completions (JSON string arguments become json.loads first). dispatch_tool is the step after that normalization.

Two practical rules help in Python. First, validate the tool name against an explicit allowlist before you route execution. Second, validate the arguments dict with the same Pydantic model you use in tests, not with ad hoc key access. The failure mode you are avoiding is "valid JSON arguments, wrong shape for the tool that fired," which slips past string checks.

from typing import Any, Callable

from pydantic import BaseModel

ToolHandler = Callable[[dict[str, Any]], str]

def dispatch_tool(
    *,
    name: str,
    arguments: dict[str, Any],
    handlers: dict[str, tuple[type[BaseModel], ToolHandler]],
) -> str:
    if name not in handlers:
        raise ValueError(f"unsupported tool {name}")
    model_cls, handler = handlers[name]
    validated = model_cls.model_validate(arguments)
    return handler(validated.model_dump())

handlers: dict[str, tuple[type[BaseModel], ToolHandler]] = {
    "classify_ticket": (
        TicketClassification,
        lambda data: f"queued as {data['category']}",
    ),
}

That pattern keeps routing and validation in one place. Your real handlers will be richer, but the split should stay the same: allowed names, typed arguments, then side effects.

Schema validation still needs business rules

A valid object is not the same thing as a correct object. OpenAI says this directly. Structured Outputs does not prevent mistakes inside the values of the JSON object. That is why the FAQ "why do schema validation and business-rule validation both matter" has a blunt answer. Because a response can match the schema perfectly and still be wrong in a way that hurts the business.

Here is a realistic example. The structure can be valid, but the pricing logic can still be nonsense.

from decimal import Decimal
from typing import Literal
from typing_extensions import Self

from pydantic import BaseModel, ConfigDict, Field, model_validator

class Offer(BaseModel):
    model_config = ConfigDict(extra="forbid")

    currency: Literal["USD", "EUR", "GBP"]
    amount: Decimal = Field(gt=0)
    original_amount: Decimal | None
    discounted: bool

    @model_validator(mode="after")
    def check_discount_logic(self) -> Self:
        if self.discounted:
            if self.original_amount is None:
                raise ValueError(
                    "original_amount is required when discounted is true"
                )
            if self.original_amount <= self.amount:
                raise ValueError(
                    "original_amount must be greater than amount"
                )
        return self

That validator does something schemas alone often do poorly in real systems. It checks cross-field semantics after the whole model has been parsed. Pydantic's model_validator exists exactly for this kind of whole-object validation. Notice the Decimal | None field without a default. That keeps the field present while still allowing null, which matches OpenAI's documented pattern for optional-like values under strict Structured Outputs.

If you want validation failures to feed back into the model automatically, Instructor is a practical layer on top of Pydantic. Its docs describe a retry loop where validation errors are captured, formatted as feedback, and used to ask the model to try again.

import instructor

retrying_client = instructor.from_provider("openai/gpt-4o", max_retries=2)

offer = retrying_client.create(
    response_model=Offer,
    messages=[
        {
            "role": "user",
            "content": (
                "Extract the offer from this text. "
                "Was 49.00 USD, now 19.00 USD."
            ),
        }
    ],
)

This is one of the few conveniences I will happily recommend. Automatic retries tied to real validation errors are useful. Silent coercion is not. Instructor's model layer, retry docs, and validation docs all lean into that same idea, and they are right to do so.

You can implement the same idea without a framework. The loop is small. Ask the model, validate with Pydantic, and if validation fails, send the error details back in a follow-up user message and ask for corrected JSON only. Cap attempts, log the final failure, and surface a controlled error to callers. When you already rely on responses.parse or other schema-bound helpers, you may rarely exercise this path. It still matters for JSON mode, older chat endpoints, or any gateway that hands you a raw string.

from openai import OpenAI
from pydantic import ValidationError

client = OpenAI()

messages = [
    {"role": "system", "content": "Return only JSON that matches the ticket schema."},
    {"role": "user", "content": "Customer reports duplicate charges after refreshing checkout."},
]

ticket: TicketClassification | None = None
for attempt in range(2):
    completion = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=messages,
        response_format={"type": "json_object"},
    )
    raw_text = completion.choices[0].message.content or ""
    try:
        ticket = TicketClassification.model_validate_json(raw_text)
        break
    except ValidationError as exc:
        messages.append(
            {
                "role": "user",
                "content": f"Validation failed with {exc.errors()}. Return corrected JSON only.",
            }
        )
else:
    raise RuntimeError("exhausted structured output retries")

assert ticket is not None

In real services you would attach tracing IDs, redact customer text in logs, and distinguish recoverable validation errors from refusals or incomplete responses. The important part is that the retry is driven by real validator output, not by a generic "try again" message.

Test, retry, and fail closed

What should happen when LLM validation fails? Not a shrug. Reject the payload, log the failure, retry with bounded attempts if the task is worth retrying, and fail closed instead of normalizing garbage into something that only looks acceptable. This is also where many teams forget to handle refusals and incomplete outputs explicitly, even though the provider docs tell them those paths exist.

For OpenAI's Responses API, failure handling should be first-class code, not an afterthought. The variable is response from client.responses.create or parse, not completion from chat streaming elsewhere in this article.

if response.status == "incomplete":
    raise RuntimeError(response.incomplete_details.reason)

content = response.output[0].content[0]

if content.type == "refusal":
    raise RuntimeError(content.refusal)

That is not defensive over-engineering. It is directly aligned with the documented failure modes. If the model refuses, you are not holding a schema-valid payload. If the response is incomplete, you are not holding a schema-valid payload. Treat both as explicit branches in your control flow.

You should also test the contract outside the model call itself.

import pytest
from jsonschema import validate as validate_json
from pydantic import ValidationError

def test_ticket_fixture_matches_schema():
    payload = {
        "category": "bug",
        "priority": "high",
        "needs_human": True,
        "summary": "Checkout duplicates charges after refresh.",
    }
    validate_json(instance=payload, schema=TicketClassification.model_json_schema())

def test_discount_logic_rejects_broken_offer():
    with pytest.raises(ValidationError):
        Offer.model_validate(
            {
                "currency": "USD",
                "amount": "19.00",
                "original_amount": "10.00",
                "discounted": True,
            }
        )

def test_ticket_rejects_unknown_category_string():
    with pytest.raises(ValidationError):
        TicketClassification.model_validate(
            {
                "category": "refund",
                "priority": "high",
                "needs_human": True,
                "summary": "Customer wants a refund.",
            }
        )

def test_ticket_rejects_extra_keys():
    with pytest.raises(ValidationError):
        TicketClassification.model_validate(
            {
                "category": "bug",
                "priority": "high",
                "needs_human": True,
                "summary": "Broken flow.",
                "severity": "critical",
            }
        )

This is the right shape of test strategy for LLM output validation in Python. Validate golden fixtures with jsonschema so every field in the contract is exercised. Validate semantics with Pydantic, then add adversarial cases such as illegal enum strings, forbidden extra keys, and cross-field contradictions you care about. If you snapshot real model outputs, scrub PII and treat them as regression fixtures.

If your team lives in the OpenAI stack, the Evals API also includes structured-output evaluation recipes specifically for testing and iterating on tasks that depend on machine-readable formats. And if you keep raw schema files in the repo, wire check-jsonschema into CI or pre-commit. Ship contracts, not vibes.

Production checks that save you later

When validation fails, the FAQ answer is blunt. Reject the payload, log why, retry with targeted feedback when the task is worth another attempt, and fail closed instead of coercing bad data into a queue.

A short operations checklist helps teams avoid repeat incidents.

Log schema version or a hash of the JSON Schema you sent to the provider so you can replay failures accurately.
Redact model inputs and outputs in logs. Structured logs are useless if they leak customer text.
Emit counters or metrics for refusal rate, incomplete response rate, validation failure rate, and repair success rate. Spikes there beat guessing when a model or prompt change shipped.

Broader observability for LLM systems guidance helps wire those signals into dashboards, traces, and SLO reviews once the counters exist.

The best practice is not complicated. Use provider-side Structured Outputs or strict tool schemas when you can. Normalize raw text when you must. Mirror the contract in Python with Pydantic. Add business-rule validation for what the schema cannot prove. Handle refusals and incomplete responses as normal branches. Test the contract until it stops being a demo and starts being software. Anything less is just prompt engineering cosplay.

Second Brain Explained for Engineers and Knowledge Workers

Rost — Thu, 14 May 2026 08:51:13 +0000

Information overload is less about sheer volume than about unresolved inputs. Modern knowledge work leaves a trail of tabs, chat threads, docs, highlights, snippets, transcripts, screenshots, and half-written notes.

Most of that material is only potentially useful, because almost none of it surfaces at the moment it would actually help. That gap between capture and reuse is where the idea of a second brain becomes interesting.

In contemporary personal knowledge management, Tiago Forte popularized the term second brain for an external digital repository of ideas, insights, and resources. The phrase can sound inflated, yet the useful core is practical. A second brain externalizes thinking so your biological brain spends less energy on storage and more on interpretation, connection, and output.

The site’s Knowledge Management in 2026 hub gathers adjacent guides—tools, self-hosted wikis, and PKM methods—when you want surrounding context beyond this article.

Philosophically, the idea is less exotic than the branding implies. External media have always extended cognition—a notebook, a diagram, a link map, or a markdown vault can sit inside the thinking loop. A second brain is that familiar pattern updated for search, backlinks, linked notes, and AI-assisted retrieval.

What Is a Second Brain

A second brain is an external knowledge system, but that label alone is too weak. Plenty of systems store information; a genuine second brain also helps you retrieve, compare, compress, and reuse ideas.

That is why a second brain is not merely a note-taking app. Apps hold text; a second brain sustains a loop between capture and expression. When someone asks what a second brain is, the shortest honest answer is that it is a personal system for turning scattered inputs into reusable thinking.

The contrast between notes and a knowledge system matters because notes are inert artifacts. A knowledge system gives those artifacts retrieval paths, relationships, and context. A folder full of markdown files is no more a second brain than a pile of source files is a finished product—structure and flow are the missing layers.

The strongest setups therefore resist obsession with storage. Storage is cheap, retrieval is expensive, and synthesis is where value compounds. If the system cannot help turn yesterday’s reading into tomorrow’s writing, design, research, or decision-making, it behaves less like a brain and more like a basement.

Core Principles of a Second Brain

The most useful modern framing is CODE—Capture, Organize, Distill, Express. The acronym sounds simple because it is simple, which is part of its power.

Capture

Capture does not mean saving everything; that path leads quickly to digital hoarding. Good capture means saving ideas with future energy. Useful notes tend to be surprising, reusable, unresolved, emotional, or clearly tied to active work.

Accordingly, the capture question is rarely “Should this be saved forever?” The sharper question is “Will this be useful again in a different context?” A second brain improves when it collects sparks rather than exhaust.

Organize

Organization is not about perfect taxonomy. It is about retrieval with low friction—making information easier to find while work is already in motion.

Here PARA often enters the conversation. Projects, Areas, Resources, and Archives offer a lightweight way to organize by actionability rather than abstract topic. Strict category trees often decay into maintenance work, whereas action-oriented buckets keep the system tethered to reality.

Distill

Distillation is where raw notes stop cluttering the vault and start becoming knowledge. A long highlight dump is not yet useful; a distilled note surfaces what is worth keeping, which claims deserve testing, and which ideas can be reused.

Many people skip this step, yet it is what makes the whole method work. Distillation turns large volumes of text into a smaller set of ideas you can recognize later without rereading everything from scratch.

Express

Expression is the phase most note-taking systems quietly avoid, but without output the loop never closes. A second brain earns its keep when notes become articles, designs, code comments, decision memos, architecture docs, or working theories.

Without output there is no pressure test, and without a pressure test there is no learning loop—so a second brain that never expresses anything is only a well-organized backlog.

Second Brain vs PKM

Personal knowledge management (PKM) names the wider field—the habits, skills, and systems people use to gather, evaluate, organize, retrieve, and apply what they learn. In academic literature PKM stretches beyond note-taking and software into cognitive, informational, social, and learning competencies. For a fuller tour of that field than this narrower framing allows, see Personal Knowledge Management — goals, methods, and tools.

A second brain sits inside that umbrella as one philosophy of PKM, especially the digital workflow built around capture, organization, distillation, and expression. In Tiago Forte’s framing, Building a Second Brain describes the larger creative process, while PARA is one implementation layer within it.

The terms are related but not interchangeable. PKM is the category; a second brain is an opinionated implementation—and many online debates about second-brain systems are really debates about the broader PKM problem wearing a narrower label.

Second Brain vs Wiki vs RAG

Technical readers usually arrive next at a pair of questions—how a second brain differs from a wiki, and how it differs from RAG—and the answer begins with intent.

System	Primary job	Best at	Weak point
Second brain	Personal evolving context	Idea development and synthesis	Can become messy and highly personal
Wiki	Shared structured knowledge	Documentation and stable reference	Weaker for unfinished thinking
RAG	Query time retrieval for AI	Grounded responses over external sources	Does not preserve human interpretation by itself

Wikis stabilize knowledge. They favor explicit structure, shared naming, and pages that converge toward a source of truth, which makes them excellent for documentation yet awkward for half-formed concepts, private context, and exploratory thinking. Self-hosted setups such as DokuWiki and its alternatives illustrate how teams turn that impulse into durable reference sites.

A second brain usually begins from the opposite posture—it is personal, evolving, and tolerant of ambiguity, existing before consensus settles. In that sense a wiki is where knowledge goes when it stops changing quickly, whereas a second brain is where it still changes shape.

RAG addresses yet another problem. Retrieval-augmented generation connects an AI model to external knowledge so responses can draw on fresher or more domain-specific context at query time. That capability is valuable, yet it is not the same as building a personal knowledge system—RAG retrieves at inference time, while a second brain remembers what mattered, why it mattered, and how your interpretation shifted.

The interesting technical point is complementarity. A second brain can feed a wiki; a wiki can supply a clean source for RAG; RAG can make a second brain easier to search. None of those roles makes the abstractions interchangeable. The production-oriented RAG tutorial spells out the machine-side retrieval stack; read alongside a personal vault, it clarifies what human-curated notes preserve that query-time retrieval alone does not.

Tools for a Second Brain

People gravitate to tool wars because tools are visible and structure is not, yet the tool is usually the least informative part of the system.

Obsidian

Obsidian appeals because it pairs local markdown files with internal links, backlinks, properties, and graph-style navigation—it feels like a knowledge base first and a text editor second. For technical users who care about file ownership and link-driven structure, that combination is hard to ignore. Vault-oriented setup detail lives in Using Obsidian for personal knowledge management.

Logseq

Logseq speaks to a different instinct. It is local-first, privacy-oriented, and built around an outline model where daily journals, bullets, references, and nonlinear linking make the tool feel less like drafting documents and more like accumulating thought fragments that later connect.

Notion

Notion sits closer to docs, lightweight databases, and team wiki workflows, while still supporting links, backlinks, and increasingly AI-driven search and summarization across connected workspaces. For anyone who wants one surface for docs, projects, and knowledge hubs, the appeal is obvious.

Underneath those differences, all three can support a second brain—and all three can fail at it. Tool choice shifts ergonomics more than philosophy; a weak workflow inside a powerful tool stays weak, while a clear workflow inside a simpler tool still compounds. When Obsidian and Logseq are both on the table, Obsidian vs Logseq is the feature-level split readers usually want next.

Common Second Brain Mistakes

The first trap is collecting too much. Capture feels productive because it is frictionless, yet when everything seems worth saving, nothing stays salient. The usual outcome is a bloated archive with thin signal density.

The second trap is over-structuring, often driven by anxiety. Extra folders, tags, naming rules, and dashboards feel safer, but systems that demand constant grooming stop serving thinking and begin consuming it.

The third trap—both the most common and the most costly—is failing to express. Notes that never become output do not compound; they only accumulate. The promise of a second brain hinges on turning private fragments into public or practical artifacts.

How a Second Brain Evolves

Early on the system can look underwhelming—a handful of notes, a few saved links, perhaps a project page and some book highlights—and then the connections start.

A meeting note links to a design decision; a blog draft links to a half-finished idea from six months earlier; a research note links to a bug report, which links to a product discussion, which loops back to a concept that once seemed unrelated. That is when static notes begin behaving like a dynamic system.

Over time a second brain starts acting like a personal knowledge graph, which does not require a literal graph view. Value shifts from individual notes to relationships among them—the archive stops feeling like a cabinet of documents and starts feeling like a map of evolving context.

That shift drives compounding. Notes become connections, connections become reusable patterns, and reusable patterns cultivate judgment.

AI and the Second Brain

AI is the newest animating layer in this conversation, though not for the reason hype suggests. The payoff is not that AI replaces your second brain; it is that AI can make a human-centered second brain more capable. Readers routing notes toward assistants will find adjacent infrastructure context in AI systems—orchestration, retrieval, and memory beyond a single chat prompt.

In practice AI can fill three roles—summarizing large notes, transcripts, and documents; surfacing related ideas across a workspace faster than manual search; and augmenting expression through outlines, alternative framings, rough rewrites, or extracted action items.

Those abilities edge toward magic until they don’t. AI does not decide what deserves to matter inside your system; it predicts relevance from patterns. Meaning still flows from human priorities, context, and taste—which is why “Can AI improve a second brain without replacing human judgment?” lands on a clear yes only because the judgment layer stays human.

The strongest systems will probably braid both strands—human-curated notes supplying durable context, AI supplying acceleration through summarization, search, and transformation—so the model operates quickly over the archive without owning it.

Take Away

“Second brain” is slightly misleading branding. The aim is not to manufacture another brain; it is to stop treating your first one like cold storage.

A second brain is neither a single tool nor “just notes” nor a prettier folder tree. It is a system for capturing ideas, organizing them for retrieval, distilling them into reusable insight, and expressing them as work.

That is why the concept survives tool churn. Apps change, interfaces change, and AI changes faster than both, yet the underlying failure mode persists—knowledge work breaks when useful ideas vanish between the moment of capture and the moment of need. A second brain is one of the few frameworks that treats that gap as a design problem rather than a character flaw.

Useful links

To deepen your grasp of CODE and PARA, the philosophical idea of extended cognition, and the gap between human-centered notes and retrieval-first RAG, these readings are a practical next step:

Building a Second Brain overview — Tiago Forte’s canonical introduction—the naming of the idea, the CODE workflow (Capture, Organize, Distill, Express), and the case for externalized cognition beyond sheer storage.
PARA method — Practical organization by actionability rather than textbook taxonomy; especially helpful for thinking about retrieval friction versus folder perfectionism.
The extended mind — Andy Clark and David Chalmers’ paper on cognitive extension—why notebooks, diagrams, and digital notes can count as part of the thinking process, not just accessories to it.
Retrieval-augmented generation for knowledge-intensive NLP tasks — Lewis et al.’s foundational RAG paper; useful background for why RAG is built around query-time retrieval and differs in purpose from a curated personal vault.
What is retrieval-augmented generation? — A clear, implementation-minded explanation of RAG architecture and limits—good companion reading for the wiki versus second brain versus RAG comparison.

Bonus. Supersizing the mind — the science of cognitive extension — Forte connects extended-mind ideas to everyday knowledge work; a strong bridge between theory and practice.

Idempotency in Distributed Systems That Actually Works

Rost — Mon, 11 May 2026 11:37:09 +0000

Idempotency in distributed systems is the property that saves you after the network lies, the queue retries, the client panics, and the operator hits replay. In production systems, duplicate delivery is normal. Duplicate side effects are the bug.

HTTP defines an idempotent method as one where multiple identical requests have the same intended effect on the server as one request. That is why PUT, DELETE, and safe methods are idempotent in protocol semantics and can be retried automatically after a communication failure.

That definition is useful, but it is not enough. In real architectures, idempotency is not an HTTP trivia answer. It is a business guarantee. If a customer hits "pay" once, you do not get to charge twice because a timeout happened between commit and response. If a worker updates inventory and crashes before acking the message, you do not get to decrement stock twice because the broker redelivered. That is the bar.

The mistake I see over and over is treating idempotency as a transport feature instead of a system property. Queue deduplication, HTTP verbs, and client retries help, but none of them rescue a design that lets the same business intent create a second side effect. If you want the broader framing for how these integration decisions fit service boundaries and persistence trade-offs, start with App Architecture in Production: Integration Patterns, Code Design, and Data Access.

Where duplicates come from in production

Duplicates do not appear because teams are careless. They appear because distributed systems retry, reorder, and replay.

A client can send a create request, the server can commit it, and the response can still disappear on the wire. That is exactly why HTTP distinguishes idempotent methods and why payment APIs such as Stripe and PayPal expose explicit idempotency mechanisms for unsafe methods like POST.

Message brokers make the problem even more obvious. At-least-once delivery means a consumer can be invoked repeatedly for the same message, and a handler can update the database successfully but fail before acknowledgment, causing the broker to deliver the same message again.

Webhooks are no different. GitHub says webhook deliveries can arrive out of order, failed deliveries are not automatically redelivered, and each delivery carries a unique X-GitHub-Delivery GUID that you should use when protecting against replay. For a practical architecture view of chat endpoints as interaction boundaries, see Chat Platforms as System Interfaces in Modern Systems.

Even systems that advertise stronger guarantees still leave you work to do. Kafka can prevent duplicate entries in Kafka logs with idempotent producers and can provide exactly-once delivery for read-process-write flows that stay inside Kafka with transactions and read_committed consumers. But Kafka's own design docs are clear that external systems still require coordination with offsets and outputs. Google Cloud Pub/Sub exactly-once delivery is limited to pull subscriptions, within a cloud region, and still requires clients to track processing progress until acknowledgment succeeds.

My opinionated summary is simple. Assume the transport will retry. Assume operators will replay. Assume webhooks will arrive late. Design the write path so a repeated intent cannot create a second business effect.

The API contract I actually trust

How do idempotency keys prevent duplicate API requests

The only API contract I trust for mutating operations is caller-supplied intent plus server-side persistence.

AWS recommends a caller-provided request identifier and warns that the service must atomically record the idempotency token together with the mutating work. Stripe stores the first status code and response body for a key, compares later parameters with the original request, and returns the same result for retries. PayPal uses PayPal-Request-Id on supported POST APIs and returns the latest status for the previous request with that same header.

That leads to a practical contract:

The client generates an idempotency key for a business operation.
The server scopes that key by tenant and operation name.
The server stores a request hash so the same key cannot be reused for a different payload.
The server records state such as pending, completed, or failed.
Retries with the same key either return the stored outcome or a stable pointer to it.
Retries with the same key and a different payload fail loudly.

There is an IETF Idempotency-Key header draft, but as of 2026-05-09 it is still listed in the IETF Datatracker as an expired Internet-Draft rather than a published RFC. In practice, the header name is still widely useful as a de facto convention, but you should document the contract in your own API instead of pretending the standard is finished.

What should the key represent? Intent. Not an HTTP attempt. Not a TCP connection. Not a retry counter. If the user means "create order 123 once", every retry for that same command must reuse the same key. If the user means "place a second order", that must use a different key.

A request ID is for tracing. An idempotency key is for correctness. If you mix those up, your dashboards look tidy while your money moves twice.

Why PUT is not enough

No, HTTP PUT is not enough to make an operation idempotent.

Yes, RFC 9110 gives PUT idempotent semantics. But if your PUT handler emits a new downstream event, sends an email on every retry, or charges an external provider again, then your implementation has violated the business contract even if your route name looks respectable.

Verb choice helps clients understand intent. It does not implement intent for you.

Use PUT when the resource model genuinely fits a full replacement or upsert style operation. Use POST when you are creating commands or actions. But for any mutation that might be retried across network boundaries, document an explicit idempotency contract. If your mutating actions are triggered from chat workflows, the same contract applies in Slack Integration Patterns for Alerts and Workflows and Discord Integration Pattern for Alerts and Control Loops. Hidden side effects are where architecture goes to die.

How long should an idempotency key be stored

Longer than your transport team wants.

Stripe says keys can be pruned after at least 24 hours. PayPal says retention is API specific and gives examples that can last up to 45 days. Amazon SQS FIFO deduplicates only within a 5-minute window. GitHub keeps recent deliveries for 3 days for manual redelivery. Those numbers are wildly different because the right retention period is a business decision, not a protocol default.

If you only keep keys for five minutes because your queue does, you are not designing idempotency. You are copying a transport limitation into your business layer.

Keep idempotency records for at least the maximum of these windows:

client retry horizon
queue redrive horizon
webhook replay horizon
operator replay horizon
settlement or compensation horizon for money-moving operations

For payments, bookings, and provisioning, that often means hours or days, not minutes.

AWS also calls out two anti-patterns I fully agree with. Do not use timestamps as the key, because clock skew and collisions make them unreliable. Do not blindly store entire request payloads as the dedup record for every request, because that harms performance and scalability. Store a normalized request hash plus the minimum response state you need to replay safely. If you must reproduce the first response byte for byte, store the canonical response body the way Stripe does.

The database patterns that make idempotency real

Idempotency becomes real when the persistence layer can win a race exactly once.

PostgreSQL gives you two critical primitives here. Unique constraints enforce uniqueness on one or more columns, and INSERT ... ON CONFLICT lets you define an alternative action instead of failing on a uniqueness violation. PostgreSQL also documents that ON CONFLICT DO UPDATE guarantees an atomic insert-or-update outcome under concurrency.

That means your idempotency layer should usually start with a table like this:

create table api_idempotency (
    tenant_id text not null,
    operation text not null,
    idempotency_key text not null,
    request_hash text not null,
    state text not null,
    status_code integer,
    response_body jsonb,
    resource_type text,
    resource_id text,
    created_at timestamptz not null default now(),
    expires_at timestamptz not null,
    primary key (tenant_id, operation, idempotency_key)
);

And the handling flow should look like this:

begin transaction

try insert (tenant_id, operation, idempotency_key, request_hash, state='pending')
on conflict do nothing

load row for (tenant_id, operation, idempotency_key) for update

if row.request_hash != incoming_request_hash
    fail with conflict or validation error

if row.state = 'completed'
    return stored response

if row.state = 'pending' and row was created by another live request
    either wait briefly, or fail fast with a retryable response

perform local business mutation

store stable result in idempotency row
set state = 'completed'

commit
return result

The important part is not the syntax. The important part is the atomicity. Recording the key and performing the mutation must succeed or fail together. AWS says this explicitly for API idempotency, and the same rule applies in SQL-backed services.

Do not do a naive check-then-act sequence like "select key; if missing then insert order". Under concurrency, two requests can pass the check and both create the side effect. A unique constraint is not optional. It is the mechanism that turns your architecture from optimistic folklore into something you can prove under load.

Here is the rule I use in reviews. If the dedup decision is not protected by the same transactional boundary as the mutation, you do not have idempotency. You have hope.

Messages, events, and webhooks need their own boundary

How do consumers handle duplicate events and messages

For message consumers, the classic pattern is still the right one. Record processed message IDs in the same database transaction as the business update. Chris Richardson describes the PROCESSED_MESSAGES table approach directly, using a primary key on subscriber and message ID so duplicates fail cleanly and can be ignored.

Many teams call that explicit processed_messages store an inbox table. The label matters less than the rule. The receiver must persist proof that it already handled the message before a retry can safely do nothing.

A minimal form looks like this:

create table processed_messages (
    subscriber_id text not null,
    message_id text not null,
    processed_at timestamptz not null default now(),
    primary key (subscriber_id, message_id)
);

And the consumer flow is just as strict as the HTTP flow:

begin transaction

insert into processed_messages (subscriber_id, message_id)
values (?, ?)
on conflict do nothing

if no row inserted
    rollback
    ack and ignore duplicate

apply business mutation

commit
ack message

That pattern is boring. Good. Idempotency should be boring.

It is also usually better than trying to lean on broker marketing terms. Kafka's exactly-once support is excellent when you stay inside Kafka's own transactional model, but Kafka's docs still warn that external destinations need cooperation. SQS FIFO reduces duplicate sends only within its 5-minute dedup window. Pub/Sub exactly-once still expects the subscriber to track progress and avoid duplicate work when acknowledgments fail.

Exactly-once is usually a local optimization. Idempotent side effects are the system guarantee.

Pair dedup with the outbox pattern

If your service updates local state and also publishes an event, idempotent consumption alone is not enough. You also need a safe way to get the event out after the local transaction commits.

That is why the transactional outbox pattern matters. Chris Richardson describes the basic idea as writing the event to an outbox table in the same transaction as the business update, and then publishing it asynchronously. Debezium says the outbox pattern avoids inconsistencies between a service's internal state and the events consumed by other services. NServiceBus goes further and shows how outbox processing deduplicates incoming messages and avoids zombie records and ghost messages.

This is the architecture I recommend for services that own data and publish integration events:

Validate and persist the command under an idempotency key.
Write business state and outbox event in one local transaction.
Let CDC or an outbox dispatcher publish the event.
Make downstream consumers idempotent too.

Outbox does not remove the need for idempotent consumers. It removes the need to pretend that a database commit and a broker publish can be one magical distributed transaction when they usually cannot.

Webhooks are just messages with better branding

Treat inbound webhooks exactly like messages from an untrusted network edge.

GitHub documents that deliveries can arrive out of order, recommends using X-Hub-Signature-256 to verify authenticity, and provides X-GitHub-Delivery as the unique delivery identifier. It also notes that redeliveries reuse the same delivery ID.

So the architecture is straightforward:

verify the signature first
use the delivery GUID as the dedup key
persist receipt before side effects
make handlers order-aware rather than assuming arrival order
enqueue the heavy work and return fast

If your webhook handler writes directly to business tables before it records receipt, it is not production-ready. It is just faster at making duplicate mistakes.

Sagas and workflow engines still need idempotency

Sagas and durable workflow engines do not delete the problem. They make it visible.

Temporal recommends writing Activities to be idempotent because Activities can be retried after failures or timeouts. Its docs even call out the edge case where a worker completes an external side effect successfully but crashes before reporting completion, which causes the Activity to run again. Temporal also suggests using a combination of Workflow Run ID and Activity ID as a stable idempotency key when calling downstream services. If you are applying this in service orchestration, Go Microservices for AI/ML Orchestration covers the broader workflow trade-offs.

That is exactly the right mental model. A workflow engine can preserve execution history and coordinate retries. It cannot retroactively uncharge a card or unsend an email unless your application gives it idempotent steps and idempotent compensations.

The same applies to sagas. Temporal's own saga guidance describes compensating actions that run when a step fails. Those compensations must be idempotent too. If "refund payment" runs twice, you may have solved the original bug by creating a new one.

My rule here is brutal and simple. Every Activity, every command handler, and every compensation that touches the outside world should either be naturally idempotent or carry a real idempotency key to the downstream system.

How to test idempotency before production

Most teams test happy paths and then act surprised when retries happen. That is not enough.

You should have automated tests for at least these cases:

the server commits the mutation but the response never reaches the client
two identical requests race with the same idempotency key
the same key is reused with a different payload
a consumer commits its database work and crashes before ack
a webhook is replayed with the same delivery ID
an outbox dispatcher publishes the same event more than once
a workflow Activity completes the external call and crashes before completion is reported
an idempotency record expires and a genuine late retry arrives

AWS explicitly recommends comprehensive test suites that include successful requests, failed requests, and duplicate requests. That advice is pedestrian and absolutely correct.

I would add one more failure drill. Verify that the replayed response is semantically equivalent to the first result. AWS discusses late-arriving retries and argues for responses that preserve the original meaning even after underlying state has changed. That is the difference between "no extra side effect happened" and "the caller still has a consistent contract."

Opinionated rules that save real systems

Here are the rules I would enforce in an architecture review.

First, idempotency keys belong to business intent, not transport attempts.

Second, scope every key by tenant and operation. Global key spaces are how unrelated requests collide.

Third, persist the dedup decision atomically with the mutation. If that is not true, the design is wrong.

Fourth, reject same-key different-payload retries. Stripe and AWS both do this for good reason.

Fifth, keep keys for the full replay horizon of the business process, not for the shortest queue window.

Sixth, pair producers with an outbox and consumers with message ID tracking. One side without the other is half a design.

Seventh, propagate the same operation identity downstream when the business action is the same. AWS explicitly recommends passing the idempotency token along the processing chain.

Eighth, never assume exactly-once marketing removes the need for idempotent side effects.

If that sounds strict, good. Idempotency is where optimistic architecture meets production reality. You do not need complexity everywhere. But wherever duplicate side effects would hurt money, state, or trust, idempotency should be a first-class part of the contract.

Useful Links

Hermes Voice Control from Your Phone

Rost — Sun, 10 May 2026 11:12:56 +0000

You already chat to Hermes Agent from your phone with text.
Now you want to talk to it directly and get spoken replies back.
That is usually the right move, especially if you already use Hermes as a persistent self-hosted assistant.
Typing long prompts on a small screen is slow and error-prone

Voice mode makes Hermes practical in the moments where it matters most, while walking, commuting, or doing admin work away from your desk.

The good news is that voice mode can run with zero paid APIs. A local faster-whisper model handles transcription, and Edge TTS handles spoken output for free. This guide covers setup, provider choices, platform differences, practical command patterns, and the failure modes that usually block first-time users.

How the Pipeline Works

Three stages, no magic:

Transcription STT — Your voice message becomes text.
Reasoning — Hermes processes that text exactly like a typed request.
Synthesis TTS — The response text is converted back to audio.

The important distinction from consumer assistants is execution depth. Hermes is not just answering trivia. It can call tools, inspect files, run code paths, and continue multi-step work from memory. In practice, that means voice can trigger real workflows such as incident triage, draft generation, and targeted debugging. If you want the broader architecture context, the AI Systems pillar explains how this voice layer fits into local agent infrastructure.

What Voice Control Is Great For

Use voice mode when keyboard precision is not required yet:

Operational checks while away from your laptop.
Idea capture for drafts, outlines, and rough specs.
Fast triage of alerts and errors before deeper desktop follow-up.
Hands-busy workflows where speaking is the only realistic input channel.

Voice Input: Pick an STT Provider

Provider	Cost	API Key	Notes
Local faster-whisper	Free	None	On-device, ~150 MB model, 90+ languages
Groq Whisper	Free tier	`GROQ_API_KEY`	Fast cloud inference
OpenAI Whisper	Paid	`VOICE_TOOLS_OPENAI_KEY`	Highest accuracy
Mistral Voxtral	Paid	`MISTRAL_API_KEY`	Alternative cloud option

Configuration in ~/.hermes/config.yaml:

stt:
  enabled: true
  provider: local
  local:
    model: base  # tiny, base, small, medium, large-v3

Start with local. It works immediately, handles multilingual speech, and adds no recurring cost. Move to Groq or OpenAI only if your local setup cannot meet your latency or accuracy requirements. For command-level setup and diagnostics while testing providers, keep the Hermes CLI cheat sheet nearby.

Faster Whisper Model Selection

Use a simple progression:

tiny for very low-power devices where speed matters most.
base as the default balance for laptops and small servers.
small when accents, noisy environments, or domain terms reduce accuracy.
medium or large-v3 when quality is critical and hardware budget is higher.

If your transcripts are consistently wrong, increase model size first before adding more prompt complexity.

Voice Output: TTS Providers

Provider	Quality	Cost	Best For
Edge TTS (default)	Good	Free	Quick start, 322 voices, 74 languages
ElevenLabs	Excellent	Paid	Premium quality, voice cloning
OpenAI TTS	Good	Paid	Natural voices, 6 options
MiniMax TTS	Excellent	Paid	Fine-grained speed/volume/pitch control
NeuTTS	Good	Free (local)	Fully offline, voice cloning

Configuration:

tts:
  provider: "edge"
  speed: 1.0

  edge:
    voice: "en-US-AriaNeural"

One critical detail is output format. Telegram voice bubbles are most reliable when audio is encoded as OGG with Opus. Hermes relies on ffmpeg for these conversions in common setups. If ffmpeg is missing, replies often show up as file attachments instead of inline voice bubbles.

Install ffmpeg early:

sudo apt install ffmpeg  # Ubuntu/Debian
brew install ffmpeg       # macOS

Platform Workflows and Practical Differences

Telegram is the easiest place to start. Voice messages are first-class on mobile, and the interaction loop is simple hold, speak, release, receive.

Setup:

# 1. Create a bot via @BotFather, get your token
# 2. Add to ~/.hermes/.env:
TELEGRAM_BOT_TOKEN=***
TELEGRAM_ALLOWED_USERS=your_user_id

# 3. Start the gateway
hermes gateway start

Then open the Hermes chat, tap the microphone, and speak. If STT and TTS are enabled, Hermes transcribes your request, executes it, and sends a voice reply.

Discord

Discord supports two useful modes. Voice messages in DMs or channels are close to Telegram behavior.

The more advanced option is live voice channels. In that flow, Hermes can participate continuously, transcribing speech and responding without explicit message bubbles.

Requirements:

Message Content Intent enabled in your bot settings
Server Members Intent enabled
Bot permissions: Connect and Speak

Signal

Signal works through the signal-cli daemon. Voice messages still use the same Hermes STT and TTS pipeline.

A useful pattern is running signal-cli as a linked device and using Signal Note to Self. You can leave yourself a voice note and get Hermes output in the same thread.

WhatsApp follows the same gateway model. Audio messages transcribe automatically once the connector is configured.

Mobile App Permissions

Both iOS and Android need microphone access for the messaging app you're using.

iOS: Settings → Telegram (or Discord) → Permissions → Microphone → Allow. Enable Background App Refresh for instant responses.

Android: Settings → Apps → Telegram → Permissions → Microphone → Allow. For Discord voice channels, enable overlay permission.

Pinning the Hermes bot chat to your home screen helps — one tap to start speaking.

Speaking Patterns That Work Reliably

Voice interaction has different ergonomics than typing. You cannot easily paste logs or quote long stack traces, so structure matters:

Be explicit. Say the action, scope, and output format in one sentence.
Keep one objective per message. Split multi-step jobs into short follow-ups.
Constrain output. Ask for numbered actions or a 3-point summary when mobile readability matters.
Stay short. Around 10 to 30 seconds per message usually transcribes better.
Use iterative turns. Correct and refine in the next voice message instead of overloading the first.

Example Prompts You Can Speak

"Check deployment logs for the last one hour and report only critical errors."
"Create a draft outline for a post about OpenTelemetry migration with five sections."
"Summarize this bug in three bullets and propose the most likely root cause."
"Review the config and tell me what to change for lower transcription latency."

Common Use Cases with Concrete Outcomes

Operations — "Check production health and list failed services." Outcome is a focused status update you can act on immediately.
Writing — "Turn these rough points into a publishable intro paragraph." Outcome is polished text from spoken notes.
Debug triage — "Investigate this TypeError and suggest the first fix to test." Outcome is a concrete next step before opening the IDE.
Research — "Find three recent sources on topic X and summarize differences." Outcome is a compressed briefing for later deep work.
Automation — "Run the home routine and confirm device states." Outcome is direct action plus confirmation.

Troubleshooting

Voice messages not transcribing: Confirm stt.enabled: true in config.yaml. Verify local dependencies are installed. Then restart with hermes gateway restart.

TTS not responding: Confirm tts.provider is set. If using a paid provider, verify the API key in .env. Validate current voice settings from the Hermes CLI status commands.

Poor transcription quality: Increase stt.local.model from base to small or medium. Reduce noise and speak in shorter segments. If needed, switch to cloud STT for better accuracy.

Voice bubbles showing as files on Telegram: Install ffmpeg and restart the gateway. This is the most common issue.

The Free Stack

For cost-conscious setups, this baseline is strong:

STT: Local faster-whisper with no API key
TTS: Edge TTS with wide language coverage
Total cost: $0

This is a meaningful advantage over many closed assistants where voice quality and automation quickly become paid-only features.

If quality requirements increase, upgrade one layer at a time. Usually STT upgrades produce the biggest immediate gain, then TTS quality can be improved later if needed.

FAQ Topics in Practice

The four most common user questions are predictable. They also overlap with memory and profile design concerns covered in Hermes Agent Memory System and Hermes production setup patterns.

Whether voice commands get the same tool access as text.
Whether a free stack is viable for daily use.
Why Telegram sometimes shows attachments instead of voice bubbles.
Which local Whisper model should be used first.

This guide addresses each of these directly in setup, tuning, and troubleshooting sections so you can move from first run to stable daily usage quickly.

Quick Start Recap

# 1. Install voice extras
pip install "hermes-agent[all]"

# 2. Set up Telegram gateway
hermes gateway setup

# 3. Install ffmpeg (required for Telegram voice bubbles)
sudo apt install ffmpeg

# 4. Send a voice message from your phone
# Hermes transcribes, processes, and responds

From there, iterate based on your real bottleneck. If latency is the issue, tune model size or cloud STT. If audio quality is the issue, tune TTS provider and voice preset. Start free, measure, then upgrade only where it actually improves your workflow.

Kanban in Hermes Agent for Self Hosted LLM Workflows

Rost — Fri, 08 May 2026 09:48:56 +0000

Hermes Agent ships with a Kanban-style board and the Hermes Gateway that can saturate your self-hosted LLM if too many tasks are dispatched at once.

I can say you can easily ddos your own LLM this way.

Hermes Kanban is a durable multi-profile board backed by ~/.hermes/kanban.db.

Each lane represents a phase of work, and each card is a task that can be claimed by a specific Hermes profile.

Out of the box, the dispatcher can promote many ready tasks in one pass. That is fine for elastic cloud APIs, but it can overload a small self-hosted GPU cluster.

If you are new to this stack, start with the broader Hermes setup and operations guide and the AI Systems pillar for surrounding architecture.

This post shows how to:

Understand how Hermes Kanban dispatch interacts with your LLM gateway.
Control parallelism safely for heavy tasks.
Batch promotions with cron so background jobs do not collide with interactive use.
Monitor and tune the system so GPUs stay busy without overload.

How Hermes Kanban and the dispatcher work

At a high level, the system has three layers:

Board - durable SQLite state for tasks, columns, relations, and history.
Workers - Hermes profiles started in isolated workspaces to process a task.
Dispatcher - a long-lived process that scans for dispatchable cards and starts runs.

Tasks created from CLI or dashboard usually start in backlog or ready.

The dispatcher scans for eligible cards, claims one atomically, and starts the assigned profile with its tools and memory.

Each worker then calls your LLM gateway or local runtime (for example, OpenAI-compatible endpoints backed by Ollama, vLLM, or llama.cpp). For deployment choices across these runtimes, use the LLM Hosting in 2026 Local Self-Hosted and Cloud Infrastructure Compared. If you are tuning request fan-out on Ollama itself, this pairs well with How Ollama Handles Parallel Requests.

If you add many heavy tasks and do not cap promotions, your gateway can get flooded with concurrent requests.

On a single-GPU or CPU-bound host, that often means queueing, thrashing, and timeouts instead of better throughput.

The practical limitation today

In current Hermes builds many teams run, dispatcher config exposes only two Kanban dispatch keys and does not apply a global active-task cap from config:

kanban:
  dispatch_in_gateway: false
  dispatch_interval_seconds: 10

For active-task control, rely on explicit dispatch cadence (hermes kanban dispatch --max ...) plus dependency modeling.

Known gotchas:

Do not run gateway-embedded dispatch and hermes kanban daemon --force against the same board, or you can get claim races.
If the gateway is down, ready tasks do not dispatch and can burst later when service returns.
Longer dispatch intervals feel uneven because claiming happens in ticks.
Behavior can vary across versions because run-state and reclaim edge cases were patched over time.

Quick verification when behavior looks wrong:

# 1) confirm exactly one dispatcher path is active
pgrep -af "hermes gateway start|hermes kanban daemon"

# 2) check the wired Kanban dispatcher keys
rg "dispatch_in_gateway|dispatch_interval_seconds" ~/.hermes/config.yaml

# 3) inspect queue shape
hermes kanban list --status ready
hermes kanban list --status running

Key ideas:

Dispatcher config wires dispatch_in_gateway and dispatch_interval_seconds.
dispatch --max limits new spawns in that pass, not total running tasks.
For small self-hosted clusters, start conservative and increase only after latency stays stable.

When first deploying Hermes near your LLM gateway:

Keep only supported Kanban dispatcher keys in config.
Observe GPU and CPU utilization under real queue pressure.
Use Strategy 1 or Strategy 2 for deterministic pacing.

Investigation findings and root cause

hermes kanban dispatch does not read config.yaml for max_active_tasks.

In hermes_cli/kanban.py, the dispatch command exposes --max as a CLI cap (default None) and passes only args.max into kb.dispatch_once(...). There is no max_active_tasks config lookup in this path. See hermes_cli/kanban.py raw.

Then in kanban_db.dispatch_once, the only cap is max_spawn, with logic equivalent to:

if max_spawn is not None and spawned >= max_spawn:
    break

There is no check of already running tasks and no max_active_tasks reference in that dispatch path. See hermes_cli/kanban_db.py raw.

Effective behavior:

hermes kanban dispatch

unbounded for that pass (limited by ready queue size).

hermes kanban dispatch --max 2

caps only new spawns in that pass, not total running tasks.

The wired config knobs around gateway dispatch are kanban.dispatch_in_gateway and kanban.dispatch_interval_seconds.

So max_active_tasks is ignored in this dispatch path because it is not implemented there.

Strategy 1 - Encode dependencies for strictly sequential flows

Some workflows should run strictly one after another — for example:

multi step data pipelines with shared intermediate artefacts
migrations or infrastructure changes
batch jobs that write to the same object store or database

Hermes Kanban supports parent child dependencies between tasks so that a child card becomes dispatchable only when its parent is done.

You can model this with a small helper script around the Hermes CLI:

#!/usr/bin/env bash

set -euo pipefail

parent_id="$(hermes kanban add \
  --title 'Ingest customer logs for April' \
  --profile 'etl-worker' \
  --column backlog)"

hermes kanban add \
  --title 'Generate April anomaly report' \
  --profile 'analytics-worker' \
  --column backlog \
  --parent "${parent_id}"

hermes kanban add \
  --title 'Publish April summary to dashboard' \
  --profile 'reporting-worker' \
  --column backlog \
  --parent "${parent_id}"

With an appropriate board policy and low dispatcher limits only the parent task runs first.

Once it finishes the child tasks gradually become ready, and the dispatcher pulls them one by one without ever exceeding your concurrency caps.

Strategy 2 - Use Linux cron with a running-aware dispatch cap

If you want deterministic pacing, use host cron plus a small wrapper script.

Instead of always calling dispatch --max 2, first count currently running tasks, then dispatch only the remaining slots.

Create hermes-kanban-dispatch-capped.sh:

#!/usr/bin/env bash
set -euo pipefail

MAX_PARALLEL="${MAX_PARALLEL:-2}"
BOARD="${BOARD:-}"

board_args=()
if [[ -n "$BOARD" ]]; then
  board_args=(--board "$BOARD")
fi

# or where your hermes is installed
export PATH="/home/abc/.local/bin:$PATH"

running_out="$(hermes kanban "${board_args[@]}" list --status running)"

if [[ "$running_out" == *"(no matching tasks)"* ]]; then
  running_count=0
else
  running_count="$(printf '%s\n' "$running_out" | wc -l)"
fi

slots=$(( MAX_PARALLEL - running_count ))

if (( slots <= 0 )); then
  echo "Already at limit running=$running_count max=$MAX_PARALLEL dispatch skipped"
  exit 0
fi

echo "running=$running_count max=$MAX_PARALLEL slots=$slots dispatching up to $slots"

hermes kanban "${board_args[@]}" dispatch --max "$slots"

Make it executable:

chmod +x ./hermes-kanban-dispatch-capped.sh

Run it with:

MAX_PARALLEL=2 ./hermes-kanban-dispatch-capped.sh

For a specific board:

BOARD=my-board MAX_PARALLEL=2 ./hermes-kanban-dispatch-capped.sh

Schedule it once per minute with cron:

* * * * * /opt/hermes/scripts/hermes-kanban-dispatch-capped.sh >> /var/log/hermes/kanban-cron.log 2>&1

Operational notes:

Cron often has a minimal PATH, so if hermes is not found, use its full path inside the script (for example /usr/local/bin/hermes).
If you log to /var/log/hermes/..., create that directory first and ensure the cron user has write access.

Example:

sudo mkdir -p /var/log/hermes
sudo chown "$USER":"$USER" /var/log/hermes

Create or edit cron entries with:

crontab -e

Then verify with:

crontab -l

Sub-minute cadence with one cron entry

Cron ticks once per minute, but you can still dispatch more frequently by running a short loop inside the script.

Example hermes-kanban-dispatch-subminute.sh:

#!/usr/bin/env bash
set -euo pipefail

LOCK_FILE="/tmp/hermes-kanban-dispatch.lock"
RUNS_PER_MINUTE="${RUNS_PER_MINUTE:-4}"    # 4 runs => every 15 seconds
CAP_SCRIPT="${CAP_SCRIPT:-/opt/hermes/scripts/hermes-kanban-dispatch-capped.sh}"

exec 9>"$LOCK_FILE"
flock -n 9 || exit 0

sleep_seconds=$(( 60 / RUNS_PER_MINUTE ))

for ((i=1; i<=RUNS_PER_MINUTE; i++)); do
  "$CAP_SCRIPT"

  if (( i < RUNS_PER_MINUTE )); then
    sleep "$sleep_seconds"
  fi
done

Make it executable:

chmod +x ./hermes-kanban-dispatch-subminute.sh

Schedule it once per minute:

* * * * * /opt/hermes/scripts/hermes-kanban-dispatch-subminute.sh >> /var/log/hermes/kanban-subminute.log 2>&1

This gives an effective sub-minute cadence while flock prevents overlapping runs.

Why this works:

list --status running gives current running load.
dispatch --max N caps only new spawns for that pass.
Computing N as remaining slots keeps total running tasks near your target limit.

Important caveat: this cap works only for dispatches made through this script.

Disable gateway embedded dispatch, otherwise it can still promote tasks independently:

kanban:
  dispatch_in_gateway: false

The official docs describe both command capabilities and note gateway dispatch defaults in the Kanban feature guide: Hermes Kanban docs.

Internal Hermes Cron

Do not use it.
Do you really want your llm to process regular prompts like Execute in terminal the command /path/hermes-kanban-dispatch-capped.sh, especially when it's busy doing some useful work?

Hermes Kanban Monitoring and Tuning

Whichever strategy you choose you should monitor:

LLM gateway metrics — request rate, latency, error rate, token throughput.
Node health — GPU utilisation, VRAM usage, CPU load and RAM.
Hermes metrics — how many tasks are in backlog, ready, active and done.

For production metric baselines and dashboards, see Monitor LLM Inference in Production with Prometheus and Grafana and the broader LLM Performance hub.

Start with low concurrency, then gradually raise limits while watching for:

rising latency at constant throughput
increasing timeout or rate limit errors
long tails where some tasks stay active for a very long time

As soon as you see these symptoms roll back to the previous stable configuration and keep that as your default.

When Kanban is the right tool

Hermes Kanban shines when you have:

long lived research or engineering backlogs
multi agent collaboration with named profiles
workflows that must survive restarts and host reboots
humans who want a dashboard to triage work

If you only need a single run to create a few temporary helpers, the built in delegate task tools are usually simpler.

Once you need history, dashboards and strict control over how your agents hit self hosted LLMs the Kanban board plus dispatcher is the right foundation.

With a few configuration tweaks and optional cron based batching you can keep Hermes Kanban responsive while protecting your gateway and hardware.

Hermes Agent Skill Authoring — SKILL.md Structure and Best Practices

Rost — Wed, 06 May 2026 08:10:04 +0000

Hermes Agent treats skills as the default way to teach repeatable workflows. Official documentation describes them as on-demand knowledge documents aligned with the open agentskills.io shape, loaded through progressive disclosure so the model sees a small index first and only pulls full instructions when a task actually needs them.

Authoring is less about clever wording than about packaging—you are telling the runtime when to load a procedure, what order of steps counts as “done,” and how to tell success from a silent failure. This article stays focused on SKILL.md structure, supporting folders, visibility rules, and the split between secrets and non-secret settings—the details that decide whether a skill shows up in /slash commands, survives a hub install, or quietly disappears on CI.

Hermes sits inside the broader AI Systems: Self-Hosted Assistants, RAG, and Local Infrastructure cluster, where assistants are treated as systems built from inference, retrieval, memory, and tooling rather than as a single chat surface. Install paths, provider wiring, gateway behavior, and the layout of ~/.hermes are all spelled out in the Hermes AI Assistant - Install, Setup, Workflow, and Troubleshooting guide; day-to-day shell ergonomics—hermes skills, profiles, gateway, memory—are easier to scan in the Hermes Agent CLI cheat sheet — commands, flags, and slash shortcuts. In real deployments, skills inherit isolation from profiles (separate config, secrets, memories, and skill trees). Hermes AI Assistant Skills for Real Production Setups argues for treating those profiles—not individual markdown files—as the unit of ownership; keep that in mind when you name skills and decide what belongs in shared external_dirs versus a single profile.

Skill or tool?

Official guidance is blunt. Use a skill when the capability is mostly prose instructions plus shell commands and tools Hermes already exposes—wrapping a CLI, driving git, calling curl, or using web_extract for structured fetches. Use a tool when you need tight integration for API keys and auth flows, deterministic binary handling, streaming, or Python that must execute the same way every time.

That boundary matters in practice because skills ship without changing agent code, while tools carry review and release overhead. Most teams benefit from starting with a skill, then promoting only the brittle core to a tool once the failure modes are obvious (auth refresh loops, binary parsers, strict idempotency).

Procedures versus curated memory

Skills answer how to run a workflow; Hermes’ bounded core memory answers what has already been agreed about the user and the project. A skill loads when the task matches its description; MEMORY.md and USER.md stay in the prompt as a small, curated fact layer. The two mechanisms stack rather than compete, and the full picture of snapshots, limits, and external providers is laid out in Hermes Agent Memory System: How Persistent AI Memory Actually Works.

Anatomy of a skill directory

On disk, every skill is a folder under ~/.hermes/skills/, often nested under a category such as devops/ or research/. Hermes expects SKILL.md at the leaf; everything else is optional structure you add when the instructions would otherwise sprawl. The usual pattern is references/ for long tables or vendor docs, templates/ for output skeletons, scripts/ for deterministic helpers, and assets/ for static files the agent should not re-fetch.

That layout mirrors how progressive disclosure works in practice: the agent can stay at the main file until it truly needs a deep appendix. Keeping “happy path” prose in SKILL.md and pushing rarely used detail into references/ is one of the cheapest ways to protect token budgets.

Hermes can also merge in external skill directories via skills.external_dirs in config.yaml. Those paths are scanned for discovery, but the agent still writes through skill_manage into the primary ~/.hermes/skills/ tree. Local names shadow external ones, so if you “fix” a shared skill in your home directory, teammates pulling the same external repo will not see your edit until they remove or rename the local copy—a common source of “it works on my machine” confusion.

SKILL.md frontmatter that survives review

The body of SKILL.md is Markdown; the opening block must be valid YAML between --- delimiters. Real skills accumulate long fenced examples, so the small habits from Markdown Code Blocks: Complete Guide with Syntax, Languages & Examples—consistent language tags, readable excerpts, tight fences—keep large files maintainable for humans and slightly easier for the model to scan.

Required fields are name and description. The name becomes the slash route and index key; it stays lowercase with hyphens and must respect the documented length cap. The description is the only prose many sessions ever pay for at level zero, so it should read like a search result or router string (“when backups look stale, verify latest archive and checksum”), not the first paragraph of a blog post.

Optional top-level keys such as version, author, and license help hub packaging and audits. The platforms list (macos, linux, windows) is sharper than it looks—when set, Hermes omits the skill entirely on non-matching hosts, which is why a skill that “works on my Mac” can vanish in Linux CI with no error message beyond a shorter skill list.

Hermes-specific knobs live under metadata.hermes: tags, related_skills, and the conditional visibility fields in the next section. required_environment_variables declares secrets that should land in .env and pass into sandboxes; required_credential_files covers OAuth token files and other on-disk credentials that must mount into Docker or Modal; metadata.hermes.config declares non-secret preferences stored under skills.config in config.yaml.

Official docs stress size discipline for a reason. Trim the description to its budget, front-load the procedure, and push historical notes or giant option matrices into references/ so a partial skill_view still gives the agent something actionable.

Below is a minimal SKILL.md you can drop into ~/.hermes/skills/devops/backup-check/SKILL.md (or any category folder) and iterate from there.

---
name: backup-check
description: Verify nightly backup archives exist, are non-empty, and pass a quick checksum spot-check on the latest file.
version: 1.0.0
metadata:
  hermes:
    tags: [devops, backups, shell]
    requires_toolsets: [terminal]
    config:
      - key: backup_check.archive_dir
        description: Absolute path to the directory that holds backup archives
        default: "/var/backups"
        prompt: Backup archive directory (absolute path)
---

# Backup archive spot-check

## When to use

Use when the user asks to confirm backups ran, to audit the latest archive on disk, or to catch empty or stale backup files before a restore drill.

## Quick reference

- Latest archive directory is configured under `skills.config.backup_check.archive_dir` (set via `hermes config migrate` if declared in metadata).
- Default check uses `ls` by mtime and `test -s` for non-empty files.

## Procedure

1. Resolve the archive directory from skill config or ask the user once if unset.
2. List the most recently modified file matching the expected pattern (for example `*.tar.zst`).
3. Confirm the file exists, is non-empty, and record its path and size for the reply.
4. If a checksum file exists beside the archive, verify it with the documented tool (for example `sha256sum -c`).

## Pitfalls

- Empty files can still have a recent mtime if a failed job touched the path; always check size.
- Relative paths break when the terminal cwd is not the backup host; use absolute paths in config.

## Verification

The user should see the latest archive path, byte size, and either a checksum OK line or an explicit note that no `.sha256` sidecar was found.

Progressive disclosure in practice

Progressive disclosure is the difference between a skill library that feels snappy and one that burns thousands of tokens before the first user message. Hermes walks three conceptual steps: a compact catalog (names and short descriptions), the full SKILL.md when the task matches, and—only if needed—a slice of a reference file via skill_view paths. Assume level zero is all the model will read until it explicitly commits; every sentence in the description and the first screen of body text should help routing, not storytelling.

A practical outline that survives partial loads is When to use (triggers in plain language), Quick reference (commands, env vars, file paths), Procedure (ordered steps the agent should not improvise away), Pitfalls (known failure modes), and Verification (what “green” looks like). Narrative history, vendor changelog dumps, and twenty-row option tables belong in references/ with stable headings so the agent can pull a single section.

When a skill activates, Hermes can rewrite ${HERMES_SKILL_DIR} and ${HERMES_SESSION_ID} in the body so shell lines point at the installed folder without hand-built paths. Optional inline shell snippets (!cmd`) can inject fresh context (current branch, disk free space), but they execute on the host and stay disabled unlessskills.inline_shell` is on—treat that flag as a trust boundary for the whole skill source, not a convenience toggle.

Conditional activation and prompt hygiene

Skills can show or hide based on which toolsets or tools exist in the current session. requires_toolsets / requires_tools gate a skill behind capabilities that must be present; fallback_for_toolsets / fallback_for_tools surface a cheaper or local path when a premium integration is absent—the DuckDuckGo fallback when a paid web search API is not configured is the canonical example.

These predicates directly shape prompt noise. An overly strict requires_* rule hides a skill from newcomers who have not finished hermes tools setup yet; an overly loose fallback_for_* rule duplicates half your library whenever someone omits an API key. The useful middle ground is to name real prerequisites, test with hermes chat --toolsets skills, and toggle keys or toolsets on purpose while watching whether the skill list breathes the way you expect.

Secrets, config, and credential files

Secrets should be declared in required_environment_variables. Hermes can prompt when a skill loads in the local CLI, persist values in .env, and pass them into terminal and execute_code sandboxes without streaming the raw secret back into the model transcript. Remote chat surfaces refuse to collect keys inline and instead point people at hermes setup or manual .env edits—author your skill text so it matches that behavior (tell users that a key is required, not *to paste it into Telegram).

Non-secret preferences—default paths, org names, feature toggles—belong in metadata.hermes.config. Values resolve into skills.config inside config.yaml, show up in hermes config show, and arrive in the skill message as resolved facts so the model does not need to open your config file mid-task.

File-shaped credentials (OAuth token JSON, service account keys) map to required_credential_files. When those files exist, Hermes can bind-mount them into Docker or sync them into Modal jobs; declaring them upfront avoids the classic “script works locally, dies in sandbox” gap.

Supporting scripts and dependencies

The upstream guide pushes authors toward boring dependencies: stdlib Python, curl, and Hermes’ own tools (web_extract, read_file, terminal). That is less about purity than about reproducibility—every extra pip install is another silent failure when the agent runs in a clean container.

When JSON or XML parsing is fiddly, a short script under scripts/ plus a ${HERMES_SKILL_DIR} path beats asking the model to re-derive parsers each run. If you truly need a package, state the install command in Procedure, repeat the failure symptom in Pitfalls, and give a Verification command that fails loudly when the dependency is missing.

Publishing, hub installs, and trust

Community skills move through the Skills Hub and the other discovery paths the user guide lists—official optional skills, GitHub slugs, skills.sh entries, .well-known indexes, and raw SKILL.md URLs. Installs are scanned for obvious exfiltration, injection, and destructive patterns; trust tiers run from builtin through community, and some findings only clear with --force while the worst cases stay blocked entirely.

The SKILL.md file shape is not Hermes-specific; IDE-centric assistants use the same progressive-loading idea with different discovery and triggers. Claude Skills and SKILL.md for Developers: VS Code, JetBrains, Cursor is a useful contrast read—frontmatter discipline and “load only when relevant” carry over, even when the installer and slash-command wiring differ.

Org-wide rollouts usually pair a private tap or shared Git repo with external_dirs for read-only sharing, while keeping the agent-writable copy under each profile when skill_manage is allowed to mutate skills in place.

Troubleshooting and optimization

When a skill misbehaves, walk this checklist before rewriting prose.

Visibility — Confirm platforms, requires_*, and fallback_for_* predicates. A skill that “works on my Mac” but not in Linux CI is often a platform guard.
Name collisions — Duplicate names across local and external directories follow local precedence. Rename or namespace aggressively.
Discovery layout — A misplaced SKILL.md or wrong category folder can drop the skill from indexing entirely.
Token load — If sessions feel slow, shorten level-zero descriptions, move depth into references/, and deduplicate giant tables.
Agent edits — Hermes can create, patch, or delete skills via skill_manage. Treat valuable skills like code: review diffs, export snapshots, and reset bundled skills deliberately when upgrades drift.

A tight regression loop beats rereading the whole file: hermes chat --toolsets skills -q "Use the <skill> workflow to <concrete task>" should show the agent pulling the right disclosure level before it freestyles. If it never invokes skill_view, your When to use text or description probably does not match how people phrase requests.

Official references stay authoritative for behavior changes—the Skills System user guide for runtime semantics, Creating Skills for author-facing rules, the Bundled Skills Catalog for copy-paste examples, and the agentskills.io specification for the shared file format Hermes aligns with.

Hermes Agent CLI cheat sheet — commands, flags, and slash shortcuts

Rost — Mon, 04 May 2026 10:57:09 +0000

Hermes Agent from Nous Research is a model-agnostic, tool-using assistant you run locally or on a VPS.

Hermes does not lock you into one surface. You can use

the classic hermes / hermes chat CLI,
the full-screen hermes --tui session,
a long-running hermes gateway for Telegram, Discord, Slack, and other messaging platforms,
hermes dashboard for a local browser UI when the web extra is installed.

Those paths share the same config and data under ~/.hermes; this page lists shell commands that matter across those modes.

Below is a dense command reference grouped by task.

Install Hermes Agent and first-run CLI commands

For install and troubleshooting, start with Hermes AI Assistant — Install, Setup, Workflow, and Troubleshooting.

The installer pulls the repo, sets up a Python environment, and wires the hermes executable. After source ~/.bashrc or ~/.zshrc, your default entry point for interactive chat is simply hermes (same family as hermes chat).

Command	Description
`curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh \	bash`
`hermes` / `hermes chat`	Start interactive chat after install (default daily entry).
`hermes --version` / `hermes version`	Print version information.
`hermes completion bash` \	`zsh` \
`hermes update [--check] [--backup] [--restart-gateway]`	Pull latest code, reinstall deps, optional pre-update home snapshot or gateway restart.
`hermes uninstall [--full] [--yes]`	Remove Hermes; optional full data deletion.

Native Windows is not supported; use WSL2. Android installs via Termux follow a dedicated path in the upstream docs.

Global flags for every `hermes` invocation

These flags apply before subcommands and change which profile, which session, or how much personal config loads.

Flag	Description
`--profile`, `-p`	Select Hermes profile for this run (overrides sticky default from `hermes profile use`).
`--resume`, `-r`	Resume a session by ID or title.
`--continue [name]`, `-c`	Continue the latest session, or latest matching a title.
`--worktree`, `-w`	Start in an isolated Git worktree for parallel agents.
`--yolo`	Bypass dangerous-command approval prompts (use with care).
`--pass-session-id`	Include session ID in the system prompt.
`--ignore-user-config`	Skip `~/.hermes/config.yaml` (defaults only); `.env` still loads.
`--ignore-rules`	Skip auto-injection of AGENTS.md, SOUL.md, `.cursorrules`, memory, preloaded skills.
`--tui`	Launch the TUI (`HERMES_TUI=1` equivalent).
`--dev`	With `--tui`, run TS sources via `tsx` for TUI development.

Isolated automation often pairs hermes chat --ignore-user-config --ignore-rules with hermes -z for reproducible one-shots.

`hermes chat`, one-shot prompts, and `hermes -z`

Command / pattern	Description
`hermes chat`	Interactive or scripted chat; main surface for `-q`, `-m`, `--provider`, toolsets, resume, worktree, checkpoints.
`hermes chat -q "..."`	One-shot prompt (non-interactive); keeps richer output than `-z` when tools run.
`hermes -z "..."`	Scripted one-shot — final answer only on stdout, no banner or session noise. Same agent and tools; best for pipes and scripts.
`hermes chat --quiet`, `-Q`	Quieter programmatic mode (banner and tool previews suppressed).
`-m` / `--model`, `--provider`	Per-run model and provider overrides; env `HERMES_INFERENCE_MODEL` / `HERMES_INFERENCE_PROVIDER` mirror behavior.
`-t` / `--toolsets`	Enable comma-separated toolsets for the run.
`-s` / `--skills`	Preload skills (repeat or comma-separated).
`--image path`	Attach a local image to a single query.
`--checkpoints`	Enable filesystem checkpoints before destructive edits.
`--max-turns N`	Cap tool-calling iterations per turn (default from config).
`--source`	Session source tag (`cli` vs `tool` for integrations).

Hermes model outside the session vs /model inside it — Running hermes model from the shell is where you add providers, keys, and OAuth. Slash /model only switches among already configured providers. If you only see OpenRouter in /model, exit the session and complete hermes model.

Model picker, credential pools, and fallback providers

Command	Description
`hermes model`	Interactive provider and model picker; keys, OAuth, custom endpoints.
`hermes auth`	Credential pools — `add`, `list`, `remove`, `reset` for rotation-friendly keys and OAuth.
`hermes fallback [list \	add \
{% raw %}`hermes setup [model \	tts \

Deprecated {% raw %}hermes login / hermes logout — use hermes auth and hermes model instead.

Picking local OpenAI-compatible endpoints versus hosted APIs for hermes model sits on the same trade-offs as general LLM hosting (latency, cost, ops).

Config files and `hermes config` commands

Configuration resolves as CLI overrides → config.yaml → .env → defaults. API keys belong in .env; structured settings in config.yaml.

Command	Description
`hermes config show`	Display effective configuration.
`hermes config edit`	Open `config.yaml` in `$EDITOR`.
`hermes config set key value`	Set values (secrets routed to `.env`, non-secrets to YAML).
`hermes config path` / `hermes config env-path`	Print paths to config and env files.
`hermes config check`	Detect missing or stale settings.
`hermes config migrate`	Apply newly introduced options interactively.

Where files live — Everything sits under HERMES_HOME (default ~/.hermes) for config, secrets, memories, skills, sessions, gateway state, and logs.

Session management and `hermes profile`

Command	Description
`hermes sessions list`	List recent sessions.
`hermes sessions browse`	Interactive picker with search and resume.
`hermes sessions export`	Export sessions (e.g. JSONL).
`hermes sessions delete`, `prune`, `rename`, `stats`	Delete one session, prune old ones, rename titles, show store stats.
`hermes profile list` \	`use` \
`hermes profile export` / `import`	Archive or restore a profile tarball.
`hermes profile alias`	Short wrapper scripts for fast profile switching.

Use hermes -p work chat -q "..." for ad hoc runs without changing the sticky default profile.

Skills hub, toolsets, shell hooks, and plugins

For profile-first configuration and skills tuned to real production workflows by role, see Hermes AI Assistant Skills for Real Production Setups.

Command	Description
`hermes tools`	Interactive per-platform tool enablement; `--summary` prints current choices.
`hermes skills browse`, `search`, `inspect`, `install`, `list`, `check`, `update`, `audit`, `uninstall`, `publish`, `snapshot`, `tap`, `config`	Skills hub workflows including registries and URL installs.
`hermes curator status`, `run`, `pause`, `pin`, `rollback`, …	Background skill maintenance and safe rollback.
`hermes hooks list`, `test`, `revoke`, `doctor`	Declared shell hooks and allowlists in config.
`hermes plugins`	Composite UI or subcommands to install, enable, disable, remove plugins.

Built-in memory and `hermes memory` providers

Built-in MEMORY.md / USER.md stay active; external providers add optional recall layers. For how that architecture behaves in practice, read Hermes Agent Memory System — How Persistent AI Memory Actually Works. To compare external backends and activation trade-offs, see Agent Memory Providers Compared — Honcho, Mem0, Hindsight, and Five More.

Command	Description
`hermes memory setup`	Interactive external memory provider configuration.
`hermes memory status`	Show active provider settings.
`hermes memory off`	Disable external provider; built-in files remain.

When a provider is active it may register extra provider-specific top-level subcommands — run hermes --help to see what is wired today.

Messaging gateway, DM pairing, and platforms

Command	Description
`hermes gateway setup`	Interactive messaging platform setup.
`hermes gateway run`	Foreground gateway (recommended on WSL, Docker, Termux).
`hermes gateway start` \	`stop` \
`hermes gateway install` \	`uninstall`
`hermes pairing list` \	`approve` \
`hermes whatsapp`	WhatsApp bridge pairing flow.
`hermes slack manifest`	Generate Slack app manifest with gateway slash parity.

On WSL, hermes gateway run inside tmux is the resilient pattern when gateway start misbehaves.

Cron scheduler, webhooks, and Kanban

Command	Description
`hermes cron …`	Create, edit, pause, resume, run, remove scheduled prompts (`tick` for manual scheduler pass).
`hermes webhook subscribe`, `list`, `remove`, `test`	Dynamic webhook routes for event-driven runs.
`hermes kanban …`	Multi-profile task board backed by SQLite; `dispatch` drives workers.

`hermes doctor`, logs, backup, and usage insights

Command	Description
`hermes doctor [--fix]`	Interactive diagnostics and optional auto-repair.
`hermes status [--all] [--deep]`	Concise status; deeper checks when needed.
`hermes dump [--show-keys]`	Paste-friendly setup summary for Discord or GitHub issues.
`hermes debug share`	Upload redacted debug bundle to a paste service (or `--local`).
`hermes logs [agent \	errors \
{% raw %}`hermes backup`, `hermes import`	Zip snapshots of home data and restore paths.
`hermes insights [--days N] [--source …]`	Token, cost, and activity analytics.

When something breaks after an upgrade, hermes doctor, hermes status, and hermes logs errors -f form the fastest triage loop.

MCP, ACP, web dashboard, and OpenClaw migration

Command	Description
`hermes mcp serve`	Run Hermes as an MCP server.
`hermes mcp add`, `remove`, `list`, `test`, `configure`	Manage MCP client connections from Hermes.
`hermes acp`	Agent Client Protocol stdio server for editors (extra install may apply).
`hermes dashboard [--port …] [--host …]`	Local web dashboard (`pip install hermes-agent[web]`).
`hermes claw migrate …`	Migrate OpenClaw-style configs into Hermes (`--dry-run`, presets, optional secrets).

OpenClaw migration — hermes claw migrate reads legacy OpenClaw home directories; for what that stack looked like before moving, see the OpenClaw case study.

Slash commands in the Hermes CLI session

Type / for autocomplete. Commands are case-insensitive; skills register extra /skill-name routes. The tables below are a curated subset; for the full registry see Official Hermes Agent documentation at the end of this article.

Session flow, background tasks, and goals

Command	Description
`/new`, `/reset`	New session ID and history.
`/resume [name]`	Resume a named session.
`/compress [focus]`	Manual context compression with optional focus topic.
`/retry`, `/undo`	Retry last turn or drop last exchange.
`/title …`	Name the session for later `/resume`.
`/background …`, `/queue …`, `/steer …`	Parallel background run, queued next prompt, mid-loop nudge after next tool.
`/goal …`	Persistent multi-turn objective with judge loop (`status`, `pause`, `resume`, `clear`).
`/branch`, `/fork`	Branch the conversation for alternate exploration.

Models, tool toggles, skills, and reload

Command	Description
`/model … [--global]`	Switch models among configured providers; `--global` persists default.
`/tools …`, `/toolsets`	Session tool toggles and toolset listing.
`/skills …`	Search, install, and manage skills from chat.
`/cron …`	Scheduled tasks UI from the CLI session.
`/reload-mcp`	Reload MCP servers from config.
`/reload`	Reload `.env` into the running session without restart.

Usage, help, and quitting

Command	Description
`/usage`, `/insights`	Token and cost visibility; analytics snapshot.
`/help`, `/quit`	Help or exit the CLI.

Messaging apps (Telegram, Discord, Slack, and others) expose an overlapping slash set plus /approve, /restart, /commands, and related gateway-only helpers — platform differences are documented in the slash command reference under Official Hermes Agent documentation below.

Official Hermes Agent documentation

Upstream documentation on hermes-agent.nousresearch.com:

Tip. Keep hermes dump and hermes doctor --fix in muscle memory — they turn vague "something broke" reports into actionable diffs against a known-good setup.

MinIO CE in 2026: Retired Upstream, Source-Only, and What to Use

Rost — Mon, 04 May 2026 10:56:52 +0000

MinIO Community Edition is no longer a safe default for new production systems.

As of 2026, the public project status and distribution model changed enough that many teams now treat MinIO CE as end of life for serious workloads.

If you are deciding whether to keep MinIO CE, fork it, or migrate, this guide gives you:

a factual timeline of what changed
the practical risk for operators
a technical comparison of SeaweedFS, Garage, RustFS, and Ceph RGW
a migration plan you can execute in phases

For broader context around storage, databases, and search in production AI stacks, see the Data Infrastructure for AI Systems pillar.

What happened to MinIO CE

The community concern is not one single event. It is the sequence.

Date	What changed	Why it matters
May 2025	Key management features moved out of CE path	Reduced CE parity for auth and admin workflows
Oct 2025	Community Docker images and public binaries stopped	Operators must build and verify from source
Dec 2025	Public maintenance mode messaging became explicit	Fewer expectations for active OSS iteration
Feb 2026	Repository archived for the first time	Read only state blocks normal OSS collaboration
Apr 2026	Repository archived again and stayed locked	Confirms long term frozen upstream posture

The core operational impact is simple

you inherit more supply chain, patching, and maintenance responsibility than most teams expect from a mainstream S3 compatible store.

Is MinIO still open source in 2026

A common question is whether MinIO is still open source at all.

The server code in the public repository is still under AGPLv3.

However, the practical community path changed from normal binary-first consumption to source-first self build.

For many teams, that feels less like a living OSS ecosystem and more like unsupported source availability.

So the accurate answer is nuanced

license status remains open source, but operationally the community experience is no longer what most platform teams need for low risk production adoption.

Is MinIO CE safe for new production deployments

For greenfield deployments, usually no, especially when compared with documented options in this MinIO vs Garage vs AWS S3 comparison.

Why the risk profile changed

Patch cadence risk no stable, trusted community binary channel means every CVE cycle becomes your build and release cycle
Verification burden your team must own provenance, repeatability, and rollback strategy
Ecosystem drift risk tooling that assumed public images may lag or break
People risk senior SRE and security time is consumed by platform plumbing instead of product work

If you already run MinIO CE internally, this does not mean panic shutdown.

It means treat the platform as controlled technical debt and put a migration runway on your roadmap.

Community verdict and market response

Across operator communities in 2025 to 2026, the pattern is consistent:

fewer teams choose MinIO CE for net new deployments
more teams evaluate Garage and SeaweedFS first
enterprise teams with strict S3 semantics often move toward Ceph RGW
RustFS gets attention as a direct successor style option, but with alpha caution

This trend matters because platform safety is partly social

healthy ecosystems reduce integration risk, improve troubleshooting velocity, and widen hiring pools.

Best alternatives to MinIO CE

SeaweedFS

SeaweedFS is a strong option when you care about huge object counts, small file behavior, and practical efficiency in commodity environments.

Choose SeaweedFS when

you need high small-object density
you prefer Apache 2.0 governance and licensing clarity
you want production readiness without the heavy footprint of Ceph

Garage

Garage is attractive for lightweight self-hosted clusters, edge nodes, and geo distributed deployments on modest hardware.

If you want a concrete setup path, use this Garage S3 quickstart to validate replication and operations before migration.

Choose Garage when

resource efficiency matters more than full S3 feature parity
you run mixed ARM or small node environments
you want simple operations over maximal feature surface

RustFS

RustFS is frequently discussed as the closest successor narrative to MinIO style deployment and UX.

Choose RustFS when

you accept alpha-stage software risk
you can test deeply before production
you want to track a fast moving project with potential upside

For regulated or high uptime systems, keep RustFS in pilot until maturity is proven in your own reliability tests.

Ceph RGW

Ceph RGW remains the enterprise heavyweight with broad capability and scale.

Choose Ceph RGW when

you need mature enterprise S3 behavior
your team already has Ceph operational expertise
you can support higher infrastructure and on-call complexity

Which object store is best for your use case

Use this pragmatic filter:

Small team and low ops budget start with Garage or SeaweedFS
Large enterprise and strict compatibility needs prefer Ceph RGW
Experimental migration from MinIO style workflows pilot RustFS, but keep rollback options

No option is universally best.

The correct target depends on required S3 features, RPO and RTO goals, team maturity, and how much platform ownership you want.

If your team still needs legacy MinIO background before deciding, this MinIO vs AWS S3 overview and this MinIO command cheatsheet help with current-state audits.

Migration plan from MinIO CE

If you are currently on MinIO CE, this phased approach avoids risky big-bang moves.

Phase 1 inventory and risk scoring

list buckets, object counts, and growth rates
classify workloads by criticality and recovery objectives
identify hard S3 dependencies such as versioning, object lock, or policy behavior

Phase 2 proof of compatibility

stand up one or two candidate platforms
replay representative read and write workloads
verify auth, lifecycle rules, retention behavior, and SDK edge cases

Plan to instrument your pilot from day one with metrics and alerts from the Observability pillar so migration regressions are measurable rather than anecdotal.

Phase 3 pilot cutover

migrate one low blast radius workload first
run dual read validation where possible
measure latency, error rates, and operational overhead

Phase 4 production migration

migrate high priority internet facing workloads first
keep rollback artifacts and retention windows
document final runbooks before decommissioning MinIO CE paths

Bottom line

MinIO CE may still run, but it is no longer the low-friction default for new production object storage.

Treat current clusters as transition infrastructure, not a long horizon foundation.

For most teams in 2026, safer direction is:

SeaweedFS or Garage for pragmatic self hosted deployments
Ceph RGW for enterprise scale and mature S3 requirements
RustFS for monitored pilot environments only

Make the migration decision early while you can still choose your timeline instead of reacting to the next forced change.