Forem: dengkui yang

Why AI Agents Get Stuck in Loops

dengkui yang — Sat, 02 May 2026 11:07:21 +0000

Subtitle: An ontology-inspired view of why repeated action is not the same thing as recovery.

Author note: This article is written for AI builders, prompt engineers, automation teams, and founders experimenting with long-running AI agents.

Summary

Most AI agent loops are described as planning bugs, reasoning bugs, or memory bugs.

Sometimes they are.

But many looping failures have a more structural cause:

the agent can keep acting, but it cannot convert world feedback into internal adjustment.

That is why the behavior looks so repetitive. The agent retries the same tool, rewrites the same plan, makes the same assumption sound more careful, and burns more tokens without becoming more correct.

From an ontology-inspired perspective, the problem is not only whether the agent can produce the next action.

The problem is whether the agent knows what kind of thing a failure is:

a broken assumption
a boundary condition
a validation failure
a stop signal
a reason to escalate

If failure is not represented correctly, the agent will keep moving without really changing.

That is what a loop is.

1. Looping Is Not Persistence

From the outside, looping can look like effort.

The agent keeps going. It tries again. It produces more output. It searches longer. It calls more tools. It explains itself more confidently.

But persistence and looping are not the same thing.

Persistence means the agent remains committed to the goal while changing its internal model after new feedback.

Looping means the agent remains committed to the motion while refusing, or failing, to change its internal model.

That distinction matters.

An agent can be energetic, articulate, and even superficially rational while still being trapped in a dead behavioral cycle.

The hidden question is not:

Can the agent do another step?

The hidden question is:

Did the last step change what the agent now believes about the task?

If the answer is no, repeated action is usually just a prettier failure.

2. The Failure Pattern

Here is a common example.

User:
Pull customer records from the CRM, summarize the churn risks,
and send a short report.

Agent:
Understood. I will retrieve the records and prepare the report.

Step 1:
The agent calls the CRM API.

Feedback:
403 permission denied.

Bad recovery:
The agent retries.
Then rewrites the request and retries.
Then searches internal notes and retries again.
Then says the tool may be unstable and retries again.

At this point, the agent is still active, but nothing essential has changed.

Reality has already provided a strong signal:

This path is blocked by an authorization boundary.

Yet the agent does not turn that signal into a different internal state.

It does not move from:

"task in progress"

to:

"blocked by permission"
"needs escalation"
"must choose alternative path"

It simply repeats action without transformation.

Diagram: Retry Loop vs Recovery Path

The important difference is not whether the agent keeps moving.

It is whether the movement contains state transition.

3. Where the Loop Actually Forms

Most people look for the loop at the level of output.

I think the loop usually forms one layer deeper.

It forms at the point where feedback should have become self-revision, but did not.

An agent normally needs a chain like this:

Act
Receive feedback
Classify the meaning of the feedback
Update assumptions or boundaries
Choose a different next move

Looping happens when step 3 or step 4 is weak.

The agent receives a signal, but the signal never becomes an ontological event inside the agent.

It is treated as noise, friction, or a temporary inconvenience rather than as evidence that the task model itself must change.

Diagram: The Feedback Conversion Gap

That gap is where many agent loops live.

The world speaks.
The agent hears something.
But nothing structural is updated.

So the next action remains a variation of the previous action.

4. A Small Ontology of Recovery

When I use the word ontology here, I do not mean an abstract metaphysical system.

I mean a practical map of what the agent treats as real inside a task.

For recovery, at least six things need to exist inside the agent's model:

Goal: What should be preserved or achieved?
Assumption: What am I currently presuming is true?
Boundary: What am I not allowed, not able, or not ready to do?
Feedback: What did the world just reveal?
State transition: What must change inside me now?
Next move: Continue, narrow, ask, replan, stop, or escalate.

If one of these is missing, loops become much more likely.

For example:

if boundary is weak, the agent retries forbidden paths
if assumption is weak, the agent never notices what became false
if state transition is weak, the agent narrates failure without changing behavior
if next-move selection is weak, the agent keeps producing action-shaped noise

The loop is not caused by a lack of words.

It is caused by a lack of structure.

5. Why Prompt Patches Often Make Loops Worse

When an agent loops, the reflex is to add more instructions:

do not retry too many times
think step by step
reflect before acting
ask for help if blocked
verify your answer

These patches can help in narrow cases.

But they often fail because they remain external commands.

A prompt can say:

If something goes wrong, fix it.

But that is not the same as giving the agent a reliable method for answering:

What kind of wrong is this?
Which assumption failed?
Which boundary appeared?
Is this a recoverable obstacle or a stop condition?
Should I continue, ask, narrow scope, or escalate?

Long prompts often increase behavioral surface area without improving transition quality.

That is why some agents become more verbose in failure rather than more adaptive.

They gain more language for retrying, not more architecture for changing.

6. What This Changes in Training

If the main problem is failed internal adjustment, then training should not focus only on successful task completion.

It should also focus on failure transitions.

Instead of asking only whether the agent eventually got the answer, we should ask:

Did the agent classify the failure correctly?
Did it name the broken assumption?
Did it detect a boundary?
Did it update its state?
Did it choose a meaningfully different next move?
Did it know when to stop and escalate?

This changes the role of teacher AI.

The teacher should not only reward good outputs.
It should also interrogate the student's recovery logic.

Diagram: Teacher-Student Recovery Training

The key teaching question becomes:

What changed in the world, and what should therefore change in you?

That is the center of recovery-oriented agent training.

7. A Small Example

Here is a compact teacher-student pattern.

Teacher:
The agent called a tool three times and got a 403 each time.
What happened?

Student:
The retrieval failed. It should try again with a clearer request.

Teacher:
That is an action answer, not a recovery answer.
What did the world reveal?

Student:
The current path is blocked by a permission boundary.
Retrying will not produce new information.

Teacher:
Good. What should change internally?

Student:
The agent should update its state from "task in progress"
to "authorization blocked," stop retrying, record the boundary,
and ask for access or choose another route.

The critical move is not the next tool call.

The critical move is the state transition.

That is the difference between motion and learning.

8. Open Question

I suspect a large share of agent loops come from missing self-adjustment architecture rather than missing intelligence in the narrow sense.

If that is right, then "better reasoning" alone may not be the main fix.

We may need agents with a clearer ontology of:

goals
assumptions
boundaries
feedback
state transitions
stop conditions

I would be curious where others disagree.

Are agent loops mainly a memory problem, a search problem, a reward problem, or do they reflect a deeper failure to turn feedback into self-revision?

If you have a looping agent example, I can map it to this framework.

A Research Workflow That Starts With Sources, Not Prompts

dengkui yang — Thu, 30 Apr 2026 11:21:14 +0000

How private AI notebooks turn scattered files, links, notes, and local models into a reusable thinking loop.

Based on public materials from opennotebook.shop and the open-source open-notebook repository reviewed on April 30, 2026.

Many AI note-taking tools begin with the same interface: a blank prompt box.

That is convenient, but it quietly puts the wrong thing at the center. Real research does not start with a prompt. It starts with a pile of material: papers, links, meeting notes, transcripts, PDFs, half-formed thoughts, and questions that become clearer only after you spend time with the sources.

This makes Open Notebook useful to examine as a workflow idea. The opennotebook.shop page presents a simple flow: add files, links, and notes; ask questions; save cited answers; then turn the notebook into audio-style briefings with local or cloud models. The open-source project adds the deeper architecture: self-hosting, multiple model providers, full-text and vector search, context-aware chat, AI-assisted notes, podcasts, REST API access, and local model options such as Ollama.

The useful question is not whether an AI notebook can answer a question.

The useful question is whether it can help a person keep thinking after the answer.

The Scenario: Turning Raw Material Into a Briefing

Imagine a small team preparing for a product strategy review.

They have customer interview notes, a few internal memos, a competitor page, a product analytics export, and a recording transcript from last week's meeting. None of these is enough on its own. Together, they contain a direction, but only if someone can collect them, ask better questions, preserve evidence, and turn the result into something reusable.

The common AI shortcut is to paste everything into a chatbot and ask for a summary.

That works once. Then the problems begin:

Which source supported the summary?
Which parts came from private notes versus public links?
What should be sent to a cloud model, and what should stay local?
Where does the useful answer go after the chat ends?
Can the team turn the result into a note, a briefing, or a follow-up research plan?

This is where the notebook metaphor becomes more than UI. A notebook is not just where answers appear. It is where the research state accumulates.

Start by Protecting the Difference Between Sources and Notes

A good research workflow begins by refusing to collapse everything into "content."

Sources and notes are not the same thing.

Sources are evidence. They are the imported material: files, links, transcripts, videos, audio, pasted text. They should remain stable and referenceable because they are the ground from which later claims are made.

Notes are thinking. They are summaries, extracted insights, saved answers, manual observations, and decisions made after interacting with sources. Notes should be editable because understanding changes.

Open Notebook's own mental model follows this split: notebooks contain sources and notes. Sources are processed, indexed, and searchable. Notes are the evolving layer of insight.

This distinction matters more than it first appears. If a system lets generated summaries blur into source material, the notebook becomes harder to trust. If a system preserves source identity, then every later output can be inspected:

This answer came from these sources.
This note was created from that interaction.
This briefing reused these materials.

In ontology terms, a source and a note exist differently inside the workspace because they support different interactions. A source supports verification. A note supports adaptation. Confusing the two weakens both.

Figure: a private notebook is a transformation loop. Sources remain evidence; notes, answers, and audio become reusable layers of understanding.

The First Real Control Is Context

Once sources are collected, the next decision is not "which prompt should I write?"

It is:

What should the model be allowed to see?

This is the most underrated part of AI notebook design. Context is not just a technical limit. It is a privacy, cost, and reasoning boundary.

Open Notebook's docs describe context levels such as full content, summary only, or not in context. That seems like a small control, but it changes the whole workflow.

For the strategy review scenario:

Public competitor pages can go into full context.
Customer interviews might be summarized before model use.
Sensitive internal notes might stay out of cloud context entirely.
A local model can be used for a first pass on private material.

This is where a notebook becomes a cognitive tool rather than a chatbot. It gives the researcher a way to decide what participates in the current act of reasoning.

The practical ontology idea is simple: boundaries are part of the object. A source shared in full is not operationally the same as a source represented only by a summary. A private note excluded from context is not participating in the same interaction network as a public article. Good AI tooling should let users express those differences.

Chat Is for Exploration, Ask Is for Discovery, Transformations Are for Reuse

One reason prompt-first tools feel shallow is that every task becomes the same interaction.

Research has more shapes than that.

Sometimes the team wants a conversation:

"Compare these two customer interviews. What tension do you see?"

That is Chat. The user chooses the context, asks follow-up questions, and steers the reasoning.

Sometimes the team wants discovery:

"Across all sources, what are the strongest arguments for delaying the launch?"

That is Ask. Retrieval matters because the user may not know where the relevant evidence lives.

Sometimes the team wants repeatability:

"For each interview, extract pain points, buying triggers, objections, and quoted phrases."

That is a transformation. The goal is not a conversation but a consistent note structure that can be compared later.

Open Notebook separates these modes, and that separation is healthy. Chat, Ask, and Transformations are not three labels for the same thing. They are three ways of working with knowledge:

Chat keeps the thinking fluid.
Ask finds relevant material across the notebook.
Transformations turn raw sources into structured notes.

A good workflow uses all three. The team might transform every interview into a consistent format, use Ask to find cross-source patterns, and then use Chat to reason through tradeoffs before saving the final answer as a note.

That is much closer to how thinking actually happens.

The Notebook Should Remember the Work

The moment an AI answer becomes useful, it should not disappear into chat history.

It should become part of the notebook.

This is where saved answers and AI-assisted notes matter. A research workspace needs a way to turn a transient interaction into durable knowledge. Otherwise, the team repeats the same questions and loses the path from evidence to decision.

In a good notebook workflow:

raw sources stay available for verification
generated answers can be saved as notes
manual notes can correct or extend AI output
notes can become searchable material for later work
citations keep processed claims connected to evidence

This is not merely organization. It is internal adjustment.

In existence-oriented language, a system survives and develops by acting outward and adjusting inward. For a research notebook, outward action means importing sources, asking questions, generating answers, and producing briefings. Inward adjustment means saving notes, changing context, revising interpretation, and keeping the workspace ready for the next question.

That is the difference between using AI once and building understanding over time.

Audio Briefings Are Not a Gimmick if the Source Trail Survives

Podcast-style output can look like a flashy feature, but in a research workflow it solves a real problem.

Not every stakeholder will read the full notebook. Not every teammate has time to inspect every source. Sometimes the useful output is a short audio-style briefing that turns a messy pile of material into a listenable synthesis.

Open Notebook's open-source materials describe podcast generation as a higher-level transformation: sources and notes become an outline, dialogue, text-to-speech output, and finally an audio file. The important part is not just the audio. It is the path from evidence to briefing.

If the briefing is detached from the notebook, it becomes just another generated artifact. If it remains connected to sources and notes, it becomes a new consumption layer for the same research state.

That matters because knowledge work is not only about producing text. It is about changing form without losing traceability:

source to answer
answer to note
note to briefing
briefing back to follow-up questions

Better-designed AI notebook systems should not only generate outputs. They should preserve the continuity between outputs.

Why Local and Self-Hosted Options Change the Workflow

Model choice changes behavior.

If a team has to send everything to a single cloud model, it may over-share or avoid using AI for sensitive work. If the same notebook can use local models for privacy-sensitive passes and cloud models for less sensitive synthesis, the workflow becomes more flexible.

Open Notebook's support for multiple providers and local options such as Ollama is valuable for this reason. It lets model selection become part of the work, not a hidden infrastructure detail.

For the strategy review example, the team might use:

a local model to summarize sensitive notes
a cloud model to polish a non-sensitive stakeholder briefing
embeddings and search to find relevant source chunks
a self-hosted deployment to keep the notebook near private data

Self-hosting is not free. It brings setup, updates, credentials, backups, and security responsibility. But it also gives a team more control over where research lives and how models interact with it.

The point is not that every team must self-host.

The point is that serious knowledge workflows need visible tradeoffs.

What This Style of Notebook Is Really For

Open Notebook is best understood as a tool for people who do not only want answers.

They want a controlled path from source material to reusable understanding.

That makes it relevant for:

researchers collecting papers, transcripts, and notes
product teams preparing decisions from mixed evidence
students building long-term understanding instead of one-off summaries
consultants turning interviews and documents into client-ready briefings
teams that need local or self-hosted workflows for sensitive context

The pattern is broader than any one product:

Start with sources. Decide context. Ask and explore. Save notes. Transform knowledge into the format the next person can use.

That is the workflow an AI notebook should support.

Final Takeaway

A useful AI notebook is not simply the one that produces the smoothest answer to the first question.

It is the one that helps a person keep control of the research process after that answer appears.

Open Notebook points toward that direction: sources remain evidence, notes become evolving understanding, context control defines what the model can touch, and outputs can become briefings without losing their relationship to the notebook.

That is why the product is more interesting as a cognitive workflow than as a chat interface.

It starts with sources, not prompts.

And when it works, it helps research become something you can return to, revise, and reuse.

References

opennotebook.shop page: https://www.opennotebook.shop/
Open Notebook open-source repository: https://github.com/lfnovo/open-notebook

Originally published on [https://medium.com/@li3169086779/a-research-workflow-that-starts-with-sources-not-prompts-134f86e53e5a]

Why File-to-Markdown Conversion Is Becoming an AI Input Layer

dengkui yang — Thu, 30 Apr 2026 05:18:37 +0000

Based on public materials from markitdown.store and Microsoft's open-source markitdown repository reviewed on April 29, 2026.

Most teams first meet document conversion as a utility problem.

Take a PDF, a Word file, a spreadsheet, maybe a webpage, and turn it into text so an LLM can read it.

That framing is understandable, but it is too small.

Once you build retrieval, agents, or any serious document workflow, conversion stops being a side utility and starts becoming part of your system architecture.

That is why MarkItDown is interesting, and why the browser experience at markitdown.store is worth paying attention to.

The upstream open-source project from Microsoft is a lightweight Python utility for converting many file types into Markdown for LLM and text-analysis pipelines. The website makes that idea visible in a more inspectable way: the homepage presents upload, text, and URL inputs, shows a reviewable Markdown output panel, and explicitly frames the result as something you should inspect before using in retrieval or agent workflows.

That combination points to a bigger engineering idea:

Markdown is not just an output format here.

It is an input layer for AI systems.

Summary

MarkItDown matters because it treats messy source files as something that should be normalized into a stable, reviewable, token-efficient working surface before deeper AI processing begins. The technical lesson is not "convert everything to plain text." The lesson is to preserve enough structure for downstream reasoning, while keeping clear trust boundaries around how files are fetched, parsed, and routed.

1. The Real Job Is Normalization, Not Conversion

If you only describe the task as "document conversion," you miss the real systems problem.

The real problem is this:

How do you turn heterogeneous files into something an LLM, a retriever, and a human reviewer can all reason about without each downstream component reinventing its own parser?

That is a normalization problem.

In a practical ontology sense, a document is not just a named file. It reveals itself through the interactions it supports. A spreadsheet invites table reasoning. A webpage carries links and hierarchy. A PDF may contain layout clues, embedded images, or scanned pages. If you flatten everything too aggressively, you lose the very evidence downstream tools need.

What makes MarkItDown useful is that it does not aim at perfect visual reproduction. It aims at a stable intermediate representation that still carries enough structure to matter.

Figure: the important move is not just extraction, but converging mixed sources onto one working surface that humans and LLM systems can both inspect.

This is where the site demo is helpful. It does not present conversion as a magical black box. It presents source choices, a visible Markdown result, and workflow toggles such as table output, source note, and local preview. That is exactly how an input layer should behave: not only transforming data, but exposing enough of the transformation for humans to verify it.

2. Why Markdown Is a Strong Intermediate Format

The MarkItDown README gives the clearest argument for the format choice.

Its core claim is simple: Markdown stays close to plain text, but still preserves document structure such as headings, lists, tables, and links. The README also notes that mainstream LLMs are very comfortable with Markdown and that the format is relatively token-efficient.

That is a stronger point than it first appears.

A good intermediate format for AI needs at least four properties:

It should be legible to humans during review.
It should preserve enough structure for retrieval and reasoning.
It should avoid carrying unnecessary visual noise.
It should move cheaply through prompts, indexes, and tool chains.

Markdown hits a practical balance.

It is not a truth format. It does not preserve every layout detail. It is not the right choice for pixel-faithful publishing. But for review, chunking, citations, agent context, and post-processing, it is often a much better surface than raw OCR text or opaque binary formats.

This is where the existence-oriented lens becomes useful without becoming abstract. Naming is not reality. A file called report.pdf is not useful because we named it that way. It becomes useful when a system can interact with its actual content and recover a structure that supports later decisions.

Markdown is valuable because it turns that recovered structure into something operational.

3. Coverage Matters, but Routing Matters More

One reason MarkItDown has become popular is simple format breadth.

According to the public repository, it currently supports conversion from PDF, PowerPoint, Word, Excel, images, audio, HTML, text-based formats such as CSV, JSON, and XML, ZIP archives, YouTube URLs, EPubs, and more. It also exposes optional dependency groups instead of forcing every installation to carry every parser.

That design choice matters.

In production, format support should not be monolithic. Different environments have different cost, security, and dependency constraints. A local notebook, a browser-assisted workflow, a server-side API, and an internal batch pipeline do not all want the exact same surface area.

The README's plugin model reinforces this idea. Plugins are disabled by default, can be listed explicitly, and can extend conversion behavior such as OCR. That is a healthy signal. It treats conversion not as one magic parser, but as a policy surface that teams can widen carefully.

The deeper lesson is this:

format coverage is useful, but routing discipline is what makes coverage trustworthy.

If every input takes the same path, you often end up with a system that is either too permissive or too brittle. Stronger systems separate lightweight paths from heavier ones, and trusted inputs from untrusted ones.

4. Trust Boundaries Are Part of the Design

This is the part I find most important.

MarkItDown's public README includes a direct security warning: it performs I/O with the privileges of the current process, so inputs should be sanitized and callers should prefer the narrowest convert_* method that fits the job, such as convert_stream() or convert_local().

That warning should not be treated as boilerplate.

It is a statement about architecture.

A file conversion layer is not neutral. The moment it can open files, fetch URIs, or load parser dependencies, it becomes part of your trust boundary.

The homepage of markitdown.store makes a similar idea visible at the product level. The demo distinguishes between lightweight text and URL paths on one side, and heavier formats such as PDF, Office files, images, audio, ZIP, and EPub on the other. It also notes that those heavier formats are routed to a hosted worker manifest, while the output panel reminds users to review the result before using it in production retrieval or agent workflows.

That is a good design instinct.

In ontology terms, boundaries are part of what a thing is. A local text input is not operationally the same object as an untrusted remote document. A CSV pasted into a textbox is not the same risk surface as a complex attachment that may trigger multiple external dependencies. If you erase those differences, the system becomes harder to reason about and easier to misuse.

Figure: trustworthy conversion layers separate upload, isolation, routing, and downstream AI use instead of collapsing everything into one permissive parser path.

This is also why review matters. A conversion pipeline should not act like every generated Markdown file is automatically fit for retrieval, summarization, or action. Reviewable output is not a cosmetic UI feature. It is part of the safety model.

5. CLI, Python, Docker, and MCP Are Architecture Choices

The project is also notable for how many entry points it exposes.

The public materials show a command-line tool, a Python API, Docker usage, optional Azure Document Intelligence support, plugin hooks, and now an MCP server for integration with LLM applications.

It is tempting to treat that as a feature checklist.

I think the more useful way to read it is architectural:

CLI fits batch conversion and shell workflows.
Python fits ingestion services and custom pipelines.
Docker fits repeatable execution boundaries.
MCP fits agent ecosystems that want document conversion as a tool.

That makes MarkItDown more than a parser. It becomes a reusable normalization layer that can sit behind a browser UI, a backend worker, a local script, or an agent runtime without changing the core idea of the output.

For teams building document-aware AI systems, that consistency matters. You do not want four different conversion philosophies just because you have four different application surfaces.

A Practical Checklist

If you are designing a similar system, these are the questions I would ask first:

Are we preserving structure, or only extracting raw text?
Can a human inspect the Markdown before it enters retrieval or agent workflows?
Do low-risk and high-risk inputs take the same execution path?
Are we using the narrowest conversion API that matches the actual trust boundary?
Do we preserve enough provenance, notes, or source hints to debug downstream errors?
Are plugins and optional dependencies treated as deliberate policy choices instead of default sprawl?

If those questions are answered well, document conversion starts behaving like infrastructure instead of a hidden source of errors.

Final Takeaway

MarkItDown is not interesting because it converts files. Many tools can do that.

It is interesting because it treats Markdown as a stable intermediate surface between messy source documents and downstream AI systems. The open-source project gives that idea a practical engine. The browser experience at markitdown.store makes the workflow easier to inspect. Together, they point toward a useful engineering pattern:

normalize early, preserve meaningful structure, separate trust boundaries, and make the output reviewable before automation builds on top of it.

That is a much stronger design than "just get me some text."

References

MarkItDown website: https://www.markitdown.store/
Microsoft MarkItDown repository: https://github.com/microsoft/markitdown
MarkItDown MCP registry page: https://github.com/mcp/microsoft/markitdown

Why Chat-with-Docs Breaks in Real Companies: An Engineering Look at Onyx

dengkui yang — Thu, 30 Apr 2026 04:21:38 +0000

Based on the onyx.guru page and the Onyx open-source repository reviewed on April 29, 2026.

Most internal AI projects begin with a reasonable demo: connect a folder of documents, add retrieval, ask a question, get an answer with citations.

Then the system meets a real company.

The docs are scattered across Google Drive, Notion, GitHub, Slack, support tickets, policy pages, and user uploads. Some pages are stale. Some are private. Some are deleted upstream but still cached somewhere. Some are visible to one team but not another. Some answers require a fresh web lookup or a tool call, not just a paragraph from an old document.

This is where "chat with your docs" starts to break.

The onyx.guru page is interesting because it frames Onyx less like a chatbot and more like a permission-aware knowledge layer. Its public materials emphasize connectors, source permissions, freshness, citations, search, agents, actions, and cloud or self-hosted deployment. That makes it a useful case study for a broader engineering question:

What does it actually take to build a private AI knowledge system that can be trusted in production?

The Failure Mode: Retrieval Without Reality

The simplest RAG architecture is easy to describe:

Load documents into a vector database.
Retrieve similar chunks for a user query.
Put those chunks into an LLM prompt.
Ask the model to answer with citations.

That can work for a static knowledge base. It is not enough for an enterprise workspace.

In real environments, trust fails in more mundane ways:

A user sees an answer based on a document they should not have access to.
A policy was updated last week, but the old version still ranks first.
A deleted document remains embedded and continues to influence answers.
A support ticket, a code comment, and a runbook all describe the same incident differently.
A user asks for an operational next step, but the assistant can only summarize.

None of these failures are solved by simply choosing a larger model. They are systems problems. The model is only the final voice of a chain that includes source ingestion, permission mapping, indexing, retrieval, ranking, citation, tool use, and deployment boundaries.

Figure: trustworthy private knowledge is a continuous loop, not a one-time model call.

Engineering Requirement 1: Connect the Sources Before Optimizing the Prompt

A private AI system cannot answer from knowledge it never truly ingested.

That sounds obvious, but it is where many systems become fragile. Enterprise knowledge does not live in one database. It lives in documents, tickets, repositories, chat threads, policies, dashboards, and files uploaded by users. A useful system needs connectors that preserve more than raw text.

It needs to preserve:

document identity
source type
update time
authorship
metadata
deletion state
access rules where available

Onyx puts connectors near the center of the product. Its public materials describe more than 50 indexing-based connectors, plus MCP-based extensibility. This matters because connectors are not just import tools. They are the system's contact surface with reality.

If the connector layer is weak, the model may still produce polished prose, but the answer will be grounded in an incomplete or outdated world.

Engineering Requirement 2: Permissions Must Travel With the Knowledge

The most dangerous enterprise AI bug is not a bad summary. It is a correct answer shown to the wrong person.

That is why permission-aware retrieval is not a compliance add-on. It is part of the knowledge model itself. A private finance memo and a public engineering guide are not merely two text blocks with different labels. They have different organizational meaning because they participate in different visibility networks.

From an ontology perspective, boundaries are part of what a thing is. In engineering terms: access control must be attached before retrieval, not patched after generation.

Onyx's public positioning highlights permission-aware search and keeping permissions attached to the source. That is the right architectural direction. The retrieval system should know what the current user is allowed to see before the model ever receives context.

A useful test is simple:

Can two users ask the same question and receive different valid results based on their permissions?
Can the system explain which sources were used?
Can revoked access stop influencing future answers?

If the answer is no, the AI system is not ready for private knowledge work.

Engineering Requirement 3: Freshness Is a Data Pipeline Problem

Freshness is often presented as a UI feature: "This answer cites recent sources."

In practice, freshness is a pipeline property.

The system has to detect source changes, schedule sync jobs, update chunks, remove deleted content, refresh embeddings or indexes, and preserve enough metadata for ranking and filtering. This is not glamorous, but it is the difference between a useful knowledge layer and a historical archive with a chat interface.

Onyx Standard mode is interesting here because the public materials describe the heavier machinery behind production retrieval: vector and keyword indexing, background workers for sync jobs, inference services used during indexing and inference, Redis, MinIO, Postgres, and Vespa. The stack is a reminder that trustworthy AI is not one model call. It is a stateful system that has to keep adjusting.

This is where the ontology lens becomes practical. A system continues to exist by doing two things: acting outward and adjusting inward. For enterprise AI, "inward adjustment" means re-syncing, pruning, re-ranking, re-checking permissions, and correcting its own representation of the organization as the organization changes.

Without that internal adjustment, citations eventually become decoration.

Engineering Requirement 4: Search Should Be Inspectable, Not Hidden Inside Chat

Chat is a convenient interface, but it should not be the only interface.

When a user is doing serious work, they often need to inspect the source set before trusting the synthesis. They may want to filter by author, time range, source type, tag, or document family. They may want to compare sources instead of accepting a single blended answer.

Onyx's public materials describe a dedicated search experience with query classification and filters. That is more important than it might look. It separates retrieval from generation.

This separation gives teams a way to debug and trust the system:

What did the system retrieve?
Why did this source rank higher than that one?
Was the answer built from the right category of documents?
Did the model summarize the evidence correctly?

In production, observability is not only for servers. Knowledge retrieval needs observability too.

Engineering Requirement 5: Some Answers Need Actions, Not Just Text

Internal AI becomes more useful when it can move from "tell me" to "help me do the work."

Some questions require internal search. Some require fresh web context. Some require code execution, calculations, API calls, or interaction with an operational system. If the assistant cannot use tools, it remains a commentator on the workflow rather than a participant in it.

Onyx supports built-in actions such as internal search, web search, code execution, and image generation, and it supports custom actions through OpenAPI and MCP. The important part is not just that actions exist. It is that actions need governance.

For enterprise use, tool access should answer the same questions as document access:

Which user is allowed to call this action?
Does the action use shared authentication or user-level authentication?
What data leaves the deployment boundary?
Can the result be traced back to the tool and source?

This is where many AI assistants become operationally risky. The moment an assistant can act, permissions, auditability, and data boundaries matter even more.

Engineering Requirement 6: Deployment Boundaries Are Product Decisions

Private knowledge systems cannot treat deployment as an afterthought.

Some teams are comfortable with cloud hosting. Others need self-hosting because of data sensitivity, compliance requirements, network topology, or internal security review. The architecture has to make those tradeoffs explicit.

Onyx describes both Lite and Standard deployment modes. Lite is lighter and chat-oriented. Standard adds the heavier retrieval and synchronization infrastructure needed for stronger production knowledge workflows. Its public materials also describe a self-hosted architecture where the core system runs inside the deployment boundary, while external services such as LLM APIs, embedding providers, web search, or image generation are explicitly configured by the admin.

That distinction matters. A private AI system should make it clear where data is stored, when data leaves the boundary, and which external services participate in the answer.

Good security architecture is not just about preventing incidents. It also makes trust explainable.

A Practical Evaluation Checklist

If you are evaluating a private AI knowledge system, the most useful questions are not about demo magic. They are about failure modes.

Ask these:

What source systems can it connect to?
Does it preserve metadata and deletion state?
Are permissions enforced before retrieval?
Can different users receive different valid answers?
How does the system handle stale, conflicting, or removed content?
Can users inspect retrieved sources before trusting the answer?
Does it support both search and chat?
Can it use tools or actions safely?
What leaves the deployment boundary?
Can the architecture scale from a small pilot to a production knowledge layer?

This checklist is deliberately practical. If a system cannot answer these questions, the risk is not that the model sounds bad. The risk is that it sounds good while being wrong, stale, or unsafe.

Where Onyx Fits

Onyx is not interesting because it promises a prettier chatbot. It is interesting because its public architecture acknowledges the parts of enterprise AI that are easy to underestimate:

source connectivity
permission-aware retrieval
citations and freshness
dedicated search
agents and governed actions
cloud and self-hosted deployment options

The phrase "Search private knowledge before you trust the answer" is a good summary of the engineering posture. Trust should be earned by the source path, not assumed from model fluency.

That is also where the ontology idea fits naturally. A knowledge system has to keep existing correctly in relation to its environment. It must absorb changes from the outside, adjust its internal representation, respect boundaries, and act only through governed channels. Otherwise, it is not a trusted layer. It is a static snapshot with a fluent interface.

Final Takeaway

The next wave of enterprise AI will not be defined by "chat with docs" alone. It will be defined by systems that can connect private sources, preserve permissions, stay fresh, expose evidence, and act safely.

Onyx is a useful case study because it treats those concerns as core architecture rather than as optional polish.

For teams exploring this category, the best next step is small and concrete: choose one knowledge domain with real permissions, frequent updates, and answers that require citations. Test whether the system can handle the full chain from source connection to permission-aware retrieval to cited answer to governed action. If that chain works, the pilot can grow. If it breaks, the problem is probably not the prompt. It is the knowledge system.

References

onyx.guru page: https://onyx.guru/
Onyx open-source repository: https://github.com/onyx-dot-app/onyx

Why File Type Detection Is More Than a Metadata Problem

dengkui yang — Wed, 29 Apr 2026 08:20:32 +0000

What Magika teaches us about names, evidence, boundaries, and trustworthy file intelligence

Author note: This article is written for engineers building upload flows, storage systems, CI pipelines, security tooling, and AI products that need to reason about real files instead of just trusting filenames.

Summary

When a system accepts a file, one of the first questions sounds almost trivial:

What is this thing?

But many production systems still answer that question with a weak proxy:

the filename extension
the browser-provided MIME type
a user claim
a storage metadata field

That works until it does not.

A file called invoice.pdf may actually be a ZIP container, a JavaScript payload, a damaged document, or a binary blob that should never reach the parser you are about to invoke.

This is why Google's open-source Magika project is interesting.

Magika is not just another convenience wrapper around file metadata. It is a content-based file type detector that tries to ground classification in the file's actual bytes.

For readers who want to inspect that idea without installing a command-line tool, magika.uk provides a web version of the same practical workflow: upload a file, and the result exposes detected type, MIME type, file group, confidence, and an extension mismatch signal.

That design choice matters technically. It also gives us a useful way to think about file identity.

If we borrow the word "ontology" in a practical engineering sense, it simply means:

the model a system uses to decide what kind of thing it is interacting with, where the boundary of that thing is, and what actions are valid once that classification is made.

From that perspective, file type detection is not only a naming problem.

It is a boundary and evidence problem.

1. The Extension Mistake

Let me start with a question.

Suppose your upload service receives these three files:

headshot.png
report.docx
archive.txt

Which one should go to the image thumbnailer?
Which one is safe to send to a document parser?
Which one deserves secondary inspection before entering the rest of your pipeline?

If your answer is mostly based on the suffix after the last dot, your system is not classifying files. It is trusting labels.

That is a very human habit.

Humans like names. Names are cheap. Names are convenient. Names are socially useful.

But files do not become PNGs because we call them .png.

Operationally, a file becomes a "PNG" because its internal structure, magic bytes, and content patterns support a set of downstream interactions:

image decoders can parse it
rendering pipelines can transform it
security scanners can apply the right rules
storage systems can make the right policy decisions

The file's practical identity is tied to how systems interact with it, not to what a human named it.

This is where a deeper model of identity becomes useful.

A useful principle is that we should stop treating human definitions as if they were identical to reality. Things reveal themselves through interaction. In file pipelines, that means the "real type" of a file is closer to its interaction surface than to its filename.

An extension is a claim.

Content is evidence.

Caption: A filename is a useful claim, but the file's bytes provide stronger evidence for downstream decisions.

2. What Magika Actually Adds

Magika matters because it operationalizes that distinction.

According to the official project materials, Magika uses a compact deep learning model, only a few megabytes in size, trained and evaluated on roughly 100 million samples across more than 200 content types. After the one-time model load, inference is on the order of milliseconds per file on a single CPU. It also avoids reading entire large files into memory, typically inspecting only a limited subset of content, usually a few hundred bytes and up to around 2 KB depending on the model.

That combination leads to an engineering result that is more important than it first appears:

content-based classification
near-constant inference time
enough accuracy to be useful in real routing decisions
enough speed to sit early in a pipeline

This is why Magika is not just a developer toy.

It is a pre-routing layer.

It answers a system question that sits before many more expensive questions:

Before I parse, transform, render, index, execute, or scan this object deeply, what kind of object am I probably dealing with?

That early answer changes architecture.

Instead of letting every downstream component discover the file type in its own fragile way, you can establish a first-pass classification layer and make later steps conditional on that result.

For example:

Weak pipeline	Stronger pipeline
Route by extension	Route by detected content label
Trust client MIME type	Compare claimed type with observed type
Parse first, reject later	Identify first, then choose parser
Demand exact guesses	Allow generic fallback when confidence is low

That last row is especially important.

Because one of Magika's best ideas is not only that it predicts types.

It is that it does not always pretend to know.

3. The Most Interesting Part: It Separates Belief from Decision

This is, to me, the most underrated design choice in Magika.

The official output model distinguishes between:

the raw deep-learning prediction
the final tool output used for operational decisions

In other words, the system separates:

what the model believes
what the product is willing to say

That is a powerful distinction.

If the model predicts a type with low confidence, Magika does not have to force a precise answer. It can return a more generic label such as a broad text or unknown binary category, depending on the case. The documentation also describes per-content-type thresholds and multiple prediction modes, including high-confidence, medium-confidence, and best-guess.

This is not just a tuning convenience.

It is an epistemic boundary.

A careless classifier says:

I always owe you a specific answer.

A disciplined classifier says:

I owe you the strongest answer that the evidence can justify.

That difference is the heart of trustworthy file intelligence.

From a system design perspective, this is a very healthy move. A system should not confuse naming with knowing. If it cannot identify an object precisely enough, it should still place that object honestly within a safer boundary.

That is why Magika's generic labels are not a weakness.

They are a form of boundary recognition.

And boundary recognition is one of the hardest things to get right in production systems.

Caption: The valuable step is not only prediction, but converting confidence into an honest system output.

4. A Practical Model of File Identity

If "ontology" sounds too abstract, here is the same idea in narrower engineering terms.

For a production system, a file identity model answers questions like:

What entities do I believe exist here?
How do I distinguish one entity from another?
What evidence is strong enough to justify that distinction?
What actions become valid after classification?
What should I do when the boundary is unclear?

Now apply those questions to files.

A simplistic model says:

Entity = filename extension

A better model says:

Entity = content-bearing object with a detectable internal structure

An even better operational model says:

Entity = content-bearing object whose probable downstream interactions
can be estimated from observed bytes, confidence thresholds, and routing policy

That third version is much closer to how resilient systems should think.

Two ideas are especially useful here:

do not remain trapped in human-centered naming
understand things through external interaction and internal adjustment

Magika maps neatly onto both.

First, it moves classification away from human-centered naming. The extension may still be useful, but it is no longer treated as the essence of the object.

Second, it helps a larger system connect external interaction with internal adjustment.

The file produces an external signal through its bytes.

The system then performs internal adjustment:

allow
block
quarantine
route to a safer parser
request secondary scanning
log an extension mismatch
downgrade trust

That is why I would describe Magika not merely as a classifier, but as a boundary-aware adjustment trigger for file pipelines.

5. Why the Web Version Matters

magika.uk is useful not because every file intelligence idea needs a website, but because a web version makes the classification process easier to inspect.

The interface does not present file detection as a mystical black box. It surfaces a set of operationally relevant fields:

detected type
MIME type
group
confidence
extension mismatch

It also frames the runtime explicitly; the upload demo shows magika-js/browser, which is a useful reminder that the same classification idea can run close to the user, not only deep in backend infrastructure.

That matters for product architecture.

If a browser-side or edge-side layer can classify content early, then some decisions can happen before the file reaches more privileged systems. Even when you still need server-side verification, early detection can improve UX, reduce bad uploads, and make downstream policy more explainable.

Notice what is absent from this kind of interface:

hype about "understanding all files"
vague security theater
a single overconfident badge that hides uncertainty

Instead, it exposes the kind of metadata a builder actually needs to reason with.

That is a good product instinct.

6. A Better Mental Model: File Identity as Interaction Potential

One reason file classification is often implemented poorly is that teams think of type as static metadata.

But in real systems, type is better understood as interaction potential.

A file type is a compressed summary of likely behavior:

what parser chain it can enter
what rendering path it can trigger
what policy rules should apply
what scanners become relevant
what storage or preview behavior is safe

From this viewpoint, a file's "real type" is not just descriptive.

It is predictive.

That also connects nicely to another useful idea: simulation.

Before a system acts on a file, it benefits from a lightweight simulation of what kind of world this object belongs to. Magika effectively provides that first simulation layer. It does not fully validate the object, and it does not tell you whether the file is malicious in every sense. But it does offer an informed prior about what downstream interactions are likely to make sense.

That is enough to improve many workflows:

upload moderation
malware triage
CI artifact inspection
ETL pipelines
object storage intake
AI systems that ingest user-provided documents

This is also where the "extension mismatch" signal becomes more interesting than it looks.

A mismatch is not just a UX warning.

It is a conflict between claimed identity and observed structure.

And conflicts of identity are exactly where good systems should slow down.

7. How I Would Use Magika in a Real Pipeline

Here is a minimal example in JavaScript using the official package:

import { Magika } from "magika";

const magika = await Magika.create();

const bytes = new Uint8Array(await file.arrayBuffer());
const result = await magika.identifyBytes(bytes);

const label = result.output.label;
const score = result.score;
const mime = result.output.mime_type;
const group = result.output.group;

if (label === "unknown") {
  holdForReview(file);
} else if (group === "code") {
  sendToCodeScanning(file, { label, mime, score });
} else if (group === "document") {
  sendToDocumentPipeline(file, { label, mime, score });
} else {
  sendToGenericProcessing(file, { label, mime, score });
}

That example is intentionally simple, but the architectural pattern is the point.

I would not use Magika as the final judge of safety.

I would use it as the first trustworthy classifier that helps the rest of the system choose the right next interaction.

A stronger production version might do something like this:

Capture the claimed extension and client MIME type.
Run Magika on content bytes.
Compare claim vs observed label.
Apply a risk-dependent prediction mode.
Route to different scanners or parsers.
Log mismatches and low-confidence outcomes for monitoring.
Refuse dangerous transitions, such as "claimed image, detected executable/script-like content."

Caption: Magika is most useful as an early identity layer that helps the rest of the pipeline choose the right next interaction.

This is where the identity model becomes practical.

You are not only classifying the object.

You are defining what kinds of interactions your system is willing to have with that object.

8. Where Magika Should Not Be Overstated

A good technical article should also name the limits.

Magika does not eliminate the need for:

malware scanning
parser hardening
archive recursion policies
schema validation
business-level content checks

It is also not the same as full semantic understanding. Knowing that a file is likely a PDF is not the same as knowing whether the PDF is safe, well-formed, policy-compliant, or useful for your application.

The official documentation also makes clear that some edge cases are handled outside the model itself, such as empty files, non-regular files like directories or symlinks, and very small inputs where only coarse heuristics make sense.

That is normal.

In fact, it is another sign of maturity.

Reliable systems are often hybrids. They combine learned models, thresholds, heuristics, and policy logic. Pretending that one model should do everything is usually a symptom of bad architecture.

So the right question is not:

Can Magika solve file security by itself?

The better question is:

Where in my pipeline do I need a fast, content-grounded identity layer so later decisions become safer and more explainable?

That is a much more realistic framing.

9. Four Questions I Would Ask Before Integrating It

If you are evaluating Magika or any similar system, I think these questions matter more than benchmark screenshots:

What decisions in your pipeline are currently driven by extension or client-provided MIME alone?
Which of those decisions are high-risk enough to require high-confidence behavior instead of best-guess behavior?
What should your system do when it cannot classify precisely but can still classify safely as "generic text" or "unknown binary"?
Do you treat extension mismatch as an actionable policy event, or only as debug information?

Those questions expose whether your problem is merely "file detection" or whether it is actually "boundary-aware system design."

Most of the time, it is the second.

10. Closing Thought

What I like about Magika is not the vague idea that "AI can classify files better."

What I like is the discipline behind it.

It pushes a system to stop asking:

What did the user call this object?

and to start asking:

Based on the evidence available in the object itself, what kind of thing is this, how sure am I, and what interactions are justified next?

That is a better technical question.

It is also a better question about identity and boundaries.

And I suspect the same lesson applies far beyond file uploads:

reliable systems improve when they ground identity in interaction, preserve uncertainty honestly, and let classification drive internal adjustment instead of blind execution.

If you are building anything that accepts untrusted files, that shift is worth thinking about.

Not because it sounds philosophical.

Because it is operationally useful.

Open Questions

I would be curious how other builders think about this:

Are you still routing uploads mainly by extension or client MIME?
Where in your pipeline would a generic "unknown" answer actually be safer than an overconfident specific label?
Do you treat file identity as metadata, or as a prediction about downstream interaction?
If you have tried Magika or the magika.uk web version, what did it change in your routing or security design?

We Built Multica to Make Multi-Agent AI Useful for Real Workflows

dengkui yang — Tue, 28 Apr 2026 16:15:59 +0000

AI is getting better fast, but the way most people use it still feels fragmented.

You open one chat, try one prompt, get one answer, and then start over somewhere else. That works for simple tasks, but it quickly becomes limiting when real work requires comparison, coordination, and iteration across multiple models.

That gap is what led us to build Multica.

Multica is a multi-agent collaboration platform designed for real workflows. Instead of treating AI like a series of isolated conversations, Multica gives you one place to run, compare, and coordinate multiple AI agents in parallel. You can review outputs side by side, test different reasoning styles, and turn parallel AI execution into a more structured and repeatable process.

We built it around a simple belief: the future of AI is not just better models. It is better workflow.

In practice, real work is rarely solved by a single prompt or a single model. One model may be better at reasoning. Another may be better at speed. A third may produce a stronger draft, better structure, or a more useful perspective for the task. But switching across tools and trying to manage that process manually is slow, messy, and difficult to scale.

Multica was created to solve that problem.

With Multica, teams can work across models like GPT, Claude, Gemini, and others in one workflow instead of scattering tasks across separate tabs and disconnected chats. The goal is not just to generate more output. It is to create clearer decisions, faster iteration loops, and a more reliable way to use AI in production work.

That matters because multi-agent collaboration only becomes valuable when it is usable. It needs structure. It needs visibility. It needs a workflow that helps people compare outputs, make decisions, and move forward without losing context.

This is the direction we believe AI tools need to go.

Not just chat interfaces. Not just one-off prompts. Not just isolated answers.

But systems that help people coordinate intelligence in a way that fits real work.

That is what we are building with Multica.

If you are exploring how to make multi-agent AI more practical, organized, and useful for real workflows, you can learn more here:https://www.multica.uk/

What If You Could Predict Decisions Before Making Them?

dengkui yang — Tue, 28 Apr 2026 14:36:42 +0000

What if you could test a decision before actually making it?

Not with spreadsheets. Not with gut feeling. But by simulating how people, markets, and narratives might react.

That’s the idea I’ve been exploring.

— -

Most decisions today are still based on guesswork.

We run A/B tests. We analyze past data. We try to predict outcomes.

But real-world systems don’t behave like clean models.

People react. Narratives spread. Unexpected things happen.

And once a decision is made, it’s often too late to go back.

— -

So I started thinking:

What if we could simulate decisions before committing to them?

— -

I built a tool called MiroFish.

It lets you ask questions like:

“What happens if we raise prices next quarter?” - “How might public opinion shift after a policy change?” - “What happens to a brand after a PR crisis?”

Behind the scenes, it runs multi-agent simulations to model how things might unfold — and returns structured predictions.

— -

The experience is simple.

You ask a question — just like you would in ChatGPT.

But instead of generating a single answer, the system simulates interactions between agents, narratives, and possible outcomes.

It’s less about finding the answer, and more about exploring what could happen.

— -

Some of the use cases I’m exploring:

Pricing strategy decisions - Market sentiment forecasting - Narrative and public opinion shifts - Policy impact simulation

Basically, any situation where outcomes depend on complex interactions.

— -

This is still an early project.

But I believe the idea of “simulating decisions” will become increasingly important as systems get more complex.

— -

If you’re curious, you can try it here: https://www.mirofish.work/

Would love to hear what you think — or how you’d use something like this.

Why Prompts Are Not Enough for Long-Running AI Agents

dengkui yang — Wed, 22 Apr 2026 07:16:05 +0000

A small ontology-inspired model for understanding why AI agents fail after the first obstacle

Author note: This article is written for AI builders, prompt engineers, automation teams, and founders experimenting with long-running AI agents.

Summary

Most AI agent failures are not caused by a lack of instructions.

They happen after instructions meet resistance.

The agent starts well. It understands the goal. It calls a tool. It writes a plan. It takes the first step. Then reality pushes back: a missing field, an unclear constraint, a failed API call, a contradictory user request, an impossible subtask, a weak assumption.

At that moment, many agents do not adjust themselves.

They repeat. They rephrase. They overthink. They add more steps. They call the same tool again. They produce a more confident version of the same mistake.

That is why prompts are not enough for long-running AI agents.

A prompt tells an agent what to do. A survival framework tells it how to continue when the task pushes back.

This article introduces a small ontology-inspired model for AI agent behavior:

A stable agent needs two loops: external action and internal adjustment.

1. The Prompt Patch Problem

When an AI agent fails, the usual response is to patch the prompt.

We add:

more rules
more constraints
more examples
more warnings
more formatting requirements
more tool-use instructions
more "do not hallucinate" clauses

Sometimes this works.

But prompt patching has a limit. Past a certain point, the prompt becomes a pile of defensive instructions. The agent is not becoming more stable. It is simply carrying more fragile rules.

The problem is deeper:

Many prompts describe the desired behavior, but they do not define how the agent should transform itself after failure.

That missing transformation is the core issue.

Diagram: Prompt Patch vs Adjustment Loop

Prompt patching says:

"Here is another rule. Try not to fail again."

Internal adjustment says:

"When you fail, identify what changed inside your model of the task, then act again."

Those are not the same thing.

2. The Failure Pattern

Here is a common long-running agent failure pattern:

User:
Find 20 relevant communities where I can discuss AI agent reliability,
then draft a short post for each one.

Agent:
Understood. I will search for communities and draft posts.

Step 1:
The agent searches.

Problem:
The search result is noisy. Some communities ban self-promotion.
Some are inactive. Some are not about AI agents.

Bad agent behavior:
The agent still drafts 20 posts anyway.

Worse agent behavior:
When corrected, it says "You're right" and drafts another 20 posts,
but with slightly different wording.

The failure is not that the agent misunderstood the original instruction.

The failure is that it did not adjust after discovering new reality:

community rules matter
activity level matters
relevance is not binary
self-promotion risk must be modeled
a search result is not yet a valid target

The agent performed external action.

It did not perform internal adjustment.

3. A Small Ontology for AI Agents

I use "ontology" here in a practical sense.

Not as a grand metaphysical claim.

For AI agent design, ontology means:

what entities the agent recognizes
what boundaries it assigns
what actions it can take
what feedback it treats as meaningful
how it updates itself after interaction

In this model, any agent trying to persist through a task needs two loops.

Loop 1: External Action

External action is how the agent affects the world.

It can include:

writing text
calling tools
searching
editing files
sending messages
making plans
asking questions
changing a workflow

Loop 2: Internal Adjustment

Internal adjustment is how the agent changes itself after the world pushes back.

It can include:

revising assumptions
narrowing scope
identifying missing data
recognizing a boundary
changing strategy
asking for help
stopping a risky path
updating the task model

Diagram: The Two-Loop Agent

A long-running agent does not need only a stronger instruction.

It needs a way to process feedback into self-change.

4. Why Longer Prompts Can Make Agents Less Stable

Longer prompts often try to solve every possible future failure in advance.

But the real world is interactive. The agent will encounter states that the prompt did not predict.

When this happens, long prompts can create three problems.

Problem	What happens	Why it matters
Rule collision	Multiple instructions apply at once	The agent chooses one arbitrarily
False confidence	The prompt sounds complete	The agent stops checking reality
No recovery layer	The prompt says what to do, not how to recover	The agent repeats failure

The issue is not prompt length itself.

The issue is using prompt length as a substitute for adjustment architecture.

A prompt can say:

If something goes wrong, fix it.

But a stronger agent needs to know:

What kind of wrong is this?
Did my assumption fail?
Did my boundary fail?
Did my tool fail?
Did my goal conflict with the environment?
Should I continue, ask, narrow, stop, or replan?

That is not just instruction following.

That is self-diagnosis.

5. The Four Failure Types

When I look at long-running agent failures, I usually see four categories.

Diagram: Agent Failure Map

5.1 Assumption Failure

The agent assumes something that is not true.

Example:

It assumes a community allows promotional posts because similar communities do.

5.2 Boundary Failure

The agent does not recognize what it should not do.

Example:

It drafts outreach messages that violate platform rules or user trust.

5.3 Validation Failure

The agent does not define how success will be checked.

Example:

It produces a list of targets without checking whether they are active.

5.4 Adjustment Failure

The agent receives feedback but does not change its internal model.

Example:

It says "You're right" and repeats the same flawed strategy.

This fourth type is the most important.

Because if the agent has no adjustment loop, the other failures keep returning.

6. A Teacher AI -> Student AI Training Example

Here is a simplified example of how an ontology-based teacher AI might train a student AI.

Student AI: Before training

User:
Find forums where I can post about AI agent reliability.

Student AI:
I found 20 forums. Here are draft posts for each one.

The answer looks productive.

But it skipped key checks:

Are these forums active?
Do they allow self-promotion?
Are they about AI agents or general AI?
Should the first action be posting, commenting, or observing?
Is there a reputational risk?

Teacher AI: Diagnosis

Teacher AI:
You performed external action, but you did not perform internal adjustment.

Before drafting posts, identify:
1. The target entity: which community?
2. The boundary: what does the community allow?
3. The validation point: what proves the community is relevant?
4. The risk: what behavior would be considered spam?
5. The next safe action: post, comment, observe, or skip?

Student AI: After training

Student AI:
I should not draft posts yet.

First, I will classify each community by:
- relevance to AI agents
- activity level
- self-promotion rules
- preferred contribution style
- risk level

For high-risk communities, I will not post links.
I will first contribute comments and only share the longer article if someone asks.

This is a small change.

But it is the difference between a task executor and an agent that can adjust itself.

7. From Prompt Template to Training Protocol

Here is the practical shift:

Prompt template mindset	Training protocol mindset
Tell the agent what to do	Teach the agent how to recover
Add more rules	Diagnose failure modes
Optimize first answer	Improve multi-step behavior
Prevent mistakes in advance	Convert mistakes into adjustment
Focus on output	Focus on action loop

This is why I think the future of AI agent reliability will not be only prompt engineering.

It will also involve agent training protocols.

Not necessarily in the heavy machine-learning sense.

Even structured conversations can train behavior if they repeatedly force the agent to:

name the target
define the boundary
simulate failure
validate action
review feedback
update strategy

Diagram: A Minimal Training Protocol

8. What This Changes in Agent Design

If this model is useful, then an AI agent prompt should not only contain task instructions.

It should contain recovery questions.

For example:

Before acting:
- What entity am I acting on?
- What boundary limits my action?
- What assumption am I relying on?
- What would prove that I am wrong?

After failure:
- Did the target change?
- Did the boundary change?
- Did my assumption fail?
- Do I need to ask, stop, narrow, or replan?

This is not a magic solution.

It will not eliminate hallucination.

It will not guarantee business outcomes.

But it gives the agent a better structure for converting failure into adjustment.

And that is one of the missing layers in long-running agent design.

9. The Checklist

When diagnosing an AI agent, I would start with these 10 questions.

Question	Good sign	Bad sign
Does it define the target entity?	It names what it acts on	It acts on vague context
Does it define boundaries?	It knows what not to do	It overreaches
Does it define success checks?	It validates progress	It assumes completion
Does it simulate failure?	It predicts resistance	It acts blindly
Does it notice missing data?	It asks or narrows	It invents
Does it classify feedback?	It diagnoses failure type	It says "sorry" and repeats
Does it update strategy?	It changes its approach	It rephrases
Does it know when to stop?	It uses stop-loss	It loops
Does it escalate uncertainty?	It asks for help	It hides uncertainty
Does it record the adjustment?	It learns within the session	It forgets the correction

If an agent fails most of these, it probably does not need a longer prompt first.

It needs an internal adjustment loop.

10. Open Question

I am still testing this framework, so I am more interested in criticism than agreement.

My current claim is:

Long-running AI agents fail when they can perform external action but cannot convert feedback into internal adjustment.

I am curious:

Do you see the same pattern in your own AI agents?
Are there failure types this model misses?
Have you found a better way to train recovery behavior?
Is "ontology" the wrong word for this, even if the model is useful?

Forem: dengkui yang

Why AI Agents Get Stuck in Loops

Summary

1. Looping Is Not Persistence

2. The Failure Pattern

Diagram: Retry Loop vs Recovery Path

3. Where the Loop Actually Forms

Diagram: The Feedback Conversion Gap

4. A Small Ontology of Recovery

5. Why Prompt Patches Often Make Loops Worse

6. What This Changes in Training

Diagram: Teacher-Student Recovery Training

7. A Small Example

8. Open Question

A Research Workflow That Starts With Sources, Not Prompts

The Scenario: Turning Raw Material Into a Briefing

Start by Protecting the Difference Between Sources and Notes

The First Real Control Is Context

Chat Is for Exploration, Ask Is for Discovery, Transformations Are for Reuse

The Notebook Should Remember the Work

Audio Briefings Are Not a Gimmick if the Source Trail Survives

Why Local and Self-Hosted Options Change the Workflow

What This Style of Notebook Is Really For

Final Takeaway

References

Why File-to-Markdown Conversion Is Becoming an AI Input Layer

Summary

1. The Real Job Is Normalization, Not Conversion

2. Why Markdown Is a Strong Intermediate Format

3. Coverage Matters, but Routing Matters More

4. Trust Boundaries Are Part of the Design

5. CLI, Python, Docker, and MCP Are Architecture Choices

A Practical Checklist

Final Takeaway

References

Why Chat-with-Docs Breaks in Real Companies: An Engineering Look at Onyx

The Failure Mode: Retrieval Without Reality

Engineering Requirement 1: Connect the Sources Before Optimizing the Prompt

Engineering Requirement 2: Permissions Must Travel With the Knowledge

Engineering Requirement 3: Freshness Is a Data Pipeline Problem

Engineering Requirement 4: Search Should Be Inspectable, Not Hidden Inside Chat

Engineering Requirement 5: Some Answers Need Actions, Not Just Text

Engineering Requirement 6: Deployment Boundaries Are Product Decisions

A Practical Evaluation Checklist

Where Onyx Fits

Final Takeaway

References

Why File Type Detection Is More Than a Metadata Problem

What Magika teaches us about names, evidence, boundaries, and trustworthy file intelligence

Summary

1. The Extension Mistake

2. What Magika Actually Adds

3. The Most Interesting Part: It Separates Belief from Decision

4. A Practical Model of File Identity

5. Why the Web Version Matters

6. A Better Mental Model: File Identity as Interaction Potential

7. How I Would Use Magika in a Real Pipeline

8. Where Magika Should Not Be Overstated

9. Four Questions I Would Ask Before Integrating It

10. Closing Thought

Open Questions

Links

We Built Multica to Make Multi-Agent AI Useful for Real Workflows

What If You Could Predict Decisions Before Making Them?

Why Prompts Are Not Enough for Long-Running AI Agents

Summary

1. The Prompt Patch Problem

Diagram: Prompt Patch vs Adjustment Loop

2. The Failure Pattern

3. A Small Ontology for AI Agents

Loop 1: External Action

Loop 2: Internal Adjustment

Diagram: The Two-Loop Agent

4. Why Longer Prompts Can Make Agents Less Stable

A prompt can say:

5. The Four Failure Types

Diagram: Agent Failure Map

5.1 Assumption Failure

5.2 Boundary Failure

5.3 Validation Failure