Forem: Artem KK

How to Build a Multi-Agent Pipeline That Doesn't Lose the Plot

Artem KK — Thu, 16 Apr 2026 06:04:43 +0000

How to Build a Multi-Agent Pipeline That Doesn't Lose the Plot

The biggest problem with using LLMs for long-form content generation isn't the quality of the prose—it's the loss of coherence. You start with a brilliant premise, but by chapter five, your protagonist has forgotten their motivation, and the magic system has completely broken.

When you treat an LLM as a single, monolithic writer, you are asking it to perform three cognitively heavy tasks simultaneously: structural planning, character consistency, and atmospheric prose generation. Even with a massive context window, the "attention" drifts.

To solve this, I implemented a hierarchical, three-layer agentic architecture in NovelGenerator. Instead of one prompt, we use a pipeline of specialized agents where each layer's output becomes the "source of truth" for the next.

Here is how you can build a multi-agent pipeline that maintains narrative integrity.

The Architecture: The Hierarchy of Intent

The core principle is delegation. We move from high-level abstraction (the "what") to low-level implementation (the "how").

Structure Agent (The Architect): Defines the skeleton.
Character Agent (The Soul): Populates the skeleton with identity and memory.
Scene Agent (The Painter): Renders the final, atmospheric text.

Step 1: Establishing the Narrative Skeleton (Structure Agent)

The first agent's job is purely structural. It doesn't care about adjectives or dialogue. Its only goal is to ensure the plot follows a logical progression (e.g., Hero's Journey or Three-Act Structure).

The output of this agent must be highly structured—preferably JSON—so that subsequent agents can parse it without ambiguity.

// The blueprint that the next agents will consume
interface PlotOutline {
title: string;
acts: {
actNumber: number;
summary: string;
chapters: Chapter[];
}[];
}

interface Chapter {
chapterNumber: number;
setting: string;
keyEvents: string[];
requiredCharacters: string[];
}

class StructureAgent {
async generateOutline(premise: string): Promise<PlotOutline> {
const prompt = `Analyze this premise: "${premise}".
Create a structured 3-act plot outline in JSON format.
Focus on logical causality and pacing.`;

// Implementation calls LLM with JSON mode enabled
const response = await this.llm.call(prompt);
return JSON.parse(response);
}
}

By forcing the StructureAgent to work with keyEvents and requiredCharacters as discrete arrays, we prevent the "drifting plot" syndrome. The next agent isn't guessing what happens; it is following a checklist.

Step 2: Injecting Personality and Memory (Character Agent)

Once we have the chapters, we need to ensure that the characters behave consistently. If a character is established as "stoic and traumatized" in Chapter 1, they cannot suddenly become a "jovial comedian" in Chapter 3.

The CharacterAgent takes the requiredCharacters from the StructureAgent and expands them into deep profiles. It also manages the "memory" of these characters.

interface CharacterProfile {
name: string;
traits: string[];
backstory: string;
internalConflict: string;
}

class CharacterAgent {
async enrichCharacters(chapters: Chapter[], characters: string[]): Promise<CharacterProfile[]> {
const profiles: CharacterProfile[] = [];

for (const charName of characters) {
const prompt = `Based on the plot outline, develop a deep profile for ${charName}.
Define their traits, backstory, and how their internal conflict
will drive the events in the provided chapters.`;

const profile = await this.llm.call(prompt);
profiles.push(JSON.parse(profile));
}

return profiles;
}
}

The magic happens here: the CharacterAgent acts as a bridge. It takes the "skeleton" and adds "muscle." When the pipeline moves to the final stage, the prompt will include not just the chapter summary, but the specific psychological profile of every character present in that scene.

Step 3: Atmospheric Rendering (Scene Agent)

The final layer, the SceneAgent, is the most computationally expensive. This agent is responsible for the actual prose. Because the structural and character constraints are already baked into its context, it can focus entirely on sensory details, dialogue, and pacing.

The prompt for the SceneAgent is a synthesis of all previous layers:
[Structure Context] + [Character Context] + [Atmospheric Instructions] = Final Prose

class SceneAgent {
async generateScene(
chapter: Chapter,
characters: CharacterProfile[],
settingContext: string
): Promise<string> {
const characterContext = characters
.filter(c => chapter.requiredCharacters.includes(c.name))
.map(c => `${c.name}: ${c.traits.join(', ')}. Conflict: ${c.internalConflict}`)
.join('\n');

const prompt = `
Write a detailed prose scene for Chapter ${chapter.chapterNumber}.

CONTEXT:
Setting: ${chapter.setting}
Characters Present:
${characterContext}

PLOT EVENTS TO COVER:
${chapter.keyEvents.map(e => `- ${e}`).join('\n')}

STYLE GUIDE:
Use sensory details (smell, sound, texture).
Maintain a ${settingContext} atmosphere.
Focus on the internal monologue of the characters.
`;

return await this.llm.call(prompt);
}
}

By the time the SceneAgent receives the request, the "hard work" of logic and consistency is already done. It is simply "painting" the scene within the boundaries defined by the previous agents.

The Pipeline Orchestration

The real complexity lies in the orchestration. You need a controller that manages the state and ensures that the output of Agent A is correctly formatted for Agent B.

class NovelGeneratorPipeline {
async run(premise: string) {
// 1. Architect Phase
const outline = await this.structureAgent.generateOutline(premise);

// 2. Soul Phase
const allCharacters = this.extractCharactersFromOutline(outline);
const characterProfiles = await this.characterAgent.enrichCharacters(outline.acts.chapters, allCharacters);

// 3. Painter Phase
const manuscript = [];
for (const chapter of outline.acts.chapters) {
const sceneProse = await this.sceneAgent.generateScene(
chapter,
characterProfiles,
"dark and cinematic"
);
manuscript.push(sceneProse);
}

return manuscript.join('\n\n');
}
}

Lessons Learned

JSON is your best friend: Never let an agent return raw text if another agent needs to read it. Use structured outputs (JSON mode) to ensure the pipeline doesn't break.
Context Dilution: Don't pass the entire book to the SceneAgent. Only pass the characters and plot points relevant to the specific chapter being written. This keeps the focus sharp and saves tokens.
The "Identity" Problem: The CharacterAgent is the most critical for long-term consistency. If you skip this step, your characters will become generic archetypes within three chapters.

Building multi-agent systems is not about making one agent "smarter"; it is about creating a specialized assembly line where each worker has a narrow, well-defined task.

If you want to see a full implementation of this architecture, including how I handle long-context memory and EPUB generation, check out the source code here:

https://github.com/KazKozDev/NovelGenerator

AI_Agents #TypeScript #SoftwareArchitecture #LLM

Why Book Translation Needs a Second Pass

Artem KK — Thu, 16 Apr 2026 04:41:43 +0000

Why Book Translation Needs a Second Pass

Most LLM translation demos stop after a single generation pass. That is enough to preserve rough meaning, but not enough to preserve rhythm, tone, and narrative continuity across long chapters.

Book Translator uses a two-step workflow:

Draft translation for semantic fidelity.
Self-reflection pass for style, flow, and readability.

That extra pass matters because long-form translation quality breaks down in subtle ways. Literal phrasing accumulates. Transitional sentences become stiff. Paragraph rhythm starts sounding machine-generated even when each sentence is technically correct.

The project treats translation less like one-shot prompting and more like an editorial pipeline. It runs locally with Ollama, which keeps sensitive manuscripts off third-party APIs while still giving you a repeatable CLI workflow.

Key design choices:

chunking for long documents
local-first inference via Ollama
explicit self-reflection stage for refinement
CLI-first workflow for repeatable runs

If you are building long-form AI writing systems, the main lesson is simple: generation quality is often a workflow problem, not just a model problem.

Repo: https://github.com/KazKozDev/book-translator

I Studied How GitHub READMEs Are Actually Evaluated — Here Are the 5 Things That Matter

Artem KK — Sat, 04 Apr 2026 05:01:44 +0000

I spent weeks reading hiring threads, portfolio guides, recruiter-facing articles, Reddit discussions, and academic papers to answer one question: what do people actually look at when they evaluate a GitHub profile?

I expected to find a clear standard. I didn't.

What I found was more useful: most README "best practices" aren't rules — they're signals. And there's a formal framework for understanding why some signals matter and others don't.

I wrote up the full deep-dive with all sources and references. Here's the short version of what I verified.

1. Your README Is a Screening Surface, Not Documentation

People don't start with a deep code review. The first pass is shallow — they're scanning for signs of seriousness. Eye-tracking research shows recruiters spend about 7 seconds on an initial screen. Your README's first job isn't to explain everything. It's to justify continued attention.

2. Tests and CI Are Signals, Not Checkboxes

Tests, CI, .env.example, meaningful commits — these details keep showing up in advice because they compress information. They help a reviewer infer how you work. But here's the caveat: a CI pipeline on a three-file todo app is a weak signal. These things only matter when attached to substantive work.

3. The Clone Problem Is Real, but Misunderstood

The internet loves to say "remove your Netflix clone." The real issue isn't that familiar project shapes are bad — it's that simple clones signal tutorial-following more than independent judgment. The real divide is replication as an endpoint vs. replication as a starting point for something original.

4. Proof Beats Description — Every Time

A live demo, a screenshot, a short GIF, a deployment link, or even a small number of real users. What matters is whether the reader has to imagine the project works, or can see that it does. After I added a deployment link and a 10-second GIF to one of my projects, people stopped asking "what does it do?" and started asking implementation questions.

5. Writing Helps Only When It Extends Real Work

Blog posts don't replace projects. But "I built X, here's what broke, and here's what I learned" is powerful — because it shows how you think. A post-mortem of a real project is hard to fake. Generic "Top 10 tips" pieces are not.

What I'd Actually Do Now

If I were cleaning up a GitHub repo today, I'd focus on five things:

Explain what the project is in one clear sentence
Show proof that it works: demo, screenshot, deployment
Include signals of engineering discipline: tests, CI, setup clarity
Explain why the project exists, not just what it does
Add a section showing reflection: trade-offs, challenges, what you'd change

The question that ties it all together:

What uncertainty is this README removing for the person reading it?

The full article goes deeper into the academic research behind these ideas — including signaling theory, eye-tracking studies, and peer-reviewed work on how GitHub profiles are evaluated in hiring.

👉 Read the full deep-dive on Medium

📂 All sources and references are also available on GitHub: github.com/KazKozDev/github-rabbit-hole

What do you think is the stronger signal in a portfolio: a polished clone, or a rough project with real users? I'd love to hear your take in the comments.

Building a Perplexity Clone for Local LLMs in 50 Lines of Python

Artem KK — Fri, 20 Mar 2026 05:42:39 +0000

Your local LLM is smart but blind — it can't see the internet. Here's how to give it eyes, a filter, and a citation engine.

This is a hands-on tutorial. We'll install a library, run a real query, break down every stage of what happens inside, and look at the actual output your LLM receives.

By the end, you'll have a working pipeline that turns any local model (Ollama, LM Studio, anything with a text input) into something that searches the web, reads pages, ranks the results, and generates a structured prompt with inline citations — like a self-hosted Perplexity.

Background: If you want to understand the architecture this is based on, I wrote a deep dive into how Perplexity actually works — the five-stage RAG pipeline, hybrid retrieval on Vespa.ai, Cerebras-accelerated inference, the citation integrity problems. This tutorial is the practical counterpart.

Repo: github.com/KazKozDev/production_rag_pipeline

What We're Building

A pipeline that does this:

Your question
    ↓
Search (Bing + DuckDuckGo, parallel)
    ↓
Semantic pre-filter (drop irrelevant results before fetching)
    ↓
Fetch pages (only the ones that passed filtering)
    ↓
Extract content (strip boilerplate, ads, navigation)
    ↓
Chunk + Rerank (BM25 + semantic + answer-span + MMR)
    ↓
LLM-ready prompt with numbered citations

The pipeline does NOT include the LLM itself — it builds the prompt. You plug in whatever model you want.

Step 1: Installation

git clone https://github.com/KazKozDev/production_rag_pipeline.git
cd production_rag_pipeline

Pick your install level:

# Minimal — BM25 ranking, BeautifulSoup extraction. No ML models.
pip install .

# Better extraction with trafilatura
pip install .[extraction]

# Semantic ranking with sentence-transformers (recommended)
pip install .[semantic]

# Everything
pip install .[full]

For this tutorial, use .[full]. First run will download embedding models (~100–500MB depending on language) — this only happens once.

No API keys needed. Bing and DuckDuckGo are queried without authentication.

Step 2: Your First Query — 3 Lines

from production_rag_pipeline import build_llm_prompt

prompt = build_llm_prompt("latest AI news", lang="en")
print(prompt)

That's the entire interface. build_llm_prompt runs the full pipeline — search, filter, fetch, extract, rerank — and returns a formatted string ready to paste into any LLM.

CLI alternative

production-rag-pipeline "latest AI news"

Or with options:

# Search-only mode (no page fetching)
production-rag-pipeline "Bitcoin price" --mode search

# Russian query
production-rag-pipeline "новости ИИ" --mode read --lang ru

macOS users

./run_llm_query.command

This bootstraps a virtual environment automatically on first run.

Step 3: What Just Happened — Stage by Stage

Let's trace what the pipeline actually does with "latest AI news". Enable debug mode to see it:

from production_rag_pipeline.pipeline import search_extract_rerank

chunks, results, fetched_urls = search_extract_rerank(
    query="latest AI news",
    num_fetch=8,
    lang="en",
    debug=True,
)

Stage 1: Dual-Engine Search

Bing and DuckDuckGo are searched in parallel. Results are merged with position-based scoring — first result from each engine scores highest, and results that appear in both engines get a boost.

The pipeline detects keywords like "news", "latest", "breaking" and switches DDG to its News index — returning actual articles instead of generic homepages.

Stage 2: Semantic Pre-Filtering

This is the key optimization. Before fetching any page, the pipeline computes cosine similarity between the query embedding and each result's title+snippet embedding.

Results below threshold get dropped:

English: threshold 0.30
Russian: threshold 0.25

In practice, ~11 out of 20 results get filtered — saving about 6 seconds of HTTP fetches.

Example from a real run with "LLM agents news":

✗ flutrackers.com     sim=0.12  → filtered (irrelevant)
✓ llm-stats.com       sim=0.68  → fetched
✗ reddit.com/r/gaming  sim=0.15  → filtered
✓ arxiv.org/abs/2503   sim=0.71  → fetched

No hardcoded domain lists. Pure semantic relevance.

Stage 3: Parallel Fetch + Content Extraction

Surviving results (typically 5–9 URLs) are fetched in parallel. Content extraction runs a two-stage quality check:

Structural check: Does >30% of lines look like numbers/prices/tables?

Semantic check: If flagged, is the table relevant to the query?

This is how exchange rate tables from cbr.ru pass for a currency query (similarity 0.75) but CS:GO price lists get rejected (similarity 0.05).

After extraction, boilerplate is stripped — navigation, ads, newsletter signup patterns, cookie banners.

Stage 4: Chunking + Multi-Signal Reranking

Extracted content is chunked, then reranked by four signals:

BM25 — classic lexical term-frequency matching
Semantic similarity — cosine between query and chunk embeddings
Answer-span detection — does this chunk directly answer the question?
MMR diversity — prevents top results from all being paraphrases of the same paragraph

Optional: a cross-encoder runs on the final shortlist for maximum accuracy (slower but better).

For news queries, freshness penalties apply:

Content >7 days old: −1 confidence
Content >30 days old: −2 confidence
Outdated sources flagged in the prompt with exact age

Stage 5: Prompt Assembly with Citation Binding

The pipeline builds a structured prompt:

from production_rag_pipeline.pipeline import build_llm_context
from production_rag_pipeline.prompts import build_llm_prompt

context, source_mapping, grouped_sources = build_llm_context(
    chunks,
    results,
    fetched_urls=fetched_urls,
    renumber_sources=True,  # ← fixes phantom citation numbers
)

Citation numbers are renumbered after every filtering step. If three sources survive, they're numbered [1], [2], [3] — never [1], [3], [7] with phantom gaps.

Current date and time are injected into the prompt so the LLM can reason about source freshness.

Step 4: What the Output Looks Like

The final prompt looks roughly like this (abbreviated):

Current date: 2026-03-20

Answer the user's question using ONLY the provided sources.
Cite sources using [1], [2], etc. Do not make claims without a citation.

=== SOURCES ===

[1] OpenAI announces GPT-5 turbo with 1M context window
Source: techcrunch.com | Published: 2026-03-19
OpenAI today released GPT-5 Turbo, featuring a 1 million token
context window and improved reasoning capabilities...

[2] Google DeepMind publishes Gemini 2.5 technical report
Source: blog.google | Published: 2026-03-18
The technical report details architectural changes including
mixture-of-experts scaling to 3.2 trillion parameters...

[3] Anthropic raises $5B Series E at $90B valuation
Source: reuters.com | Published: 2026-03-17
Anthropic closed a $5 billion funding round, bringing its
total raised to over $15 billion...

=== QUESTION ===

latest AI news

Drop this into Ollama, LM Studio, or any API. The model sees curated, relevant, cited content — not raw web pages.

Step 5: Configuration

Dataclass

from production_rag_pipeline import RAGConfig, build_llm_prompt

config = RAGConfig(
    num_per_engine=12,       # results per search engine
    top_n_fetch=8,           # max pages to fetch
    fetch_timeout=10,        # seconds per page
    total_context_chunks=12, # chunks in final prompt
)

prompt = build_llm_prompt("latest AI news", config=config)

YAML

production-rag-pipeline "latest AI news" --config config.example.yaml

Environment variables

export RAG_TOP_N_FETCH=8
export RAG_FETCH_TIMEOUT=10
production-rag-pipeline "latest AI news"

Step 6: The 50-Line Version

Here's the entire pipeline, from query to LLM-ready prompt, using the module-level API:

from production_rag_pipeline.search import search
from production_rag_pipeline.fetch import fetch_pages_parallel
from production_rag_pipeline.extract import extract_content, chunk_text
from production_rag_pipeline.rerank import rerank_chunks
from production_rag_pipeline.pipeline import build_llm_context
from production_rag_pipeline.prompts import build_llm_prompt

# 1. Search
query = "latest AI news"
results = search(query, num_per_engine=10, lang="en")

# 2. Fetch
urls = [r["url"] for r in results[:8]]
pages = fetch_pages_parallel(urls, timeout=10)

# 3. Extract + Chunk
all_chunks = []
for url, html in pages.items():
    text = extract_content(html, url=url)
    if text:
        chunks = chunk_text(text, url=url)
        all_chunks.extend(chunks)

# 4. Rerank
ranked = rerank_chunks(query, all_chunks, lang="en")

# 5. Build prompt
context, mapping, sources = build_llm_context(
    ranked, results, renumber_sources=True
)
prompt = build_llm_prompt(query, context=context, sources=sources)

print(prompt)

This is what build_llm_prompt("latest AI news") does internally, broken into visible steps.

Graceful Degradation

The pipeline works at every install level:

Install	Ranking	Extraction	Speed
`pip install .`	BM25 only	BeautifulSoup	Fastest, least accurate
`pip install .[extraction]`	BM25 only	Trafilatura	Better content quality
`pip install .[semantic]`	BM25 + semantic + MMR	BeautifulSoup	Much better ranking
`pip install .[full]`	BM25 + semantic + cross-encoder + MMR	Trafilatura	Best quality

No GPU required. Semantic models run on CPU — slower, but functional.

How It Compares to Perplexity

	Perplexity	production-rag-pipeline
Index	200B+ pre-indexed URLs	Real-time Bing + DDG
Latency	358ms median	8–15s on a MacBook
Models	20+ with dynamic routing	You choose (Ollama, LM Studio, etc.)
Inference	Cerebras CS-3, 1,200 tok/s	Your hardware
Cost	$20/mo Pro	Free
Privacy	Cloud	Local
Code	Closed	Open source, MIT

The gap is real — especially on latency and index size. But for a tool that runs on your laptop, feeds any local model, and costs nothing, the tradeoff is worth it.

Multilingual Support

The pipeline auto-detects language by Cyrillic character ratio (10% threshold):

English → all-MiniLM-L6-v2 (fast, English-optimized)
Russian → paraphrase-multilingual-MiniLM-L12-v2 (13 languages)

Cross-encoder reranking also switches models per language. No manual configuration needed.

production-rag-pipeline "новости ИИ" --lang ru

What's Next

This is Part 2 of a series:

Part 1 — How Perplexity Actually Searches the Internet (architecture teardown)
Part 2 — You're reading it (build the local equivalent)

Star the repo if this is useful: github.com/KazKozDev/production_rag_pipeline

Issues and contributions welcome.