<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Artem KK</title>
    <description>The latest articles on Forem by Artem KK (@kazkozdev).</description>
    <link>https://forem.com/kazkozdev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3834673%2Fe455ed15-4828-4ced-b5c4-97bf43cb3143.jpg</url>
      <title>Forem: Artem KK</title>
      <link>https://forem.com/kazkozdev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kazkozdev"/>
    <language>en</language>
    <item>
      <title>How to Build a Multi-Agent Pipeline That Doesn't Lose the Plot</title>
      <dc:creator>Artem KK</dc:creator>
      <pubDate>Thu, 16 Apr 2026 06:04:43 +0000</pubDate>
      <link>https://forem.com/kazkozdev/how-to-build-a-multi-agent-pipeline-that-doesnt-lose-the-plot-4ngb</link>
      <guid>https://forem.com/kazkozdev/how-to-build-a-multi-agent-pipeline-that-doesnt-lose-the-plot-4ngb</guid>
      <description>&lt;h1&gt;
  
  
  How to Build a Multi-Agent Pipeline That Doesn't Lose the Plot
&lt;/h1&gt;

&lt;p&gt;The biggest problem with using LLMs for long-form content generation isn't the quality of the prose—it's the loss of coherence. You start with a brilliant premise, but by chapter five, your protagonist has forgotten their motivation, and the magic system has completely broken.&lt;/p&gt;

&lt;p&gt;When you treat an LLM as a single, monolithic writer, you are asking it to perform three cognitively heavy tasks simultaneously: structural planning, character consistency, and atmospheric prose generation. Even with a massive context window, the "attention" drifts.&lt;/p&gt;

&lt;p&gt;To solve this, I implemented a hierarchical, three-layer agentic architecture in &lt;strong&gt;NovelGenerator&lt;/strong&gt;. Instead of one prompt, we use a pipeline of specialized agents where each layer's output becomes the "source of truth" for the next.&lt;/p&gt;

&lt;p&gt;Here is how you can build a multi-agent pipeline that maintains narrative integrity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: The Hierarchy of Intent
&lt;/h2&gt;

&lt;p&gt;The core principle is &lt;strong&gt;delegation&lt;/strong&gt;. We move from high-level abstraction (the "what") to low-level implementation (the "how").&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Structure Agent (The Architect):&lt;/strong&gt; Defines the skeleton.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Character Agent (The Soul):&lt;/strong&gt; Populates the skeleton with identity and memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scene Agent (The Painter):&lt;/strong&gt; Renders the final, atmospheric text.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Step 1: Establishing the Narrative Skeleton (Structure Agent)
&lt;/h2&gt;

&lt;p&gt;The first agent's job is purely structural. It doesn't care about adjectives or dialogue. Its only goal is to ensure the plot follows a logical progression (e.g., Hero's Journey or Three-Act Structure).&lt;/p&gt;

&lt;p&gt;The output of this agent must be highly structured—preferably JSON—so that subsequent agents can parse it without ambiguity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// The blueprint that the next agents will consume&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;PlotOutline&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nl"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nl"&gt;acts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="na"&gt;actNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nl"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nl"&gt;chapters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Chapter&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="p"&gt;}[];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;Chapter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nl"&gt;chapterNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nl"&gt;setting&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nl"&gt;keyEvents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="nl"&gt;requiredCharacters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StructureAgent&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;generateOutline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;premise&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;PlotOutline&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`Analyze this premise: "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;premise&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;".
Create a structured 3-act plot outline in JSON format.
Focus on logical causality and pacing.`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Implementation calls LLM with JSON mode enabled&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By forcing the &lt;code&gt;StructureAgent&lt;/code&gt; to work with &lt;code&gt;keyEvents&lt;/code&gt; and &lt;code&gt;requiredCharacters&lt;/code&gt; as discrete arrays, we prevent the "drifting plot" syndrome. The next agent isn't guessing what happens; it is following a checklist.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Injecting Personality and Memory (Character Agent)
&lt;/h2&gt;

&lt;p&gt;Once we have the chapters, we need to ensure that the characters behave consistently. If a character is established as "stoic and traumatized" in Chapter 1, they cannot suddenly become a "jovial comedian" in Chapter 3.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;CharacterAgent&lt;/code&gt; takes the &lt;code&gt;requiredCharacters&lt;/code&gt; from the &lt;code&gt;StructureAgent&lt;/code&gt; and expands them into deep profiles. It also manages the "memory" of these characters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;CharacterProfile&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nl"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nl"&gt;traits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="nl"&gt;backstory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nl"&gt;internalConflict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CharacterAgent&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;enrichCharacters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chapters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Chapter&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="nx"&gt;characters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;CharacterProfile&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;profiles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CharacterProfile&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;charName&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;characters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`Based on the plot outline, develop a deep profile for &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;charName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.
Define their traits, backstory, and how their internal conflict
will drive the events in the provided chapters.`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;profiles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;profiles&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The magic happens here: the &lt;code&gt;CharacterAgent&lt;/code&gt; acts as a bridge. It takes the "skeleton" and adds "muscle." When the pipeline moves to the final stage, the prompt will include not just the chapter summary, but the specific psychological profile of every character present in that scene.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Atmospheric Rendering (Scene Agent)
&lt;/h2&gt;

&lt;p&gt;The final layer, the &lt;code&gt;SceneAgent&lt;/code&gt;, is the most computationally expensive. This agent is responsible for the actual prose. Because the structural and character constraints are already baked into its context, it can focus entirely on sensory details, dialogue, and pacing.&lt;/p&gt;

&lt;p&gt;The prompt for the &lt;code&gt;SceneAgent&lt;/code&gt; is a synthesis of all previous layers:&lt;br&gt;
&lt;code&gt;[Structure Context] + [Character Context] + [Atmospheric Instructions] = Final Prose&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SceneAgent&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;generateScene&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="nx"&gt;chapter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Chapter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="nx"&gt;characters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CharacterProfile&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
&lt;span class="nx"&gt;settingContext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;characterContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;characters&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;chapter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;requiredCharacters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;traits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;, &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;. Conflict: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;internalConflict&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`
Write a detailed prose scene for Chapter &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;chapter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chapterNumber&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.

CONTEXT:
Setting: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;chapter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;setting&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;
Characters Present:
&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;characterContext&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;

PLOT EVENTS TO COVER:
&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;chapter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;keyEvents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`- &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;

STYLE GUIDE:
Use sensory details (smell, sound, texture).
Maintain a &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;settingContext&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; atmosphere.
Focus on the internal monologue of the characters.
`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By the time the &lt;code&gt;SceneAgent&lt;/code&gt; receives the request, the "hard work" of logic and consistency is already done. It is simply "painting" the scene within the boundaries defined by the previous agents.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline Orchestration
&lt;/h2&gt;

&lt;p&gt;The real complexity lies in the orchestration. You need a controller that manages the state and ensures that the output of &lt;code&gt;Agent A&lt;/code&gt; is correctly formatted for &lt;code&gt;Agent B&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NovelGeneratorPipeline&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;premise&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="c1"&gt;// 1. Architect Phase&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;outline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;structureAgent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generateOutline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;premise&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// 2. Soul Phase&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allCharacters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extractCharactersFromOutline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;outline&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;characterProfiles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;characterAgent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enrichCharacters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;outline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;acts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chapters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;allCharacters&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// 3. Painter Phase&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;manuscript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chapter&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;outline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;acts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chapters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sceneProse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sceneAgent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generateScene&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="nx"&gt;chapter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="nx"&gt;characterProfiles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;dark and cinematic&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;manuscript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sceneProse&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;manuscript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;JSON is your best friend:&lt;/strong&gt; Never let an agent return raw text if another agent needs to read it. Use structured outputs (JSON mode) to ensure the pipeline doesn't break.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Dilution:&lt;/strong&gt; Don't pass the &lt;em&gt;entire&lt;/em&gt; book to the &lt;code&gt;SceneAgent&lt;/code&gt;. Only pass the characters and plot points relevant to the &lt;em&gt;specific&lt;/em&gt; chapter being written. This keeps the focus sharp and saves tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Identity" Problem:&lt;/strong&gt; The &lt;code&gt;CharacterAgent&lt;/code&gt; is the most critical for long-term consistency. If you skip this step, your characters will become generic archetypes within three chapters.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Building multi-agent systems is not about making one agent "smarter"; it is about creating a specialized assembly line where each worker has a narrow, well-defined task.&lt;/p&gt;

&lt;p&gt;If you want to see a full implementation of this architecture, including how I handle long-context memory and EPUB generation, check out the source code here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/KazKozDev/NovelGenerator" rel="noopener noreferrer"&gt;https://github.com/KazKozDev/NovelGenerator&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  AI_Agents #TypeScript #SoftwareArchitecture #LLM
&lt;/h1&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Why Book Translation Needs a Second Pass</title>
      <dc:creator>Artem KK</dc:creator>
      <pubDate>Thu, 16 Apr 2026 04:41:43 +0000</pubDate>
      <link>https://forem.com/kazkozdev/why-book-translation-needs-a-second-pass-5e75</link>
      <guid>https://forem.com/kazkozdev/why-book-translation-needs-a-second-pass-5e75</guid>
      <description>&lt;h1&gt;
  
  
  Why Book Translation Needs a Second Pass
&lt;/h1&gt;

&lt;p&gt;Most LLM translation demos stop after a single generation pass. That is enough to preserve rough meaning, but not enough to preserve rhythm, tone, and narrative continuity across long chapters.&lt;/p&gt;

&lt;p&gt;Book Translator uses a two-step workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Draft translation for semantic fidelity.&lt;/li&gt;
&lt;li&gt;Self-reflection pass for style, flow, and readability.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That extra pass matters because long-form translation quality breaks down in subtle ways. Literal phrasing accumulates. Transitional sentences become stiff. Paragraph rhythm starts sounding machine-generated even when each sentence is technically correct.&lt;/p&gt;

&lt;p&gt;The project treats translation less like one-shot prompting and more like an editorial pipeline. It runs locally with Ollama, which keeps sensitive manuscripts off third-party APIs while still giving you a repeatable CLI workflow.&lt;/p&gt;

&lt;p&gt;Key design choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chunking for long documents&lt;/li&gt;
&lt;li&gt;local-first inference via Ollama&lt;/li&gt;
&lt;li&gt;explicit self-reflection stage for refinement&lt;/li&gt;
&lt;li&gt;CLI-first workflow for repeatable runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are building long-form AI writing systems, the main lesson is simple: generation quality is often a workflow problem, not just a model problem.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/KazKozDev/book-translator" rel="noopener noreferrer"&gt;https://github.com/KazKozDev/book-translator&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>opensource</category>
      <category>devtools</category>
    </item>
    <item>
      <title>I Studied How GitHub READMEs Are Actually Evaluated — Here Are the 5 Things That Matter</title>
      <dc:creator>Artem KK</dc:creator>
      <pubDate>Sat, 04 Apr 2026 05:01:44 +0000</pubDate>
      <link>https://forem.com/kazkozdev/i-studied-how-github-readmes-are-actually-evaluated-here-are-the-5-things-that-matter-2epd</link>
      <guid>https://forem.com/kazkozdev/i-studied-how-github-readmes-are-actually-evaluated-here-are-the-5-things-that-matter-2epd</guid>
      <description>&lt;p&gt;I spent weeks reading hiring threads, portfolio guides, recruiter-facing articles, Reddit discussions, and academic papers to answer one question: &lt;strong&gt;what do people actually look at when they evaluate a GitHub profile?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I expected to find a clear standard. I didn't.&lt;/p&gt;

&lt;p&gt;What I found was more useful: most README "best practices" aren't rules — they're &lt;strong&gt;signals&lt;/strong&gt;. And there's a formal framework for understanding why some signals matter and others don't.&lt;/p&gt;

&lt;p&gt;I wrote up the full deep-dive with all sources and references. Here's the short version of what I verified.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Your README Is a Screening Surface, Not Documentation
&lt;/h2&gt;

&lt;p&gt;People don't start with a deep code review. The first pass is shallow — they're scanning for signs of seriousness. Eye-tracking research shows recruiters spend about 7 seconds on an initial screen. Your README's first job isn't to explain everything. It's to &lt;strong&gt;justify continued attention&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Tests and CI Are Signals, Not Checkboxes
&lt;/h2&gt;

&lt;p&gt;Tests, CI, &lt;code&gt;.env.example&lt;/code&gt;, meaningful commits — these details keep showing up in advice because they compress information. They help a reviewer infer &lt;em&gt;how you work&lt;/em&gt;. But here's the caveat: a CI pipeline on a three-file todo app is a weak signal. These things only matter when attached to substantive work.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The Clone Problem Is Real, but Misunderstood
&lt;/h2&gt;

&lt;p&gt;The internet loves to say "remove your Netflix clone." The real issue isn't that familiar project shapes are bad — it's that simple clones signal tutorial-following more than independent judgment. The real divide is &lt;strong&gt;replication as an endpoint&lt;/strong&gt; vs. &lt;strong&gt;replication as a starting point for something original&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Proof Beats Description — Every Time
&lt;/h2&gt;

&lt;p&gt;A live demo, a screenshot, a short GIF, a deployment link, or even a small number of real users. What matters is whether the reader has to &lt;em&gt;imagine&lt;/em&gt; the project works, or can &lt;em&gt;see&lt;/em&gt; that it does. After I added a deployment link and a 10-second GIF to one of my projects, people stopped asking "what does it do?" and started asking implementation questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Writing Helps Only When It Extends Real Work
&lt;/h2&gt;

&lt;p&gt;Blog posts don't replace projects. But "I built X, here's what broke, and here's what I learned" is powerful — because it shows how you think. A post-mortem of a real project is hard to fake. Generic "Top 10 tips" pieces are not.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Actually Do Now
&lt;/h2&gt;

&lt;p&gt;If I were cleaning up a GitHub repo today, I'd focus on five things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Explain what the project is in &lt;strong&gt;one clear sentence&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Show &lt;strong&gt;proof&lt;/strong&gt; that it works: demo, screenshot, deployment&lt;/li&gt;
&lt;li&gt;Include signals of engineering discipline: tests, CI, setup clarity&lt;/li&gt;
&lt;li&gt;Explain &lt;strong&gt;why&lt;/strong&gt; the project exists, not just what it does&lt;/li&gt;
&lt;li&gt;Add a section showing &lt;strong&gt;reflection&lt;/strong&gt;: trade-offs, challenges, what you'd change&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The question that ties it all together:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What uncertainty is this README removing for the person reading it?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;The full article goes deeper into the academic research behind these ideas — including signaling theory, eye-tracking studies, and peer-reviewed work on how GitHub profiles are evaluated in hiring.&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://medium.com/local-llm-lab/the-great-readme-hunt-what-readme-best-practices-actually-signal-d9df9782b512" rel="noopener noreferrer"&gt;Read the full deep-dive on Medium&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📂 All sources and references are also available on GitHub: &lt;a href="https://github.com/KazKozDev/github-rabbit-hole" rel="noopener noreferrer"&gt;github.com/KazKozDev/github-rabbit-hole&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What do you think is the stronger signal in a portfolio: a polished clone, or a rough project with real users?&lt;/strong&gt; I'd love to hear your take in the comments.&lt;/p&gt;

</description>
      <category>github</category>
      <category>softwareengineering</category>
      <category>career</category>
      <category>portfolio</category>
    </item>
    <item>
      <title>Building a Perplexity Clone for Local LLMs in 50 Lines of Python</title>
      <dc:creator>Artem KK</dc:creator>
      <pubDate>Fri, 20 Mar 2026 05:42:39 +0000</pubDate>
      <link>https://forem.com/kazkozdev/building-a-perplexity-clone-for-local-llms-in-50-lines-of-python-2p79</link>
      <guid>https://forem.com/kazkozdev/building-a-perplexity-clone-for-local-llms-in-50-lines-of-python-2p79</guid>
      <description>&lt;p&gt;Your local LLM is smart but blind — it can't see the internet. Here's how to give it eyes, a filter, and a citation engine.&lt;/p&gt;




&lt;p&gt;This is a hands-on tutorial. We'll install a library, run a real query, break down every stage of what happens inside, and look at the actual output your LLM receives.&lt;/p&gt;

&lt;p&gt;By the end, you'll have a working pipeline that turns any local model (Ollama, LM Studio, anything with a text input) into something that searches the web, reads pages, ranks the results, and generates a structured prompt with inline citations — like a self-hosted Perplexity.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Background:&lt;/strong&gt; If you want to understand the architecture this is based on, I wrote a &lt;a href="https://medium.com/@kazkozdev/how-perplexity-actually-searches-the-internet-ae4b50dd9837" rel="noopener noreferrer"&gt;deep dive into how Perplexity actually works&lt;/a&gt; — the five-stage RAG pipeline, hybrid retrieval on Vespa.ai, Cerebras-accelerated inference, the citation integrity problems. This tutorial is the practical counterpart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/KazKozDev/production_rag_pipeline" rel="noopener noreferrer"&gt;github.com/KazKozDev/production_rag_pipeline&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;A pipeline that does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your question
    ↓
Search (Bing + DuckDuckGo, parallel)
    ↓
Semantic pre-filter (drop irrelevant results before fetching)
    ↓
Fetch pages (only the ones that passed filtering)
    ↓
Extract content (strip boilerplate, ads, navigation)
    ↓
Chunk + Rerank (BM25 + semantic + answer-span + MMR)
    ↓
LLM-ready prompt with numbered citations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline does NOT include the LLM itself — it builds the prompt. You plug in whatever model you want.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/KazKozDev/production_rag_pipeline.git
&lt;span class="nb"&gt;cd &lt;/span&gt;production_rag_pipeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pick your install level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Minimal — BM25 ranking, BeautifulSoup extraction. No ML models.&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Better extraction with trafilatura&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; .[extraction]

&lt;span class="c"&gt;# Semantic ranking with sentence-transformers (recommended)&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; .[semantic]

&lt;span class="c"&gt;# Everything&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; .[full]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this tutorial, use &lt;code&gt;.[full]&lt;/code&gt;. First run will download embedding models (~100–500MB depending on language) — this only happens once.&lt;/p&gt;

&lt;p&gt;No API keys needed. Bing and DuckDuckGo are queried without authentication.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Your First Query — 3 Lines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;build_llm_prompt&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_llm_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latest AI news&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the entire interface. &lt;code&gt;build_llm_prompt&lt;/code&gt; runs the full pipeline — search, filter, fetch, extract, rerank — and returns a formatted string ready to paste into any LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLI alternative
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;production-rag-pipeline &lt;span class="s2"&gt;"latest AI news"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Search-only mode (no page fetching)&lt;/span&gt;
production-rag-pipeline &lt;span class="s2"&gt;"Bitcoin price"&lt;/span&gt; &lt;span class="nt"&gt;--mode&lt;/span&gt; search

&lt;span class="c"&gt;# Russian query&lt;/span&gt;
production-rag-pipeline &lt;span class="s2"&gt;"новости ИИ"&lt;/span&gt; &lt;span class="nt"&gt;--mode&lt;/span&gt; &lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;--lang&lt;/span&gt; ru
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  macOS users
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./run_llm_query.command
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This bootstraps a virtual environment automatically on first run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: What Just Happened — Stage by Stage
&lt;/h2&gt;

&lt;p&gt;Let's trace what the pipeline actually does with &lt;code&gt;"latest AI news"&lt;/code&gt;. Enable debug mode to see it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;search_extract_rerank&lt;/span&gt;

&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fetched_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_extract_rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latest AI news&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_fetch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Stage 1: Dual-Engine Search
&lt;/h3&gt;

&lt;p&gt;Bing and DuckDuckGo are searched &lt;strong&gt;in parallel&lt;/strong&gt;. Results are merged with position-based scoring — first result from each engine scores highest, and results that appear in both engines get a boost.&lt;/p&gt;

&lt;p&gt;The pipeline detects keywords like "news", "latest", "breaking" and switches DDG to its &lt;strong&gt;News index&lt;/strong&gt; — returning actual articles instead of generic homepages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Semantic Pre-Filtering
&lt;/h3&gt;

&lt;p&gt;This is the key optimization. Before fetching any page, the pipeline computes &lt;strong&gt;cosine similarity&lt;/strong&gt; between the query embedding and each result's title+snippet embedding.&lt;/p&gt;

&lt;p&gt;Results below threshold get dropped:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;English: threshold 0.30&lt;/li&gt;
&lt;li&gt;Russian: threshold 0.25&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, &lt;strong&gt;~11 out of 20 results get filtered&lt;/strong&gt; — saving about 6 seconds of HTTP fetches.&lt;/p&gt;

&lt;p&gt;Example from a real run with &lt;code&gt;"LLM agents news"&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✗ flutrackers.com     sim=0.12  → filtered (irrelevant)
✓ llm-stats.com       sim=0.68  → fetched
✗ reddit.com/r/gaming  sim=0.15  → filtered
✓ arxiv.org/abs/2503   sim=0.71  → fetched
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No hardcoded domain lists. Pure semantic relevance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Parallel Fetch + Content Extraction
&lt;/h3&gt;

&lt;p&gt;Surviving results (typically 5–9 URLs) are fetched in parallel. Content extraction runs a two-stage quality check:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structural check:&lt;/strong&gt; Does &amp;gt;30% of lines look like numbers/prices/tables?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic check:&lt;/strong&gt; If flagged, is the table relevant to the query?&lt;/p&gt;

&lt;p&gt;This is how exchange rate tables from &lt;code&gt;cbr.ru&lt;/code&gt; pass for a currency query (similarity 0.75) but CS:GO price lists get rejected (similarity 0.05).&lt;/p&gt;

&lt;p&gt;After extraction, boilerplate is stripped — navigation, ads, newsletter signup patterns, cookie banners.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 4: Chunking + Multi-Signal Reranking
&lt;/h3&gt;

&lt;p&gt;Extracted content is chunked, then reranked by four signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BM25&lt;/strong&gt; — classic lexical term-frequency matching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic similarity&lt;/strong&gt; — cosine between query and chunk embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Answer-span detection&lt;/strong&gt; — does this chunk directly answer the question?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MMR diversity&lt;/strong&gt; — prevents top results from all being paraphrases of the same paragraph&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Optional: a &lt;strong&gt;cross-encoder&lt;/strong&gt; runs on the final shortlist for maximum accuracy (slower but better).&lt;/p&gt;

&lt;p&gt;For news queries, &lt;strong&gt;freshness penalties&lt;/strong&gt; apply:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Content &amp;gt;7 days old: −1 confidence&lt;/li&gt;
&lt;li&gt;Content &amp;gt;30 days old: −2 confidence&lt;/li&gt;
&lt;li&gt;Outdated sources flagged in the prompt with exact age&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Stage 5: Prompt Assembly with Citation Binding
&lt;/h3&gt;

&lt;p&gt;The pipeline builds a structured prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;build_llm_context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline.prompts&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;build_llm_prompt&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_mapping&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;grouped_sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_llm_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fetched_urls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fetched_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;renumber_sources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← fixes phantom citation numbers
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Citation numbers are renumbered after every filtering step. If three sources survive, they're numbered [1], [2], [3] — never [1], [3], [7] with phantom gaps.&lt;/p&gt;

&lt;p&gt;Current date and time are injected into the prompt so the LLM can reason about source freshness.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: What the Output Looks Like
&lt;/h2&gt;

&lt;p&gt;The final prompt looks roughly like this (abbreviated):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Current date: 2026-03-20

Answer the user's question using ONLY the provided sources.
Cite sources using [1], [2], etc. Do not make claims without a citation.

=== SOURCES ===

[1] OpenAI announces GPT-5 turbo with 1M context window
Source: techcrunch.com | Published: 2026-03-19
OpenAI today released GPT-5 Turbo, featuring a 1 million token
context window and improved reasoning capabilities...

[2] Google DeepMind publishes Gemini 2.5 technical report
Source: blog.google | Published: 2026-03-18
The technical report details architectural changes including
mixture-of-experts scaling to 3.2 trillion parameters...

[3] Anthropic raises $5B Series E at $90B valuation
Source: reuters.com | Published: 2026-03-17
Anthropic closed a $5 billion funding round, bringing its
total raised to over $15 billion...

=== QUESTION ===

latest AI news
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Drop this into Ollama, LM Studio, or any API. The model sees curated, relevant, cited content — not raw web pages.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Configuration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dataclass
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RAGConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;build_llm_prompt&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RAGConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;num_per_engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# results per search engine
&lt;/span&gt;    &lt;span class="n"&gt;top_n_fetch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# max pages to fetch
&lt;/span&gt;    &lt;span class="n"&gt;fetch_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# seconds per page
&lt;/span&gt;    &lt;span class="n"&gt;total_context_chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# chunks in final prompt
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_llm_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latest AI news&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  YAML
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;production-rag-pipeline &lt;span class="s2"&gt;"latest AI news"&lt;/span&gt; &lt;span class="nt"&gt;--config&lt;/span&gt; config.example.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Environment variables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;RAG_TOP_N_FETCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;RAG_FETCH_TIMEOUT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10
production-rag-pipeline &lt;span class="s2"&gt;"latest AI news"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 6: The 50-Line Version
&lt;/h2&gt;

&lt;p&gt;Here's the entire pipeline, from query to LLM-ready prompt, using the module-level API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline.search&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;search&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline.fetch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fetch_pages_parallel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline.extract&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract_content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline.rerank&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rerank_chunks&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;build_llm_context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;production_rag_pipeline.prompts&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;build_llm_prompt&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Search
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latest AI news&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_per_engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Fetch
&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_pages_parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Extract + Chunk
&lt;/span&gt;&lt;span class="n"&gt;all_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;all_chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Rerank
&lt;/span&gt;&lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rerank_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;all_chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 5. Build prompt
&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_llm_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;renumber_sources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_llm_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is what &lt;code&gt;build_llm_prompt("latest AI news")&lt;/code&gt; does internally, broken into visible steps.&lt;/p&gt;




&lt;h2&gt;
  
  
  Graceful Degradation
&lt;/h2&gt;

&lt;p&gt;The pipeline works at every install level:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Install&lt;/th&gt;
&lt;th&gt;Ranking&lt;/th&gt;
&lt;th&gt;Extraction&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pip install .&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BM25 only&lt;/td&gt;
&lt;td&gt;BeautifulSoup&lt;/td&gt;
&lt;td&gt;Fastest, least accurate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pip install .[extraction]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BM25 only&lt;/td&gt;
&lt;td&gt;Trafilatura&lt;/td&gt;
&lt;td&gt;Better content quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pip install .[semantic]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BM25 + semantic + MMR&lt;/td&gt;
&lt;td&gt;BeautifulSoup&lt;/td&gt;
&lt;td&gt;Much better ranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pip install .[full]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BM25 + semantic + cross-encoder + MMR&lt;/td&gt;
&lt;td&gt;Trafilatura&lt;/td&gt;
&lt;td&gt;Best quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No GPU required. Semantic models run on CPU — slower, but functional.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Compares to Perplexity
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Perplexity&lt;/th&gt;
&lt;th&gt;production-rag-pipeline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Index&lt;/td&gt;
&lt;td&gt;200B+ pre-indexed URLs&lt;/td&gt;
&lt;td&gt;Real-time Bing + DDG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;358ms median&lt;/td&gt;
&lt;td&gt;8–15s on a MacBook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Models&lt;/td&gt;
&lt;td&gt;20+ with dynamic routing&lt;/td&gt;
&lt;td&gt;You choose (Ollama, LM Studio, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference&lt;/td&gt;
&lt;td&gt;Cerebras CS-3, 1,200 tok/s&lt;/td&gt;
&lt;td&gt;Your hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$20/mo Pro&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;Local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;Open source, MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gap is real — especially on latency and index size. But for a tool that runs on your laptop, feeds any local model, and costs nothing, the tradeoff is worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multilingual Support
&lt;/h2&gt;

&lt;p&gt;The pipeline auto-detects language by Cyrillic character ratio (10% threshold):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;English&lt;/strong&gt; → &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; (fast, English-optimized)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Russian&lt;/strong&gt; → &lt;code&gt;paraphrase-multilingual-MiniLM-L12-v2&lt;/code&gt; (13 languages)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cross-encoder reranking also switches models per language. No manual configuration needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;production-rag-pipeline &lt;span class="s2"&gt;"новости ИИ"&lt;/span&gt; &lt;span class="nt"&gt;--lang&lt;/span&gt; ru
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdza4pi269yswcmh6q0wt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdza4pi269yswcmh6q0wt.png" alt="Dark abstract 3D render with the word BUILD in large translucent letters" width="800" height="307"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This is Part 2 of a series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://medium.com/@kazkozdev/how-perplexity-actually-searches-the-internet-ae4b50dd9837" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt;&lt;/strong&gt; — How Perplexity Actually Searches the Internet (architecture teardown)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt; — You're reading it (build the local equivalent)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Star the repo if this is useful: &lt;a href="https://github.com/KazKozDev/production_rag_pipeline" rel="noopener noreferrer"&gt;github.com/KazKozDev/production_rag_pipeline&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Issues and contributions welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>python</category>
    </item>
  </channel>
</rss>
