<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: eyanpen</title>
    <description>The latest articles on Forem by eyanpen (@eyanpen).</description>
    <link>https://forem.com/eyanpen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3893228%2F3dc88537-5bc9-4c8b-acbb-8dcc4932177d.png</url>
      <title>Forem: eyanpen</title>
      <link>https://forem.com/eyanpen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/eyanpen"/>
    <language>en</language>
    <item>
      <title>Why Does Semantic Chunking Need an Embedding API?</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Mon, 04 May 2026 05:54:39 +0000</pubDate>
      <link>https://forem.com/eyanpen/why-does-semantic-chunking-need-an-embedding-api-4dei</link>
      <guid>https://forem.com/eyanpen/why-does-semantic-chunking-need-an-embedding-api-4dei</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Fixed-length chunking requires no external services, yet semantic chunking absolutely needs an Embedding API — why?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Short Answer
&lt;/h2&gt;

&lt;p&gt;The core idea of semantic chunking is to &lt;strong&gt;split text at semantic boundaries&lt;/strong&gt;. Determining whether "two pieces of text belong to the same topic" requires converting text into vectors and computing similarity — that's exactly what the Embedding API does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional Chunking vs Semantic Chunking
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Fixed-Length / Recursive&lt;/th&gt;
&lt;th&gt;Semantic Chunking&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Split criteria&lt;/td&gt;
&lt;td&gt;Character count, token count, delimiters&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Semantic similarity&lt;/strong&gt; between adjacent sentences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Requires Embedding&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Split quality&lt;/td&gt;
&lt;td&gt;May break in the middle of a topic&lt;/td&gt;
&lt;td&gt;Splits at topic transitions, preserving semantic coherence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Fixed-length chunking is like measuring paper with a ruler — regardless of content, it cuts every 500 characters. Semantic chunking is like a reader who, after finishing a paragraph, asks "is the next part still about the same thing?" If not, that's where the cut goes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Mainstream Semantic Chunking Strategies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Strategy 1: Adjacent Similarity (Kamradt Method)
&lt;/h3&gt;

&lt;p&gt;Core idea: Compute semantic distances between adjacent sentences and split where distances spike.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Process:
1. Split text into small sentences
2. For each sentence, concatenate buffer_size adjacent sentences as context
3. Call Embedding API to get vectors for each combined sentence
4. Compute cosine distances between adjacent combined sentences
5. Use binary search to find a threshold, split where distance exceeds it
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pseudocode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Build context windows
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;combined_texts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Get embeddings for all combined sentences (one batch call)
&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_texts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined_texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embedding_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Compute cosine distances only between adjacent sentences
&lt;/span&gt;&lt;span class="n"&gt;distances&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;embedding_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Higher distance = greater topic difference
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 4: Binary search for threshold targeting total_size / avg_chunk_size cuts
&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;binary_search_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_cuts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 5: Split where distance exceeds threshold
&lt;/span&gt;&lt;span class="n"&gt;breakpoints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Intuition: Imagine reading an article sentence by sentence, asking yourself after each one: "Is the next sentence still about the same thing?" When you feel the topic has jumped, you cut there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key characteristic: Only looks at adjacent relationships.&lt;/strong&gt; It only computes the distance between sentence[i] and sentence[i+1] — a &lt;strong&gt;local greedy&lt;/strong&gt; strategy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 2: Cluster Optimal Segmentation (Dynamic Programming Method)
&lt;/h3&gt;

&lt;p&gt;Core idea: Build a similarity matrix between all sentence pairs and use dynamic programming to find the segmentation that maximizes intra-cluster similarity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Process:
1. Split text into small sentences
2. Call Embedding API to get vectors for all sentences
3. Build an N×N similarity matrix
4. Normalize the matrix by subtracting the mean (prevents degeneration into one giant cluster)
5. Use dynamic programming to find the optimal segmentation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pseudocode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Get embeddings for all sentences (note: no buffer concatenation)
&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_texts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embedding_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Build N×N similarity matrix
&lt;/span&gt;&lt;span class="n"&gt;similarity_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_matrix&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Mean normalization to prevent DP from putting everything in one cluster
&lt;/span&gt;&lt;span class="n"&gt;mean_sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;upper_triangle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarity_matrix&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;similarity_matrix&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;mean_sim&lt;/span&gt;
&lt;span class="nf"&gt;fill_diagonal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarity_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Dynamic programming for optimal segmentation
# dp[i] = maximum intra-cluster similarity sum for the first i+1 sentences
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;cluster_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max_chunk_size&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarity_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reward&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 5: Backtrack to get optimal segmentation
&lt;/span&gt;&lt;span class="n"&gt;clusters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;backtrack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;segmentation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key characteristic: Globally optimal.&lt;/strong&gt; It considers relationships between all sentence pairs and uses DP to find the overall best segmentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Comparison of the Two Strategies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fundamental Algorithmic Differences
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Kamradt (Adjacent Similarity)&lt;/th&gt;
&lt;th&gt;Cluster (Dynamic Programming)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Local&lt;/strong&gt; — only adjacent sentences&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Global&lt;/strong&gt; — all sentence pairs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decision method&lt;/td&gt;
&lt;td&gt;Greedy: cut when distance exceeds threshold&lt;/td&gt;
&lt;td&gt;Optimization: maximize intra-cluster similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Threshold&lt;/td&gt;
&lt;td&gt;Binary search for target cut count&lt;/td&gt;
&lt;td&gt;No threshold needed, DP decides automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context enhancement&lt;/td&gt;
&lt;td&gt;✅ buffer_size concatenation&lt;/td&gt;
&lt;td&gt;❌ Uses raw sentences directly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size constraints&lt;/td&gt;
&lt;td&gt;avg_chunk_size + max_chunk_size dual constraint&lt;/td&gt;
&lt;td&gt;max_chunk_size hard constraint&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The core difference in one sentence:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kamradt asks: "Is there a topic transition between these two adjacent sentences?"&lt;/li&gt;
&lt;li&gt;Cluster asks: "Which grouping makes sentences within each group most similar to each other?"&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  An Intuitive Example
&lt;/h3&gt;

&lt;p&gt;Consider 6 sentences with the following topic distribution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sentence 1: Discussing Apple's earnings report
Sentence 2: Discussing Apple's new products
Sentence 3: Discussing the weather forecast
Sentence 4: Discussing tomorrow's temperature
Sentence 5: Discussing Apple's stock price
Sentence 6: Discussing Apple's competitors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kamradt's approach:&lt;/strong&gt; Compare adjacent pairs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sentence 2→3: Topic jump (Apple → weather), cut!&lt;/li&gt;
&lt;li&gt;Sentence 4→5: Topic jump (weather → Apple), cut!&lt;/li&gt;
&lt;li&gt;Result: [1,2] [3,4] [5,6]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cluster's approach:&lt;/strong&gt; The global similarity matrix shows sentences 1,2,5,6 are highly similar to each other&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;But since DP requires contiguous segmentation (can't skip around), it can only cut contiguous spans&lt;/li&gt;
&lt;li&gt;Result is likely also [1,2] [3,4] [5,6], but the reasoning is different&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The key difference emerges when boundaries are fuzzy:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consider an article that gradually transitions from "EV technology" to "energy policy":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sentence 1: Tesla released a new generation of battery technology
Sentence 2: The new battery's energy density improved by 50%
Sentence 3: Higher energy density means longer driving range
Sentence 4: Range anxiety has been a barrier for consumers buying EVs
Sentence 5: The government introduced charging station subsidies to address this
Sentence 6: Subsidies cover both residential and commercial charging facilities
Sentence 7: Commercial charging uses time-of-use electricity pricing
Sentence 8: Time-of-use pricing is a key component of electricity market reform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What Kamradt sees (adjacent distances):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1→2: 0.08  (both about batteries)
2→3: 0.10  (battery → range, very close)
3→4: 0.12  (range → range anxiety, very close)
4→5: 0.15  (consumers → government policy, slightly far but not outstanding)
5→6: 0.09  (both about subsidies)
6→7: 0.13  (subsidies → pricing, somewhat far)
7→8: 0.11  (both about pricing)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No single distance clearly "spikes" — the topic slides gradually. Kamradt's binary search struggles to find a reasonable threshold, potentially producing suboptimal splits like [1-4][5-8] or [1-3][4-6][7-8].&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Cluster sees (global similarity matrix summary):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        S1    S2    S3    S4    S5    S6    S7    S8
S1      --   0.9   0.7   0.4   0.2   0.1   0.1   0.05
S2           --    0.8   0.5   0.2   0.15  0.1   0.05
S3                 --    0.6   0.3   0.2   0.15  0.1
S4                       --    0.5   0.4   0.3   0.2
S5                             --    0.8   0.6   0.4
S6                                   --    0.7   0.5
S7                                         --    0.8
S8                                               --
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The global view clearly shows: sentences 1-3 are highly similar to each other (battery/range technology), sentences 5-8 are highly similar to each other (policy/pricing), and sentence 4 is a transition. DP optimization discovers that [1-3][4-8] or [1-4][5-8] maximizes intra-cluster similarity, producing a more reasonable split.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The essential difference:&lt;/strong&gt; Kamradt only looks at "the gap between adjacent sentences" — in a gradual transition, each step's gap is small, like the boiling frog metaphor. Cluster looks at "the overall similarity within each group" — even when the transition is smooth, it can still detect that sentence 1 and sentence 8 are essentially unrelated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding Cost Comparison
&lt;/h3&gt;

&lt;p&gt;This is one of the most important practical differences between the two strategies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Kamradt&lt;/th&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Embedding input&lt;/td&gt;
&lt;td&gt;combined_sentence (with buffer context)&lt;/td&gt;
&lt;td&gt;Raw sentences (no buffer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding call count&lt;/td&gt;
&lt;td&gt;N texts, 1 batch call&lt;/td&gt;
&lt;td&gt;N texts, 1 batch call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average text length&lt;/td&gt;
&lt;td&gt;Longer (~7 sentences, buffer_size=3)&lt;/td&gt;
&lt;td&gt;Shorter (1 sentence)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total token consumption&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Higher&lt;/strong&gt; (buffer causes input inflation)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Lower&lt;/strong&gt; (no redundancy)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-embedding computation&lt;/td&gt;
&lt;td&gt;O(N) — only adjacent distances&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;O(N²)&lt;/strong&gt; — full similarity matrix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DP computation&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;O(N × max_cluster_size)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Concrete Numbers (1000 sentences, ~30 tokens each)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Kamradt:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding input: 1000 combined_sentences, each ~7×30 = 210 tokens&lt;/li&gt;
&lt;li&gt;Total token consumption: 1000 × 210 = &lt;strong&gt;210,000 tokens&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Distance computation: 999 dot products → negligible&lt;/li&gt;
&lt;li&gt;Memory: 1000 × embedding_dim matrix&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cluster:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding input: 1000 raw sentences, each ~30 tokens&lt;/li&gt;
&lt;li&gt;Total token consumption: 1000 × 30 = &lt;strong&gt;30,000 tokens&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Similarity matrix: 1000 × 1000 = &lt;strong&gt;1 million floats&lt;/strong&gt; (~8MB)&lt;/li&gt;
&lt;li&gt;DP computation: O(1000 × max_cluster_size) iterations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding API cost&lt;/strong&gt;: Kamradt consumes ~7x more tokens (due to buffer concatenation), higher API cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute resources&lt;/strong&gt;: Cluster's O(N²) matrix and DP are more expensive on local CPU/memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network latency&lt;/strong&gt;: Same for both (both use 1 batch call, or multiple calls based on batch_size)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Large-Scale Scenario (100,000 sentences)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Kamradt&lt;/th&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total embedding tokens&lt;/td&gt;
&lt;td&gt;~21 million tokens&lt;/td&gt;
&lt;td&gt;~3 million tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API calls (batch_size=500)&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Similarity computation&lt;/td&gt;
&lt;td&gt;99,999 dot products&lt;/td&gt;
&lt;td&gt;10 billion dot products (N² matrix)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory usage&lt;/td&gt;
&lt;td&gt;~400MB (embedding matrix)&lt;/td&gt;
&lt;td&gt;~40GB (N² similarity matrix) ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;At 100K sentences, Cluster's N² matrix will blow up memory&lt;/strong&gt; — this is its hard limitation. In practice, Cluster is better suited for medium-length documents (hundreds to thousands of sentences), while Kamradt can handle any length.&lt;/p&gt;

&lt;h3&gt;
  
  
  Split Quality Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Kamradt&lt;/th&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Clear topic boundaries&lt;/td&gt;
&lt;td&gt;✅ Excellent, obvious distance spikes&lt;/td&gt;
&lt;td&gt;✅ Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gradual topic transitions&lt;/td&gt;
&lt;td&gt;⚠️ May fail to find split points&lt;/td&gt;
&lt;td&gt;✅ Global optimization still finds best split&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Short documents (&amp;lt;50 sentences)&lt;/td&gt;
&lt;td&gt;✅ Fast&lt;/td&gt;
&lt;td&gt;✅ Higher quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long documents (&amp;gt;10K sentences)&lt;/td&gt;
&lt;td&gt;✅ Linear scaling&lt;/td&gt;
&lt;td&gt;❌ Memory explosion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Very short sentences&lt;/td&gt;
&lt;td&gt;⚠️ Needs buffer for context&lt;/td&gt;
&lt;td&gt;⚠️ Short sentence embeddings are low quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  How to Choose?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your Scenario&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unknown document length, need general solution&lt;/td&gt;
&lt;td&gt;Kamradt&lt;/td&gt;
&lt;td&gt;Linear complexity, won't blow memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Short documents (&amp;lt;2000 sentences), want optimal splits&lt;/td&gt;
&lt;td&gt;Cluster&lt;/td&gt;
&lt;td&gt;Globally optimal, higher quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding API charges per token&lt;/td&gt;
&lt;td&gt;Cluster&lt;/td&gt;
&lt;td&gt;No buffer inflation, 7x fewer tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Limited local compute resources&lt;/td&gt;
&lt;td&gt;Kamradt&lt;/td&gt;
&lt;td&gt;O(N) computation, memory-friendly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fuzzy topic boundaries, need precise splits&lt;/td&gt;
&lt;td&gt;Cluster&lt;/td&gt;
&lt;td&gt;DP global optimization is more robust&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why Can't Other Methods Replace Embedding?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Alternative&lt;/th&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Keyword overlap / TF-IDF&lt;/td&gt;
&lt;td&gt;Cannot capture synonyms or contextual semantics ("automobile" and "vehicle" would be considered unrelated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rule-based delimiters (paragraphs, periods)&lt;/td&gt;
&lt;td&gt;One paragraph may contain multiple topics; different paragraphs may discuss the same topic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM direct judgment&lt;/td&gt;
&lt;td&gt;Too expensive, high latency, unsuitable for batch processing tens of thousands of sentences&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Embedding maps text into a high-dimensional semantic space where semantically similar texts have small vector distances and dissimilar texts have large distances. This is currently the optimal approach for semantic similarity measurement, balancing &lt;strong&gt;cost, speed, and quality&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  buffer_size: The Role of the Context Window
&lt;/h2&gt;

&lt;p&gt;Semantic chunking has a key parameter &lt;code&gt;buffer_size&lt;/code&gt; (default: 3) that determines how much context is concatenated when generating embeddings for each sentence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Concatenation logic
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# 3 before
&lt;/span&gt;              &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;                                     &lt;span class="c1"&gt;# current
&lt;/span&gt;              &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# 3 after
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key point: buffer_size does not affect the number of Embedding calls — only the length of each input text.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With 10 sentences, whether buffer_size is 1 or 10, you still embed 10 combined_sentences. The difference is how much context each text contains:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;buffer_size&lt;/th&gt;
&lt;th&gt;Avg sentences per text&lt;/th&gt;
&lt;th&gt;Effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;~3&lt;/td&gt;
&lt;td&gt;Less context, may misjudge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 (default)&lt;/td&gt;
&lt;td&gt;~7&lt;/td&gt;
&lt;td&gt;Balance point&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;~21&lt;/td&gt;
&lt;td&gt;Rich context, but may exceed model token limit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note: Embedding models have input length limits (e.g., BGE-M3 max 8192 tokens). If buffer_size is too large, texts get truncated, potentially losing the current sentence's information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance at Scale
&lt;/h2&gt;

&lt;p&gt;Suppose a long document is split into 100,000 sentences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Texts to embed = &lt;strong&gt;100,000&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;With batch_size of 500, actual API calls = 100,000 ÷ 500 = &lt;strong&gt;200 HTTP requests&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The performance bottleneck is API call count (determined by total sentences and batch_size), independent of buffer_size.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fallback Strategy: What If Embedding Is Unavailable?
&lt;/h2&gt;

&lt;p&gt;Good system design should account for Embedding service unavailability. The common approach: when Embedding calls fail, automatically fall back to recursive chunking (pure rule-based splitting, no Embedding needed).&lt;/p&gt;

&lt;p&gt;This means semantic chunking is an &lt;strong&gt;enhancement&lt;/strong&gt;, not a &lt;strong&gt;dependency&lt;/strong&gt; — the system still works without the Embedding service, just with lower split quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Why is Embedding needed?&lt;/td&gt;
&lt;td&gt;Judging semantic similarity requires vector representations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can rules replace it?&lt;/td&gt;
&lt;td&gt;No, rules cannot capture semantics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can LLM replace it?&lt;/td&gt;
&lt;td&gt;Theoretically yes, but cost and latency are unacceptable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kamradt vs Cluster core difference?&lt;/td&gt;
&lt;td&gt;Local adjacent comparison vs global optimal segmentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which has higher Embedding cost?&lt;/td&gt;
&lt;td&gt;Kamradt: higher token consumption (buffer inflation); Cluster: higher compute cost (N² matrix)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which for large documents?&lt;/td&gt;
&lt;td&gt;Kamradt — linear complexity, won't blow memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which for optimal splits?&lt;/td&gt;
&lt;td&gt;Cluster — global DP optimization, but limited to medium-length documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What if service is unavailable?&lt;/td&gt;
&lt;td&gt;Both fall back to rule-based chunking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Embedding API is the "eyes" of semantic chunking — without it, the chunking algorithm is a blind person cutting a cake. The two strategies "see" text differently: Kamradt is like a line-by-line scanner, Cluster is like an editor with a bird's-eye view. Which to choose depends on your document scale and split quality requirements.&lt;/p&gt;

</description>
      <category>semanticchunking</category>
      <category>embedding</category>
      <category>rag</category>
      <category>textsplitting</category>
    </item>
    <item>
      <title>Multiple Independent Questions: Batch Into One Request or Split Into Many? — An Analysis of LLM Concurrent Processing</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Sun, 03 May 2026 00:19:18 +0000</pubDate>
      <link>https://forem.com/eyanpen/multiple-independent-questions-batch-into-one-request-or-split-into-many-an-analysis-of-llm-1h6m</link>
      <guid>https://forem.com/eyanpen/multiple-independent-questions-batch-into-one-request-or-split-into-many-an-analysis-of-llm-1h6m</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;When you have 5 unrelated questions, should you pack them into one message to the LLM, or send 5 requests simultaneously? Which is faster?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Short Answer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Splitting into multiple independent parallel requests is almost always faster.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't a gut feeling — it's determined by the underlying inference mechanism of LLMs. Let's walk through the reasoning from first principles.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. How LLMs Generate Text: Autoregressive Decoding
&lt;/h2&gt;

&lt;p&gt;To understand this problem, you first need to know how LLMs "write."&lt;/p&gt;

&lt;p&gt;LLMs (GPT-4, Claude, etc.) use &lt;strong&gt;autoregressive generation&lt;/strong&gt;: they produce one token at a time, append that token back to the input, then generate the next token. This repeats until generation is complete.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;Generating N tokens requires N forward passes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A 100-token answer requires 100 inference steps&lt;/li&gt;
&lt;li&gt;A 500-token answer requires 500 inference steps&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total output length directly determines total latency&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Batched Request: Output Volumes Stack, Latency Grows Linearly
&lt;/h2&gt;

&lt;p&gt;Suppose you have 5 independent questions, each requiring ~200 tokens to answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach A: Combine into one request&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You stuff all 5 questions into a single message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Please answer the following questions separately:
1. xxx
2. xxx
3. xxx
4. xxx
5. xxx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM needs to generate total output ≈ 5 × 200 = 1000 tokens. Due to autoregressive decoding, these 1000 tokens are generated &lt;strong&gt;sequentially&lt;/strong&gt; — token #201 must wait for the first 200 to finish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total latency ≈ 1000 × per-token generation time&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Plus additional overhead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The LLM must maintain context switches between answers ("now answering question 3")&lt;/li&gt;
&lt;li&gt;Longer KV Cache means increasing attention computation at each step&lt;/li&gt;
&lt;li&gt;Actual output often exceeds 1000 tokens (formatting, transition phrases, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Split Requests: Parallel Inference, Latency Equals the Slowest One
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Approach B: Split 5 questions into 5 independent requests, sent simultaneously&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each request independently generates ~200 tokens. If the server has sufficient concurrent processing capacity (all modern LLM services do), these 5 requests are &lt;strong&gt;processed in parallel&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total latency ≈ max(individual request latencies) ≈ 200 × per-token generation time&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Total output tokens&lt;/th&gt;
&lt;th&gt;Actual latency (relative)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Combined request&lt;/td&gt;
&lt;td&gt;~1000+&lt;/td&gt;
&lt;td&gt;~1000 steps (sequential)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Split into 5 requests&lt;/td&gt;
&lt;td&gt;~200 each&lt;/td&gt;
&lt;td&gt;~200 steps (parallel)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Theoretical speedup ≈ 5x&lt;/strong&gt; (equals the number of questions).&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Why Does Parallelism Work? — Server-Side Continuous Batching
&lt;/h2&gt;

&lt;p&gt;You might ask: doesn't the LLM server have capacity limits? Won't 5 simultaneous requests queue up?&lt;/p&gt;

&lt;p&gt;Modern LLM inference engines (vLLM, TensorRT-LLM, TGI, etc.) all implement &lt;strong&gt;Continuous Batching&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multiple requests share the same GPU matrix operation&lt;/strong&gt;: GPUs excel at parallel computation. Combining tokens from 5 requests into one batch allows a single forward pass to generate one token for each request simultaneously.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic scheduling&lt;/strong&gt;: Different requests have different output lengths. Shorter ones finish first, and their slots are immediately given to new requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput vs. latency decoupling&lt;/strong&gt;: Larger batches mean higher GPU utilization and more total tokens processed per unit time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From the server's perspective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 short parallel requests → GPU does 5-way batched inference, producing 5 tokens per step&lt;/li&gt;
&lt;li&gt;1 long request → GPU does single-sequence inference, producing 1 token per step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The GPU's parallel computing power is wasted when requests are combined.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The Prefill Phase Difference
&lt;/h2&gt;

&lt;p&gt;LLM inference has two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prefill&lt;/strong&gt;: Process the input prompt, computing KV Cache for all input tokens. This step can process all input tokens in parallel, with latency roughly linear to input length.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decode&lt;/strong&gt;: Generate output token by token. This step is sequential.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With combined requests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefill phase: Longer input (all 5 questions concatenated), longer prefill time&lt;/li&gt;
&lt;li&gt;Decode phase: Longer output, longer decode time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With split requests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each request's prefill is shorter, and all 5 prefills can run in parallel or pipelined&lt;/li&gt;
&lt;li&gt;Each request's decode is shorter, and they run in parallel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Both phases favor splitting.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  6. An Often-Overlooked Factor: Quality
&lt;/h2&gt;

&lt;p&gt;Beyond speed, combining requests carries quality risks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Attention dilution&lt;/strong&gt;: When an LLM processes multiple unrelated tasks in one generation, its "focus" on each task decreases. Research shows that more irrelevant information in the prompt leads to lower answer quality (the "Lost in the Middle" phenomenon).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format confusion&lt;/strong&gt;: Answers to 5 questions easily suffer from numbering errors, omissions, or mismatched responses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error propagation&lt;/strong&gt;: If the answer to question 2 goes wrong, the LLM may be influenced in subsequent answers (autoregressive "inertia").&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Split requests completely isolate context, giving each question the LLM's "full attention."&lt;/p&gt;

&lt;h2&gt;
  
  
  7. When Is Combining Actually Better?
&lt;/h2&gt;

&lt;p&gt;To be fair, there are a few scenarios where combining may be more appropriate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hidden correlations between questions&lt;/strong&gt;: Even if you think they're independent, the LLM might give more consistent answers seeing the full picture (e.g., different sections of the same report).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strict API rate limits&lt;/strong&gt;: If your API quota is 3 requests per minute, you have no choice but to combine 5 questions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network latency far exceeds generation time&lt;/strong&gt;: If each API call has 2 seconds of network round-trip but generation only takes 0.5 seconds, splitting 5 times (5 × 2s = 10s network overhead) might exceed the combined generation time. But this is rare in practice — modern API network latency is typically 100-300ms, far less than generation time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extremely short answers&lt;/strong&gt;: If each question only needs a word or two, prefill overhead dominates, and combining can reduce redundant prefill costs.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  8. How to Verify This Yourself
&lt;/h2&gt;

&lt;p&gt;If you want to test this empirically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask_single&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Call LLM API
&lt;/span&gt;    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;API_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;questions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Approach A: Combined
&lt;/span&gt;        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please answer separately:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ask_single&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;time_combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;

        &lt;span class="c1"&gt;# Approach B: Parallel
&lt;/span&gt;        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;ask_single&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;time_parallel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Combined: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time_combined&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Parallel: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time_parallel&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Speedup: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time_combined&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;time_parallel&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, 5 moderately complex independent questions typically achieve 3-5x speedup with parallel requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Combined request&lt;/th&gt;
&lt;th&gt;Split parallel requests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generation speed&lt;/td&gt;
&lt;td&gt;Slow (sequential output of all answers)&lt;/td&gt;
&lt;td&gt;Fast (parallel generation, latency = slowest)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU utilization&lt;/td&gt;
&lt;td&gt;Low (single-sequence inference)&lt;/td&gt;
&lt;td&gt;High (batched parallel inference)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer quality&lt;/td&gt;
&lt;td&gt;May degrade (attention dilution)&lt;/td&gt;
&lt;td&gt;Better (isolated context)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API calls&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;N&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Rate-limited / extremely short answers&lt;/td&gt;
&lt;td&gt;Independent questions needing detailed answers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Core principle in one sentence: LLM's autoregressive mechanism means output is sequential; combining requests = forcing all outputs into a single serial stream; splitting requests = leveraging server-side parallelism to generate multiple outputs simultaneously. Splitting independent questions is the classic strategy of trading space (concurrent slots) for time.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>llminference</category>
      <category>autoregressivegeneration</category>
      <category>parallelrequests</category>
      <category>continuousbatching</category>
    </item>
    <item>
      <title>What Is GraphRAG Really Doing? — A Deep Dive into Microsoft's Blog Post</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Fri, 24 Apr 2026 11:57:01 +0000</pubDate>
      <link>https://forem.com/eyanpen/what-is-graphrag-really-doing-a-deep-dive-into-microsofts-blog-post-17m5</link>
      <guid>https://forem.com/eyanpen/what-is-graphrag-really-doing-a-deep-dive-into-microsofts-blog-post-17m5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Original: &lt;a href="https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/" rel="noopener noreferrer"&gt;GraphRAG: Unlocking LLM discovery on narrative private data - Microsoft Research&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;In early 2024, Microsoft published a technical blog post. The core message boils down to one sentence: &lt;strong&gt;Traditional RAG falls short with complex data, and GraphRAG fills the gap using knowledge graphs + graph clustering.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't an academic paper — it reads more like a "tech pitch" aimed at technical decision-makers and engineers. Let me break it down.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Does Traditional RAG Fall Short?
&lt;/h2&gt;

&lt;p&gt;To understand what GraphRAG solves, we need to start with the pain points of traditional RAG. The article highlights two scenarios where traditional RAG struggles:&lt;/p&gt;

&lt;h3&gt;
  
  
  Information That Can't Be Connected
&lt;/h3&gt;

&lt;p&gt;Imagine asking an AI: "What has Novorossiya done?"&lt;/p&gt;

&lt;p&gt;Traditional RAG takes the word "Novorossiya" and runs a vector search. But among the 10 text chunks retrieved, none directly mentions that name — the answer is scattered across different documents, connected only through indirect relationships between entities. Vector search only finds text that "looks similar"; it can't handle this kind of reasoning that requires "jumping" between connections.&lt;/p&gt;

&lt;p&gt;GraphRAG works differently: it locates the Novorossiya node in the knowledge graph, then traverses along relationship edges — actions, goals, related organizations — and assembles the complete answer.&lt;/p&gt;

&lt;p&gt;Put simply, vector retrieval is "local matching," while real-world knowledge is often connected indirectly through chains of entity relationships.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can't Answer "Big Questions"
&lt;/h3&gt;

&lt;p&gt;Another example: "What are the top 5 themes in this dataset?"&lt;/p&gt;

&lt;p&gt;Traditional RAG is stumped — the word "themes" is too broad. Vector search doesn't know which direction to look, and ends up matching some irrelevant text that happens to contain the word "theme." The answer naturally goes off track.&lt;/p&gt;

&lt;p&gt;This is fundamentally a granularity problem: vector RAG retrieves at the text chunk level, but "overall themes" require a macro-level understanding of the entire dataset. No single chunk can support that kind of answer.&lt;/p&gt;

&lt;p&gt;GraphRAG handles this easily with pre-built community clusters and community summaries, extracting themes directly from the macro structure.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Does GraphRAG Work?
&lt;/h2&gt;

&lt;p&gt;The entire process has two phases: offline indexing, then online question answering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Offline Indexing: Three Steps
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw Documents
    │
    ▼
┌─────────────────────────────┐
│ Step 1: Entity &amp;amp; Relationship│  LLM processes documents chunk
│ Extraction                   │  by chunk, extracting all
│                              │  entities (people, places,
│                              │  organizations, etc.) and
│                              │  their relationships
└─────────────────────────────┘
    │
    ▼
┌─────────────────────────────┐
│ Step 2: Knowledge Graph      │  Assemble extracted entities
│ Construction                 │  and relationships into a
│                              │  complete graph structure
└─────────────────────────────┘
    │
    ▼
┌─────────────────────────────┐
│ Step 3: Community Detection  │  Perform bottom-up hierarchical
│ &amp;amp; Summarization              │  clustering on the graph (e.g.,
│                              │  Leiden algorithm), generate
│                              │  LLM summary reports for each
│                              │  community
└─────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In short: first let the LLM extract all the people, events, things, and their relationships from the documents, assemble them into a large graph, then cluster the graph into groups and write a summary for each group.&lt;/p&gt;

&lt;h3&gt;
  
  
  Online Answering: Choose Strategy by Question Type
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question Type&lt;/th&gt;
&lt;th&gt;How to Find the Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Specific questions (e.g., "What has Novorossiya done?")&lt;/td&gt;
&lt;td&gt;Locate entity in graph → traverse relationships → collect related text → generate answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Macro questions (e.g., "Top 5 themes")&lt;/td&gt;
&lt;td&gt;Use community summaries directly → aggregate layer by layer → generate global answer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Technical Points Worth Digging Into
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Use LLM for Graph Construction Instead of Traditional NLP?
&lt;/h3&gt;

&lt;p&gt;The traditional approach uses NER (Named Entity Recognition) + relation extraction models, but these have hard limitations: you need to predefine entity types and relation types, they break when you switch domains, and they can't capture implicit relationships.&lt;/p&gt;

&lt;p&gt;LLM advantages are clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-shot capability&lt;/strong&gt; — no need to train separately for each domain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can read between the lines&lt;/strong&gt; — for example, extracting the implicit "government attention" relationship from "the Attorney General's office reported the creation of Novorossiya"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not constrained by schema&lt;/strong&gt; — let the LLM discover entity and relationship types on its own&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is straightforward: LLM calls are expensive, and the indexing phase needs to process the entire dataset, so computational costs are significant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Community Detection — GraphRAG's Killer Feature
&lt;/h3&gt;

&lt;p&gt;Many approaches use knowledge graphs to enhance RAG, but what truly sets GraphRAG apart is community detection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses algorithms like Leiden to partition the knowledge graph into multi-level communities (think of them as "topic clusters")&lt;/li&gt;
&lt;li&gt;Pre-generates an LLM summary report for each community&lt;/li&gt;
&lt;li&gt;Different community levels correspond to different levels of abstraction; choose the right granularity when answering questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the secret behind its ability to answer "big questions" — no need to traverse the entire graph on the fly, just look up the pre-written summaries.&lt;/p&gt;

&lt;p&gt;When generating community reports, the LLM receives CSV tables of entities and relationships within that community: an Entities table (entity ID, name, description), a Relationships table (source, target, description, combined_degree), and an optional Claims table. Relationships are sorted by &lt;code&gt;combined_degree&lt;/code&gt; in descending order, prioritizing the most important ones, with truncation when the token limit is exceeded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Provenance — Every Statement Is Traceable
&lt;/h3&gt;

&lt;p&gt;GraphRAG places special emphasis on provenance. The complete evidence chain looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    → GraphRAG Answer + [Data: Entities (ID), Relationships (ID)]
        → Relationship IDs point to specific edges in the knowledge graph
            → Edges link back to specific passages in the original source documents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Answer → entities/relationships in the graph → original documents — fully traceable end to end. For enterprise applications, this capability is critical — you can verify every claim the AI makes.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Were the Experiments Conducted?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dataset
&lt;/h3&gt;

&lt;p&gt;They used the VIINA dataset (violence information from news articles), chosen deliberately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Involves multi-party conflict with fragmented information — complex enough&lt;/li&gt;
&lt;li&gt;Includes news sources from both Russian and Ukrainian sides with opposing viewpoints and contradictory information&lt;/li&gt;
&lt;li&gt;Data from June 2023, ensuring it's not in the LLM's training set&lt;/li&gt;
&lt;li&gt;Thousands of articles, far exceeding context window limits — can't be handled without RAG&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Evaluation Results
&lt;/h3&gt;

&lt;p&gt;Four metrics were used for scoring:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;th&gt;How It's Evaluated&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Comprehensiveness&lt;/td&gt;
&lt;td&gt;How complete is the answer&lt;/td&gt;
&lt;td&gt;LLM scorer pairwise comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human Empowerment&lt;/td&gt;
&lt;td&gt;Does it provide sources for verification&lt;/td&gt;
&lt;td&gt;LLM scorer pairwise comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Diversity&lt;/td&gt;
&lt;td&gt;Does it answer from multiple perspectives&lt;/td&gt;
&lt;td&gt;LLM scorer pairwise comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Faithfulness&lt;/td&gt;
&lt;td&gt;Does it hallucinate&lt;/td&gt;
&lt;td&gt;SelfCheckGPT absolute measurement&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The results are interesting: GraphRAG significantly outperforms traditional RAG on the first three metrics, but they're roughly equal on faithfulness. In other words, GraphRAG's improvement is mainly in "finding more comprehensively," not in "hallucinating less."&lt;/p&gt;




&lt;h2&gt;
  
  
  Don't Just Look at the Strengths — Know the Limitations Too
&lt;/h2&gt;

&lt;p&gt;This is a pitch piece after all, so it naturally emphasizes the positives. A few caveats to keep in mind:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High indexing cost&lt;/strong&gt; — Every document chunk requires an LLM call to extract entities and relationships. For large datasets, this could take hours or even days. With GPT-4 level models, API costs are considerable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incremental updates are a hard problem&lt;/strong&gt; — The article doesn't mention what happens when data changes. In practice, new documents require re-extraction and merging, community structures may change as a result, requiring re-clustering and re-generating summaries. There's no good engineering solution for this yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extraction quality depends on the LLM&lt;/strong&gt; — LLM entity and relationship extraction isn't 100% accurate. It may miss implicit entities, get relationships wrong, and different models produce varying extraction quality with inconsistent results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Queries will be slower&lt;/strong&gt; — Graph traversal + LLM generation has a longer pipeline than simple vector retrieval + LLM generation, so latency is naturally higher.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not every question needs it&lt;/strong&gt; — The article itself acknowledges that for simple factual queries (like "What is Novorossiya?"), traditional RAG is sufficient. GraphRAG's advantages are concentrated in multi-hop reasoning and global summarization scenarios.&lt;/p&gt;




&lt;h2&gt;
  
  
  An Analogy to Build Your Intuition
&lt;/h2&gt;

&lt;p&gt;Imagine you're a new employee at a company, and you want to understand "the most important project developments in the last three months."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traditional RAG is like searching through a filing cabinet&lt;/strong&gt;: You walk into the archive room and search using "project developments" as a keyword. You find dozens of files scattered across different drawers — meeting minutes, emails, reports. You have to piece the fragments together yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GraphRAG is like asking a colleague who knows everything&lt;/strong&gt;: They've not only read every document but also remember that "Zhang San's Project A and Li Si's Project B are actually related," and know that "last month's budget adjustment affected three departments." They can give you an organized, complete answer right away.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Traditional RAG&lt;/th&gt;
&lt;th&gt;GraphRAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;How it works&lt;/td&gt;
&lt;td&gt;Search keywords, find relevant passages&lt;/td&gt;
&lt;td&gt;Build a relationship network first, then answer along relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Good at&lt;/td&gt;
&lt;td&gt;"What is X?" "How to do X?"&lt;/td&gt;
&lt;td&gt;"What's the relationship between X and Y?" "What's the overall picture?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analogy&lt;/td&gt;
&lt;td&gt;A librarian helping you find books&lt;/td&gt;
&lt;td&gt;A detective connecting clues into a complete story&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weakness&lt;/td&gt;
&lt;td&gt;Fragmented, lacks global perspective&lt;/td&gt;
&lt;td&gt;Building the relationship network takes time and compute&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GraphRAG doesn't solve the "search more accurately" problem — it solves the "search dimension" problem&lt;/strong&gt; — expanding from text similarity to entity relationships and global structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The knowledge graph is the means; community clustering is the real innovation&lt;/strong&gt; — Many approaches use graphs to enhance RAG, but community detection + pre-summarization is GraphRAG's unique weapon for global queries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Provenance is the foundation of trust&lt;/strong&gt; — Every assertion can be traced back to the original document. Enterprise applications can't do without this.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The trade-off is indexing cost&lt;/strong&gt; — Using LLMs to process all data for graph construction is much more expensive than simple vectorization. This must be weighed when deploying in production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Not a replacement, but a complement&lt;/strong&gt; — Use GraphRAG for complex reasoning and global analysis, traditional RAG for simple factual queries. In real systems, combining both is the right approach.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>graphrag</category>
      <category>rag</category>
      <category>knowledgegraph</category>
      <category>communitydetection</category>
    </item>
    <item>
      <title>The Biggest Pitfall in GraphRAG: One Entity, Seven Identities</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Fri, 24 Apr 2026 11:54:16 +0000</pubDate>
      <link>https://forem.com/eyanpen/the-biggest-pitfall-in-graphrag-one-entity-seven-identities-5d8d</link>
      <guid>https://forem.com/eyanpen/the-biggest-pitfall-in-graphrag-one-entity-seven-identities-5d8d</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;You thought the hardest part of GraphRAG was "building the graph." In reality, the hardest part is "assigning entity types" — even when you've predefined a strict type schema.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. A Real-World Dataset
&lt;/h2&gt;

&lt;p&gt;We ran GraphRAG entity extraction on 3GPP TS 23.502 (5G Core Network signaling procedure specification). This document is about 700+ pages and one of the most critical standards in the telecom domain.&lt;/p&gt;

&lt;p&gt;The results were painful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A total of &lt;strong&gt;8,873 distinct entities&lt;/strong&gt; were extracted (deduplicated by title)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1,123 entities were assigned 2 or more types&lt;/strong&gt; — 12.7% of the total&lt;/li&gt;
&lt;li&gt;The most extreme case, &lt;code&gt;PMIC&lt;/code&gt;, was classified into &lt;strong&gt;7 different types&lt;/strong&gt;: &lt;code&gt;ARCHITECTURE_CONCEPT&lt;/code&gt;, &lt;code&gt;DATA_TYPE&lt;/code&gt;, &lt;code&gt;INFORMATION_ELEMENT&lt;/code&gt;, &lt;code&gt;MANAGEMENT_ENTITY&lt;/code&gt;, &lt;code&gt;NETWORK_ELEMENT&lt;/code&gt;, &lt;code&gt;PROCEDURE&lt;/code&gt;, &lt;code&gt;PROTOCOL&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that this experiment &lt;strong&gt;already used a strictly predefined entity type schema&lt;/strong&gt;, with the prompt explicitly constraining the LLM to only use the specified type set. In other words, this isn't chaos caused by "no constraints" — it's &lt;strong&gt;chaos that persists even after constraints are applied&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What's worse, these "type conflicts" don't occur across different documents — they happen &lt;strong&gt;within the same document&lt;/strong&gt; and even &lt;strong&gt;within the same chunk&lt;/strong&gt;. When the LLM reads a minimal text segment, even with explicit type constraints, it still assigns different types to the same entity.&lt;/p&gt;

&lt;p&gt;We found &lt;strong&gt;63 text_unit-level overlapping conflicts&lt;/strong&gt; — the same entity annotated with two different types within the same text block. For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Entity&lt;/th&gt;
&lt;th&gt;Labeled as&lt;/th&gt;
&lt;th&gt;Also labeled as&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AF&lt;/td&gt;
&lt;td&gt;ORGANIZATION&lt;/td&gt;
&lt;td&gt;NETWORK_FUNCTION&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NRF&lt;/td&gt;
&lt;td&gt;INTERFACE&lt;/td&gt;
&lt;td&gt;NETWORK_FUNCTION&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5G SECURITY CONTEXT&lt;/td&gt;
&lt;td&gt;SECURITY_ELEMENT&lt;/td&gt;
&lt;td&gt;ARCHITECTURE_CONCEPT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HPLMN&lt;/td&gt;
&lt;td&gt;NETWORK_FUNCTION&lt;/td&gt;
&lt;td&gt;ORGANIZATION&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SERVICE REQUEST&lt;/td&gt;
&lt;td&gt;INFORMATION_ELEMENT&lt;/td&gt;
&lt;td&gt;PROCEDURE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This isn't the LLM making rookie mistakes, nor is the schema poorly designed. Think about it: &lt;code&gt;AF&lt;/code&gt; (Application Function) genuinely is both a "network function" and an "organizational role"; &lt;code&gt;NRF&lt;/code&gt; is both a "network function" and exposes "interfaces." These types are all in our predefined schema, and the LLM picks a "legal" type every time — it just picks different legal types for the same entity. &lt;strong&gt;The problem isn't that the LLM judged wrong, nor that the schema isn't strict enough — it's that real-world entities are inherently not single-typed.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Why Is This Problem So Hard?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Entities Are Inherently Multi-Faceted
&lt;/h3&gt;

&lt;p&gt;In 3GPP specifications, the term &lt;code&gt;AMF&lt;/code&gt; (Access and Mobility Management Function):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In architecture diagrams, it's a &lt;strong&gt;NETWORK_FUNCTION&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In signaling procedures, it's a participant in a &lt;strong&gt;PROCEDURE&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In deployment descriptions, it's a &lt;strong&gt;NETWORK_ELEMENT&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In interface definitions, it's an endpoint of an &lt;strong&gt;INTERFACE&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The same entity plays different roles in different contexts. This isn't a bug — it's reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 LLM Type Judgment Depends on the Context Window
&lt;/h3&gt;

&lt;p&gt;GraphRAG entity extraction is performed chunk by chunk. Each text_unit is roughly a few hundred tokens, and the LLM can only see that small segment.&lt;/p&gt;

&lt;p&gt;The same entity &lt;code&gt;PDU SESSION ESTABLISHMENT&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In a chunk describing signaling procedures, the LLM classifies it as &lt;strong&gt;PROCEDURE&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In a chunk describing message formats, the LLM classifies it as &lt;strong&gt;INFORMATION_ELEMENT&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both judgments are correct, but they conflict when merged into the knowledge graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 No Matter How Good the Schema, Type Boundaries Are Inherently Fuzzy
&lt;/h3&gt;

&lt;p&gt;We already predefined a type schema, but who defines the boundary between &lt;code&gt;ARCHITECTURE_CONCEPT&lt;/code&gt; and &lt;code&gt;NETWORK_FUNCTION&lt;/code&gt;? In the 3GPP context, many concepts naturally span multiple categories. &lt;code&gt;POLICY CONTROL&lt;/code&gt; is both a "procedure" (PROCEDURE) and an "architectural concept" (ARCHITECTURE_CONCEPT) — both types are in our schema, and the LLM isn't wrong to pick either one.&lt;/p&gt;

&lt;p&gt;This isn't a problem of poorly written prompts or imprecise schema definitions — it's &lt;strong&gt;a fundamental tension between the granularity of type systems and the complexity of the real world&lt;/strong&gt;. You can make the schema more fine-grained, but a finer schema only creates more boundary issues, not fewer.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 Scale Amplifies the Problem
&lt;/h3&gt;

&lt;p&gt;Our data shows that among entities with multiple types, the top 20 entities average 4–7 types and are associated with 10–200 descriptions. A core entity like &lt;code&gt;AF&lt;/code&gt; has 209 descriptions, 192 text_unit references, and 4 types.&lt;/p&gt;

&lt;p&gt;When a knowledge graph contains thousands of such "multi-faceted entities," downstream community detection, relationship reasoning, and summary generation are all affected — because the graph structure is polluted by type noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. How Does the Industry Currently Address This?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Approach 1: Predefined Strict Type System (Schema-First) ⚠️ We Already Tried This
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Before extraction, manually define a strict entity type schema and explicitly constrain the LLM in the prompt to only use these types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: Microsoft GraphRAG's default configuration, most enterprise knowledge graph projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our actual results&lt;/strong&gt;: All the data at the beginning of this article was produced under Schema-First mode. We predefined the type set and explicitly constrained it in the prompt — yet 1,123 entities still had multi-type conflicts, and 63 text_unit-level overlapping conflicts persisted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's not enough&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema can constrain the LLM to "only pick from these types," but &lt;strong&gt;can't constrain it to "pick only one for the same entity"&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Domain concepts are inherently multi-faceted; &lt;code&gt;AF&lt;/code&gt; in the 3GPP context genuinely is both NETWORK_FUNCTION and ORGANIZATION — no schema, however strict, changes this fact&lt;/li&gt;
&lt;li&gt;Requires domain experts to design the schema — high cost, and you need to redesign for each new domain&lt;/li&gt;
&lt;li&gt;Being too strict loses information — forcing &lt;code&gt;AF&lt;/code&gt; to be &lt;code&gt;NETWORK_FUNCTION&lt;/code&gt; discards its semantics as &lt;code&gt;ORGANIZATION&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;: Schema-First is a necessary condition but not a sufficient one. It reduces the "random naming" problem but doesn't solve the fundamental contradiction of "one entity, multiple identities."&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 2: Allow Multi-Types, Post-Processing Merge (Multi-Label + Post-Processing)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Don't limit the number of types during extraction; allow an entity to have multiple types, then merge, deduplicate, and select a primary type through rules or models in post-processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: LlamaIndex's PropertyGraphIndex, some academic research.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Preserves multi-faceted entity information&lt;/li&gt;
&lt;li&gt;No information loss during extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Post-processing logic is complex; rules are hard to enumerate exhaustively&lt;/li&gt;
&lt;li&gt;"Selecting a primary type" itself requires domain knowledge&lt;/li&gt;
&lt;li&gt;Graph complexity increases; query performance degrades&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suitable for&lt;/strong&gt;: Exploratory analysis, early stages where domain boundaries are uncertain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 3: Hierarchical Typing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Build a hierarchical type system where, for example, &lt;code&gt;NETWORK_FUNCTION&lt;/code&gt; is a subtype of &lt;code&gt;ARCHITECTURE_CONCEPT&lt;/code&gt;. Extract at the finest granularity; aggregate by hierarchy during queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: Wikidata's type system, YAGO knowledge base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Balances precision and flexibility&lt;/li&gt;
&lt;li&gt;Supports queries at different granularities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Designing the hierarchy itself is a major undertaking&lt;/li&gt;
&lt;li&gt;LLMs struggle to accurately determine hierarchical relationships during extraction&lt;/li&gt;
&lt;li&gt;Cross-domain hierarchies are hard to unify&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suitable for&lt;/strong&gt;: Large-scale, long-term knowledge graph projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 4: Abandon Explicit Types, Use Embeddings (Type-Free + Embedding)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Don't assign discrete type labels to entities; instead, use vector embeddings to represent semantic features. Similar entities naturally cluster in vector space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: Some recent research, such as GNN-based entity representation learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Completely avoids the type conflict problem&lt;/li&gt;
&lt;li&gt;Captures subtle semantic differences between entities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loses interpretability — you can't tell users "this is a network function"&lt;/li&gt;
&lt;li&gt;Downstream community detection and summary generation need redesign&lt;/li&gt;
&lt;li&gt;Difficult to debug&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suitable for&lt;/strong&gt;: Research projects, scenarios with low interpretability requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 5: Context-Aware Dynamic Typing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: Don't fix types during extraction; instead, dynamically determine entity types based on query context. For example, when a user asks about architecture, &lt;code&gt;AF&lt;/code&gt; is treated as &lt;code&gt;NETWORK_FUNCTION&lt;/code&gt;; when asking about organization, it's treated as &lt;code&gt;ORGANIZATION&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representatives&lt;/strong&gt;: Currently mostly in the academic exploration stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most aligned with reality — an entity's "identity" truly depends on context&lt;/li&gt;
&lt;li&gt;No difficult type decisions needed during extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extremely high engineering complexity&lt;/li&gt;
&lt;li&gt;Graph structure can't be determined during offline graph building; community detection algorithms are hard to apply&lt;/li&gt;
&lt;li&gt;Increased query latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suitable for&lt;/strong&gt;: A research direction for next-generation GraphRAG systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. My Recommendation: Schema-First Foundation + Layered Types + Primary Type Voting + Context Preservation
&lt;/h2&gt;

&lt;p&gt;Our experiments have proven that Schema-First is a necessary starting point — without it, types become even more chaotic. But it alone isn't enough. Based on our hands-on experience with 3GPP documents, I recommend layering a &lt;strong&gt;pragmatic post-processing approach&lt;/strong&gt; on top of Schema-First:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 0: Keep Schema-First (Already in Place)
&lt;/h3&gt;

&lt;p&gt;Continue using the predefined type schema to constrain the LLM. This step is already done; its value lies in keeping types within a finite set, preventing the LLM from freely inventing meaningless types like &lt;code&gt;THINGY&lt;/code&gt; or &lt;code&gt;STUFF&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Preserve All Types During Extraction
&lt;/h3&gt;

&lt;p&gt;On top of Schema-First, don't force a single type during extraction. If the LLM picks multiple types from the predefined set, keep them all. Preserve every (entity, type, text_unit) triple. This is the raw signal — once lost, it can't be recovered.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Statistical Voting for Primary Type
&lt;/h3&gt;

&lt;p&gt;For each entity, count how many times it's annotated as each type across all text_units, and select the most frequent as the &lt;strong&gt;primary type&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Taking &lt;code&gt;AF&lt;/code&gt; as an example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NETWORK_FUNCTION: 150 occurrences → &lt;strong&gt;primary type&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;ORGANIZATION: 30 occurrences&lt;/li&gt;
&lt;li&gt;ARCHITECTURE_CONCEPT: 20 occurrences&lt;/li&gt;
&lt;li&gt;NETWORK_ELEMENT: 9 occurrences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The primary type is used for the knowledge graph's main structure, community detection, and default queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Preserve Alternative Types as Properties
&lt;/h3&gt;

&lt;p&gt;Other types aren't discarded — they're stored as the entity's &lt;code&gt;alternative_types&lt;/code&gt; property, available for use during queries as needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AF"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"primary_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NETWORK_FUNCTION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"alternative_types"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ORGANIZATION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ARCHITECTURE_CONCEPT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NETWORK_ELEMENT"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type_distribution"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"NETWORK_FUNCTION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ORGANIZATION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ARCHITECTURE_CONCEPT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"NETWORK_ELEMENT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 4: Type Conflict Detection and Manual Review
&lt;/h3&gt;

&lt;p&gt;For text_unit-level overlapping conflicts (same entity labeled as different types within the same chunk), flag them as candidates for review. These 63 conflicts are the most worth manually checking — they often reveal blind spots in the type system design.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the Cost?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Increased storage&lt;/strong&gt;: Each entity stores multiple types and distribution info; graph data volume increases by roughly 20–30%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No change to extraction&lt;/strong&gt;: No need to modify prompts or extraction pipelines; no additional cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-processing development needed&lt;/strong&gt;: The voting, merging, and conflict detection pipeline requires additional development — roughly 2–3 days of engineering effort.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slightly more complex queries&lt;/strong&gt;: The query layer needs to decide whether to use the primary type or all types, but this logic can be encapsulated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can't be fully automated&lt;/strong&gt;: Text_unit-level conflicts still require human judgment, but the volume is manageable (only 63 in our case).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  5. Final Thoughts
&lt;/h2&gt;

&lt;p&gt;GraphRAG papers and blog posts always focus on the flashy capabilities like "community detection" and "global queries," but when it comes to real-world deployment, &lt;strong&gt;entity type chaos is the first roadblock&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;One TS 23.502 document, 8,873 entities, 1,123 with multi-type conflicts — and this is &lt;strong&gt;after applying Schema-First constraints&lt;/strong&gt;. This isn't an edge case; it's the norm for all complex domain documents. Predefined type schemas are necessary but far from sufficient.&lt;/p&gt;

&lt;p&gt;There's no silver bullet for this problem. But at least we can: &lt;strong&gt;build on Schema-First, avoid losing information during post-processing, use statistical methods to select primary types, preserve multi-faceted nature for downstream use, and keep the conflicts that truly need human judgment within a manageable scope.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the gap between "running a demo" and "going to production" in GraphRAG — and it's the most important one to fill.&lt;/p&gt;

</description>
      <category>graphrag</category>
      <category>entitytyping</category>
      <category>knowledgegraph</category>
      <category>rag</category>
    </item>
    <item>
      <title>Why Do We Need GraphRAG? — The Evolution from "Search" to "Understanding"</title>
      <dc:creator>eyanpen</dc:creator>
      <pubDate>Fri, 24 Apr 2026 11:49:37 +0000</pubDate>
      <link>https://forem.com/eyanpen/why-do-we-need-graphrag-the-evolution-from-search-to-understanding-4die</link>
      <guid>https://forem.com/eyanpen/why-do-we-need-graphrag-the-evolution-from-search-to-understanding-4die</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;When AI stops just "looking things up" and starts truly "understanding" your question.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. Let's Start with an Everyday Scenario
&lt;/h2&gt;

&lt;p&gt;Imagine you're a new employee at a company. On your first day, you want to know "the most important project updates from the past three months."&lt;/p&gt;

&lt;p&gt;You have two options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A: Dig through the filing cabinet&lt;/strong&gt;&lt;br&gt;
You walk to the archive room, open the filing cabinet, and search by the keyword "project updates." You find dozens of documents, but they're scattered across different drawers — some are meeting minutes, some are emails, some are reports. You have to piece these fragments together yourself to get a complete answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option B: Ask a colleague who "knows everything"&lt;/strong&gt;&lt;br&gt;
This colleague has not only read every document but also remembers that "Project A led by Zhang San and Project B led by Li Si are actually related," and knows that "last month's budget adjustment affected three departments' plans." They can give you an organized, complete answer right away.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A is traditional RAG (Retrieval-Augmented Generation).&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Option B is what GraphRAG aims to achieve.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  2. What Is RAG? It's Already Impressive — So Why Isn't It Enough?
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What Is RAG
&lt;/h3&gt;

&lt;p&gt;RAG stands for Retrieval-Augmented Generation. Simply put, it lets AI search through a pile of documents for relevant content before answering your question, then generates a response based on what it found.&lt;/p&gt;

&lt;p&gt;It's like an open-book exam — AI can flip through references to find answers instead of relying purely on memory.&lt;/p&gt;
&lt;h3&gt;
  
  
  RAG's Limitations
&lt;/h3&gt;

&lt;p&gt;RAG is genuinely useful, but it has a fundamental weakness: &lt;strong&gt;it can "find" but it can't "connect."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For example, suppose you ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What impact has the company's business expansion in Asia-Pacific had on the supply chain?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Traditional RAG would:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Search for documents containing keywords like "Asia-Pacific," "business expansion," "supply chain"&lt;/li&gt;
&lt;li&gt;Find several relevant passages&lt;/li&gt;
&lt;li&gt;Hand these passages to the AI to generate an answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Where's the problem?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Information about "Asia-Pacific business expansion" might be in a strategic report&lt;/li&gt;
&lt;li&gt;Information about "supply chain adjustments" might be in an operations report&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;connection&lt;/strong&gt; between these two reports — such as "because of Asia-Pacific expansion, a new Vietnamese supplier was added, causing logistics cost changes" — might &lt;strong&gt;not be explicitly stated in any single document&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What traditional RAG finds are isolated "fragments." It's not good at connecting the &lt;strong&gt;implicit relationships&lt;/strong&gt; between fragments.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. How Does GraphRAG Solve This Problem?
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Core Idea: Build a "Relationship Network" First
&lt;/h3&gt;

&lt;p&gt;GraphRAG's key innovation is that before answering questions, it does something extra: &lt;strong&gt;it organizes all the information from documents into a "relationship network" (knowledge graph).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What does this relationship network look like? Think of it as a character relationship map:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nodes&lt;/strong&gt; (circles): Represent individual "things" — people, companies, projects, locations, concepts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edges&lt;/strong&gt; (arrows): Represent relationships between them — "responsible for," "belongs to," "affects," "collaborates with"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Zhang San] --responsible for--&amp;gt; [Project A]
[Project A] --depends on--&amp;gt; [Project B]
[Project B] --led by--&amp;gt; [Li Si]
[Project A] --budget from--&amp;gt; [Asia-Pacific Department]
[Asia-Pacific Department] --partners with--&amp;gt; [Vietnamese Supplier]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this network, when you ask "What's the relationship between Zhang San's project and the Vietnamese supplier?", the AI can "walk" through the network and discover:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Zhang San → Project A → Asia-Pacific Department → Vietnamese Supplier&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even if no single document ever directly mentions "the relationship between Zhang San and the Vietnamese supplier," the AI can reason out the answer through this path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Plain-Language Summary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Traditional RAG&lt;/th&gt;
&lt;th&gt;GraphRAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;How it works&lt;/td&gt;
&lt;td&gt;Searches keywords, finds relevant passages&lt;/td&gt;
&lt;td&gt;Builds a relationship network first, then follows relationships to answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Good at&lt;/td&gt;
&lt;td&gt;"What is X?" "How do I do X?"&lt;/td&gt;
&lt;td&gt;"What's the relationship between X and Y?" "What's the big picture?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analogy&lt;/td&gt;
&lt;td&gt;A librarian helping you find books&lt;/td&gt;
&lt;td&gt;A detective connecting clues into a complete story&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weakness&lt;/td&gt;
&lt;td&gt;Fragmented, lacks global perspective&lt;/td&gt;
&lt;td&gt;Building the relationship network takes time and compute&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  4. What Can GraphRAG Do for Us?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario 1: Enterprise Knowledge Management
&lt;/h3&gt;

&lt;p&gt;A large company has thousands of internal documents: policies, procedures, meeting minutes, technical docs...&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traditional approach&lt;/strong&gt;: Employees search by keywords, browse through many documents, summarize on their own&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GraphRAG approach&lt;/strong&gt;: AI has already "understood" the relationships between all documents. Employees can directly ask "What was the root cause of increased customer complaints last quarter?" and the AI can provide a connected analysis across product changes, customer service records, supplier issues, and more&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scenario 2: Healthcare
&lt;/h3&gt;

&lt;p&gt;A patient's medical records, test reports, and medication history are scattered across different systems.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traditional approach&lt;/strong&gt;: Doctors review each one individually, relying on experience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GraphRAG approach&lt;/strong&gt;: AI builds a network connecting patient information, medications, diseases, and test results. It can flag that "Drug A the patient is currently taking and newly prescribed Drug B may interact because they both act on the same metabolic pathway"&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scenario 3: Financial Risk Control
&lt;/h3&gt;

&lt;p&gt;A bank needs to assess the risk of a loan.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traditional approach&lt;/strong&gt;: Review the borrower's credit report and financial data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GraphRAG approach&lt;/strong&gt;: AI discovers that the borrower's company and another company that has already defaulted share the same ultimate beneficial owner, and this connection is hidden within multiple layers of equity structures — uncovering these "hidden relationships" is exactly where GraphRAG excels&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scenario 4: Everyday Q&amp;amp;A Assistant
&lt;/h3&gt;

&lt;p&gt;You're using an AI assistant to learn about a complex topic like "climate change."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traditional approach&lt;/strong&gt;: AI gives you a general overview of climate change&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GraphRAG approach&lt;/strong&gt;: AI can tell you "climate change affects agricultural yields, which in turn affects food prices, which ultimately affects social stability in developing countries" — this kind of &lt;strong&gt;multi-hop reasoning&lt;/strong&gt; (from A to B to C to D) is GraphRAG's core advantage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. GraphRAG Isn't a Silver Bullet
&lt;/h2&gt;

&lt;p&gt;After all these benefits, let's be honest about its limitations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Building the relationship network has costs&lt;/strong&gt;: Converting large volumes of documents into a knowledge graph requires time and compute resources. For small-scale, simple Q&amp;amp;A scenarios, traditional RAG may be sufficient.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The quality of the relationship network is critical&lt;/strong&gt;: If the AI misunderstands a relationship during graph construction, subsequent reasoning will also be wrong. Just like a detective who connects clues incorrectly will reach the wrong conclusion.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Not every question needs it&lt;/strong&gt;: If you just want to look up "What's the company's expense reimbursement process?", traditional search can answer that perfectly well — no need to deploy GraphRAG.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  6. Summary
&lt;/h2&gt;

&lt;p&gt;The essence of GraphRAG is evolving AI from "keyword search" to "relationship reasoning."&lt;/p&gt;

&lt;p&gt;It's not meant to replace traditional RAG but to add a layer of "understanding relationships" on top of it. It's like upgrading from "looking up a dictionary" to "reading an encyclopedia" — a dictionary tells you what each word means; an encyclopedia also tells you how those words are connected.&lt;/p&gt;

&lt;p&gt;For scenarios that involve processing large amounts of complex information, discovering hidden connections, and requiring a global perspective, GraphRAG is a direction worth paying attention to.&lt;/p&gt;

</description>
      <category>graphrag</category>
      <category>rag</category>
      <category>knowledgegraph</category>
      <category>retrievalaugmentedgeneration</category>
    </item>
  </channel>
</rss>
