Forem: eyanpen

The "Ghost Clone" of Community Reports in GraphRAG: Why the Same Report Gets Created Twice

eyanpen — Tue, 26 May 2026 01:52:57 +0000

Symptom

When querying the Top 10 nodes by HAS_REPORT edge count in FalkorDB, we found 4 community_report nodes each with 4 HAS_REPORT edges pointing to them. By design, each community should map to exactly one report — so why the one-to-many relationship?

Edge type: HAS_REPORT
Rank  Title                                                          Count
1     Tech Dept Core Team: Backend Architecture & System Design        4
2     Product Dept: User Growth & Monetization Strategy                4
3     Ops Dept: Service Stability & Monitoring System                  4
4     QA Dept: Quality Assurance & Test Automation                     4

In theory each community has one report, each report belongs to one community, and HAS_REPORT should be a 1:1 relationship.

An Intuitive Example

Imagine You're Managing a Company's Org Chart

Suppose your company has this department structure:

Tech Dept (278 people)
  └── Backend Team (253 people)

"Backend Team" is a sub-department of "Tech Dept." Now HR needs to write a department brief for each.

HR discovers that the core members of "Backend Team" heavily overlap with "Tech Dept" (the backend team IS the main force of the tech department), so the AI generates nearly identical briefs for both:

Department	Brief Title	Headcount
Tech Dept (community 1491)	"Core Tech Team: Backend Architecture & System Design"	278
Backend Team (community 2790)	"Core Tech Team: Backend Architecture & System Design"	253

The two briefs have identical titles and content (because they essentially describe the same group of people), differing only in "headcount" (size).

Because the content is identical, the system computes the same ID for both (content-based hash).

Mapping to the 4 actual problem groups we found:

Dept Analogy	Actual community	Brief Title	Size
Tech Dept	community 1491	"Tech Dept Core Team: Backend Architecture & System Design"	278
└── Backend Team	community 2790	"Tech Dept Core Team: Backend Architecture & System Design"	253
Product Dept	community 200	"Product Dept: User Growth & Monetization Strategy"	796
└── Product Team 1	community 1100	"Product Dept: User Growth & Monetization Strategy"	631
Ops Dept	community 1909	"Ops Dept: Service Stability & Monitoring System"	180
└── Ops Team 1	community 3073	"Ops Dept: Service Stability & Monitoring System"	178
QA Dept	community 953	"QA Dept: Quality Assurance & Test Automation"	21
└── QA Team 1	community 2343	"QA Dept: Quality Assurance & Test Automation"	19

Where's the Problem?

When importing this data into the graph database:

Step 1: Create report nodes

Taking "Tech Dept" and "Backend Team" as an example. The system sees two rows in the parquet with the same ID but different communities, and blindly creates two nodes:

Report Node A: {id: "abc123", community: 1491, size: 278}  -- Tech Dept's brief
Report Node B: {id: "abc123", community: 2790, size: 253}  -- Backend Team's brief

Step 2: Create HAS_REPORT edges

The system iterates over each report record and matches report nodes by id:

-- Processing Tech Dept (community 1491)
MATCH (c:communities {community: 1491})
MATCH (r:community_reports {id: "abc123"})  -- Matches 2 nodes (A and B)!
CREATE (c)-[:HAS_REPORT]->(r)
-- Result: Tech Dept → Node A, Tech Dept → Node B (2 edges)

-- Processing Backend Team (community 2790)
MATCH (c:communities {community: 2790})
MATCH (r:community_reports {id: "abc123"})  -- Also matches 2 nodes!
CREATE (c)-[:HAS_REPORT]->(r)
-- Result: Backend Team → Node A, Backend Team → Node B (2 edges)

Final result: This report title has 4 HAS_REPORT edges (2 departments × 2 same-ID nodes = 4).

The correct result should be: Tech Dept → Tech Dept's brief (1 edge), Backend Team → Backend Team's brief (1 edge), totaling 2 edges.

Root Cause Analysis

The problem is caused by two factors compounding:

1. Leiden Hierarchical Clustering Produces Identical Reports

GraphRAG uses the Leiden algorithm for hierarchical community detection. When a sub-community's members heavily overlap with its parent community, the LLM generates nearly identical reports for both. Since report IDs are content-based hashes, identical content → identical IDs.

Actual data verification:

report id	communities	sizes	Hierarchy
6516e2f4...	2790, 1491	253, 278	2790 is a sub-community of 1491
feda9fa0...	1100, 200	631, 796	1100 is a sub-community of 200
d8f25d09...	2343, 953	19, 21	2343 is a sub-community of 953
223c76c6...	3073, 1909	178, 180	3073 is a sub-community of 1909

2. Import Logic Lacks Deduplication and Precise Matching

In the import code:

# Node creation: unconditional CREATE, no deduplication
"UNWIND $batch AS p CREATE (n:community_reports) SET n = p"

# Edge creation: matches only by id, no community condition
"MATCH (r:community_reports {id: p.rid})"  # Matches multiple same-ID nodes → Cartesian product

Solution

Precise Matching When Creating HAS_REPORT

When creating HAS_REPORT edges, match on both id and community to avoid the Cartesian product:

# Before (buggy)
"MATCH (r:community_reports {id: p.rid}) "

# After (fixed)
"MATCH (r:community_reports {id: p.rid, community: p.cnum}) "

This way each community only matches the report node that belongs to it, creating exactly 1 edge.

Lesson learned: When using the MATCH + CREATE pattern to create relationships in a graph database, if the match condition isn't precise enough (target nodes have duplicates), you'll get unexpected Cartesian products. Always ensure MATCH conditions can uniquely locate the target node.

Known Pitfall in DeepEval Faithfulness Metric: "idk" Verdicts Don't Penalize the Score

eyanpen — Fri, 22 May 2026 02:29:23 +0000

Background

While using DeepEval to evaluate a GraphRAG system in a no-reference setting, we discovered that FaithfulnessMetric can produce misleading perfect scores under certain conditions.

Observed Behavior

We asked GraphRAG a complex question about the 5GC PDU Session establishment procedure. The system returned a detailed technical answer (covering specific responsibilities of AMF, SMF, UPF, PCF, etc.), but the retrieved context contained only the table of contents from 3GPP documents, such as:

The document contains a section '5.6 Session Management' with several sub-subsections.
The document contains a section '5.2 Network Access Control' with several sub-subsections.

The context contained no substantive technical content, yet the Faithfulness score was 1.00 (perfect).

Root Cause Analysis

The Faithfulness metric evaluation consists of 4 steps:

Step	Purpose
1. Truths extraction	Extract factual statements from retrieval_context
2. Claims extraction	Extract claims from actual_output
3. Verdicts	Compare each claim against context, assign `yes`/`no`/`idk`
4. Score calculation	Compute final score from verdicts

The key lies in Step 3's verdict rules:

yes — claim is consistent with context
no — claim directly contradicts context
idk — context contains no relevant information to judge

And Step 4's default scoring formula:

score = (total - no_count) / total

idk does not count as a penalty. Only explicit contradictions (no) reduce the score.

Real-World Example

In our evaluation, the LLM judge (after switching to a stricter model) assigned idk to all 20 claims:

{
  "verdicts": [
    {"verdict": "idk"},
    {"verdict": "idk"},
    ...  // 20 total, all idk
  ]
}

Score calculation: score = (20 - 0) / 20 = 1.00

The final reason output:

"The score is 1.00 because there are no contradictions; the actual output fully aligns with the retrieval context."

This is clearly misleading — none of the claims in the answer are supported by the context, but since none are "contradicted" either, the score is perfect.

The Fundamental Issue

Faithfulness measures "is there a contradiction with the context", not "is the answer supported by the context".

These are entirely different dimensions:

Scenario	Faithfulness	Groundedness
Answer fully based on context	High	High
Answer correct but context irrelevant	High (no contradiction)	Low (no support)
Answer contradicts context	Low	Low

When retrieval context contains only table-of-contents or summary-level information, it's nearly impossible for any specific claim to "directly contradict" it, so Faithfulness will always be perfect.

Solutions

Solution 1: Enable `penalize_ambiguous_claims`

DeepEval provides a built-in parameter:

FaithfulnessMetric(model=model, threshold=0.5, penalize_ambiguous_claims=True)

With this enabled, the scoring formula becomes:

score = (total - no_count - idk_count) / total

Now 20 claims all judged idk yields: (20 - 0 - 20) / 20 = 0.00, which more accurately reflects how well the context supports the answer.

Solution 2: Add a Groundedness Metric

Use GEval to define a custom Groundedness metric that directly evaluates whether the answer is supported by context:

GEval(
    name="Groundedness",
    criteria="Determine whether the actual output is fully supported and grounded by the retrieval context. "
             "Penalize claims in the output that cannot be traced back to specific information in the retrieval context.",
    evaluation_params=[SingleTurnParams.INPUT, SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.RETRIEVAL_CONTEXT],
    model=model,
    threshold=0.5,
)

Recommendation

Use both solutions together:

Keep Faithfulness (with penalize_ambiguous_claims enabled) to detect contradictions and unsupported claims
Add Groundedness to positively evaluate support coverage
Note Faithfulness limitations in reports to avoid misinterpretation

Additional Pitfall: Summary Claims Misjudged as "idk"

Even when the context contains specific detailed information, if the actual output summarizes those details, the judge may still assign idk.

Real-World Example

The context contained specific procedural details about PDU Session establishment (AMF handling registration, SMF selecting UPF, N4 session setup, etc.), while the actual output included a summary claim:

"From the UE attempting to access a specific DNN to achieving effective user plane forwarding, the entire process involves close cooperation among multiple core network elements, each playing an indispensable role."

The judge's verdict:

{
  "verdict": "idk",
  "reason": "The claim is a summary statement; the context provides specific procedural details but does not directly confirm this overall description."
}

Cause

The Faithfulness prompt imposes strict constraints on the judge:

"Only use 'no' if retrieval context DIRECTLY CONTRADICTS the claim — never use prior knowledge."

"Use 'idk' for claims not backed up by context — do not assume your knowledge."

The judge is required to perform literal-level matching, not semantic-level reasoning. Even though the context details fully support the summary through logical inference, since the context doesn't "directly confirm" the statement, the judge can only assign idk.

Impact

For RAG systems, answers are expected to synthesize and summarize context — this is normal and desired behavior. However, Faithfulness's literal-level matching treats such reasonable summaries as "unsupported," causing scores to drop when penalize_ambiguous_claims is enabled.

Possible Improvement

DeepEval's FaithfulnessMetric supports an evaluation_template parameter. You can inherit from FaithfulnessTemplate and modify the verdict guidelines to include "summaries that can be reasonably inferred from context details" in the yes category. However, this changes the semantics of the evaluation criteria and should be used cautiously.

Conclusion

The Faithfulness metric was designed to detect hallucination — whether the model fabricates information that contradicts the context. However, it has limitations on two levels:

"idk" doesn't penalize by default — always perfect when context is irrelevant (solved with penalize_ambiguous_claims=True)
Literal-level matching is too strict — reasonable summaries are judged as unsupported (requires custom templates or supplementary Groundedness metrics)

When evaluating RAG systems, both Faithfulness and Groundedness dimensions must be considered to comprehensively assess answer quality.

How FalkorDB Stores Edges: Why Neighbor Lookup Is O(degree)

eyanpen — Wed, 20 May 2026 07:35:07 +0000

Many people have a question when they first see FalkorDB's architecture:

It doesn't use traditional adjacency lists but maintains edges with sparse matrices — how does it efficiently find all edges of a given node?

And a follow-up question:

If neighbor data is already stored contiguously, why is the query complexity still O(degree) instead of O(1)?

1. How Traditional Graph Databases Store Edges

Traditional graph databases (like Neo4j) typically use:

Adjacency List

For example:

A -> B
A -> C
A -> D

Internally it looks more like:

A:
  edge1 -> edge2 -> edge3

That is:

Each node maintains its own edge linked list
To find all edges of a node:
- Simply traverse the linked list

Therefore the complexity is:

O(degree)

Where:

degree = number of edges

For example:

out_degree
Number of outgoing edges
in_degree
Number of incoming edges

2. FalkorDB Is Completely Different: Sparse Matrix

FalkorDB's core design is not an adjacency list.

It is based on:

Sparse Matrix
GraphBLAS

to maintain the entire graph.

For example:

A(id=0) -> B(id=1)

Internal representation:

M[0,1] = edge_id

Meaning:

source=0
target=1

An edge exists.

3. One Matrix Per Edge Type

For example:

(:User)-[:FRIEND]->(:User)
(:User)-[:LIKES]->(:Post)

FalkorDB maintains:

FRIEND matrix
LIKES matrix

This way during traversal:

No need to scan the entire graph.

4. How Multi-edges Are Maintained

FalkorDB supports:

A -[:CALL]-> B
A -[:CALL]-> B
A -[:CALL]-> B

Therefore a matrix cell cannot simply be:

M[0,1] = 1

It's more like:

M[0,1] = [3,8,15]

That is:

edge ids

Essentially similar to:

sparse tensor
compressed adjacency structure

5. How to Efficiently Find Edges?

Many people mistakenly think:

0 0 0 1 0 0 1 1 0

Means:

You must scan the entire row to find the 1s.

That's completely wrong.

Because:

Sparse Matrix Doesn't Store Zeros At All

6. What Does a Sparse Matrix Actually Store?

For example:

[0,0,0,1,0,0,1,1,0]

The actual storage looks more like:

[3,6,7]

Meaning:

index 3 has an edge
index 6 has an edge
index 7 has an edge

Zeros don't exist at all.

Therefore:

Finding neighbors of node A:

neighbors(A) = [3,6,7]

Return directly.

7. CSR / CSC: Industrial-Grade Sparse Matrix Structures

Real implementations typically use:

CSR (Compressed Sparse Row)
CSC (Compressed Sparse Column)

For example:

Matrix:

A: 0 0 0 1 0 0 1 1
B: 1 0 0 0 0 0 0 0
C: 0 1 0 0 1 0 0 0

CSR might store it as:

indices = [3,6,7,0,1,4]
row_ptr = [0,3,4,6]

Explanation:

A's data is at indices[0:3]
B's data is at indices[3:4]
C's data is at indices[4:6]

So:

Finding all edges of A:

indices[0:3]

Gives us:

[3,6,7]

8. Why Is the Complexity Still O(degree)?

This is the most commonly misunderstood point.

Many people ask:

Since [3,6,7] is already in contiguous memory,
isn't a direct memcpy just O(1)?

The answer:

Locating the array is O(1)

But:

Traversing the array is still O(k)

Where:

k = degree

9. What Does Algorithmic Complexity Actually Measure?

For example:

MATCH (a)-[e]->()
RETURN e

The database doesn't just return:

an array pointer

It must:

Traverse each edge
Decode the edge object
Construct the result set
Return to the client

Therefore:

for edge in neighbors:
    emit(edge)

Must execute:

degree times

So the overall complexity is:

O(degree)

10. Output-sensitive Complexity

This is a classic concept:

The size of the output itself counts toward complexity

For example:

If:

A has 1 million edges

Even if:

finding the array start

Only takes:

O(1)

But:

Returning 1 million edges:

Cannot be:

O(1)

Because:

You must at least "look at" each element.

11. Why Is FalkorDB Still Fast?

Because:

[3,6,7]

Is:

Contiguous memory
Cache-friendly
SIMD-friendly

The CPU can:

Prefetch
Vector load
Branch prediction

While traditional adjacency lists:

edge1 -> edge2 -> edge3

Involve:

Pointer chasing

Which causes:

Cache misses
Memory stalls
Branch mispredictions

Therefore:

FalkorDB has clear advantages in:

High fan-out traversal
Multi-hop pattern matching
Graph analytics
GraphRAG

scenarios.

12. Neo4j vs FalkorDB: The Essential Difference

Neo4j is more like:

nodes + edge linked lists

Suited for:

OLTP
Single-hop queries
High-frequency edge updates

FalkorDB is more like:

a graph computation engine

Suited for:

Multi-hop traversal
Pattern matching
Graph analytics
Vectorized computation

For example:

(A)-[:F]->(B)-[:F]->(C)

Neo4j:

pointer traversal

FalkorDB:

matrix multiply

That is:

F × F

This is its biggest architectural difference.

13. Final Summary

FalkorDB's core philosophy:

Don't store "empty"
Only store "existing edges"

Therefore:

0 0 0 1 0 0 1

Actually becomes:

[3,6]

Querying all edges of a node:

Locating adjacency data:
- O(1)
Returning all edges:
- O(degree)

Where:

degree = number of edges for the current node

Not:

total number of edges in the entire graph

This is the core performance model of a Sparse Matrix graph database.

14. Does Splitting Edges Into Multiple Types vs. a Single Type Affect Query Speed?

A common question:

Since locating edges is O(1) and returning edges is O(degree),
does categorizing edges into one type vs. multiple types affect query speed?

The answer: It depends on whether the query specifies an edge type.

When the Query Specifies an Edge Type

For example:

MATCH (a)-[:FRIEND]->(b) RETURN b

FalkorDB only scans the FRIEND matrix.

If all edges are categorized as a single type (e.g., :REL), the matrix contains all edges, making the degree larger.

Multiple types = smaller matrices = less traversal = faster.

When the Query Does Not Specify an Edge Type

For example:

MATCH (a)-[]->(b) RETURN b

FalkorDB needs to merge results from multiple matrices.

In this case:

Total traversal volume is the same (total degree)
Multiple types have slight merge overhead
Single type traverses one matrix directly

The difference is minimal, approximately no impact.

Summary

Scenario	Single Type vs. Multiple Types	Impact
Query specifies edge type	Multiple types faster	Only scans the corresponding matrix, smaller degree
Query does not specify edge type	Nearly no difference	Same total degree, slight merge overhead with multiple types

Practical modeling recommendation:

Splitting into multiple types is the better practice.
Most real-world queries specify a relationship type, and splitting types significantly reduces the number of edges that need to be traversed.

Orphan Communities in GraphRAG Hierarchical Clustering: Why Some Communities Have No PARENT_OF Edges

eyanpen — Wed, 20 May 2026 04:33:03 +0000

The Phenomenon

After building a knowledge graph with GraphRAG, you query a community node and discover it has no PARENT_OF relationships — neither a parent nor any children. Yet the graph clearly contains many PARENT_OF edges. Why was this community "forgotten"?

Background: GraphRAG's Hierarchical Community Structure

GraphRAG uses the Leiden algorithm to perform hierarchical clustering on the entity graph. To make this intuitive, let's use a "world map" analogy to explain the entire process.

Imagine You're Grouping Everyone in the World

Suppose you have a massive social network graph where each node is a person and edges represent "these two people are connected." Now you need to group them:

Level 0 (coarsest granularity): First divide by the largest circles — equivalent to splitting everyone into "continents." People within the same continent are closely connected; connections between continents are sparse.
Level 1: Further divide within each continent — equivalent to splitting into "countries."
Level 2: Divide within each country — equivalent to "provinces/states."
Level 3, 4, ...: Continue dividing into "cities," "neighborhoods"...

The higher the level, the finer the granularity.

Each layer connects to the next through PARENT_OF edges (coarse → fine):

Continent ──PARENT_OF──> Country ──PARENT_OF──> Province ──PARENT_OF──> City
(level 0)              (level 1)             (level 2)              (level 3)

A Complete Example

Suppose we run GraphRAG hierarchical clustering on a "Global Cuisine Knowledge Graph." The entities are various ingredients, dishes, and cooking techniques, with edges representing their associations.

First Round of Clustering (Level 0): 5 Major Groups

Community	Representative Entities	Size
Continent A "Asian Cuisine"	Rice, soy sauce, wok, tofu, miso...	800
Continent B "European Cuisine"	Olive oil, cheese, bread, red wine, butter...	600
Continent C "American Cuisine"	Corn, chili peppers, avocado, BBQ...	400
Continent D "African Cuisine"	Cassava, peanut sauce, couscous...	200
Continent E "Antarctic Research Station Cafeteria"	Canned food, hardtack, instant coffee	3

Second Round of Clustering (Level 1): Subdividing Within Groups

Continent A "Asian Cuisine" (800 entities) has complex internal structure and can be further divided:

Continent A "Asian Cuisine" (level 0, size=800)
  ├── PARENT_OF → Country A1 "Chinese Cuisine" (level 1, size=300)
  │     ├── PARENT_OF → Province A1a "Sichuan Cuisine" (level 2, size=80)
  │     ├── PARENT_OF → Province A1b "Cantonese Cuisine" (level 2, size=70)
  │     └── PARENT_OF → Province A1c "Shandong Cuisine" (level 2, size=50)
  ├── PARENT_OF → Country A2 "Japanese Cuisine" (level 1, size=200)
  ├── PARENT_OF → Country A3 "Southeast Asian Cuisine" (level 1, size=150)
  └── PARENT_OF → Country A4 "Korean Cuisine" (level 1, size=100)

What about Continent E "Antarctic Research Station Cafeteria" (3 entities)?

Continent E "Antarctic Research Station Cafeteria" (level 0, size=3)
  ├── Canned food
  ├── Hardtack
  └── Instant coffee

  (That's it — no outgoing PARENT_OF edges)

The relationships among these 3 entities:

Canned food ↔ Hardtack (both are long-shelf-life foods)
Canned food ↔ Instant coffee (both are ready-to-eat items)
Hardtack ↔ Instant coffee (both are research station staples)

They're closely related, so they're grouped together. But with only 3 members — you can't split 3 people into "departments" and "teams." That would be absurd.

Meanwhile, Continent E's external connections are extremely sparse — only "canned food" has one weak link to Continent B's "canned olive oil." This connection is too weak for the algorithm to merge Continent E into Continent B.

Result: Continent E becomes an orphan — it can neither be subdivided downward nor merged into another group.

Why Do Orphans Occur? Two Conditions Must Be Met Simultaneously

                    ┌─────────────────────────┐
                    │  Community too small     │
                    │  (2~9 entities)          │
                    │  Cannot subdivide further│
                    └───────────┬─────────────┘
                                │
                                ▼
                    ┌─────────────────────────┐
                    │  Becomes an orphan       │
                    │  Community               │
                    │  No PARENT_OF edges      │
                    └───────────┬─────────────┘
                                │
                    ┌───────────┴─────────────┐
                    │  Extremely weak external │
                    │  connections             │
                    │  (1~2 cross-group edges) │
                    │  Not worth merging into  │
                    │  another group           │
                    └─────────────────────────┘

The Leiden algorithm's criterion is modularity:

Subdivide downward: Split 3 people into 2 groups? Each group would have 1-2 people — no statistical significance, modularity won't improve. Abandoned.
Merge into others: Only 1 weak connection to the nearest large group; forcing a merge would reduce that group's cohesion. Abandoned.

The Data Speaks

Returning to real GraphRAG data, the statistics perfectly confirm this pattern:

Orphan communities (no PARENT_OF edges):

Community	Size (entity count)
Orphan 1	9
Orphan 2	7
Orphan 3	5
Orphan 4	3
Orphan 5	2

Normal communities (have PARENT_OF edges, participate in hierarchical subdivision):

Community	Size (entity count)
Normal 1	2,511
Normal 2	2,330
Normal 3	1,571
Normal 4	688
Normal 5	685

The pattern is crystal clear: the larger the size, the more likely it participates in the hierarchy; the smaller the size, the more likely it becomes an orphan.

In one real knowledge graph, level 0 had 41 communities total — 23 participated normally in hierarchical subdivision, while 18 became orphans. All orphans had sizes between 2 and 9.

Impact on GraphRAG Queries

Global Search

Global Search traverses community reports at a certain level to answer questions. If it chooses to traverse level 1 reports:

✅ Normal communities' information appears in level 1 sub-community reports
❌ Orphan communities have no level 1 sub-communities; their information won't appear in any level 1+ reports

Analogy: If you only read "country-level" reports, the Antarctic research station cafeteria's information won't appear in any country's report — because it doesn't belong to any country.

Local Search

Local Search finds relevant entities directly through entity vector matching, independent of the hierarchical structure. So entities within orphan communities can still be retrieved by Local Search.

Practical Impact

Since orphan communities are very small (2-9 entities) and contain limited information, their impact on most queries is minimal. But if your query happens to involve this "edge knowledge," you should be aware of this blind spot.

Summary

Feature	Normal Community	Orphan Community
Size	Tens to thousands	2~9
Analogy	Continents/Countries/Provinces (large populations)	Antarctic research station (3 people)
Internal structure	Complex, can be subdivided layer by layer	Too simple, cannot be subdivided
External connections	Extensive interactions with other groups	Almost isolated from the outside
PARENT_OF edges	Yes (pointing to finer sub-communities)	None
Global Search visibility	Information propagates through reports at all levels	Only visible in level 0 reports

The Leiden hierarchical clustering algorithm's behavior is just like the real world, where the Antarctic research station truly doesn't belong to any country's administrative division — it's too small and too isolated; forcing it into some country would be unreasonable. The algorithm makes the same judgment: communities too small cannot be further subdivided, and communities with connections too weak to the outside won't be forcibly merged.

GraphRAG Local Search Text Unit Selection Strategy: Design Trade-offs and Improvement Directions

eyanpen — Fri, 15 May 2026 00:49:08 +0000

Introduction

GraphRAG's Local Search needs to select the most relevant raw text fragments (Text Units) associated with the knowledge graph to fill the LLM context window during query time. This selection strategy seems simple — sort by entity similarity, fill one by one — but in real-world scenarios it exposes a significant limitation: popular entities can monopolize the entire Text Unit budget, causing key text from other entities to be truncated.

This article provides an in-depth analysis of the root cause of this problem, the core problem it was designed to solve, and possible improvement directions.

What Is the Current Strategy

Local Search's Text Unit selection has four steps:

Iterate through selected entities (ranked by vector similarity), collecting each entity's associated text_unit_ids
Deduplication: each TU is attributed only to the first entity encountered
Sorting: by (entity_index, -num_relationships) — entity order takes priority, within the same entity sorted by relationship density in descending order
Fill into context one by one until reaching the token limit (default 50% of total budget, approximately 6000 tokens)

Core code:

for index, entity in enumerate(selected_entities):
    entity_relationships = [rel for rel in relationships if rel.source == entity.title or rel.target == entity.title]
    for text_id in entity.text_unit_ids or []:
        if text_id not in text_unit_ids_set and text_id in self.text_units:
            num_relationships = count_relationships(entity_relationships, self.text_units[text_id])
            text_unit_ids_set.add(text_id)
            unit_info_list.append((self.text_units[text_id], index, num_relationships))

unit_info_list.sort(key=lambda x: (x[1], -x[2]))

Problem Scenario: Popular Entities Monopolize the Budget

Concrete Example

Suppose the user asks: "What is the anti-inflammatory mechanism of chamazulene?"

Entities returned by vector search:

Rank	Entity	Associated TU Count	Notes
0	Chamomile	50	High-frequency entity, mentioned in almost all herbal documents
1	Chamazulene	4	Active component of chamomile, fewer specialized references
2	NF-κB pathway	2	Specific anti-inflammatory molecular mechanism

TU attribution after deduplication:

index 0 "Chamomile": TU1, TU2, TU3, ..., TU50  (50 items)
index 1 "Chamazulene": TU51, TU52              (TU1, TU5 already claimed by Chamomile)
index 2 "NF-κB":  TU53                    (only 1 unclaimed)

Sorting result:

TU1(index=0, rel=5) → TU2(index=0, rel=4) → ... → TU50(index=0, rel=0)
→ TU51(index=1, rel=2) → TU52(index=1, rel=1)
→ TU53(index=2, rel=1)

Assuming a token budget of 6000 tokens and each TU averaging 300 tokens, only about 20 TUs can fit.

Result: All top 20 positions are occupied by "Chamomile" TUs. The text about "chamazulene's anti-inflammatory mechanism" that the user actually cares about (TU51, TU52, TU53) is entirely truncated. The context fed to the LLM is filled with generic introductions about "Chamomile" but contains no original text supporting chamazulene's specific molecular mechanisms.

Why It Was Designed This Way: What Problem It Solves

This strategy was not designed arbitrarily — it solves a more fundamental problem: ensuring that the most semantically relevant entities receive the most comprehensive original text support.

The Scenario It Addresses

Suppose the user asks: "What is the status of chamomile in European traditional medicine?"

Vector search returns:

Rank	Entity	Associated TU Count
0	Chamomile	50
1	European Herbalism	8
2	Lavender	30

In this scenario, "Chamomile" is indeed the most core entity — the user is asking about it. If a round-robin strategy were used (taking 1 TU from each entity in turn), then "Lavender's" 30 TUs would split the budget equally with "Chamomile" — but the user never asked about lavender.

The advantages of the current strategy:

Respects semantic ranking: The entity with the highest vector similarity gets the most original text support, which is correct in most cases
Relationship density sorting ensures quality: Among multiple TUs for the same entity, the most information-dense ones come first
Deduplication avoids redundancy: The same TU won't appear repeatedly because it's associated with multiple entities

Core Trade-off

This is a classic relevance depth vs. coverage breadth trade-off:

The current strategy chooses depth: ensuring the most relevant entity has sufficient original text evidence
The cost is breadth: secondary entities may have no original text support at all

For most "questions about a specific entity" (the design target of Local Search), depth-first is reasonable. The problem emerges when queries involve cross-entity relationships.

The Essence of the Problem: A Single Sorting Dimension Cannot Express Multi-Objective Optimization

Text Unit selection is fundamentally a multi-objective optimization problem:

Relevance: The semantic relevance of a TU to the query (expressed indirectly through entity ranking)
Information density: The number of relationships contained in a TU
Coverage: Ensuring every selected entity has original text support
Diversity: Avoiding homogeneous content flooding the context

The current strategy uses a single tuple (entity_index, -num_relationships) attempting to optimize the first two objectives simultaneously, but completely ignores the latter two.

Improvement Directions

Approach 1: Per-Entity Cap

The simplest improvement — set a TU contribution cap for each entity:

MAX_TU_PER_ENTITY = 5

for index, entity in enumerate(selected_entities):
    count = 0
    for text_id in entity.text_unit_ids or []:
        if count >= MAX_TU_PER_ENTITY:
            break
        if text_id not in text_unit_ids_set and text_id in self.text_units:
            # ... addition logic unchanged
            count += 1

Pros: Simple to implement, guarantees each entity at least has a chance to contribute TUs
Cons: Cap value is hard to determine; if an entity genuinely needs extensive original text support, it gets artificially limited

Approach 2: Round-Robin

Each round takes 1 TU from each entity (selecting the best by relationship density), cycling until the budget is exhausted:

entity_queues = {i: sorted_tus_for_entity_i for i in range(len(selected_entities))}
result = []
while budget > 0 and any(entity_queues.values()):
    for i in range(len(selected_entities)):
        if entity_queues[i]:
            tu = entity_queues[i].pop(0)
            result.append(tu)
            budget -= token_count(tu)

Pros: Guarantees coverage, every entity has original text support
Cons: Depth of the most relevant entity is diluted; lower-ranked irrelevant entities also receive equal budget

Approach 3: Weighted Quota Allocation

Allocate TU quotas based on entity vector similarity scores:

# Assuming similarity scores: [0.95, 0.82, 0.71]
scores = [0.95, 0.82, 0.71]
total = sum(scores)
quotas = [int(max_tus * s / total) for s in scores]
# quotas ≈ [15, 13, 11] (assuming max_tus=39)

Pros: Balances depth and breadth; higher-relevance entities get more quota without monopolizing
Cons: Increased implementation complexity; requires preserving similarity scores from vector search results (not retained in current code)

Approach 4: Minimum Guarantee + Remaining Competition

Guarantee each entity at least N TUs (e.g., 2), with remaining budget competed for using the current strategy:

# Phase 1: Guarantee 2 best TUs per entity
for entity in selected_entities:
    guaranteed_tus = top_2_by_relationship_density(entity)
    result.extend(guaranteed_tus)

# Phase 2: Fill remaining budget using original sorting strategy
remaining = all_tus - guaranteed_tus
remaining.sort(key=lambda x: (x.entity_index, -x.num_relationships))
fill_until_budget(remaining)

Pros: Guarantees coverage while preserving the depth advantage of the original strategy
Cons: If many entities are selected, the guarantee phase may consume significant budget

Summary

Dimension	Current Strategy	Issue
Relevance depth	✅ Excellent	—
Information density	✅ Excellent	—
Coverage breadth	❌ Missing	Popular entities monopolize budget
Content diversity	❌ Missing	Homogenization risk

GraphRAG's current Text Unit selection strategy is a "depth-first" design that performs well for "questions about a single entity" scenarios, but exposes insufficient coverage when queries involve multi-entity cross-relationships.

The most pragmatic improvement is Approach 4 (Minimum Guarantee + Remaining Competition) — it guarantees that every selected entity has at least some original text support with minimal code changes, without breaking the original strategy's advantages in mainstream scenarios.

Why Gold Answers Are Becoming Less Important in GraphRAG Systems

eyanpen — Tue, 12 May 2026 08:10:41 +0000

Traditional RAG evaluation relies on human-annotated "standard answers," but in the GraphRAG era, this approach is losing its relevance.

What Is a Gold Answer?

A Gold Answer is a human-annotated "standard correct answer." In traditional NLP and RAG system evaluation, the process typically goes like this:

Prepare a batch of test questions
Have humans write the "correct answer" for each question
Let the system answer the same questions
Compare system answers against Gold Answers, calculating F1, BLEU, ROUGE, and other scores

This approach has worked for years in search engines and simple Q&A systems. But in complex systems like GraphRAG, the value of Gold Answers is declining rapidly.

Knowledge Graphs Evolve Continuously — Gold Answers Can't Keep Up

The Core Problem

The heart of GraphRAG is the knowledge graph. Graphs aren't static — every document update, every re-extraction of entities and relationships changes the graph. Today's "correct answer" might be outdated tomorrow.

Example

Suppose your company has an internal technical architecture document:

January version: The document states "the order service uses MySQL"
March version: After an architecture upgrade, it now reads "the order service uses PostgreSQL + Redis cache"

The Gold Answer you annotated in January is:

Q: What database does the order service use?

A: MySQL

By March, the GraphRAG system has re-indexed the new documents and correctly answers "PostgreSQL + Redis." But if you still evaluate against the January Gold Answer, the system gets marked as "wrong."

A More Realistic Scenario

In enterprise environments, document update frequency is much higher than most people imagine:

API documentation changes weekly
Organizational structures are adjusted quarterly
Technology choices may be overhauled every six months

After each document update, you need to re-annotate Gold Answers. For an evaluation set with 500 test questions, each update might require modifying 30% of the answers — that means re-reviewing 150 answers every time.

Human Annotation of Gold Answers Is Extremely Costly and Unreliable

The Core Problem

The questions GraphRAG handles often involve multi-hop reasoning and cross-document correlation. For such questions, even human experts struggle to provide a single "uniquely correct" answer.

Example

Suppose the question is:

"Among the projects Zhang San is responsible for, which ones use EOL (End of Life) technology stacks?"

To answer this, annotators need to:

Find which projects Zhang San is responsible for (possibly scattered across 5 documents)
Find the technology stack for each project (yet more documents)
Determine which stacks are EOL (requires external information)
Synthesize all the above into an answer

Suppose the ground truth is that Zhang San is responsible for 4 projects, 3 of which use EOL tech stacks. After an hour of document review, the annotator writes this Gold Answer:

Project A (Spring Boot 2.5), Project B (Log4j 1.x), Project C (Python 2.7)

But the annotator missed Project D — because Zhang San's responsibility for Project D was documented in meeting minutes, not in the official project assignment sheet.

Now look at the evaluation results:

System	Answer	Score Against Gold Answer
Traditional RAG	Found Projects A, B (missed C)	Recall 2/3 = 0.67
GraphRAG	Found Projects A, B, C, D (discovered D through relationship reasoning in meeting minutes)	Recall 3/3 = 1.0, but Precision 3/4 = 0.75 (D judged as "extraneous")

The irony: GraphRAG gets penalized for being more correct than the Gold Answer. It discovered information through the graph's relationship chain (Zhang San → attended meeting → meeting resolution → responsible for Project D) that even the annotator missed, but in the evaluation framework, this "extra correct answer" is treated as an error.

Final F1 scores:

Traditional RAG: F1 = 0.80
GraphRAG: F1 = 0.86

GraphRAG clearly found more complete and accurate results, yet its score advantage is negligible — and in some evaluation settings (like strict exact matching), it might even score lower than traditional RAG. The Gold Answer ceiling limits the ability to identify superior systems.

The Cost Calculation

Annotating a single complex GraphRAG test question might take a domain expert 30-60 minutes (requiring cross-referencing multiple documents). If you need 200 test questions, that's 100-200 hours of expert time. And these answers might only remain valid for a few months (see the first point above).

GraphRAG Answers Are Inherently Diverse in Form

The Core Problem

Traditional RAG typically answers factual questions ("What is X?"), where answers are relatively fixed. But GraphRAG excels at relationship reasoning and comprehensive analysis — questions where the "correct answer" naturally has multiple valid expressions.

Example

Question:

"In our microservices architecture, which services have circular dependencies?"

GraphRAG might answer:

Answer A: Service A → Service B → Service C → Service A forms a cycle; Service D and Service E call each other.

Answer B: Two groups of circular dependencies exist: (1) A-B-C triangular cycle (2) D-E bidirectional dependency. Recommend prioritizing decoupling the A-B-C cycle as it involves the core transaction path.

Answer C: Circular dependency path detected: A→B→C→A. Additionally, D↔E has bidirectional calls, but since they use asynchronous messaging, the actual impact is minimal.

All three answers are "correct," but with different emphases. Using any single one as the Gold Answer would unfairly penalize other equally correct responses.

Traditional Metrics Fail

Comparing the three answers above using ROUGE scores:

Answer A vs Answer B: ROUGE-L might be only 0.3 (completely different wording)
Answer A vs Answer C: ROUGE-L might be 0.5 (some overlap)

But from an information correctness perspective, all three should receive full marks. The Gold Answer + text similarity metric combination completely fails here.

LLM-as-Judge Is Replacing Gold Answers

The Core Problem

Given all these issues with Gold Answers, the industry is shifting toward a new evaluation paradigm: using LLMs as judges (LLM-as-Judge), directly evaluating answer quality rather than comparing against "standard answers."

Example

Traditional approach:

System answer: "PostgreSQL + Redis"
Gold Answer: "MySQL"
ROUGE score: 0.0  → Judged as incorrect ❌

LLM-as-Judge approach:

Question: "What database does the order service use?"
System answer: "PostgreSQL + Redis"
Reference document: [Latest architecture doc, clearly states PostgreSQL + Redis]

LLM judgment: Answer is consistent with documentation, information is accurate, score 5/5 ✅

Advantages of LLM-as-Judge:

Dimension	Gold Answer	LLM-as-Judge
Requires human annotation	Extensive manual work	Not needed
Adapts to document updates	Requires re-annotation	Automatically adapts (references latest docs)
Handles multiple valid expressions	Cannot	Can (understands semantic equivalence)
Evaluation cost	High (manual)	Low (API calls)
Evaluation speed	Slow (days/weeks)	Fast (minutes)

GraphRAG Evaluation Dimensions Far Exceed "Answer Correctness"

The Core Problem

Gold Answers can only evaluate one dimension: whether the answer content is correct. But GraphRAG system quality depends on many other factors that Gold Answers simply cannot measure.

Example

For the same question, two GraphRAG systems both give correct answers, but the quality differs dramatically:

System A's response:

Zhang San is responsible for Project X, which uses Spring Boot 2.5 (EOL).

System B's response:

Zhang San is responsible for Project X, which uses Spring Boot 2.5 (maintenance ended November 2023). Additionally, the project depends on Log4j 1.x (EOL since 2015, with known security vulnerability CVE-2019-17571). Recommend referring to the internal migration guide [link] for upgrading.

Both answers might score identically against the Gold Answer, but System B is clearly more valuable — it provides more complete information, security risk alerts, and actionable recommendations.

The Dimensions That Actually Matter

For GraphRAG systems, we should focus on:

Graph coverage: Are entities and relationships being completely extracted?
Reasoning path explainability: Which nodes and edges did the system traverse to reach its conclusion?
Information completeness: Are important related details being missed?
Timeliness: Is the referenced information current?
Actionability: Does the answer provide executable recommendations?

None of these dimensions can be evaluated by Gold Answers.

How Should We Evaluate GraphRAG Then?

Since Gold Answers are no longer a silver bullet, here are evaluation strategies better suited for GraphRAG:

LLM-as-Judge + dimension decomposition: Have LLMs score separately on accuracy, completeness, relevance, and other dimensions
Source document fact-checking: Verify whether each fact in the answer can be traced back to source documents
Graph quality metrics: Directly evaluate knowledge graph entity coverage and relationship accuracy
End-to-end user satisfaction: Have real users evaluate whether answers solved their problems
Regression testing over absolute scoring: Focus on quality changes before and after system updates, rather than pursuing absolute scores

Final Thoughts

Gold Answers aren't entirely worthless — for simple factual Q&A and system cold-start phases, they remain a useful baseline. But in complex systems like GraphRAG, over-reliance on Gold Answers introduces three risks:

False sense of security: High Gold Answer scores don't mean the system is actually useful
Maintenance burden: The cost of continuously updating Gold Answers may exceed the value they provide
Evaluation blind spots: Gold Answers cannot cover GraphRAG's most important quality dimensions

Rather than spending enormous effort maintaining a set of "standard answers" destined to become outdated, invest that energy into more modern, comprehensive evaluation systems. GraphRAG evaluation should be like GraphRAG itself — dynamic, multi-dimensional, and based on understanding rather than rote memorization.

Why Does Semantic Chunking Need an Embedding API?

eyanpen — Mon, 04 May 2026 05:54:39 +0000

Fixed-length chunking requires no external services, yet semantic chunking absolutely needs an Embedding API — why?

The Short Answer

The core idea of semantic chunking is to split text at semantic boundaries. Determining whether "two pieces of text belong to the same topic" requires converting text into vectors and computing similarity — that's exactly what the Embedding API does.

Traditional Chunking vs Semantic Chunking

Dimension	Fixed-Length / Recursive	Semantic Chunking
Split criteria	Character count, token count, delimiters	Semantic similarity between adjacent sentences
Requires Embedding	❌ No	✅ Yes
Split quality	May break in the middle of a topic	Splits at topic transitions, preserving semantic coherence

Fixed-length chunking is like measuring paper with a ruler — regardless of content, it cuts every 500 characters. Semantic chunking is like a reader who, after finishing a paragraph, asks "is the next part still about the same thing?" If not, that's where the cut goes.

Two Mainstream Semantic Chunking Strategies

Strategy 1: Adjacent Similarity (Kamradt Method)

Core idea: Compute semantic distances between adjacent sentences and split where distances spike.

Process:
1. Split text into small sentences
2. For each sentence, concatenate buffer_size adjacent sentences as context
3. Call Embedding API to get vectors for each combined sentence
4. Compute cosine distances between adjacent combined sentences
5. Use binary search to find a threshold, split where distance exceeds it

Pseudocode:

# Step 1: Build context windows
for i, sentence in enumerate(sentences):
    combined = ""
    for j in range(max(0, i - buffer_size), i):
        combined += sentences[j] + " "
    combined += sentence
    for j in range(i + 1, min(n, i + 1 + buffer_size)):
        combined += " " + sentences[j]
    combined_texts.append(combined)

# Step 2: Get embeddings for all combined sentences (one batch call)
embeddings = embedding_client.embed_texts(combined_texts)
embedding_matrix = np.array(embeddings)

# Step 3: Compute cosine distances only between adjacent sentences
distances = []
for i in range(len(sentences) - 1):
    similarity = dot(embedding_matrix[i], embedding_matrix[i + 1])
    distances.append(1 - similarity)  # Higher distance = greater topic difference

# Step 4: Binary search for threshold targeting total_size / avg_chunk_size cuts
threshold = binary_search_threshold(distances, target_cuts)

# Step 5: Split where distance exceeds threshold
breakpoints = [i for i, d in enumerate(distances) if d > threshold]

Intuition: Imagine reading an article sentence by sentence, asking yourself after each one: "Is the next sentence still about the same thing?" When you feel the topic has jumped, you cut there.

Key characteristic: Only looks at adjacent relationships. It only computes the distance between sentence[i] and sentence[i+1] — a local greedy strategy.

Strategy 2: Cluster Optimal Segmentation (Dynamic Programming Method)

Core idea: Build a similarity matrix between all sentence pairs and use dynamic programming to find the segmentation that maximizes intra-cluster similarity.

Process:
1. Split text into small sentences
2. Call Embedding API to get vectors for all sentences
3. Build an N×N similarity matrix
4. Normalize the matrix by subtracting the mean (prevents degeneration into one giant cluster)
5. Use dynamic programming to find the optimal segmentation

Pseudocode:

# Step 1: Get embeddings for all sentences (note: no buffer concatenation)
embeddings = embedding_client.embed_texts(sentences)
embedding_matrix = np.array(embeddings)

# Step 2: Build N×N similarity matrix
similarity_matrix = dot(embedding_matrix, embedding_matrix.T)

# Step 3: Mean normalization to prevent DP from putting everything in one cluster
mean_sim = mean(upper_triangle(similarity_matrix))
similarity_matrix -= mean_sim
fill_diagonal(similarity_matrix, 0)

# Step 4: Dynamic programming for optimal segmentation
# dp[i] = maximum intra-cluster similarity sum for the first i+1 sentences
for i in range(n):
    for size in range(1, i + 2):
        start = i - size + 1
        if cluster_size(start, i) > max_chunk_size and size > 1:
            break
        reward = sum(similarity_matrix[start:i+1, start:i+1])
        if start > 0:
            reward += dp[start - 1]
        dp[i] = max(dp[i], reward)

# Step 5: Backtrack to get optimal segmentation
clusters = backtrack(segmentation)

Key characteristic: Globally optimal. It considers relationships between all sentence pairs and uses DP to find the overall best segmentation.

Deep Comparison of the Two Strategies

Fundamental Algorithmic Differences

Dimension	Kamradt (Adjacent Similarity)	Cluster (Dynamic Programming)
Scope	Local — only adjacent sentences	Global — all sentence pairs
Decision method	Greedy: cut when distance exceeds threshold	Optimization: maximize intra-cluster similarity
Threshold	Binary search for target cut count	No threshold needed, DP decides automatically
Context enhancement	✅ buffer_size concatenation	❌ Uses raw sentences directly
Size constraints	avg_chunk_size + max_chunk_size dual constraint	max_chunk_size hard constraint

The core difference in one sentence:

Kamradt asks: "Is there a topic transition between these two adjacent sentences?"
Cluster asks: "Which grouping makes sentences within each group most similar to each other?"

An Intuitive Example

Consider 6 sentences with the following topic distribution:

Sentence 1: Discussing Apple's earnings report
Sentence 2: Discussing Apple's new products
Sentence 3: Discussing the weather forecast
Sentence 4: Discussing tomorrow's temperature
Sentence 5: Discussing Apple's stock price
Sentence 6: Discussing Apple's competitors

Kamradt's approach: Compare adjacent pairs

Sentence 2→3: Topic jump (Apple → weather), cut!
Sentence 4→5: Topic jump (weather → Apple), cut!
Result: [1,2] [3,4] [5,6]

Cluster's approach: The global similarity matrix shows sentences 1,2,5,6 are highly similar to each other

But since DP requires contiguous segmentation (can't skip around), it can only cut contiguous spans
Result is likely also [1,2] [3,4] [5,6], but the reasoning is different

The key difference emerges when boundaries are fuzzy:

Consider an article that gradually transitions from "EV technology" to "energy policy":

Sentence 1: Tesla released a new generation of battery technology
Sentence 2: The new battery's energy density improved by 50%
Sentence 3: Higher energy density means longer driving range
Sentence 4: Range anxiety has been a barrier for consumers buying EVs
Sentence 5: The government introduced charging station subsidies to address this
Sentence 6: Subsidies cover both residential and commercial charging facilities
Sentence 7: Commercial charging uses time-of-use electricity pricing
Sentence 8: Time-of-use pricing is a key component of electricity market reform

What Kamradt sees (adjacent distances):

1→2: 0.08  (both about batteries)
2→3: 0.10  (battery → range, very close)
3→4: 0.12  (range → range anxiety, very close)
4→5: 0.15  (consumers → government policy, slightly far but not outstanding)
5→6: 0.09  (both about subsidies)
6→7: 0.13  (subsidies → pricing, somewhat far)
7→8: 0.11  (both about pricing)

No single distance clearly "spikes" — the topic slides gradually. Kamradt's binary search struggles to find a reasonable threshold, potentially producing suboptimal splits like [1-4][5-8] or [1-3][4-6][7-8].

What Cluster sees (global similarity matrix summary):

        S1    S2    S3    S4    S5    S6    S7    S8
S1      --   0.9   0.7   0.4   0.2   0.1   0.1   0.05
S2           --    0.8   0.5   0.2   0.15  0.1   0.05
S3                 --    0.6   0.3   0.2   0.15  0.1
S4                       --    0.5   0.4   0.3   0.2
S5                             --    0.8   0.6   0.4
S6                                   --    0.7   0.5
S7                                         --    0.8
S8                                               --

The global view clearly shows: sentences 1-3 are highly similar to each other (battery/range technology), sentences 5-8 are highly similar to each other (policy/pricing), and sentence 4 is a transition. DP optimization discovers that [1-3][4-8] or [1-4][5-8] maximizes intra-cluster similarity, producing a more reasonable split.

The essential difference: Kamradt only looks at "the gap between adjacent sentences" — in a gradual transition, each step's gap is small, like the boiling frog metaphor. Cluster looks at "the overall similarity within each group" — even when the transition is smooth, it can still detect that sentence 1 and sentence 8 are essentially unrelated.

Embedding Cost Comparison

This is one of the most important practical differences between the two strategies:

Dimension	Kamradt	Cluster
Embedding input	combined_sentence (with buffer context)	Raw sentences (no buffer)
Embedding call count	N texts, 1 batch call	N texts, 1 batch call
Average text length	Longer (~7 sentences, buffer_size=3)	Shorter (1 sentence)
Total token consumption	Higher (buffer causes input inflation)	Lower (no redundancy)
Post-embedding computation	O(N) — only adjacent distances	O(N²) — full similarity matrix
DP computation	None	O(N × max_cluster_size)

Concrete Numbers (1000 sentences, ~30 tokens each)

Kamradt:

Embedding input: 1000 combined_sentences, each ~7×30 = 210 tokens
Total token consumption: 1000 × 210 = 210,000 tokens
Distance computation: 999 dot products → negligible
Memory: 1000 × embedding_dim matrix

Cluster:

Embedding input: 1000 raw sentences, each ~30 tokens
Total token consumption: 1000 × 30 = 30,000 tokens
Similarity matrix: 1000 × 1000 = 1 million floats (~8MB)
DP computation: O(1000 × max_cluster_size) iterations

Conclusions:

Embedding API cost: Kamradt consumes ~7x more tokens (due to buffer concatenation), higher API cost
Compute resources: Cluster's O(N²) matrix and DP are more expensive on local CPU/memory
Network latency: Same for both (both use 1 batch call, or multiple calls based on batch_size)

Large-Scale Scenario (100,000 sentences)

Metric	Kamradt	Cluster
Total embedding tokens	~21 million tokens	~3 million tokens
API calls (batch_size=500)	200	200
Similarity computation	99,999 dot products	10 billion dot products (N² matrix)
Memory usage	~400MB (embedding matrix)	~40GB (N² similarity matrix) ⚠️

At 100K sentences, Cluster's N² matrix will blow up memory — this is its hard limitation. In practice, Cluster is better suited for medium-length documents (hundreds to thousands of sentences), while Kamradt can handle any length.

Split Quality Comparison

Scenario	Kamradt	Cluster
Clear topic boundaries	✅ Excellent, obvious distance spikes	✅ Excellent
Gradual topic transitions	⚠️ May fail to find split points	✅ Global optimization still finds best split
Short documents (<50 sentences)	✅ Fast	✅ Higher quality
Long documents (>10K sentences)	✅ Linear scaling	❌ Memory explosion
Very short sentences	⚠️ Needs buffer for context	⚠️ Short sentence embeddings are low quality

How to Choose?

Your Scenario	Recommended	Reason
Unknown document length, need general solution	Kamradt	Linear complexity, won't blow memory
Short documents (<2000 sentences), want optimal splits	Cluster	Globally optimal, higher quality
Embedding API charges per token	Cluster	No buffer inflation, 7x fewer tokens
Limited local compute resources	Kamradt	O(N) computation, memory-friendly
Fuzzy topic boundaries, need precise splits	Cluster	DP global optimization is more robust

Why Can't Other Methods Replace Embedding?

Alternative	Problem
Keyword overlap / TF-IDF	Cannot capture synonyms or contextual semantics ("automobile" and "vehicle" would be considered unrelated)
Rule-based delimiters (paragraphs, periods)	One paragraph may contain multiple topics; different paragraphs may discuss the same topic
LLM direct judgment	Too expensive, high latency, unsuitable for batch processing tens of thousands of sentences

Embedding maps text into a high-dimensional semantic space where semantically similar texts have small vector distances and dissimilar texts have large distances. This is currently the optimal approach for semantic similarity measurement, balancing cost, speed, and quality.

buffer_size: The Role of the Context Window

Semantic chunking has a key parameter buffer_size (default: 3) that determines how much context is concatenated when generating embeddings for each sentence.

# Concatenation logic
for each sentence[i]:
    combined = sentence[i-3] + sentence[i-2] + sentence[i-1]  # 3 before
              + sentence[i]                                     # current
              + sentence[i+1] + sentence[i+2] + sentence[i+3]  # 3 after

Key point: buffer_size does not affect the number of Embedding calls — only the length of each input text.

With 10 sentences, whether buffer_size is 1 or 10, you still embed 10 combined_sentences. The difference is how much context each text contains:

buffer_size	Avg sentences per text	Effect
1	~3	Less context, may misjudge
3 (default)	~7	Balance point
10	~21	Rich context, but may exceed model token limit

Note: Embedding models have input length limits (e.g., BGE-M3 max 8192 tokens). If buffer_size is too large, texts get truncated, potentially losing the current sentence's information.

Performance at Scale

Suppose a long document is split into 100,000 sentences:

Texts to embed = 100,000
With batch_size of 500, actual API calls = 100,000 ÷ 500 = 200 HTTP requests

The performance bottleneck is API call count (determined by total sentences and batch_size), independent of buffer_size.

Fallback Strategy: What If Embedding Is Unavailable?

Good system design should account for Embedding service unavailability. The common approach: when Embedding calls fail, automatically fall back to recursive chunking (pure rule-based splitting, no Embedding needed).

This means semantic chunking is an enhancement, not a dependency — the system still works without the Embedding service, just with lower split quality.

Summary

Question	Answer
Why is Embedding needed?	Judging semantic similarity requires vector representations
Can rules replace it?	No, rules cannot capture semantics
Can LLM replace it?	Theoretically yes, but cost and latency are unacceptable
Kamradt vs Cluster core difference?	Local adjacent comparison vs global optimal segmentation
Which has higher Embedding cost?	Kamradt: higher token consumption (buffer inflation); Cluster: higher compute cost (N² matrix)
Which for large documents?	Kamradt — linear complexity, won't blow memory
Which for optimal splits?	Cluster — global DP optimization, but limited to medium-length documents
What if service is unavailable?	Both fall back to rule-based chunking

The Embedding API is the "eyes" of semantic chunking — without it, the chunking algorithm is a blind person cutting a cake. The two strategies "see" text differently: Kamradt is like a line-by-line scanner, Cluster is like an editor with a bird's-eye view. Which to choose depends on your document scale and split quality requirements.

Multiple Independent Questions: Batch Into One Request or Split Into Many? — An Analysis of LLM Concurrent Processing

eyanpen — Sun, 03 May 2026 00:19:18 +0000

When you have 5 unrelated questions, should you pack them into one message to the LLM, or send 5 requests simultaneously? Which is faster?

The Short Answer

Splitting into multiple independent parallel requests is almost always faster.

This isn't a gut feeling — it's determined by the underlying inference mechanism of LLMs. Let's walk through the reasoning from first principles.

1. How LLMs Generate Text: Autoregressive Decoding

To understand this problem, you first need to know how LLMs "write."

LLMs (GPT-4, Claude, etc.) use autoregressive generation: they produce one token at a time, append that token back to the input, then generate the next token. This repeats until generation is complete.

The key insight: Generating N tokens requires N forward passes.

This means:

A 100-token answer requires 100 inference steps
A 500-token answer requires 500 inference steps
Total output length directly determines total latency

2. Batched Request: Output Volumes Stack, Latency Grows Linearly

Suppose you have 5 independent questions, each requiring ~200 tokens to answer.

Approach A: Combine into one request

You stuff all 5 questions into a single message:

Please answer the following questions separately:
1. xxx
2. xxx
3. xxx
4. xxx
5. xxx

The LLM needs to generate total output ≈ 5 × 200 = 1000 tokens. Due to autoregressive decoding, these 1000 tokens are generated sequentially — token #201 must wait for the first 200 to finish.

Total latency ≈ 1000 × per-token generation time

Plus additional overhead:

The LLM must maintain context switches between answers ("now answering question 3")
Longer KV Cache means increasing attention computation at each step
Actual output often exceeds 1000 tokens (formatting, transition phrases, etc.)

3. Split Requests: Parallel Inference, Latency Equals the Slowest One

Approach B: Split 5 questions into 5 independent requests, sent simultaneously

Each request independently generates ~200 tokens. If the server has sufficient concurrent processing capacity (all modern LLM services do), these 5 requests are processed in parallel.

Total latency ≈ max(individual request latencies) ≈ 200 × per-token generation time

Comparison:

Approach	Total output tokens	Actual latency (relative)
Combined request	~1000+	~1000 steps (sequential)
Split into 5 requests	~200 each	~200 steps (parallel)

Theoretical speedup ≈ 5x (equals the number of questions).

4. Why Does Parallelism Work? — Server-Side Continuous Batching

You might ask: doesn't the LLM server have capacity limits? Won't 5 simultaneous requests queue up?

Modern LLM inference engines (vLLM, TensorRT-LLM, TGI, etc.) all implement Continuous Batching:

Multiple requests share the same GPU matrix operation: GPUs excel at parallel computation. Combining tokens from 5 requests into one batch allows a single forward pass to generate one token for each request simultaneously.
Dynamic scheduling: Different requests have different output lengths. Shorter ones finish first, and their slots are immediately given to new requests.
Throughput vs. latency decoupling: Larger batches mean higher GPU utilization and more total tokens processed per unit time.

From the server's perspective:

5 short parallel requests → GPU does 5-way batched inference, producing 5 tokens per step
1 long request → GPU does single-sequence inference, producing 1 token per step

The GPU's parallel computing power is wasted when requests are combined.

5. The Prefill Phase Difference

LLM inference has two phases:

Prefill: Process the input prompt, computing KV Cache for all input tokens. This step can process all input tokens in parallel, with latency roughly linear to input length.
Decode: Generate output token by token. This step is sequential.

With combined requests:

Prefill phase: Longer input (all 5 questions concatenated), longer prefill time
Decode phase: Longer output, longer decode time

With split requests:

Each request's prefill is shorter, and all 5 prefills can run in parallel or pipelined
Each request's decode is shorter, and they run in parallel

Both phases favor splitting.

6. An Often-Overlooked Factor: Quality

Beyond speed, combining requests carries quality risks:

Attention dilution: When an LLM processes multiple unrelated tasks in one generation, its "focus" on each task decreases. Research shows that more irrelevant information in the prompt leads to lower answer quality (the "Lost in the Middle" phenomenon).
Format confusion: Answers to 5 questions easily suffer from numbering errors, omissions, or mismatched responses.
Error propagation: If the answer to question 2 goes wrong, the LLM may be influenced in subsequent answers (autoregressive "inertia").

Split requests completely isolate context, giving each question the LLM's "full attention."

7. When Is Combining Actually Better?

To be fair, there are a few scenarios where combining may be more appropriate:

Hidden correlations between questions: Even if you think they're independent, the LLM might give more consistent answers seeing the full picture (e.g., different sections of the same report).
Strict API rate limits: If your API quota is 3 requests per minute, you have no choice but to combine 5 questions.
Network latency far exceeds generation time: If each API call has 2 seconds of network round-trip but generation only takes 0.5 seconds, splitting 5 times (5 × 2s = 10s network overhead) might exceed the combined generation time. But this is rare in practice — modern API network latency is typically 100-300ms, far less than generation time.
Extremely short answers: If each question only needs a word or two, prefill overhead dominates, and combining can reduce redundant prefill costs.

8. How to Verify This Yourself

If you want to test this empirically:

import asyncio
import time
import aiohttp

async def ask_single(session, question):
    start = time.time()
    # Call LLM API
    resp = await session.post(API_URL, json={"prompt": question})
    result = await resp.json()
    return time.time() - start

async def benchmark():
    questions = ["Question 1", "Question 2", "Question 3", "Question 4", "Question 5"]

    async with aiohttp.ClientSession() as session:
        # Approach A: Combined
        start = time.time()
        combined = "Please answer separately:\n" + "\n".join(questions)
        await ask_single(session, combined)
        time_combined = time.time() - start

        # Approach B: Parallel
        start = time.time()
        await asyncio.gather(*[ask_single(session, q) for q in questions])
        time_parallel = time.time() - start

    print(f"Combined: {time_combined:.2f}s")
    print(f"Parallel: {time_parallel:.2f}s")
    print(f"Speedup: {time_combined / time_parallel:.1f}x")

In practice, 5 moderately complex independent questions typically achieve 3-5x speedup with parallel requests.

Summary

Dimension	Combined request	Split parallel requests
Generation speed	Slow (sequential output of all answers)	Fast (parallel generation, latency = slowest)
GPU utilization	Low (single-sequence inference)	High (batched parallel inference)
Answer quality	May degrade (attention dilution)	Better (isolated context)
API calls	1	N
Best for	Rate-limited / extremely short answers	Independent questions needing detailed answers

Core principle in one sentence: LLM's autoregressive mechanism means output is sequential; combining requests = forcing all outputs into a single serial stream; splitting requests = leveraging server-side parallelism to generate multiple outputs simultaneously. Splitting independent questions is the classic strategy of trading space (concurrent slots) for time.

What Is GraphRAG Really Doing? — A Deep Dive into Microsoft's Blog Post

eyanpen — Fri, 24 Apr 2026 11:57:01 +0000

Original: GraphRAG: Unlocking LLM discovery on narrative private data - Microsoft Research

In early 2024, Microsoft published a technical blog post. The core message boils down to one sentence: Traditional RAG falls short with complex data, and GraphRAG fills the gap using knowledge graphs + graph clustering.

This isn't an academic paper — it reads more like a "tech pitch" aimed at technical decision-makers and engineers. Let me break it down.

Where Does Traditional RAG Fall Short?

To understand what GraphRAG solves, we need to start with the pain points of traditional RAG. The article highlights two scenarios where traditional RAG struggles:

Information That Can't Be Connected

Imagine asking an AI: "What has Novorossiya done?"

Traditional RAG takes the word "Novorossiya" and runs a vector search. But among the 10 text chunks retrieved, none directly mentions that name — the answer is scattered across different documents, connected only through indirect relationships between entities. Vector search only finds text that "looks similar"; it can't handle this kind of reasoning that requires "jumping" between connections.

GraphRAG works differently: it locates the Novorossiya node in the knowledge graph, then traverses along relationship edges — actions, goals, related organizations — and assembles the complete answer.

Put simply, vector retrieval is "local matching," while real-world knowledge is often connected indirectly through chains of entity relationships.

Can't Answer "Big Questions"

Another example: "What are the top 5 themes in this dataset?"

Traditional RAG is stumped — the word "themes" is too broad. Vector search doesn't know which direction to look, and ends up matching some irrelevant text that happens to contain the word "theme." The answer naturally goes off track.

This is fundamentally a granularity problem: vector RAG retrieves at the text chunk level, but "overall themes" require a macro-level understanding of the entire dataset. No single chunk can support that kind of answer.

GraphRAG handles this easily with pre-built community clusters and community summaries, extracting themes directly from the macro structure.

How Does GraphRAG Work?

The entire process has two phases: offline indexing, then online question answering.

Offline Indexing: Three Steps

Raw Documents
    │
    ▼
┌─────────────────────────────┐
│ Step 1: Entity & Relationship│  LLM processes documents chunk
│ Extraction                   │  by chunk, extracting all
│                              │  entities (people, places,
│                              │  organizations, etc.) and
│                              │  their relationships
└─────────────────────────────┘
    │
    ▼
┌─────────────────────────────┐
│ Step 2: Knowledge Graph      │  Assemble extracted entities
│ Construction                 │  and relationships into a
│                              │  complete graph structure
└─────────────────────────────┘
    │
    ▼
┌─────────────────────────────┐
│ Step 3: Community Detection  │  Perform bottom-up hierarchical
│ & Summarization              │  clustering on the graph (e.g.,
│                              │  Leiden algorithm), generate
│                              │  LLM summary reports for each
│                              │  community
└─────────────────────────────┘

In short: first let the LLM extract all the people, events, things, and their relationships from the documents, assemble them into a large graph, then cluster the graph into groups and write a summary for each group.

Online Answering: Choose Strategy by Question Type

Question Type	How to Find the Answer
Specific questions (e.g., "What has Novorossiya done?")	Locate entity in graph → traverse relationships → collect related text → generate answer
Macro questions (e.g., "Top 5 themes")	Use community summaries directly → aggregate layer by layer → generate global answer

Technical Points Worth Digging Into

Why Use LLM for Graph Construction Instead of Traditional NLP?

The traditional approach uses NER (Named Entity Recognition) + relation extraction models, but these have hard limitations: you need to predefine entity types and relation types, they break when you switch domains, and they can't capture implicit relationships.

LLM advantages are clear:

Zero-shot capability — no need to train separately for each domain
Can read between the lines — for example, extracting the implicit "government attention" relationship from "the Attorney General's office reported the creation of Novorossiya"
Not constrained by schema — let the LLM discover entity and relationship types on its own

The trade-off is straightforward: LLM calls are expensive, and the indexing phase needs to process the entire dataset, so computational costs are significant.

Community Detection — GraphRAG's Killer Feature

Many approaches use knowledge graphs to enhance RAG, but what truly sets GraphRAG apart is community detection:

Uses algorithms like Leiden to partition the knowledge graph into multi-level communities (think of them as "topic clusters")
Pre-generates an LLM summary report for each community
Different community levels correspond to different levels of abstraction; choose the right granularity when answering questions

This is the secret behind its ability to answer "big questions" — no need to traverse the entire graph on the fly, just look up the pre-written summaries.

When generating community reports, the LLM receives CSV tables of entities and relationships within that community: an Entities table (entity ID, name, description), a Relationships table (source, target, description, combined_degree), and an optional Claims table. Relationships are sorted by combined_degree in descending order, prioritizing the most important ones, with truncation when the token limit is exceeded.

Provenance — Every Statement Is Traceable

GraphRAG places special emphasis on provenance. The complete evidence chain looks like this:

User Query
    → GraphRAG Answer + [Data: Entities (ID), Relationships (ID)]
        → Relationship IDs point to specific edges in the knowledge graph
            → Edges link back to specific passages in the original source documents

Answer → entities/relationships in the graph → original documents — fully traceable end to end. For enterprise applications, this capability is critical — you can verify every claim the AI makes.

How Were the Experiments Conducted?

Dataset

They used the VIINA dataset (violence information from news articles), chosen deliberately:

Involves multi-party conflict with fragmented information — complex enough
Includes news sources from both Russian and Ukrainian sides with opposing viewpoints and contradictory information
Data from June 2023, ensuring it's not in the LLM's training set
Thousands of articles, far exceeding context window limits — can't be handled without RAG

Evaluation Results

Four metrics were used for scoring:

Metric	What It Measures	How It's Evaluated
Comprehensiveness	How complete is the answer	LLM scorer pairwise comparison
Human Empowerment	Does it provide sources for verification	LLM scorer pairwise comparison
Diversity	Does it answer from multiple perspectives	LLM scorer pairwise comparison
Faithfulness	Does it hallucinate	SelfCheckGPT absolute measurement

The results are interesting: GraphRAG significantly outperforms traditional RAG on the first three metrics, but they're roughly equal on faithfulness. In other words, GraphRAG's improvement is mainly in "finding more comprehensively," not in "hallucinating less."

Don't Just Look at the Strengths — Know the Limitations Too

This is a pitch piece after all, so it naturally emphasizes the positives. A few caveats to keep in mind:

High indexing cost — Every document chunk requires an LLM call to extract entities and relationships. For large datasets, this could take hours or even days. With GPT-4 level models, API costs are considerable.

Incremental updates are a hard problem — The article doesn't mention what happens when data changes. In practice, new documents require re-extraction and merging, community structures may change as a result, requiring re-clustering and re-generating summaries. There's no good engineering solution for this yet.

Extraction quality depends on the LLM — LLM entity and relationship extraction isn't 100% accurate. It may miss implicit entities, get relationships wrong, and different models produce varying extraction quality with inconsistent results.

Queries will be slower — Graph traversal + LLM generation has a longer pipeline than simple vector retrieval + LLM generation, so latency is naturally higher.

Not every question needs it — The article itself acknowledges that for simple factual queries (like "What is Novorossiya?"), traditional RAG is sufficient. GraphRAG's advantages are concentrated in multi-hop reasoning and global summarization scenarios.

An Analogy to Build Your Intuition

Imagine you're a new employee at a company, and you want to understand "the most important project developments in the last three months."

Traditional RAG is like searching through a filing cabinet: You walk into the archive room and search using "project developments" as a keyword. You find dozens of files scattered across different drawers — meeting minutes, emails, reports. You have to piece the fragments together yourself.

GraphRAG is like asking a colleague who knows everything: They've not only read every document but also remember that "Zhang San's Project A and Li Si's Project B are actually related," and know that "last month's budget adjustment affected three departments." They can give you an organized, complete answer right away.

	Traditional RAG	GraphRAG
How it works	Search keywords, find relevant passages	Build a relationship network first, then answer along relationships
Good at	"What is X?" "How to do X?"	"What's the relationship between X and Y?" "What's the overall picture?"
Analogy	A librarian helping you find books	A detective connecting clues into a complete story
Weakness	Fragmented, lacks global perspective	Building the relationship network takes time and compute

Key Takeaways

GraphRAG doesn't solve the "search more accurately" problem — it solves the "search dimension" problem — expanding from text similarity to entity relationships and global structure.
The knowledge graph is the means; community clustering is the real innovation — Many approaches use graphs to enhance RAG, but community detection + pre-summarization is GraphRAG's unique weapon for global queries.
Provenance is the foundation of trust — Every assertion can be traced back to the original document. Enterprise applications can't do without this.
The trade-off is indexing cost — Using LLMs to process all data for graph construction is much more expensive than simple vectorization. This must be weighed when deploying in production.
Not a replacement, but a complement — Use GraphRAG for complex reasoning and global analysis, traditional RAG for simple factual queries. In real systems, combining both is the right approach.

The Biggest Pitfall in GraphRAG: One Entity, Seven Identities

eyanpen — Fri, 24 Apr 2026 11:54:16 +0000

You thought the hardest part of GraphRAG was "building the graph." In reality, the hardest part is "assigning entity types" — even when you've predefined a strict type schema.

1. A Real-World Dataset

We ran GraphRAG entity extraction on 3GPP TS 23.502 (5G Core Network signaling procedure specification). This document is about 700+ pages and one of the most critical standards in the telecom domain.

The results were painful:

A total of 8,873 distinct entities were extracted (deduplicated by title)
1,123 entities were assigned 2 or more types — 12.7% of the total
The most extreme case, PMIC, was classified into 7 different types: ARCHITECTURE_CONCEPT, DATA_TYPE, INFORMATION_ELEMENT, MANAGEMENT_ENTITY, NETWORK_ELEMENT, PROCEDURE, PROTOCOL

Note that this experiment already used a strictly predefined entity type schema, with the prompt explicitly constraining the LLM to only use the specified type set. In other words, this isn't chaos caused by "no constraints" — it's chaos that persists even after constraints are applied.

What's worse, these "type conflicts" don't occur across different documents — they happen within the same document and even within the same chunk. When the LLM reads a minimal text segment, even with explicit type constraints, it still assigns different types to the same entity.

We found 63 text_unit-level overlapping conflicts — the same entity annotated with two different types within the same text block. For example:

Entity	Labeled as	Also labeled as
AF	ORGANIZATION	NETWORK_FUNCTION
NRF	INTERFACE	NETWORK_FUNCTION
5G SECURITY CONTEXT	SECURITY_ELEMENT	ARCHITECTURE_CONCEPT
HPLMN	NETWORK_FUNCTION	ORGANIZATION
SERVICE REQUEST	INFORMATION_ELEMENT	PROCEDURE

This isn't the LLM making rookie mistakes, nor is the schema poorly designed. Think about it: AF (Application Function) genuinely is both a "network function" and an "organizational role"; NRF is both a "network function" and exposes "interfaces." These types are all in our predefined schema, and the LLM picks a "legal" type every time — it just picks different legal types for the same entity. The problem isn't that the LLM judged wrong, nor that the schema isn't strict enough — it's that real-world entities are inherently not single-typed.

2. Why Is This Problem So Hard?

2.1 Entities Are Inherently Multi-Faceted

In 3GPP specifications, the term AMF (Access and Mobility Management Function):

In architecture diagrams, it's a NETWORK_FUNCTION
In signaling procedures, it's a participant in a PROCEDURE
In deployment descriptions, it's a NETWORK_ELEMENT
In interface definitions, it's an endpoint of an INTERFACE

The same entity plays different roles in different contexts. This isn't a bug — it's reality.

2.2 LLM Type Judgment Depends on the Context Window

GraphRAG entity extraction is performed chunk by chunk. Each text_unit is roughly a few hundred tokens, and the LLM can only see that small segment.

The same entity PDU SESSION ESTABLISHMENT:

In a chunk describing signaling procedures, the LLM classifies it as PROCEDURE
In a chunk describing message formats, the LLM classifies it as INFORMATION_ELEMENT

Both judgments are correct, but they conflict when merged into the knowledge graph.

2.3 No Matter How Good the Schema, Type Boundaries Are Inherently Fuzzy

We already predefined a type schema, but who defines the boundary between ARCHITECTURE_CONCEPT and NETWORK_FUNCTION? In the 3GPP context, many concepts naturally span multiple categories. POLICY CONTROL is both a "procedure" (PROCEDURE) and an "architectural concept" (ARCHITECTURE_CONCEPT) — both types are in our schema, and the LLM isn't wrong to pick either one.

This isn't a problem of poorly written prompts or imprecise schema definitions — it's a fundamental tension between the granularity of type systems and the complexity of the real world. You can make the schema more fine-grained, but a finer schema only creates more boundary issues, not fewer.

2.4 Scale Amplifies the Problem

Our data shows that among entities with multiple types, the top 20 entities average 4–7 types and are associated with 10–200 descriptions. A core entity like AF has 209 descriptions, 192 text_unit references, and 4 types.

When a knowledge graph contains thousands of such "multi-faceted entities," downstream community detection, relationship reasoning, and summary generation are all affected — because the graph structure is polluted by type noise.

3. How Does the Industry Currently Address This?

Approach 1: Predefined Strict Type System (Schema-First) ⚠️ We Already Tried This

Method: Before extraction, manually define a strict entity type schema and explicitly constrain the LLM in the prompt to only use these types.

Representatives: Microsoft GraphRAG's default configuration, most enterprise knowledge graph projects.

Our actual results: All the data at the beginning of this article was produced under Schema-First mode. We predefined the type set and explicitly constrained it in the prompt — yet 1,123 entities still had multi-type conflicts, and 63 text_unit-level overlapping conflicts persisted.

Why it's not enough:

Schema can constrain the LLM to "only pick from these types," but can't constrain it to "pick only one for the same entity"
Domain concepts are inherently multi-faceted; AF in the 3GPP context genuinely is both NETWORK_FUNCTION and ORGANIZATION — no schema, however strict, changes this fact
Requires domain experts to design the schema — high cost, and you need to redesign for each new domain
Being too strict loses information — forcing AF to be NETWORK_FUNCTION discards its semantics as ORGANIZATION

Conclusion: Schema-First is a necessary condition but not a sufficient one. It reduces the "random naming" problem but doesn't solve the fundamental contradiction of "one entity, multiple identities."

Approach 2: Allow Multi-Types, Post-Processing Merge (Multi-Label + Post-Processing)

Method: Don't limit the number of types during extraction; allow an entity to have multiple types, then merge, deduplicate, and select a primary type through rules or models in post-processing.

Representatives: LlamaIndex's PropertyGraphIndex, some academic research.

Pros:

Preserves multi-faceted entity information
No information loss during extraction

Cons:

Post-processing logic is complex; rules are hard to enumerate exhaustively
"Selecting a primary type" itself requires domain knowledge
Graph complexity increases; query performance degrades

Suitable for: Exploratory analysis, early stages where domain boundaries are uncertain.

Approach 3: Hierarchical Typing

Method: Build a hierarchical type system where, for example, NETWORK_FUNCTION is a subtype of ARCHITECTURE_CONCEPT. Extract at the finest granularity; aggregate by hierarchy during queries.

Representatives: Wikidata's type system, YAGO knowledge base.

Pros:

Balances precision and flexibility
Supports queries at different granularities

Cons:

Designing the hierarchy itself is a major undertaking
LLMs struggle to accurately determine hierarchical relationships during extraction
Cross-domain hierarchies are hard to unify

Suitable for: Large-scale, long-term knowledge graph projects.

Approach 4: Abandon Explicit Types, Use Embeddings (Type-Free + Embedding)

Method: Don't assign discrete type labels to entities; instead, use vector embeddings to represent semantic features. Similar entities naturally cluster in vector space.

Representatives: Some recent research, such as GNN-based entity representation learning.

Pros:

Completely avoids the type conflict problem
Captures subtle semantic differences between entities

Cons:

Loses interpretability — you can't tell users "this is a network function"
Downstream community detection and summary generation need redesign
Difficult to debug

Suitable for: Research projects, scenarios with low interpretability requirements.

Approach 5: Context-Aware Dynamic Typing

Method: Don't fix types during extraction; instead, dynamically determine entity types based on query context. For example, when a user asks about architecture, AF is treated as NETWORK_FUNCTION; when asking about organization, it's treated as ORGANIZATION.

Representatives: Currently mostly in the academic exploration stage.

Pros:

Most aligned with reality — an entity's "identity" truly depends on context
No difficult type decisions needed during extraction

Cons:

Extremely high engineering complexity
Graph structure can't be determined during offline graph building; community detection algorithms are hard to apply
Increased query latency

Suitable for: A research direction for next-generation GraphRAG systems.

4. My Recommendation: Schema-First Foundation + Layered Types + Primary Type Voting + Context Preservation

Our experiments have proven that Schema-First is a necessary starting point — without it, types become even more chaotic. But it alone isn't enough. Based on our hands-on experience with 3GPP documents, I recommend layering a pragmatic post-processing approach on top of Schema-First:

Layer 0: Keep Schema-First (Already in Place)

Continue using the predefined type schema to constrain the LLM. This step is already done; its value lies in keeping types within a finite set, preventing the LLM from freely inventing meaningless types like THINGY or STUFF.

Layer 1: Preserve All Types During Extraction

On top of Schema-First, don't force a single type during extraction. If the LLM picks multiple types from the predefined set, keep them all. Preserve every (entity, type, text_unit) triple. This is the raw signal — once lost, it can't be recovered.

Layer 2: Statistical Voting for Primary Type

For each entity, count how many times it's annotated as each type across all text_units, and select the most frequent as the primary type.

Taking AF as an example:

NETWORK_FUNCTION: 150 occurrences → primary type
ORGANIZATION: 30 occurrences
ARCHITECTURE_CONCEPT: 20 occurrences
NETWORK_ELEMENT: 9 occurrences

The primary type is used for the knowledge graph's main structure, community detection, and default queries.

Layer 3: Preserve Alternative Types as Properties

Other types aren't discarded — they're stored as the entity's alternative_types property, available for use during queries as needed.

{
  "title": "AF",
  "primary_type": "NETWORK_FUNCTION",
  "alternative_types": ["ORGANIZATION", "ARCHITECTURE_CONCEPT", "NETWORK_ELEMENT"],
  "type_distribution": {
    "NETWORK_FUNCTION": 150,
    "ORGANIZATION": 30,
    "ARCHITECTURE_CONCEPT": 20,
    "NETWORK_ELEMENT": 9
  }
}

Layer 4: Type Conflict Detection and Manual Review

For text_unit-level overlapping conflicts (same entity labeled as different types within the same chunk), flag them as candidates for review. These 63 conflicts are the most worth manually checking — they often reveal blind spots in the type system design.

What's the Cost?

Increased storage: Each entity stores multiple types and distribution info; graph data volume increases by roughly 20–30%.
No change to extraction: No need to modify prompts or extraction pipelines; no additional cost.
Post-processing development needed: The voting, merging, and conflict detection pipeline requires additional development — roughly 2–3 days of engineering effort.
Slightly more complex queries: The query layer needs to decide whether to use the primary type or all types, but this logic can be encapsulated.
Can't be fully automated: Text_unit-level conflicts still require human judgment, but the volume is manageable (only 63 in our case).

5. Final Thoughts

GraphRAG papers and blog posts always focus on the flashy capabilities like "community detection" and "global queries," but when it comes to real-world deployment, entity type chaos is the first roadblock.

One TS 23.502 document, 8,873 entities, 1,123 with multi-type conflicts — and this is after applying Schema-First constraints. This isn't an edge case; it's the norm for all complex domain documents. Predefined type schemas are necessary but far from sufficient.

There's no silver bullet for this problem. But at least we can: build on Schema-First, avoid losing information during post-processing, use statistical methods to select primary types, preserve multi-faceted nature for downstream use, and keep the conflicts that truly need human judgment within a manageable scope.

This is the gap between "running a demo" and "going to production" in GraphRAG — and it's the most important one to fill.

Why Do We Need GraphRAG? — The Evolution from "Search" to "Understanding"

eyanpen — Fri, 24 Apr 2026 11:49:37 +0000

When AI stops just "looking things up" and starts truly "understanding" your question.

1. Let's Start with an Everyday Scenario

Imagine you're a new employee at a company. On your first day, you want to know "the most important project updates from the past three months."

You have two options:

Option A: Dig through the filing cabinet
You walk to the archive room, open the filing cabinet, and search by the keyword "project updates." You find dozens of documents, but they're scattered across different drawers — some are meeting minutes, some are emails, some are reports. You have to piece these fragments together yourself to get a complete answer.

Option B: Ask a colleague who "knows everything"
This colleague has not only read every document but also remembers that "Project A led by Zhang San and Project B led by Li Si are actually related," and knows that "last month's budget adjustment affected three departments' plans." They can give you an organized, complete answer right away.

Option A is traditional RAG (Retrieval-Augmented Generation).
Option B is what GraphRAG aims to achieve.

2. What Is RAG? It's Already Impressive — So Why Isn't It Enough?

What Is RAG

RAG stands for Retrieval-Augmented Generation. Simply put, it lets AI search through a pile of documents for relevant content before answering your question, then generates a response based on what it found.

It's like an open-book exam — AI can flip through references to find answers instead of relying purely on memory.

RAG's Limitations

RAG is genuinely useful, but it has a fundamental weakness: it can "find" but it can't "connect."

For example, suppose you ask:

"What impact has the company's business expansion in Asia-Pacific had on the supply chain?"

Traditional RAG would:

Search for documents containing keywords like "Asia-Pacific," "business expansion," "supply chain"
Find several relevant passages
Hand these passages to the AI to generate an answer

Where's the problem?

Information about "Asia-Pacific business expansion" might be in a strategic report
Information about "supply chain adjustments" might be in an operations report
The connection between these two reports — such as "because of Asia-Pacific expansion, a new Vietnamese supplier was added, causing logistics cost changes" — might not be explicitly stated in any single document

What traditional RAG finds are isolated "fragments." It's not good at connecting the implicit relationships between fragments.

3. How Does GraphRAG Solve This Problem?

Core Idea: Build a "Relationship Network" First

GraphRAG's key innovation is that before answering questions, it does something extra: it organizes all the information from documents into a "relationship network" (knowledge graph).

What does this relationship network look like? Think of it as a character relationship map:

Nodes (circles): Represent individual "things" — people, companies, projects, locations, concepts
Edges (arrows): Represent relationships between them — "responsible for," "belongs to," "affects," "collaborates with"

A simple example:

[Zhang San] --responsible for--> [Project A]
[Project A] --depends on--> [Project B]
[Project B] --led by--> [Li Si]
[Project A] --budget from--> [Asia-Pacific Department]
[Asia-Pacific Department] --partners with--> [Vietnamese Supplier]

With this network, when you ask "What's the relationship between Zhang San's project and the Vietnamese supplier?", the AI can "walk" through the network and discover:

Zhang San → Project A → Asia-Pacific Department → Vietnamese Supplier

Even if no single document ever directly mentions "the relationship between Zhang San and the Vietnamese supplier," the AI can reason out the answer through this path.

Plain-Language Summary

	Traditional RAG	GraphRAG
How it works	Searches keywords, finds relevant passages	Builds a relationship network first, then follows relationships to answer
Good at	"What is X?" "How do I do X?"	"What's the relationship between X and Y?" "What's the big picture?"
Analogy	A librarian helping you find books	A detective connecting clues into a complete story
Weakness	Fragmented, lacks global perspective	Building the relationship network takes time and compute

4. What Can GraphRAG Do for Us?

Scenario 1: Enterprise Knowledge Management

A large company has thousands of internal documents: policies, procedures, meeting minutes, technical docs...

Traditional approach: Employees search by keywords, browse through many documents, summarize on their own
GraphRAG approach: AI has already "understood" the relationships between all documents. Employees can directly ask "What was the root cause of increased customer complaints last quarter?" and the AI can provide a connected analysis across product changes, customer service records, supplier issues, and more

Scenario 2: Healthcare

A patient's medical records, test reports, and medication history are scattered across different systems.

Traditional approach: Doctors review each one individually, relying on experience
GraphRAG approach: AI builds a network connecting patient information, medications, diseases, and test results. It can flag that "Drug A the patient is currently taking and newly prescribed Drug B may interact because they both act on the same metabolic pathway"

Scenario 3: Financial Risk Control

A bank needs to assess the risk of a loan.

Traditional approach: Review the borrower's credit report and financial data
GraphRAG approach: AI discovers that the borrower's company and another company that has already defaulted share the same ultimate beneficial owner, and this connection is hidden within multiple layers of equity structures — uncovering these "hidden relationships" is exactly where GraphRAG excels

Scenario 4: Everyday Q&A Assistant

You're using an AI assistant to learn about a complex topic like "climate change."

Traditional approach: AI gives you a general overview of climate change
GraphRAG approach: AI can tell you "climate change affects agricultural yields, which in turn affects food prices, which ultimately affects social stability in developing countries" — this kind of multi-hop reasoning (from A to B to C to D) is GraphRAG's core advantage

5. GraphRAG Isn't a Silver Bullet

After all these benefits, let's be honest about its limitations:

Building the relationship network has costs: Converting large volumes of documents into a knowledge graph requires time and compute resources. For small-scale, simple Q&A scenarios, traditional RAG may be sufficient.
The quality of the relationship network is critical: If the AI misunderstands a relationship during graph construction, subsequent reasoning will also be wrong. Just like a detective who connects clues incorrectly will reach the wrong conclusion.
Not every question needs it: If you just want to look up "What's the company's expense reimbursement process?", traditional search can answer that perfectly well — no need to deploy GraphRAG.

6. Summary

The essence of GraphRAG is evolving AI from "keyword search" to "relationship reasoning."

It's not meant to replace traditional RAG but to add a layer of "understanding relationships" on top of it. It's like upgrading from "looking up a dictionary" to "reading an encyclopedia" — a dictionary tells you what each word means; an encyclopedia also tells you how those words are connected.

For scenarios that involve processing large amounts of complex information, discovering hidden connections, and requiring a global perspective, GraphRAG is a direction worth paying attention to.

Forem: eyanpen

The "Ghost Clone" of Community Reports in GraphRAG: Why the Same Report Gets Created Twice

Symptom

An Intuitive Example

Imagine You're Managing a Company's Org Chart

Where's the Problem?

Root Cause Analysis

1. Leiden Hierarchical Clustering Produces Identical Reports

2. Import Logic Lacks Deduplication and Precise Matching

Solution

Precise Matching When Creating HAS_REPORT

Known Pitfall in DeepEval Faithfulness Metric: "idk" Verdicts Don't Penalize the Score

Background

Observed Behavior

Root Cause Analysis

Real-World Example

The Fundamental Issue

Solutions

Solution 1: Enable penalize_ambiguous_claims

Solution 2: Add a Groundedness Metric

Recommendation

Additional Pitfall: Summary Claims Misjudged as "idk"

Real-World Example

Cause

Impact

Possible Improvement

Conclusion

How FalkorDB Stores Edges: Why Neighbor Lookup Is O(degree)

1. How Traditional Graph Databases Store Edges

Adjacency List

2. FalkorDB Is Completely Different: Sparse Matrix

3. One Matrix Per Edge Type

4. How Multi-edges Are Maintained

5. How to Efficiently Find Edges?

Sparse Matrix Doesn't Store Zeros At All

6. What Does a Sparse Matrix Actually Store?

7. CSR / CSC: Industrial-Grade Sparse Matrix Structures

8. Why Is the Complexity Still O(degree)?

Locating the array is O(1)

Traversing the array is still O(k)

9. What Does Algorithmic Complexity Actually Measure?

10. Output-sensitive Complexity

The size of the output itself counts toward complexity

11. Why Is FalkorDB Still Fast?

Pointer chasing

12. Neo4j vs FalkorDB: The Essential Difference

13. Final Summary

14. Does Splitting Edges Into Multiple Types vs. a Single Type Affect Query Speed?

When the Query Specifies an Edge Type

When the Query Does Not Specify an Edge Type

Summary

Orphan Communities in GraphRAG Hierarchical Clustering: Why Some Communities Have No PARENT_OF Edges

The Phenomenon

Background: GraphRAG's Hierarchical Community Structure

Imagine You're Grouping Everyone in the World

A Complete Example

First Round of Clustering (Level 0): 5 Major Groups

Second Round of Clustering (Level 1): Subdividing Within Groups

Why Do Orphans Occur? Two Conditions Must Be Met Simultaneously

The Data Speaks

Impact on GraphRAG Queries

Global Search

Local Search

Practical Impact

Summary

GraphRAG Local Search Text Unit Selection Strategy: Design Trade-offs and Improvement Directions

Introduction

What Is the Current Strategy

Problem Scenario: Popular Entities Monopolize the Budget

Concrete Example

Why It Was Designed This Way: What Problem It Solves

The Scenario It Addresses

Core Trade-off

The Essence of the Problem: A Single Sorting Dimension Cannot Express Multi-Objective Optimization

Improvement Directions

Approach 1: Per-Entity Cap

Approach 2: Round-Robin

Approach 3: Weighted Quota Allocation

Approach 4: Minimum Guarantee + Remaining Competition

Summary

Solution 1: Enable `penalize_ambiguous_claims`