<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: pyalwin</title>
    <description>The latest articles on Forem by pyalwin (@pyalwin).</description>
    <link>https://forem.com/pyalwin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F433840%2F11ea7c9f-d201-4e14-a9ff-6eff8eb34ef6.png</url>
      <title>Forem: pyalwin</title>
      <link>https://forem.com/pyalwin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/pyalwin"/>
    <language>en</language>
    <item>
      <title>Beyond RAG: Building Graph-Aware Retrieval for Contract Reasoning</title>
      <dc:creator>pyalwin</dc:creator>
      <pubDate>Tue, 31 Mar 2026 17:50:07 +0000</pubDate>
      <link>https://forem.com/pyalwin/beyond-rag-building-graph-aware-retrieval-for-contract-reasoning-3o05</link>
      <guid>https://forem.com/pyalwin/beyond-rag-building-graph-aware-retrieval-for-contract-reasoning-3o05</guid>
      <description>&lt;p&gt;Why We Moved Beyond Vector Search for Contract QA&lt;/p&gt;

&lt;p&gt;When we started building AgreedPro, one of the core technical questions seemed almost ordinary: how do you answer questions over contracts using modern AI retrieval systems?&lt;/p&gt;

&lt;p&gt;At first, the answer felt obvious.&lt;/p&gt;

&lt;p&gt;Use RAG.&lt;/p&gt;

&lt;p&gt;Chunk the contract. Embed the chunks. Retrieve the top matches. Pass them to a language model. Let the model generate the answer.&lt;/p&gt;

&lt;p&gt;That pipeline is familiar because, in many domains, it works. It works well on documentation, FAQs, product manuals, internal wikis, and other corpora where relevant information is typically localized. A question points to a paragraph, a section, or a small cluster of nearby passages. Retrieval is largely a similarity problem.&lt;/p&gt;

&lt;p&gt;Contracts are different.&lt;/p&gt;

&lt;p&gt;That difference was not immediately obvious when we were building early versions of AgreedPro. The model often produced answers that looked right. They were well-phrased, coherent, and aligned with the wording of the contract section that had been retrieved.&lt;/p&gt;

&lt;p&gt;But once we started checking those answers closely against the full document, a recurring pattern emerged.&lt;/p&gt;

&lt;p&gt;The model was not hallucinating in the usual sense.&lt;/p&gt;

&lt;p&gt;It was answering from incomplete context.&lt;/p&gt;

&lt;p&gt;That failure mode turned out to be much more important than it sounds. In legal documents, incomplete context is often the difference between a correct answer and a misleading one. That realization is what led to EngramDB.&lt;/p&gt;

&lt;p&gt;This article is a technical deep dive into the retrieval problem we encountered while building AgreedPro, why vector-only search was mismatched to contract reasoning, and how EngramDB emerged as a hybrid graph-plus-vector retrieval system designed for multi-hop legal reasoning.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Looked Solved Until We Tried to Solve It
&lt;/h2&gt;

&lt;p&gt;The initial retrieval pipeline was standard.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parse the contract into chunks.&lt;/li&gt;
&lt;li&gt;Embed each chunk.&lt;/li&gt;
&lt;li&gt;Embed the user query.&lt;/li&gt;
&lt;li&gt;Retrieve the most similar chunks.&lt;/li&gt;
&lt;li&gt;Ask the language model to answer from the retrieved context.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a while, this looked fine.&lt;/p&gt;

&lt;p&gt;Then we started asking realistic legal questions.&lt;/p&gt;

&lt;p&gt;A representative example was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Under what conditions can this agreement be terminated?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A vector-based retriever would often return the section titled something like &lt;strong&gt;Termination&lt;/strong&gt;. That seemed perfectly reasonable. The wording of the section was directly relevant to the query.&lt;/p&gt;

&lt;p&gt;But when we inspected the full contract, the real answer was often distributed across several places:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the termination clause itself&lt;/li&gt;
&lt;li&gt;a definitions section that defines a term like "Cause"&lt;/li&gt;
&lt;li&gt;a different section containing exceptions, limitations, or notice requirements&lt;/li&gt;
&lt;li&gt;one or more cross-referenced provisions that affect interpretation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The retriever found the clause that looked most relevant, but not the sections that made the clause fully interpretable.&lt;/p&gt;

&lt;p&gt;The language model then generated an answer that sounded clean and plausible, but it had only seen one part of the reasoning chain.&lt;/p&gt;

&lt;p&gt;This is one of the most dangerous failure modes in contract intelligence: not obviously wrong, not obviously invented, just incomplete in a way that can change the legal meaning.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Mismatch: Similarity Does Not Equal Completeness
&lt;/h2&gt;

&lt;p&gt;To understand why this keeps happening, it helps to be precise about what dense retrieval is optimizing for.&lt;/p&gt;

&lt;p&gt;In a standard vector retrieval pipeline, you embed the query and the chunks, compute similarity scores, and return the top results. In simple terms, the retriever is asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which passages look most semantically similar to the question?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That objective is useful, but it is not the same as the one we actually care about in contracts.&lt;/p&gt;

&lt;p&gt;In contract QA, the real question is often closer to this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which sections, taken together, are necessary to answer this question correctly?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Those are not the same objective.&lt;/p&gt;

&lt;p&gt;A clause that defines a key term may share very little wording with the query. A referenced section may be critical to the answer and still rank poorly in a pure embedding space. A parent section may scope the meaning of a child subsection without repeating the child’s language at all.&lt;/p&gt;

&lt;p&gt;So the retrieval problem is not simply one of semantic relevance. It is a problem of reconstructing a reasoning path across document structure.&lt;/p&gt;

&lt;p&gt;That is where vector-only retrieval begins to fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Contracts Behave More Like Graphs Than Like Flat Text
&lt;/h2&gt;

&lt;p&gt;The biggest conceptual shift while building AgreedPro was realizing that contracts should not be modeled primarily as flat text.&lt;/p&gt;

&lt;p&gt;They behave much more like structured graphs.&lt;/p&gt;

&lt;p&gt;A contract has at least three kinds of first-class structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hierarchical structure
&lt;/h3&gt;

&lt;p&gt;Contracts are organized into articles, sections, subsections, schedules, appendices, and nested numbered clauses. A child clause often depends on the scope introduced by its parent. If you detach a subsection from its surrounding article, you can easily lose the context that gives it meaning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Definitional structure
&lt;/h3&gt;

&lt;p&gt;Contracts rely heavily on defined terms. Words like "Cause," "Confidential Information," "Affiliate," or "Change of Control" are often defined once and reused throughout the document. The place where a term is used is rarely the place where its meaning is established.&lt;/p&gt;

&lt;h3&gt;
  
  
  Referential structure
&lt;/h3&gt;

&lt;p&gt;Contracts constantly point to themselves. They say things like "subject to Section 8.2," "as provided in Article III," or "except as otherwise set forth in Section 5.4." These are explicit navigational edges inside the document.&lt;/p&gt;

&lt;p&gt;Once you start treating those signals as part of the representation, a contract naturally becomes a graph of connected sections rather than a bag of isolated chunks.&lt;/p&gt;

&lt;p&gt;That framing is the foundation of EngramDB.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1uhu0dgplibrj5mi8jt2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1uhu0dgplibrj5mi8jt2.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What EngramDB Is
&lt;/h2&gt;

&lt;p&gt;EngramDB is a schema-aware hybrid retrieval system for structured documents, built around the kinds of retrieval failures that showed up while building AgreedPro.&lt;/p&gt;

&lt;p&gt;At a high level, it combines two mechanisms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Vector retrieval to find semantically relevant starting points&lt;/li&gt;
&lt;li&gt;Graph traversal to expand from those starting points into structurally related sections&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key idea is not that embeddings are bad. It is that embeddings are being asked to do too much when the answer depends on multiple linked sections.&lt;/p&gt;

&lt;p&gt;Embeddings are very good at finding where the answer might start.&lt;/p&gt;

&lt;p&gt;They are not, by themselves, a reliable way to find everything the answer depends on.&lt;/p&gt;

&lt;p&gt;EngramDB treats each document section as a node, called an &lt;strong&gt;Engram&lt;/strong&gt;, and each structural relationship as an edge, called a &lt;strong&gt;Synapse&lt;/strong&gt;. Those edges are extracted from the document itself during ingestion.&lt;/p&gt;

&lt;p&gt;So instead of hoping the embedding model implicitly captures the document’s legal structure, the system makes that structure explicit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deterministic Ingestion: Structure Extraction Without LLM Calls
&lt;/h2&gt;

&lt;p&gt;One of the most important design decisions in EngramDB is that ingestion is rule-based.&lt;/p&gt;

&lt;p&gt;That decision matters for both engineering and research reasons.&lt;/p&gt;

&lt;p&gt;From an engineering perspective, rule-based extraction is predictable, reproducible, cheaper to run, and easier to debug. If a section boundary is wrong or a reference fails to resolve, you can inspect the pattern and fix the pipeline.&lt;/p&gt;

&lt;p&gt;From a research perspective, the hypothesis is that contracts already expose a large amount of high-quality structure for free. Section numbering, heading patterns, quotation conventions, and cross-references are not weak hints. They are central features of the document.&lt;/p&gt;

&lt;p&gt;EngramDB takes advantage of that.&lt;/p&gt;

&lt;h3&gt;
  
  
  Section parsing
&lt;/h3&gt;

&lt;p&gt;The first stage identifies headings and section boundaries using regular expressions and structural cues. The goal is to segment the contract into units that are legally meaningful while preserving hierarchy.&lt;/p&gt;

&lt;p&gt;For example, the parser should understand that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Article IV is a parent scope&lt;/li&gt;
&lt;li&gt;Section 4.2 belongs under that article&lt;/li&gt;
&lt;li&gt;nested clauses inherit context from the enclosing section&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is essential because a subsection often cannot be interpreted correctly without its parent context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Definition extraction
&lt;/h3&gt;

&lt;p&gt;The next stage detects defined terms using patterns such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Term" means ...&lt;/li&gt;
&lt;li&gt;"Term" shall mean ...&lt;/li&gt;
&lt;li&gt;The term "X" means ...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows the system to map terms back to their defining sections and to link uses of those terms elsewhere in the document.&lt;/p&gt;

&lt;p&gt;That means a query about termination for Cause does not have to rely on embeddings alone to discover the definition of Cause.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-reference extraction
&lt;/h3&gt;

&lt;p&gt;The ingestion pipeline also extracts explicit references such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Section 4.2&lt;/li&gt;
&lt;li&gt;Section 8.1(b)&lt;/li&gt;
&lt;li&gt;Article III&lt;/li&gt;
&lt;li&gt;Schedule A&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When those references can be resolved, EngramDB creates edges between the source and target sections.&lt;/p&gt;

&lt;p&gt;This is one of the most valuable signals in legal documents because the contract itself is telling you which sections are meant to be read together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graph construction
&lt;/h3&gt;

&lt;p&gt;After extraction, the document is represented as a graph consisting of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engrams&lt;/strong&gt;: nodes representing sections&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synapses&lt;/strong&gt;: typed edges representing relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The primary relationship types include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PARENT_OF&lt;/code&gt; for hierarchy&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DEFINES&lt;/code&gt; for term-definition links&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;REFERENCES&lt;/code&gt; for explicit cross-references&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Embedding generation
&lt;/h3&gt;

&lt;p&gt;Each section can also be embedded using a pluggable backend. In the current implementation, the supported backends include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mock&lt;/code&gt; for deterministic tests and development&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;openai&lt;/code&gt; for hosted embeddings&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;local&lt;/code&gt; for sentence-transformers based local embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation is deliberate. Structure extraction is deterministic. Embeddings are an additional signal, not the source of the graph.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why DuckDB Is a Good Fit
&lt;/h2&gt;

&lt;p&gt;A lot of retrieval systems get operationally heavy before the retrieval logic is even stable. Separate vector stores, graph databases, metadata stores, and orchestration layers appear early, often making iteration harder rather than easier.&lt;/p&gt;

&lt;p&gt;EngramDB takes a simpler route.&lt;/p&gt;

&lt;p&gt;Everything is stored in a single DuckDB file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;section metadata&lt;/li&gt;
&lt;li&gt;graph edges&lt;/li&gt;
&lt;li&gt;embeddings&lt;/li&gt;
&lt;li&gt;retrieval artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This has several practical advantages.&lt;/p&gt;

&lt;p&gt;First, it keeps the system local-first and easy to reproduce. Second, it allows structured querying over nodes and edges without introducing another database layer. Third, it makes benchmarking, debugging, and inspection much easier when the system is still evolving.&lt;/p&gt;

&lt;p&gt;For an experimental hybrid retrieval engine, that tradeoff is extremely appealing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Retrieval Pipeline: Vector Search for Anchors, Graph Walk for Context
&lt;/h2&gt;

&lt;p&gt;The retrieval pipeline in EngramDB has two stages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 1: Anchor retrieval
&lt;/h3&gt;

&lt;p&gt;Given a natural-language query, the system embeds the query and retrieves the top-k semantically similar sections.&lt;/p&gt;

&lt;p&gt;These top-k results are treated as &lt;strong&gt;anchor nodes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is where vector search does what it is good at. It finds the parts of the document that look closest to the user’s question.&lt;/p&gt;

&lt;p&gt;If the query is about termination, the anchor set often includes a termination clause. If the query is about confidentiality, the anchor set often includes confidentiality-related sections.&lt;/p&gt;

&lt;p&gt;But anchors alone are not enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Graph expansion
&lt;/h3&gt;

&lt;p&gt;From each anchor, EngramDB expands outward through the graph for a bounded number of hops, usually one to three.&lt;/p&gt;

&lt;p&gt;That graph walk can recover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a definition linked to a term used in the anchor&lt;/li&gt;
&lt;li&gt;a referenced provision that adds an exception or condition&lt;/li&gt;
&lt;li&gt;a parent section that scopes the meaning of the anchor&lt;/li&gt;
&lt;li&gt;neighboring clauses that complete the obligation or permission being discussed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the step that turns retrieval from a semantic lookup problem into a multi-hop reasoning support system.&lt;/p&gt;

&lt;p&gt;Vector retrieval finds entry points.&lt;/p&gt;

&lt;p&gt;Graph traversal finds connected evidence.&lt;/p&gt;

&lt;p&gt;That distinction is the core of the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scoring: Blending Semantic and Structural Relevance
&lt;/h2&gt;

&lt;p&gt;Once the candidate set is assembled, the system still has to rank it.&lt;/p&gt;

&lt;p&gt;EngramDB scores each candidate using a blend of semantic similarity and structural relevance.&lt;/p&gt;

&lt;p&gt;In the current design, the score is computed as:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;score = 0.5 * semantic_similarity + 0.5 * structural_score&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The equal weighting matters conceptually. It says that structural importance is not just a small reranking feature. It is a first-class retrieval signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structural score
&lt;/h3&gt;

&lt;p&gt;Structural relevance is based on two main factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the type of edge through which the node was reached&lt;/li&gt;
&lt;li&gt;the hop distance from the anchor&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The score decays with distance, using a hop-decay factor of about &lt;code&gt;0.75&lt;/code&gt; per hop.&lt;/p&gt;

&lt;p&gt;Different edge types receive different weights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;REFERENCES&lt;/code&gt; = &lt;code&gt;1.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DEFINES&lt;/code&gt; = &lt;code&gt;0.9&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PARENT_OF&lt;/code&gt; = &lt;code&gt;0.55&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anchors themselves are treated specially and receive a structural score of &lt;code&gt;1.0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This ranking design captures an important practical truth. A section reached via an explicit legal reference may be more valuable than a section that merely looks semantically similar. Likewise, a definition section may deserve to outrank a loosely related paragraph because it provides the interpretation layer needed for the answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Engineering Trick: Preventing Graph Nodes from Getting Ranked Out
&lt;/h2&gt;

&lt;p&gt;A hybrid retrieval system can still fail even if the graph is correct.&lt;/p&gt;

&lt;p&gt;The reason is ranking pressure.&lt;/p&gt;

&lt;p&gt;Graph expansion often produces a large candidate set. A handful of anchor nodes can fan out into dozens of structurally related sections. But the final context budget is limited. The model may only receive ten or twelve pieces of evidence.&lt;/p&gt;

&lt;p&gt;If you simply rank everything by the combined score, semantically strong anchors can still dominate the final set. The graph-walk successfully discovers the right nodes, but the ranking layer drops them.&lt;/p&gt;

&lt;p&gt;This is a subtle but crucial failure mode.&lt;/p&gt;

&lt;p&gt;The graph worked.&lt;/p&gt;

&lt;p&gt;The retrieval result still failed.&lt;/p&gt;

&lt;p&gt;EngramDB addresses this using &lt;strong&gt;reserved traversal slots&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The idea is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reserve part of the result budget for anchor nodes&lt;/li&gt;
&lt;li&gt;reserve another part for graph-discovered nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Within the graph-reserved portion, candidates are prioritized by edge-type tier and then by semantic similarity.&lt;/p&gt;

&lt;p&gt;This prevents a large set of high-similarity local sections from completely displacing the structurally important evidence discovered through traversal.&lt;/p&gt;

&lt;p&gt;That design choice is one of the most practically important parts of the system. Without it, many hybrid retrievers quietly collapse back into vector-only behavior at the final ranking stage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Modes and What They Teach Us
&lt;/h2&gt;

&lt;p&gt;A retrieval system becomes much more useful once its failure modes are explicit.&lt;/p&gt;

&lt;p&gt;Two categories mattered most in EngramDB.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Traversed but ranked out
&lt;/h3&gt;

&lt;p&gt;In this case, the graph expansion discovers the required section, but the final ranking stage excludes it from the top context window.&lt;/p&gt;

&lt;p&gt;This tends to happen in dense graph neighborhoods where many candidates compete for a small number of slots.&lt;/p&gt;

&lt;p&gt;Mitigations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reserved traversal slots&lt;/li&gt;
&lt;li&gt;stronger edge-aware ranking&lt;/li&gt;
&lt;li&gt;reranking within structural tiers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Not reachable from anchors
&lt;/h3&gt;

&lt;p&gt;A different failure happens when the initial vector anchors are poor.&lt;/p&gt;

&lt;p&gt;This is especially common for questions that reference section numbers directly, such as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does Section 6 reference?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That kind of query does not necessarily produce useful semantic anchors, because the important signal is not meaning in the embedding sense. It is explicit document metadata.&lt;/p&gt;

&lt;p&gt;A natural mitigation is metadata-aware anchor injection. If the query contains a recognizable section pattern, the corresponding section can be inserted directly into the anchor set before graph expansion starts.&lt;/p&gt;

&lt;p&gt;This is a good reminder that in structured domains, metadata is not a workaround. It is part of the retrieval representation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark Results: Where the Hybrid Approach Actually Wins
&lt;/h2&gt;

&lt;p&gt;The central hypothesis behind EngramDB is that document-native structure provides retrieval signal that vector similarity alone does not fully capture, especially for multi-hop questions.&lt;/p&gt;

&lt;p&gt;The reported benchmark evaluates this hypothesis on contract QA using CUAD documents.&lt;/p&gt;

&lt;p&gt;The setup described in the project materials includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;183 multi-hop questions&lt;/li&gt;
&lt;li&gt;35 contracts&lt;/li&gt;
&lt;li&gt;questions requiring retrieval of two to three structurally linked sections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The reported results for hybrid retrieval versus vector-only retrieval are substantial.&lt;/p&gt;

&lt;h3&gt;
  
  
  Overall recall
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Hybrid: &lt;code&gt;92.8%&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Vector-only: &lt;code&gt;68.2%&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Two-hop recall
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Hybrid: &lt;code&gt;97.6%&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Vector-only: &lt;code&gt;66.8%&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Three-hop recall
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Hybrid: &lt;code&gt;86.5%&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Vector-only: &lt;code&gt;70.0%&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The largest gains appear in exactly the places you would expect if structure is the missing signal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cross-reference queries&lt;/li&gt;
&lt;li&gt;termination chains&lt;/li&gt;
&lt;li&gt;definition-linked questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the scenarios where the answer depends on sections that are explicitly connected but not necessarily semantically local.&lt;/p&gt;

&lt;p&gt;That is important because it suggests the system is not just benefiting from more retrieval. It is benefiting from the right retrieval bias.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters Beyond Legal Tech
&lt;/h2&gt;

&lt;p&gt;Although EngramDB was motivated by contract QA, the deeper lesson is broader.&lt;/p&gt;

&lt;p&gt;A lot of retrieval research is centered on semantic representation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;better embeddings&lt;/li&gt;
&lt;li&gt;better rerankers&lt;/li&gt;
&lt;li&gt;larger context windows&lt;/li&gt;
&lt;li&gt;fusion strategies across retrievers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of those matter.&lt;/p&gt;

&lt;p&gt;But EngramDB highlights another axis of improvement that is often underused: the internal structure of the document itself.&lt;/p&gt;

&lt;p&gt;In domains where meaning is distributed across linked sections, structure is not auxiliary. It is part of relevance.&lt;/p&gt;

&lt;p&gt;That makes this style of retrieval interesting beyond legal documents.&lt;/p&gt;

&lt;p&gt;The same pattern appears in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compliance manuals&lt;/li&gt;
&lt;li&gt;regulatory filings&lt;/li&gt;
&lt;li&gt;financial disclosures&lt;/li&gt;
&lt;li&gt;technical specifications&lt;/li&gt;
&lt;li&gt;API documents with defined terms and references&lt;/li&gt;
&lt;li&gt;scientific papers with structured section dependencies&lt;/li&gt;
&lt;li&gt;policy frameworks and governance documents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Any domain where answers live in chains rather than paragraphs is a candidate for graph-aware retrieval.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Building AgreedPro Made Clear
&lt;/h2&gt;

&lt;p&gt;The most important lesson from this work is not that vector retrieval is wrong.&lt;/p&gt;

&lt;p&gt;Vector retrieval is extremely effective at finding where the answer starts.&lt;/p&gt;

&lt;p&gt;The lesson is that, in structured domains, that is often not enough.&lt;/p&gt;

&lt;p&gt;Contract meaning is compositional. Definitions scope terms. Exceptions alter clauses. Parent sections constrain children. Cross-references relocate meaning. A retrieval system that only optimizes for semantic proximity will often return text that is relevant, but not sufficient.&lt;/p&gt;

&lt;p&gt;EngramDB is an attempt to align the retrieval system with the way contracts actually encode meaning.&lt;/p&gt;

&lt;p&gt;Instead of assuming the answer is a chunk, it assumes the answer is a connected set of sections.&lt;/p&gt;

&lt;p&gt;That difference changes everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Broader Principle for Retrieval System Design
&lt;/h2&gt;

&lt;p&gt;If there is one general principle I would take from this work, it is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Retrieval should optimize for reasoning completeness, not just semantic similarity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That principle has practical consequences.&lt;/p&gt;

&lt;p&gt;It means you should think seriously about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how documents are segmented&lt;/li&gt;
&lt;li&gt;what structural signals are preserved during ingestion&lt;/li&gt;
&lt;li&gt;whether explicit links are represented as edges&lt;/li&gt;
&lt;li&gt;how ranking protects graph-discovered evidence&lt;/li&gt;
&lt;li&gt;when metadata should override embedding-only assumptions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you acknowledge that many real-world questions are multi-hop, it becomes much harder to justify retrieval systems that behave as if the answer were always a single best chunk.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;What began as a practical problem inside AgreedPro became a deeper retrieval question.&lt;/p&gt;

&lt;p&gt;The limitation was not that language models could not answer contract questions. The limitation was that we were often handing them incomplete evidence and asking them to reconstruct relationships the document had already made explicit.&lt;/p&gt;

&lt;p&gt;EngramDB came out of trying to make those relationships first-class.&lt;/p&gt;

&lt;p&gt;By combining vector search with graph traversal, by using deterministic structure extraction during ingestion, and by scoring candidates with both semantic and structural signals, the system moves closer to the real shape of contract reasoning.&lt;/p&gt;

&lt;p&gt;Once you see contracts that way, the original failure mode becomes hard to ignore.&lt;/p&gt;

&lt;p&gt;The answer was never in one chunk.&lt;/p&gt;

&lt;p&gt;It was in the links between them.&lt;/p&gt;

&lt;p&gt;The source code can be found &lt;a href="https://github.com/pyalwin/engramdb" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>opensource</category>
      <category>database</category>
    </item>
    <item>
      <title>Beyond Simple Prompts: Architecting an AI Agent</title>
      <dc:creator>pyalwin</dc:creator>
      <pubDate>Wed, 17 Dec 2025 15:40:32 +0000</pubDate>
      <link>https://forem.com/pyalwin/beyond-simple-prompts-architecting-an-ai-agent-1i8c</link>
      <guid>https://forem.com/pyalwin/beyond-simple-prompts-architecting-an-ai-agent-1i8c</guid>
      <description>&lt;p&gt;Building a chatbot is easy. Building an AI agent that can review a 50-page Master Services Agreement and suggest redlines &lt;em&gt;without breaking the document formatting&lt;/em&gt; is a different problem entirely.&lt;/p&gt;

&lt;p&gt;In this post, I'll walk through how I dwelled into such an interesting problem as a weekend project, and ended up designing a system that automates contract review for legal teams. The challenge wasn't calling an LLM API—it was everything else: maintaining document structure, handling mutable paragraphs, and generating valid Microsoft Word tracked changes programmatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Space
&lt;/h2&gt;

&lt;p&gt;Contract review follows a predictable pattern. Legal team receives a counterparty's redlined contract, reviews each change against their organization's risk tolerance, and either accepts, rejects, or modifies each suggestion. This process takes hours for a single contract.&lt;/p&gt;

&lt;p&gt;So when i set out to automate this process, i realized it involves multiple aspects which includes but not limited to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Analyze contracts against specific guidelines&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate specific text suggestions&lt;/strong&gt; with rationale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apply changes as Word tracked changes&lt;/strong&gt;—not as plain text replacements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Survive document mutations&lt;/strong&gt;—users edit contracts while analysis runs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Requirements #3 and #4 are where most tools struggles. They output suggestions in a chat interface. Users still have to manually copy-paste and format. That's not automation; it's a fancier Ctrl+F.&lt;/p&gt;




&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                     Architecture                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌──────────────┐     ┌──────────────┐     ┌──────────────┐    │
│   │   Web App    │     │  Word Add-in │     │   Backend    │    │
│   │  (Next.js)   │     │ (Office.js)  │     │  (FastAPI)   │    │
│   └──────┬───────┘     └──────┬───────┘     └──────┬───────┘    │
│          │                    │                    │            │
│          │    REST + SSE      │    REST + SSE      │            │
│          └────────────────────┴────────────────────┘            │
│                               │                                 │
│                    ┌──────────┴──────────┐                      │
│                    │   Analysis Engine   │                      │
│                    │  ┌───────────────┐  │                      │
│                    │  │  DSPy + LLM   │  │                      │
│                    │  │  (OpenAI /    │  │                      │
│                    │  │   Mistral)    │  │                      │
│                    │  └───────────────┘  │                      │
│                    └─────────────────────┘                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Web Dashboard&lt;/strong&gt;: Next.js application for rule management, analytics, and administration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Word Add-in&lt;/strong&gt;: Microsoft Office plugin (React + Office.js) where users actually review contracts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend API&lt;/strong&gt;: FastAPI service handling analysis, LLM orchestration, and document processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting engineering lives in the Word Add-in (document manipulation) and Backend API (analysis pipeline).&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenge #1: The Mutable Document Problem
&lt;/h2&gt;

&lt;p&gt;Here's a scenario that breaks naive implementations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User uploads 50-page contract&lt;/li&gt;
&lt;li&gt;System analyzes paragraphs 1-50, stores suggestions keyed by paragraph index&lt;/li&gt;
&lt;li&gt;User deletes paragraph 12 while waiting&lt;/li&gt;
&lt;li&gt;System returns: "Paragraph 47 needs revision"&lt;/li&gt;
&lt;li&gt;Paragraph 47 is now paragraph 46. Suggestion applied to wrong location.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Solution: Paragraph Anchoring
&lt;/h3&gt;

&lt;p&gt;I built a logic to assign persistent IDs to mark the paragraph during preprocessing.&lt;/p&gt;

&lt;p&gt;The key is designed to reside with OOXML and persists even when users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delete adjacent paragraphs&lt;/li&gt;
&lt;li&gt;Cut and paste sections&lt;/li&gt;
&lt;li&gt;Accept/reject other tracked changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the frontend, a Zustand store maintains bidirectional mappings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;ParagraphStore&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;indexToPersistentIdMap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// index → UUID&lt;/span&gt;
  &lt;span class="nl"&gt;persistentIdToIndexMap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// UUID → index&lt;/span&gt;

  &lt;span class="nf"&gt;findAnchorByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// fallback matching&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When analysis results return, they reference UUIDs. The store resolves current paragraph indices at application time—not at analysis time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenge #2: Generating Word Tracked Changes
&lt;/h2&gt;

&lt;p&gt;This is the hard part.&lt;/p&gt;

&lt;p&gt;Office.js provides no API for creating tracked changes. The &lt;code&gt;paragraph.insertText()&lt;/code&gt; method just replaces text. To create actual redlines (strikethrough deletions, colored insertions), you must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate a diff between original and suggested text&lt;/li&gt;
&lt;li&gt;Convert that diff to OOXML elements&lt;/li&gt;
&lt;li&gt;Apply these elements to the actual content using Office.js&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Regarding the difference generation, I have implemented the Token-Based Diffing.&lt;/p&gt;

&lt;p&gt;Character-level diffs create garbage in Word. "The quick brown fox" → "A quick red fox" would show:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T̶h̶e̶A quick b̶r̶o̶w̶n̶red fox
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unusable. Token-level diffs are cleaner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The → A quick brown → red fox
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another important aspect to consider is to preserve the original paragraph properties. Because, contracts have formatting: numbering, indentation, styles. Naive replacement destroys this.&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge #3: Long-Running Analysis
&lt;/h2&gt;

&lt;p&gt;A 50-page contract with 30 playbook rules can take 2-3 minutes to analyze. HTTP requests shouldn't block that long.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session-Based Async Processing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client                          Server
  │                               │
  │ POST         │
  │ ─────────────────────────────&amp;gt;│
  │                               │ Create session
  │ { session_id: "abc123" }      │ Start background task
  │ &amp;lt;─────────────────────────────│
  │                               │
  │ GET /sessions/abc123          │
  │ ─────────────────────────────&amp;gt;│
  │ { status: "processing",       │
  │   progress: 45% }             │
  │ &amp;lt;─────────────────────────────│
  │                               │
  │ ... poll every 3 seconds ...  │
  │                               │
  │ GET /sessions/abc123          │
  │ ─────────────────────────────&amp;gt;│
  │ { status: "complete",         │
  │   results: [...] }            │
  │ &amp;lt;─────────────────────────────│
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cache Validation with Content Hashing
&lt;/h3&gt;

&lt;p&gt;Users often analyze the same contract multiple times—different guidelines, or checking after minor edits. Re-analyzing unchanged content wastes time and API costs.&lt;/p&gt;

&lt;p&gt;The hash comparison catches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Re-uploads of identical files&lt;/li&gt;
&lt;li&gt;"Analyze again" clicks without actual changes&lt;/li&gt;
&lt;li&gt;Multiple users analyzing the same template&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cache hit rate in production: ~40% for typical contract review workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenge #4: Grounding and Hallucination Prevention
&lt;/h2&gt;

&lt;p&gt;Legal documents require precision. An AI suggesting "Vendor liability is capped at $1M" when the contract says "$500K" is worse than no suggestion at all.&lt;/p&gt;

&lt;p&gt;The best way to solve it is to use Structured Output with Explicit Citations&lt;/p&gt;

&lt;p&gt;Every suggestion must reference the exact source text.&lt;/p&gt;

&lt;p&gt;This catches suggestions in practice where the model paraphrases slightly instead of quoting exactly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Analysis Pipeline
&lt;/h2&gt;

&lt;p&gt;Putting it together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────────────────┐
│                     Redline Analysis Pipeline                  │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  1. DOCUMENT INGESTION                                         │
│     ┌─────────┐     ┌─────────────┐     ┌──────────────┐       │
│     │  DOCX   │────&amp;gt;│ Extract     │────&amp;gt;│ Paragraph    │       │
│     │  File   │     │ OOXML       │     │ Anchoring    │       │
│     └─────────┘     └─────────────┘     └──────────────┘       │
│                                                                │
│  2. CONTENT NORMALIZATION                                      │
│     ┌─────────────┐     ┌─────────────────┐                    │
│     │ OOXML with  │────&amp;gt;│ Unified         │                    │
│     │ Tracked     │     │ Markdown        │                    │
│     │ Changes     │     │ (Original +     │                    │
│     │             │     │  Revised views) │                    │
│     └─────────────┘     └─────────────────┘                    │
│                                                                │
│  3. LLM ANALYSIS                                               │
│     ┌─────────────┐     ┌─────────────┐     ┌──────────────┐   │
│     │             │────&amp;gt;│ DSPy        │────&amp;gt;│ Structured   │   │
│     │ Rules       │     │ Signatures  │     │ Suggestions  │   │
│     └─────────────┘     └─────────────┘     └──────────────┘   │
│                                                                │
│  4. OUTPUT GENERATION                                          │
│     ┌─────────────┐     ┌─────────────┐     ┌──────────────┐   │
│     │ Suggestions │────&amp;gt;│ Token Diff  │────&amp;gt;│ OOXML        │   │
│     │ + Rationale │     │ Algorithm   │     │              │   │
│     └─────────────┘     └─────────────┘     └──────────────┘   │
│                                                                │
└────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The OOXML-to-Markdown conversion deserves special mention. Incoming contracts often already have tracked changes from counterparty negotiations. The converter:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parses elements&lt;/li&gt;
&lt;li&gt;Generates two synchronized views: &lt;strong&gt;Original&lt;/strong&gt; (with deletions, without insertions) and &lt;strong&gt;Revised&lt;/strong&gt; (with insertions, without deletions)&lt;/li&gt;
&lt;li&gt;Preserves paragraph IDs from content controls&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This abstraction means the LLM analyzes clean markdown, not raw XML. The complexity stays in the conversion layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;The system processes a 20-page contract in approximately 30-45 seconds, depending on rules complexity. Key metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cache hit rate&lt;/strong&gt;: ~40% (saves re-analysis on unchanged content)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination rate&lt;/strong&gt;: &amp;lt;5% (caught by validation, not surfaced to users)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format preservation&lt;/strong&gt;: 95% (paragraph properties maintained)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tracked change accuracy&lt;/strong&gt;: Token-level precision&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Office.js is powerful but underdocumented.&lt;/strong&gt; The OOXML manipulation pattern isn't in any official guide. I reverse-engineered it by exporting documents and reading the XML.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Character-level diffs are wrong for documents.&lt;/strong&gt; Always tokenize first. The general diff libraries doesn't know about words.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Async patterns matter more than you think.&lt;/strong&gt; The session-based polling approach sounds simple, but handling edge cases (browser refresh, network drops, server restarts) required careful state management.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ground everything.&lt;/strong&gt; LLMs will confidently cite text that doesn't exist. Validation layers catch this, but only if you design the output schema to require explicit source references.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Content hashing is cheap insurance.&lt;/strong&gt; The SHA-256 computation is negligible compared to LLM costs. Cache validation paid for itself in the first week.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Tech Stack Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Backend API&lt;/td&gt;
&lt;td&gt;FastAPI (Python)&lt;/td&gt;
&lt;td&gt;Async-native, great for long-running tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM Orchestration&lt;/td&gt;
&lt;td&gt;DSPy&lt;/td&gt;
&lt;td&gt;Structured outputs, provider-agnostic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM Providers&lt;/td&gt;
&lt;td&gt;OpenAI, Mistral&lt;/td&gt;
&lt;td&gt;Redundancy, cost optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;Supabase (PostgreSQL)&lt;/td&gt;
&lt;td&gt;Real-time subscriptions, hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web Frontend&lt;/td&gt;
&lt;td&gt;Next.js&lt;/td&gt;
&lt;td&gt;SSR for dashboard, API routes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Word Add-in&lt;/td&gt;
&lt;td&gt;React + Office.js&lt;/td&gt;
&lt;td&gt;Only option for Word integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document Processing&lt;/td&gt;
&lt;td&gt;python-docx, custom OOXML&lt;/td&gt;
&lt;td&gt;No library handles tracked changes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;The interesting engineering in "AI for X" products is rarely the AI part. Calling an LLM API is straightforward. The challenge is everything around it: maintaining document fidelity, handling state across long-running operations, and building validation layers that catch model failures before users see them.&lt;/p&gt;

&lt;p&gt;Legal redlining pushed me to solve problems I didn't anticipate—paragraph anchoring, OOXML manipulation, token-based diffing. Each solution came from understanding the domain deeply, not from finding a better prompt.&lt;/p&gt;

&lt;p&gt;If you're building in this space, I'd be interested to hear about your approach.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Arun Venkataramanan is a Senior Software Engineer at Ottimate, where he works on architecting solutions for accounts payable automation. With a background spanning core banking systems (TCS), fintech platforms, and enterprise automation, he focuses on building solutions and tools to help users automate repetitive things from their day to day work.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Connect on &lt;a href="https://www.linkedin.com/in/arun-venkataramanan/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tooling</category>
      <category>productivity</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
