<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Recep Çiftçi</title>
    <description>The latest articles on Forem by Recep Çiftçi (@recep_ciftci).</description>
    <link>https://forem.com/recep_ciftci</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3942968%2F4b79ee28-ecd7-4358-996b-9049668d78c3.png</url>
      <title>Forem: Recep Çiftçi</title>
      <link>https://forem.com/recep_ciftci</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/recep_ciftci"/>
    <language>en</language>
    <item>
      <title>Graph RAG vs Vector RAG: When to Use Each</title>
      <dc:creator>Recep Çiftçi</dc:creator>
      <pubDate>Fri, 22 May 2026 07:21:31 +0000</pubDate>
      <link>https://forem.com/recep_ciftci/graph-rag-vs-vector-rag-when-to-use-each-3628</link>
      <guid>https://forem.com/recep_ciftci/graph-rag-vs-vector-rag-when-to-use-each-3628</guid>
      <description>&lt;h1&gt;
  
  
  Graph RAG vs Vector RAG: When to Use Each
&lt;/h1&gt;

&lt;p&gt;Retrieval-Augmented Generation (RAG) helps LLMs use external knowledge more reliably. In practice, two patterns show up often: &lt;strong&gt;Vector RAG&lt;/strong&gt; and &lt;strong&gt;Graph RAG&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Both try to solve the same problem: bring relevant context to the model. They just do it with different data models.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector RAG: similarity-based retrieval&lt;/li&gt;
&lt;li&gt;Graph RAG: relationship-based retrieval&lt;/li&gt;
&lt;li&gt;Hybrid search: combining both&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article focuses on architecture patterns, chunking strategies, storage choices, and when each option makes sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick definitions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Vector RAG
&lt;/h3&gt;

&lt;p&gt;Documents are split into chunks, embeddings are generated, and the chunks are stored in a vector database. When a query arrives, its embedding is computed and the nearest chunks are retrieved.&lt;/p&gt;

&lt;p&gt;Its main strengths are simplicity and low operational overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graph RAG
&lt;/h3&gt;

&lt;p&gt;Knowledge is modeled as nodes and relationships. Nodes can represent documents, entities, events, concepts, or claims. Edges capture relationships such as "depends on", "references", "part of", or "causes".&lt;/p&gt;

&lt;p&gt;The query can retrieve not only similar chunks, but also a related subgraph.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architectural differences
&lt;/h2&gt;

&lt;p&gt;The diagram below summarizes the basic flow of both approaches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1aguyvoek1e6pa1t5ai.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1aguyvoek1e6pa1t5ai.png" alt="Graph RAG and Vector RAG architecture comparison" width="800" height="886"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector RAG flow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Split documents into chunks&lt;/li&gt;
&lt;li&gt;Generate chunk embeddings&lt;/li&gt;
&lt;li&gt;Store them in a vector database&lt;/li&gt;
&lt;li&gt;Retrieve nearest neighbors for the query embedding&lt;/li&gt;
&lt;li&gt;Add the retrieved context to the prompt&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This flow is usually straightforward, fast, and well understood.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graph RAG flow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Extract entities and relationships from documents&lt;/li&gt;
&lt;li&gt;Build and store the graph&lt;/li&gt;
&lt;li&gt;Identify seed nodes for the query&lt;/li&gt;
&lt;li&gt;Expand the subgraph&lt;/li&gt;
&lt;li&gt;Generate context from the relevant nodes and edges&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key difference is that retrieval uses not only similarity, but also &lt;strong&gt;structural context&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chunking strategies
&lt;/h2&gt;

&lt;p&gt;Chunking is one of the most important quality levers in any RAG system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chunking for Vector RAG
&lt;/h3&gt;

&lt;p&gt;Good chunking for Vector RAG usually has these properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;meaningful semantic boundaries&lt;/li&gt;
&lt;li&gt;chunks that are not too large&lt;/li&gt;
&lt;li&gt;overlap that preserves enough context&lt;/li&gt;
&lt;li&gt;retention of headings, subheadings, and references&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Chunks that are too small fragment the context. Chunks that are too large weaken retrieval signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chunking for Graph RAG
&lt;/h3&gt;

&lt;p&gt;In Graph RAG, chunking alone is not enough, because the goal is often not sentence similarity but relation extraction.&lt;/p&gt;

&lt;p&gt;A stronger pipeline usually combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;document chunking&lt;/li&gt;
&lt;li&gt;entity extraction&lt;/li&gt;
&lt;li&gt;relation extraction&lt;/li&gt;
&lt;li&gt;separation of claims and evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the data is first split as text, then transformed into structured knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage model
&lt;/h2&gt;

&lt;h3&gt;
  
  
  When a vector database is enough
&lt;/h3&gt;

&lt;p&gt;A vector database is often enough when the workload looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;enterprise document search&lt;/li&gt;
&lt;li&gt;semantic FAQ&lt;/li&gt;
&lt;li&gt;similar content discovery&lt;/li&gt;
&lt;li&gt;low to medium complexity Q&amp;amp;A&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Its main advantage is that indexing and querying are relatively standard.&lt;/p&gt;

&lt;h3&gt;
  
  
  When graph storage becomes useful
&lt;/h3&gt;

&lt;p&gt;Graph storage starts to matter when you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multi-hop questions&lt;/li&gt;
&lt;li&gt;entity-centric queries&lt;/li&gt;
&lt;li&gt;domains where abstract relationships matter&lt;/li&gt;
&lt;li&gt;provenance and traceability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Which policies does this decision depend on?"&lt;/li&gt;
&lt;li&gt;"What dependencies affect this service?"&lt;/li&gt;
&lt;li&gt;"Which components are related to this incident?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These questions need more than semantic proximity; they need the relationship network.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pros and cons
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Vector RAG pros
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Easy to set up&lt;/li&gt;
&lt;li&gt;Fast path to a useful first version&lt;/li&gt;
&lt;li&gt;Strong for semantic search&lt;/li&gt;
&lt;li&gt;Mature vector database ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Vector RAG cons
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Weak on relationship-heavy questions&lt;/li&gt;
&lt;li&gt;Sensitive to chunk boundaries&lt;/li&gt;
&lt;li&gt;Retrieval may return context that is close but not correct&lt;/li&gt;
&lt;li&gt;Source traceability can be hard to explain&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Graph RAG pros
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Better at representing relationships&lt;/li&gt;
&lt;li&gt;Useful for multi-hop reasoning&lt;/li&gt;
&lt;li&gt;Strong for source, dependency, and impact analysis&lt;/li&gt;
&lt;li&gt;Can be more explainable for structured queries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Graph RAG cons
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Higher data modeling cost&lt;/li&gt;
&lt;li&gt;Entity/relation extraction errors can cascade&lt;/li&gt;
&lt;li&gt;More complex to operate and maintain&lt;/li&gt;
&lt;li&gt;More dependent on domain-specific graph design&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Which one should you use?
&lt;/h2&gt;

&lt;p&gt;A practical rule of thumb is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the question is mostly "find similar content", use Vector RAG&lt;/li&gt;
&lt;li&gt;If the question is mostly "follow the relationship", use Graph RAG&lt;/li&gt;
&lt;li&gt;If you need both semantic and structural signals, use hybrid search&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choose Vector RAG if:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;the domain is mostly plain text&lt;/li&gt;
&lt;li&gt;questions can be answered directly from documents&lt;/li&gt;
&lt;li&gt;latency and simplicity are priorities&lt;/li&gt;
&lt;li&gt;you are building a fast MVP&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choose Graph RAG if:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;the domain revolves around entities and relationships&lt;/li&gt;
&lt;li&gt;provenance is critical&lt;/li&gt;
&lt;li&gt;multi-step reasoning is needed&lt;/li&gt;
&lt;li&gt;explainability of search results matters&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The hybrid search pattern
&lt;/h2&gt;

&lt;p&gt;For many real systems, the best answer is not "either/or" but &lt;strong&gt;both&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A common hybrid pattern is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use vector search to find candidates&lt;/li&gt;
&lt;li&gt;Expand relationships with graph traversal&lt;/li&gt;
&lt;li&gt;Re-rank the combined results&lt;/li&gt;
&lt;li&gt;Keep only the most relevant context in the prompt&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This pattern is especially useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;software architecture documentation&lt;/li&gt;
&lt;li&gt;compliance and policy search&lt;/li&gt;
&lt;li&gt;incident analysis and root-cause exploration&lt;/li&gt;
&lt;li&gt;product knowledge bases&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Design notes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Define the retrieval target clearly
&lt;/h3&gt;

&lt;p&gt;"Correct answer" and "correct context" are not the same thing. First decide what signal you are optimizing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Do not treat chunking as separate from the data model
&lt;/h3&gt;

&lt;p&gt;Chunk size and segmentation should be designed together with the storage model you choose.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Do not turn everything into a graph
&lt;/h3&gt;

&lt;p&gt;Graph RAG is powerful, but not every problem needs a graph. Unnecessary modeling increases maintenance cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Add observability
&lt;/h3&gt;

&lt;p&gt;You cannot improve retrieval if you cannot inspect it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which chunk was retrieved&lt;/li&gt;
&lt;li&gt;which node was expanded&lt;/li&gt;
&lt;li&gt;which relation influenced the decision&lt;/li&gt;
&lt;li&gt;why this result was selected&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Vector RAG and Graph RAG are not really competitors. They are tools for different constraints.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector RAG: fast, simple, semantic-first&lt;/li&gt;
&lt;li&gt;Graph RAG: structure, relationships, and traceability&lt;/li&gt;
&lt;li&gt;Hybrid search: often the most balanced production choice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When choosing an architecture, start with the question type, explainability needs, and maintenance cost before you choose the data model.&lt;/p&gt;

&lt;p&gt;The right approach is not the most complex one. It is the one that fits the workload.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>Context Engineering: Building More Reliable LLM Systems in Production</title>
      <dc:creator>Recep Çiftçi</dc:creator>
      <pubDate>Wed, 20 May 2026 23:24:04 +0000</pubDate>
      <link>https://forem.com/recep_ciftci/context-engineering-building-more-reliable-llm-systems-in-production-m3f</link>
      <guid>https://forem.com/recep_ciftci/context-engineering-building-more-reliable-llm-systems-in-production-m3f</guid>
      <description>&lt;h1&gt;
  
  
  Context Engineering: Building More Reliable LLM Systems in Production
&lt;/h1&gt;

&lt;p&gt;In LLM-based systems, performance is often driven less by model size and more by &lt;strong&gt;what context&lt;/strong&gt; is provided, &lt;strong&gt;in what order&lt;/strong&gt;, and &lt;strong&gt;under which constraints&lt;/strong&gt;. That is why many teams now talk about &lt;strong&gt;context engineering&lt;/strong&gt; instead of prompt engineering alone.&lt;/p&gt;

&lt;p&gt;In short, context engineering is the discipline of turning user intent, tool output, system instructions, conversation history, knowledge base content, and business rules into a context package that the model can use effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it matters
&lt;/h2&gt;

&lt;p&gt;Production LLM systems usually fail in familiar ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model seems to know the answer but drifts because of the wrong context.&lt;/li&gt;
&lt;li&gt;Long chat history buries important facts.&lt;/li&gt;
&lt;li&gt;RAG retrieves relevant documents, but ranking and truncation are weak.&lt;/li&gt;
&lt;li&gt;Tool calls exist, but the output format is unstable.&lt;/li&gt;
&lt;li&gt;The same request produces different results across sessions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common issue is not the model’s “intelligence.” It is &lt;strong&gt;context quality&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is context engineering?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jd2kq7aftxe0z7mcaib.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jd2kq7aftxe0z7mcaib.png" alt="Context Engineering Layers Diagram" width="800" height="659"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Context engineering is not just writing a prompt. It usually means designing several layers together:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;System instructions&lt;/strong&gt;: role, boundaries, priorities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task definition&lt;/strong&gt;: what the user wants.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieved knowledge&lt;/strong&gt;: RAG, databases, tool outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation history&lt;/strong&gt;: only the necessary summaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output schema&lt;/strong&gt;: JSON, Markdown, tables, or another format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety and compliance rules&lt;/strong&gt;: forbidden content, data leakage, permission boundaries.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key idea is simple: everything the model should see is context, but not everything in context should be passed to the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical lessons from production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. More context is not always better
&lt;/h3&gt;

&lt;p&gt;A longer context window looks like more information, but in practice it can create distraction and higher cost. Models often struggle when too many irrelevant documents compete for attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Better approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Select information by priority.&lt;/li&gt;
&lt;li&gt;Remove duplication.&lt;/li&gt;
&lt;li&gt;Use summaries plus supporting evidence.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Separate context into layers
&lt;/h3&gt;

&lt;p&gt;Instead of stuffing every instruction into one prompt, layer the task. This usually produces more stable behavior.&lt;/p&gt;

&lt;p&gt;A useful structure is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System level: behavior rules&lt;/li&gt;
&lt;li&gt;Application level: workflow logic&lt;/li&gt;
&lt;li&gt;Request level: user problem&lt;/li&gt;
&lt;li&gt;Data level: documents and tool outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation also makes failures easier to debug.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Source selection matters more than prompt wording
&lt;/h3&gt;

&lt;p&gt;In RAG systems, the main issue is often not how you write the prompt, but &lt;strong&gt;which chunks you retrieve&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Questions to ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this document actually relevant?&lt;/li&gt;
&lt;li&gt;Is the chunk size appropriate?&lt;/li&gt;
&lt;li&gt;Is ranking semantic or just lexical?&lt;/li&gt;
&lt;li&gt;Is stale information outranking recent information?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many production issues begin at retrieval time.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Lock down the output format early
&lt;/h3&gt;

&lt;p&gt;Free-form text is flexible for humans, but brittle for machines. In production, prefer structured outputs whenever possible.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON schema&lt;/li&gt;
&lt;li&gt;Markdown heading hierarchy&lt;/li&gt;
&lt;li&gt;Fixed field lists&lt;/li&gt;
&lt;li&gt;Stable error codes for failure cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduces parsing failures later in the pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Long sessions break without a summarization strategy
&lt;/h3&gt;

&lt;p&gt;As conversation history grows, the model will eventually miss important details. The answer is not to carry everything forward, but to maintain a &lt;strong&gt;good state summary&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A good summary preserves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The user’s goal&lt;/li&gt;
&lt;li&gt;Decisions already made&lt;/li&gt;
&lt;li&gt;Open questions&lt;/li&gt;
&lt;li&gt;Important constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A bad summary only shortens the chat and loses meaning.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple production checklist
&lt;/h2&gt;

&lt;p&gt;When working on context engineering, it helps to check the following regularly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the task clear in one sentence?&lt;/li&gt;
&lt;li&gt;Do system instructions conflict with user intent?&lt;/li&gt;
&lt;li&gt;Does every added document have a reason to exist?&lt;/li&gt;
&lt;li&gt;Is the token budget reserved for the most important information?&lt;/li&gt;
&lt;li&gt;Can the output format be validated?&lt;/li&gt;
&lt;li&gt;Is old context hurting new decisions?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This checklist measures system quality more than prompt quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple mental model
&lt;/h2&gt;

&lt;p&gt;You can think of context engineering as this equation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right information + right timing + right format + right boundaries = more reliable output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model’s power shows up through how well you manage the context around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to pay extra attention
&lt;/h2&gt;

&lt;p&gt;Context engineering becomes even more important in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-step tasks&lt;/li&gt;
&lt;li&gt;Regulated or compliance-heavy workflows&lt;/li&gt;
&lt;li&gt;Systems using internal or sensitive data&lt;/li&gt;
&lt;li&gt;Tool-using agents&lt;/li&gt;
&lt;li&gt;Long-lived sessions&lt;/li&gt;
&lt;li&gt;Multilingual products&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In these cases, small context errors can become large product failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Context engineering is the practical discipline that makes LLM products more deterministic, traceable, and maintainable. Good prompting still matters, but in production the real difference often comes from selecting, organizing, and constraining the context.&lt;/p&gt;

&lt;p&gt;If your LLM application is less stable than expected, inspect the context before you blame the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Context engineering is broader than prompt writing.&lt;/li&gt;
&lt;li&gt;Better selected context matters more than more context.&lt;/li&gt;
&lt;li&gt;Retrieval, summarization, and output schemas are critical in production.&lt;/li&gt;
&lt;li&gt;Stable systems need layered design and verifiable formats.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Originally published on &lt;a href="https://recep-ciftci.prep-test.com/en/blog/context-engineering-practical-production-lessons" rel="noopener noreferrer"&gt;Recep Ciftci's portfolio&lt;/a&gt;. I write about production AI systems, LLM, and full-stack architecture.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>contextengineering</category>
      <category>prompengineering</category>
      <category>productionai</category>
    </item>
    <item>
      <title>Building Production RAG Pipelines: Practical Lessons</title>
      <dc:creator>Recep Çiftçi</dc:creator>
      <pubDate>Wed, 20 May 2026 21:23:58 +0000</pubDate>
      <link>https://forem.com/recep_ciftci/building-production-rag-pipelines-practical-lessons-3pem</link>
      <guid>https://forem.com/recep_ciftci/building-production-rag-pipelines-practical-lessons-3pem</guid>
      <description>&lt;h1&gt;
  
  
  Building Production RAG Pipelines: Practical Lessons
&lt;/h1&gt;

&lt;p&gt;A RAG pipeline can make LLM applications more current, more traceable, and more controllable when it is designed well. When it is not, it becomes another layer of complexity. In production, the real difference comes from retrieval quality, latency budget, evaluation discipline, and operational visibility—not from demo performance alone.&lt;/p&gt;

&lt;p&gt;In this post, I’ll summarize the practical decisions and lessons that matter when you build a production-oriented RAG pipeline for AI engineering use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  What RAG solves, and what it does not
&lt;/h2&gt;

&lt;p&gt;RAG adds external knowledge to the answer generation process without retraining the model. That makes it useful for changing documentation, product knowledge, internal knowledge bases, and support workflows.&lt;/p&gt;

&lt;p&gt;But RAG is not a replacement for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;poor information architecture&lt;/li&gt;
&lt;li&gt;weak data quality processes&lt;/li&gt;
&lt;li&gt;unclear product scope&lt;/li&gt;
&lt;li&gt;fundamental model limitations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, RAG is not an automatic accuracy engine. It still needs a strong information retrieval system and a disciplined evaluation framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  A typical production flow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b0m8yujunujbe60mn0j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b0m8yujunujbe60mn0j.png" alt="RAG Flow Diagram" width="799" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A basic production RAG pipeline usually includes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ingestion&lt;/strong&gt;: Collect documents, logs, or data sources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt;: Split content into retrieval-friendly pieces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding&lt;/strong&gt;: Convert chunks into vector representations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indexing&lt;/strong&gt;: Build vector and metadata indexes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval&lt;/strong&gt;: Fetch the most relevant chunks for the query&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reranking&lt;/strong&gt;: Reorder initial results with a stronger ranker&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt assembly&lt;/strong&gt;: Add context to the prompt in a safe, bounded way&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generation&lt;/strong&gt;: Produce the response with the LLM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-processing&lt;/strong&gt;: Add citations, filters, and policy checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt;: Collect traces, metrics, and feedback&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A common mistake is focusing almost entirely on generation. In production, retrieval often drives the final quality more than the model itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chunking decisions directly affect model quality
&lt;/h2&gt;

&lt;p&gt;Chunking looks mechanical at first, but in production it is a critical design choice. If chunks are too small, context gets fragmented. If they are too large, retrieval precision drops.&lt;/p&gt;

&lt;p&gt;Useful practical rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;preserve headings and subheadings&lt;/li&gt;
&lt;li&gt;avoid breaking semantic units&lt;/li&gt;
&lt;li&gt;treat tables, code, and lists carefully&lt;/li&gt;
&lt;li&gt;tune chunk size by data type&lt;/li&gt;
&lt;li&gt;use overlap, but avoid excessive repetition&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Splitting a document by page is often worse than splitting it into meaningful sub-sections.&lt;/p&gt;

&lt;h2&gt;
  
  
  Embeddings alone are not enough for good retrieval
&lt;/h2&gt;

&lt;p&gt;The embedding model matters, but it is not sufficient by itself. In production, retrieval quality usually depends on a combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dense retrieval&lt;/li&gt;
&lt;li&gt;lexical retrieval&lt;/li&gt;
&lt;li&gt;hybrid search&lt;/li&gt;
&lt;li&gt;metadata filters&lt;/li&gt;
&lt;li&gt;reranking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Metadata filters are especially valuable in enterprise settings. Fields such as date, language, product version, access level, and source type can significantly narrow the search space.&lt;/p&gt;

&lt;p&gt;Query rewriting is another important technique. User queries are often short, incomplete, or conversational. Rewriting the query can materially improve retrieval quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reranking is often a low-cost, high-impact upgrade
&lt;/h2&gt;

&lt;p&gt;Initial retrieval results are often relevant enough, but poorly ordered. A reranker can improve the quality of the top-k context significantly.&lt;/p&gt;

&lt;p&gt;This should be viewed as a production optimization, not a luxury, because it can deliver:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;better top-k context&lt;/li&gt;
&lt;li&gt;less noise&lt;/li&gt;
&lt;li&gt;lower hallucination risk&lt;/li&gt;
&lt;li&gt;more consistent answers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reranking adds cost and latency, but for many applications the tradeoff is worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt design is more than writing instructions
&lt;/h2&gt;

&lt;p&gt;In a RAG system, the prompt determines how the model should use the retrieved context. A good prompt should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;present context within clear boundaries&lt;/li&gt;
&lt;li&gt;discourage unsupported claims&lt;/li&gt;
&lt;li&gt;define the response format clearly&lt;/li&gt;
&lt;li&gt;specify behavior when information is missing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, it is important to tell the model to state explicitly when the answer is not in the provided context. Otherwise, the model may fill in gaps.&lt;/p&gt;

&lt;p&gt;Also, stuffing too many documents into the prompt is usually a bad idea. More context does not automatically mean better results. Unnecessary context distracts the model and increases token cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shipping without evaluation is risky
&lt;/h2&gt;

&lt;p&gt;In RAG systems, offline evaluation and real user behavior can diverge. That is why retrieval and generation should be evaluated separately.&lt;/p&gt;

&lt;p&gt;Useful metrics include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retrieval hit rate&lt;/li&gt;
&lt;li&gt;context precision&lt;/li&gt;
&lt;li&gt;context recall&lt;/li&gt;
&lt;li&gt;answer faithfulness&lt;/li&gt;
&lt;li&gt;answer relevance&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;token usage&lt;/li&gt;
&lt;li&gt;fallback rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When gold labels are limited, human review and sample-based analysis become very valuable. Re-running the same query set regularly also helps catch regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability makes production debugging possible
&lt;/h2&gt;

&lt;p&gt;Logging only the final answer is not enough. In a RAG pipeline, you should track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user query&lt;/li&gt;
&lt;li&gt;normalized query&lt;/li&gt;
&lt;li&gt;retrieved chunks&lt;/li&gt;
&lt;li&gt;rerank scores&lt;/li&gt;
&lt;li&gt;prompt length&lt;/li&gt;
&lt;li&gt;model response&lt;/li&gt;
&lt;li&gt;source references&lt;/li&gt;
&lt;li&gt;errors and timeouts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these signals, it is hard to tell where a failure happened. Did retrieval degrade? Did chunking get worse? Did the model behave inconsistently? Traces make that difference visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency budget should be designed early
&lt;/h2&gt;

&lt;p&gt;One of the most overlooked aspects of production RAG is latency. Retrieval, reranking, and generation all affect the user experience.&lt;/p&gt;

&lt;p&gt;Ask these questions early:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the target response time?&lt;/li&gt;
&lt;li&gt;How long can retrieval take?&lt;/li&gt;
&lt;li&gt;Should reranking run synchronously or asynchronously?&lt;/li&gt;
&lt;li&gt;Which layers should be cached?&lt;/li&gt;
&lt;li&gt;Is there a faster fallback for simple queries?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In some systems, a simpler and faster pipeline is better than a more elaborate one. A technically richer architecture is not automatically a better product.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security and data leakage must be taken seriously
&lt;/h2&gt;

&lt;p&gt;RAG can make it easier to expose sensitive data to the model. Access control should therefore be enforced at the retrieval layer, not only in the prompt.&lt;/p&gt;

&lt;p&gt;Watch for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;unauthorized document access&lt;/li&gt;
&lt;li&gt;prompt injection&lt;/li&gt;
&lt;li&gt;malicious instructions inside source content&lt;/li&gt;
&lt;li&gt;PII and secret leakage&lt;/li&gt;
&lt;li&gt;tenant isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In multi-tenant systems especially, retrieved results should be filtered strictly according to user permissions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simplicity is often the best starting point
&lt;/h2&gt;

&lt;p&gt;A good production starting point is often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a clearly defined data scope&lt;/li&gt;
&lt;li&gt;a simple chunking strategy&lt;/li&gt;
&lt;li&gt;hybrid retrieval&lt;/li&gt;
&lt;li&gt;lightweight reranking&lt;/li&gt;
&lt;li&gt;a clear prompt template&lt;/li&gt;
&lt;li&gt;a solid evaluation set&lt;/li&gt;
&lt;li&gt;detailed logging and tracing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rather than adding a separate model for every problem, it is usually more sustainable to measure and improve the existing pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building a RAG pipeline is not just about connecting a vector database. A production-ready system requires a good balance between data quality, retrieval design, prompt control, security, evaluation, and operational visibility.&lt;/p&gt;

&lt;p&gt;The most important practical lesson is this: prove that retrieval works well before optimizing generation. In many cases, the root cause of a RAG failure is not the model—it is the wrong context being selected.&lt;/p&gt;

&lt;p&gt;If useful, I can follow this with a concrete production RAG architecture, technology choices, and an evaluation checklist.&lt;/p&gt;




&lt;p&gt;Originally published on &lt;a href="https://recep-ciftci.prep-test.com/en/blog/building-production-rag-pipelines" rel="noopener noreferrer"&gt;Recep Ciftci's portfolio&lt;/a&gt;. I write about production AI systems, RAG, and full-stack architecture.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
