<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Hamid Omarov</title>
    <description>The latest articles on Forem by Hamid Omarov (@hamidomarov).</description>
    <link>https://forem.com/hamidomarov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3439209%2F23f100f0-f5b4-4e4c-964e-9dcce5e6f4ad.jpeg</url>
      <title>Forem: Hamid Omarov</title>
      <link>https://forem.com/hamidomarov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/hamidomarov"/>
    <language>en</language>
    <item>
      <title>Building Production RAG in 2025: Lessons from 50+ Deployments</title>
      <dc:creator>Hamid Omarov</dc:creator>
      <pubDate>Sat, 16 Aug 2025 16:11:05 +0000</pubDate>
      <link>https://forem.com/hamidomarov/building-production-rag-in-2024-lessons-from-50-deployments-5fh9</link>
      <guid>https://forem.com/hamidomarov/building-production-rag-in-2024-lessons-from-50-deployments-5fh9</guid>
      <description>&lt;h1&gt;
  
  
  Building Production RAG in 2025: Lessons from 50+ Deployments
&lt;/h1&gt;

&lt;p&gt;Retrieval Augmented Generation (RAG) has become one of the most practical ways to make large language models reliable. After building more than fifty RAG systems in production, I want to share what consistently works and what doesn’t.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack That Actually Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Backend
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt; over Flask. Async support makes a big difference once you scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS&lt;/strong&gt; over ChromaDB, at least for workloads under one million documents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MiniLM&lt;/strong&gt; over Ada-002. The balance of cost and performance is hard to beat.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Critical Optimizations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. L2 Normalization
&lt;/h3&gt;

&lt;p&gt;Many teams ignore this, but it is a small tweak with a big impact on retrieval quality. By normalizing embeddings, you ensure the cosine similarity is consistent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without normalization, dense vectors with larger magnitudes may dominate results, leading to irrelevant matches.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Chunking Strategy
&lt;/h3&gt;

&lt;p&gt;Do not overcomplicate this. For most documents, 500–800 tokens per chunk with a 100–200 token overlap works best. It balances recall and precision while keeping index sizes manageable.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Metadata First Search
&lt;/h3&gt;

&lt;p&gt;Filtering by metadata before doing a vector similarity search reduces noise and latency. For example, if you know the document type or date range, apply that filter first.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Keep It Observable
&lt;/h3&gt;

&lt;p&gt;Production systems fail in subtle ways. Add metrics for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding generation errors&lt;/li&gt;
&lt;li&gt;Retrieval latency&lt;/li&gt;
&lt;li&gt;Ratio of retrieved docs to final answer tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A RAG system is only as good as its weakest link, and observability helps you catch problems early.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simple beats clever.&lt;/strong&gt; Overly complex pipelines often fail silently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate on your data, not benchmarks.&lt;/strong&gt; Many retrieval tricks look good in papers but add little value for your domain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy fast, optimize later.&lt;/strong&gt; The biggest risk is never shipping.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Building RAG in 2025 is less about cutting edge tricks and more about disciplined engineering. With the right stack and a few critical optimizations, you can deliver production-grade systems in days, not months.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/spaces/HamidOmarov/RAG-Dashboard" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/HamidOmarov/RAG-Dashboard&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.linkedin.com/in/hamidomarov/" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/hamidomarov/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.upwork.com/freelancers/%7E01340982df23f6bc11" rel="noopener noreferrer"&gt;https://www.upwork.com/freelancers/~01340982df23f6bc11&lt;/a&gt;&lt;br&gt;
&lt;a href="https://hamidomarov.github.io/portfolio/" rel="noopener noreferrer"&gt;https://hamidomarov.github.io/portfolio/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>l2</category>
    </item>
  </channel>
</rss>
