<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Himanjan</title>
    <description>The latest articles on Forem by Himanjan (@himanjan).</description>
    <link>https://forem.com/himanjan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3250368%2Fd167c36b-bf00-4602-94ab-aaca2b58271f.JPG</url>
      <title>Forem: Himanjan</title>
      <link>https://forem.com/himanjan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/himanjan"/>
    <language>en</language>
    <item>
      <title>Claude Mythos In Preview</title>
      <dc:creator>Himanjan</dc:creator>
      <pubDate>Wed, 08 Apr 2026 13:12:33 +0000</pubDate>
      <link>https://forem.com/himanjan/claude-mythos-in-preview-kfl</link>
      <guid>https://forem.com/himanjan/claude-mythos-in-preview-kfl</guid>
      <description>&lt;h1&gt;
  
  
  AI Just Found Bugs That Humans Missed for 27 Years — And That Changes Everything
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;How Anthropic's new model is rewriting the rules of cybersecurity — explained simply.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmacnez2e57ar56d88fm9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmacnez2e57ar56d88fm9.png" alt="cybersecurity-ai" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On April 7, 2026, Anthropic quietly dropped one of the most important announcements in cybersecurity history. Their new model, &lt;strong&gt;Claude Mythos Preview&lt;/strong&gt;, found and exploited security flaws in &lt;em&gt;every major operating system&lt;/em&gt; and &lt;em&gt;every major web browser&lt;/em&gt; — many of which had been hiding in plain sight for decades.&lt;/p&gt;

&lt;p&gt;Let me break down what happened, why it matters, and what it means for all of us — no jargon, no hype, just the facts.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 First, What Are "Vulnerabilities"?
&lt;/h2&gt;

&lt;p&gt;Think of software like a house. A vulnerability is an unlocked window that nobody noticed. It's been there since the house was built, but because no one checked &lt;em&gt;that particular window&lt;/em&gt;, burglars never found it either.&lt;/p&gt;

&lt;p&gt;Now imagine an AI that can walk through &lt;strong&gt;every room of every house on the internet&lt;/strong&gt; and check &lt;strong&gt;every window, every door, every crack&lt;/strong&gt; — in hours, not years.&lt;/p&gt;

&lt;p&gt;That's what Mythos Preview does for software.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔍 What Did It Actually Find?
&lt;/h2&gt;

&lt;p&gt;Here are three real examples — all confirmed and now patched:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. A 27-Year-Old Bug in OpenBSD
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What is OpenBSD?&lt;/strong&gt; A super-secure operating system used to run firewalls and critical internet infrastructure. Its &lt;em&gt;entire reputation&lt;/em&gt; is built on security.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The bug:&lt;/strong&gt; When two computers talk over the internet using TCP (the basic protocol for web traffic), they send "acknowledgment" messages back and forth — "Hey, I got packets 1 through 10."&lt;/p&gt;

&lt;p&gt;OpenBSD had a flaw in how it tracked these acknowledgments. By sending a carefully crafted message with a &lt;em&gt;negative&lt;/em&gt; starting point, an attacker could trick the system into crashing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How?&lt;/strong&gt; Two small bugs combined:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bug 1:&lt;/strong&gt; The code checked if the &lt;em&gt;end&lt;/em&gt; of an acknowledgment was valid, but never checked the &lt;em&gt;start&lt;/em&gt;. Usually harmless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bug 2:&lt;/strong&gt; Under a very specific condition, the code tried to write to a memory address that no longer existed (a "null pointer").&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trick was that reaching Bug 2 &lt;em&gt;should have been impossible&lt;/em&gt; — except that by exploiting Bug 1 with a number roughly 2 billion away from the expected range, a math overflow fooled both safety checks simultaneously.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; Anyone on the internet could remotely crash any OpenBSD machine. This bug existed since &lt;strong&gt;1998&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📊 Cost to find it: Under $50 for that specific run
   (within a $20,000 sweep of ~1,000 runs across the codebase)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. A 16-Year-Old Bug in FFmpeg
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What is FFmpeg?&lt;/strong&gt; The video processing engine behind almost every app that plays or converts video. YouTube, VLC, Discord — they all rely on it. It's one of the &lt;em&gt;most tested&lt;/em&gt; pieces of software on Earth.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The bug:&lt;/strong&gt; When decoding H.264 video (the standard format for most video), FFmpeg tracks which "slice" each chunk of pixels belongs to using a table of 16-bit numbers (max value: 65,535).&lt;/p&gt;

&lt;p&gt;The code uses the value &lt;code&gt;65535&lt;/code&gt; as a special marker meaning "nobody owns this pixel yet." But if an attacker creates a video with exactly &lt;strong&gt;65,536 slices&lt;/strong&gt;, then slice number 65,535 &lt;em&gt;collides with the marker&lt;/em&gt;. The decoder gets confused, thinks a nonexistent neighbor pixel is real, and writes data where it shouldn't.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Think of it like a hotel that uses room 9999 as
the code for "this room doesn't exist."

Now someone books exactly 10,000 rooms.
Guest 9999 checks in — and the system can't tell
the difference between a real guest and "doesn't exist."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why did nobody catch this?&lt;/strong&gt; Fuzzers (automated testing tools) had hit this code &lt;strong&gt;millions of times&lt;/strong&gt; with random inputs. But they never tried a video with exactly 65,536 slices — because no &lt;em&gt;real&lt;/em&gt; video would ever have that many. The AI understood the &lt;em&gt;logic&lt;/em&gt; of the code, not just random inputs.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  3. Full Remote Takeover of FreeBSD (CVE-2026-4747)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What is FreeBSD?&lt;/strong&gt; Another widely-used operating system, especially for servers and networking equipment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The bug:&lt;/strong&gt; FreeBSD's file-sharing service (NFS) had a function that copied data from an incoming network packet into a 128-byte buffer — but the packet could be up to 400 bytes. Classic buffer overflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Mythos Preview did with it:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What Happened&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Found the overflow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The AI read the FreeBSD kernel source and spotted the mismatch between buffer size (128 bytes) and input limit (400 bytes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Noticed the missing protections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The buffer was declared as &lt;code&gt;int32_t[]&lt;/code&gt; instead of &lt;code&gt;char[]&lt;/code&gt;, so the compiler &lt;em&gt;didn't add a security canary&lt;/em&gt; — a guard value that normally detects overflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Figured out how to authenticate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;To reach the vulnerable code, you need a secret handle. The AI discovered you could get it by making one unauthenticated call that leaks the server's UUID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Built a 20-step attack chain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The actual exploit needed ~1,000 bytes but only had 200 bytes of space. So the AI split it into &lt;strong&gt;6 sequential network requests&lt;/strong&gt;, each building a piece of the attack in memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5. Result&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unauthenticated root access — complete control of the machine, from anywhere on the internet&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;This bug had been hiding in FreeBSD for &lt;strong&gt;17 years&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  📈 The Numbers Are Staggering
&lt;/h2&gt;

&lt;p&gt;Here's the leap in capability compared to the previous best model (Opus 4.6):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;th&gt;Mythos Preview&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Firefox JS engine exploits (out of hundreds of attempts)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;181&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full control-flow hijack on patched targets&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CyberGym vulnerability benchmark&lt;/td&gt;
&lt;td&gt;66.6%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83.1%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified (code tasks)&lt;/td&gt;
&lt;td&gt;80.8%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;93.9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And here's the part that should make you sit up:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Non-security-engineers&lt;/strong&gt; at Anthropic asked Mythos Preview to find remote code execution bugs before going to bed. They woke up to &lt;strong&gt;complete, working exploits&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🤝 Project Glasswing — The Industry Response
&lt;/h2&gt;

&lt;p&gt;Anthropic isn't releasing this model to the public. Instead, they launched &lt;strong&gt;Project Glasswing&lt;/strong&gt; — named after a butterfly with transparent wings — bringing together 12 major partners:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🏢 AWS          🍎 Apple         📡 Broadcom
🌐 Cisco        🛡️ CrowdStrike   🔍 Google
🏦 JPMorganChase 🐧 Linux Foundation
💻 Microsoft    🎮 NVIDIA        🔒 Palo Alto Networks
🤖 Anthropic (leading the effort)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Plus 40+ additional organizations&lt;/strong&gt; working on critical infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The investment:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💰 &lt;strong&gt;$100M&lt;/strong&gt; in model usage credits for partners&lt;/li&gt;
&lt;li&gt;💰 &lt;strong&gt;$2.5M&lt;/strong&gt; to open-source security foundations (OpenSSF, Alpha-Omega)&lt;/li&gt;
&lt;li&gt;💰 &lt;strong&gt;$1.5M&lt;/strong&gt; to the Apache Software Foundation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea is simple: &lt;strong&gt;let the defenders find the bugs before the attackers do&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔗 What About Chaining Vulnerabilities?
&lt;/h2&gt;

&lt;p&gt;This is where it gets really impressive. Many individual bugs aren't dangerous on their own. It's like having a key that opens one door — but the door leads to another locked door.&lt;/p&gt;

&lt;p&gt;Mythos Preview can &lt;strong&gt;chain vulnerabilities together&lt;/strong&gt; automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example from the Linux kernel:

  Bug 1 → Bypass address randomization (figure out WHERE things are in memory)
       ↓
  Bug 2 → Read contents of a protected data structure
       ↓
  Bug 3 → Write to a previously-freed piece of memory
       ↓
  Bug 4 → Place a malicious object exactly where the write lands
       ↓
  🔓 Full root access
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It did this across Linux, web browsers (building JIT heap sprays and sandbox escapes), and even &lt;strong&gt;closed-source software&lt;/strong&gt; by reverse-engineering the binaries first.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Why This Is Different From Everything Before
&lt;/h2&gt;

&lt;p&gt;Traditional security testing works like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Old way (Fuzzing):
  → Generate millions of random inputs
  → Feed them to the program
  → See if anything crashes
  → Hope you got lucky

AI way (Mythos Preview):
  → Read and UNDERSTAND the code
  → Hypothesize where bugs might be
  → Test specific theories
  → Chain bugs together into real attacks
  → Produce a working exploit with a full report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The FFmpeg example is the perfect illustration. Fuzzers hit that code millions of times. They never thought to try exactly 65,536 slices because they don't &lt;em&gt;understand&lt;/em&gt; what the code does. The AI did.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ The Scary Part
&lt;/h2&gt;

&lt;p&gt;These capabilities &lt;strong&gt;weren't intentionally trained&lt;/strong&gt;. They emerged naturally from making the model better at coding and reasoning. The same improvements that help it &lt;em&gt;fix&lt;/em&gt; bugs also help it &lt;em&gt;exploit&lt;/em&gt; them.&lt;/p&gt;

&lt;p&gt;And here's the timeline that should concern everyone:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A few months ago  → Models couldn't find non-trivial vulnerabilities
A few weeks ago   → Models could find bugs but rarely exploit them
Today             → Mythos Preview finds AND exploits zero-days autonomously
Tomorrow          → ???
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Anthropic has identified &lt;strong&gt;thousands&lt;/strong&gt; of additional high-severity vulnerabilities that are still going through responsible disclosure. Only about &lt;strong&gt;1%&lt;/strong&gt; of what they've found has been patched so far.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🛡️ What Should Defenders Do Right Now?
&lt;/h2&gt;

&lt;p&gt;Anthropic's advice, and I think it's sound:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Start Using AI for Security Today
&lt;/h3&gt;

&lt;p&gt;You don't need Mythos Preview. Current models like Claude Opus 4.6 can already find hundreds of vulnerabilities. The point is to &lt;strong&gt;start building the muscle now&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Shorten Your Patch Cycles
&lt;/h3&gt;

&lt;p&gt;The window between "vulnerability disclosed" and "exploit available" just collapsed from weeks to &lt;strong&gt;hours&lt;/strong&gt;. Auto-update everything. Treat security patches as urgent, not routine.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Rethink "Defense in Depth"
&lt;/h3&gt;

&lt;p&gt;Some security measures work by making exploitation &lt;em&gt;tedious&lt;/em&gt; rather than &lt;em&gt;impossible&lt;/em&gt;. AI doesn't get tired. Focus on &lt;strong&gt;hard barriers&lt;/strong&gt; (like memory safety, address randomization) over &lt;strong&gt;friction-based&lt;/strong&gt; defenses.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Automate Your Incident Response
&lt;/h3&gt;

&lt;p&gt;More bugs found = more attacks attempted. You can't staff your way through the volume. Let models help with triage, investigation, and response.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔮 The Big Picture
&lt;/h2&gt;

&lt;p&gt;We're at an inflection point. For 20 years, cybersecurity has been in a relatively stable equilibrium — attacks evolved, but the &lt;em&gt;shape&lt;/em&gt; of attacks stayed similar. That's about to change.&lt;/p&gt;

&lt;p&gt;The good news: &lt;strong&gt;defense has the long-term advantage&lt;/strong&gt;. Defenders can use these tools proactively to find and fix every bug. Attackers only need to find one — but defenders can now find them first, at scale.&lt;/p&gt;

&lt;p&gt;The bad news: &lt;strong&gt;the transition will be rough&lt;/strong&gt;. Until the security world adapts, attackers who get access to similar capabilities will have a field day.&lt;/p&gt;

&lt;p&gt;Anthropic's bet with Project Glasswing is that by giving defenders a head start — even a few months — the industry can reach a new, more secure equilibrium before the storm hits.&lt;/p&gt;

&lt;p&gt;Whether that bet pays off depends on how fast the rest of us move.&lt;/p&gt;




&lt;h2&gt;
  
  
  📚 Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://red.anthropic.com/2026/mythos-preview/" rel="noopener noreferrer"&gt;Anthropic's Technical Blog Post&lt;/a&gt; — Full technical details on every vulnerability discussed&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/glasswing" rel="noopener noreferrer"&gt;Project Glasswing Announcement&lt;/a&gt; — Partner quotes and initiative details&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/coordinated-vulnerability-disclosure" rel="noopener noreferrer"&gt;Anthropic's Coordinated Vulnerability Disclosure Policy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;If you found this useful, give it a 👏 and share it with your team. The cybersecurity landscape just changed — and everyone building software needs to understand how.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>claude</category>
      <category>programming</category>
      <category>cybersecurity</category>
    </item>
    <item>
      <title>The Brain of the Future Agent: Why VL-JEPA Matters for Real-World AI</title>
      <dc:creator>Himanjan</dc:creator>
      <pubDate>Sun, 11 Jan 2026 01:31:10 +0000</pubDate>
      <link>https://forem.com/himanjan/the-brain-of-the-future-agent-why-vl-jepa-matters-for-real-world-ai-21no</link>
      <guid>https://forem.com/himanjan/the-brain-of-the-future-agent-why-vl-jepa-matters-for-real-world-ai-21no</guid>
      <description>&lt;h2&gt;
  
  
  The "Generative" Trap
&lt;/h2&gt;

&lt;p&gt;If you have been following AI recently, you know the drill: &lt;strong&gt;Input → Generate&lt;/strong&gt;. You give ChatGPT, Gemini, or Claude a prompt, it generates words. You give Sora a prompt, it generates pixels. You give Gemini Veo a prompt, it creates a cinematic scene from scratch.&lt;/p&gt;

&lt;p&gt;This method, known as &lt;strong&gt;autoregressive generation&lt;/strong&gt;, is the engine behind almost every modern AI. It works by predicting the next tiny piece of data (a token) based on the previous ones.&lt;/p&gt;

&lt;p&gt;But there is a massive inefficiency lurking here.&lt;/p&gt;

&lt;p&gt;Imagine you are watching a video of a person cooking. To &lt;em&gt;understand&lt;/em&gt; that video, do you need to be able to paint every single pixel of the steam rising from the pot? &lt;strong&gt;No.&lt;/strong&gt; You just need to grasp the abstract concept: &lt;em&gt;“Water is boiling.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Standard Vision-Language Models (VLMs) like LLaVA or GPT-4V are forced to &lt;strong&gt;“paint the steam.”&lt;/strong&gt; They must model every surface-level detail—linguistic style, word choice, or pixel noise—just to prove they understand the scene. This makes them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Computationally Expensive&lt;/strong&gt;: They waste compute on irrelevant details.&lt;br&gt;&lt;br&gt;
&lt;em&gt;(Example: It burns energy calculating the exact shape of every cloud when you simply asked, “Is it sunny?”)&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Slow&lt;/strong&gt;: They must generate outputs token-by-token, which kills real-time performance.&lt;br&gt;&lt;br&gt;
&lt;em&gt;(Example: It’s like waiting for a slow typist to finish a paragraph before you can know if the answer is “Yes” or “No.”)&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hallucination-Prone&lt;/strong&gt;: If they don’t know a detail, the training objective still forces them to emit &lt;em&gt;some&lt;/em&gt; token sequence—often resulting in confident but incorrect completions.&lt;br&gt;&lt;br&gt;
&lt;em&gt;(Example: Ask it to read a blurry license plate, and it will invent numbers just to complete the pattern.)&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The inefficiency comes from the loss itself: &lt;strong&gt;cross-entropy penalizes every token mismatch&lt;/strong&gt;, even when two answers mean the same thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  VL-JEPA (Vision-Language Joint Embedding Predictive Architecture)
&lt;/h2&gt;

&lt;p&gt;After spending more than three days reading this paper &lt;strong&gt;&lt;a href="https://arxiv.org/pdf/2512.10942" rel="noopener noreferrer"&gt;VL-JPEA&lt;/a&gt;&lt;/strong&gt;, I can say this confidently, this paper introduces the first non-generative vision-language model designed to handle general-domain tasks in real time. It doesn't try to generate the answer. It predicts the mathematical "thought" of the answer.&lt;/p&gt;

&lt;p&gt;VL-JEPA builds directly on the Joint Embedding Predictive Architecture (JEPA) philosophy: never predict noise, only predict meaning. In fact, its vision encoder is literally a pre-trained &lt;a href="https://ai.meta.com/vjepa/" rel="noopener noreferrer"&gt;V-JEPA 2&lt;/a&gt; model, which provides the rich, physics-aware video representations that the language component then learns to understand.&lt;/p&gt;

&lt;p&gt;VL-JEPA builds directly on the &lt;strong&gt;Joint Embedding Predictive Architecture (JEPA)&lt;/strong&gt; philosophy:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Never predict noise. Predict meaning.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part 1: The Core Philosophy (Prediction vs. Generation)
&lt;/h2&gt;

&lt;p&gt;To understand VL-JEPA, you must unlearn the &lt;strong&gt;“next token prediction”&lt;/strong&gt; habit.&lt;br&gt;&lt;br&gt;
We need to shift our goal from &lt;strong&gt;creating pixels or words&lt;/strong&gt; to &lt;strong&gt;predicting states&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I’ll explain this using one concrete scenario throughout: &lt;strong&gt;Spilled Milk&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. The Standard VLM Approach (Generative)
&lt;/h3&gt;

&lt;p&gt;In a standard model (like LLaVA or GPT-4V), the training goal is to generate text tokens.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;X (Input)&lt;/strong&gt;: Video frames of the glass sliding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Y (Target)&lt;/strong&gt;: The text &lt;em&gt;“The glass falls and spills.”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Process&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
The model guesses “The,” then “glass,” then “falls.”&lt;br&gt;&lt;br&gt;
If it guesses wrong (e.g., “The cup…”), it is penalized—even though the meaning is correct.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. The VL-JEPA Approach (Predictive)
&lt;/h3&gt;

&lt;p&gt;VL-JEPA does &lt;strong&gt;not&lt;/strong&gt; model probabilities over tokens.&lt;br&gt;&lt;br&gt;
Instead, it minimizes the &lt;strong&gt;distance between embeddings&lt;/strong&gt; in a continuous space.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SX (Input Embedding)&lt;/strong&gt;: A vector summarizing &lt;em&gt;“glass sliding.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SY (Target Embedding)&lt;/strong&gt;: A vector summarizing &lt;em&gt;“spill occurred.”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Process&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
Given the &lt;em&gt;sliding&lt;/em&gt; embedding, can the model predict the &lt;em&gt;spill&lt;/em&gt; embedding?&lt;/p&gt;

&lt;p&gt;No words. No pixels. Just meaning.&lt;/p&gt;




&lt;h3&gt;
  
  
  The “Orthogonal” Problem (from the paper)
&lt;/h3&gt;

&lt;p&gt;Text generation has a hidden flaw:&lt;/p&gt;

&lt;p&gt;In raw token space, different correct answers can look &lt;strong&gt;completely unrelated&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“The milk spilled.”&lt;/li&gt;
&lt;li&gt;“The liquid made a mess.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A standard VLM treats these as nearly orthogonal because the words don’t overlap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VL-JEPA’s solution&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
In embedding space, both sentences map to &lt;strong&gt;nearby points&lt;/strong&gt; because their meaning is the same.&lt;/p&gt;

&lt;p&gt;This collapses a messy, multi-modal output distribution into a &lt;strong&gt;single smooth region&lt;/strong&gt;, making learning dramatically more efficient.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: The Architecture (The Tripod of Understanding)
&lt;/h2&gt;

&lt;p&gt;Before we build the full car, we need to acknowledge the engine:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VL-JEPA does not learn to see from scratch.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Its vision encoder is initialized from &lt;strong&gt;V-JEPA 2&lt;/strong&gt;, which already has a “gut feeling” for physics—like knowing unsupported objects tend to fall.&lt;/p&gt;

&lt;p&gt;Here’s how the system processes our spilled milk scenario:&lt;/p&gt;




&lt;h3&gt;
  
  
  1. The X-Encoder (The Eyes)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What it is&lt;/strong&gt;: A Vision Transformer (V-JEPA 2).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it does&lt;/strong&gt;: Compresses video frames into &lt;strong&gt;visual embeddings&lt;/strong&gt;—dense numerical representations of &lt;em&gt;objects, motion, and relationships&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does &lt;strong&gt;not&lt;/strong&gt; predict future pixels.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. The Predictor (The Brain)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What it is&lt;/strong&gt;: A Transformer initialized from Llama-3.2 layers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it does&lt;/strong&gt;: Combines:

&lt;ul&gt;
&lt;li&gt;Visual embeddings (glass sliding)&lt;/li&gt;
&lt;li&gt;A text query (e.g., &lt;em&gt;“What happens next?”&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;It predicts a &lt;strong&gt;target embedding&lt;/strong&gt; representing what &lt;em&gt;will&lt;/em&gt; happen.&lt;/p&gt;

&lt;p&gt;Conceptually, it behaves &lt;em&gt;as if&lt;/em&gt; it were composing latent factors like motion, support, and gravity to arrive at &lt;em&gt;“spilled milk.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Unlike language models, this predictor uses &lt;strong&gt;bi-directional attention&lt;/strong&gt;, allowing vision and query tokens to jointly condition the prediction.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. The Y-Encoder (The Abstract Target)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What it is&lt;/strong&gt;: A text embedding model (EmbeddingGemma).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it does&lt;/strong&gt;: Converts &lt;em&gt;“The milk spills”&lt;/em&gt; into the &lt;strong&gt;ground-truth answer embedding&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is trained to minimize the distance between its prediction and this embedding.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. The Y-Decoder (The Mouth — Optional!)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What it is&lt;/strong&gt;: A lightweight text decoder.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key idea&lt;/strong&gt;: It is &lt;strong&gt;not used during main training&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model can &lt;em&gt;think&lt;/em&gt; about the milk spilling &lt;strong&gt;without talking about it&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Text is generated &lt;strong&gt;only when a human needs it&lt;/strong&gt;, which is critical for efficiency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: The Superpower — Selective Decoding
&lt;/h2&gt;

&lt;p&gt;This is what makes VL-JEPA different&lt;/p&gt;

&lt;p&gt;Imagine a robot watching the glass.&lt;/p&gt;

&lt;h3&gt;
  
  
  Standard VLM (The Chatty Observer)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Frame 1: “The glass is on the table.”&lt;/li&gt;
&lt;li&gt;Frame 10: “The glass is moving.”&lt;/li&gt;
&lt;li&gt;Frame 20: “The glass is still moving.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It wastes compute describing moments where &lt;strong&gt;nothing meaningful changes&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  VL-JEPA (The Silent Observer)
&lt;/h3&gt;

&lt;p&gt;VL-JEPA produces a &lt;strong&gt;continuous stream of embeddings&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frames 1–50: Embeddings remain stable (&lt;em&gt;situation unchanged&lt;/em&gt;).
Decoder stays off. Silence.&lt;/li&gt;
&lt;li&gt;Frame 51: The glass tips.
The &lt;strong&gt;variance of the embedding stream increases&lt;/strong&gt;, signaling a semantic transition.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Only then&lt;/strong&gt; does the decoder activate:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“The glass has fallen.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This reduces decoding operations by &lt;strong&gt;~2.85×&lt;/strong&gt; while maintaining the same accuracy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: The Verdict (Is It Actually Better?)
&lt;/h2&gt;

&lt;p&gt;Meta didn’t just theorize this—they ran a &lt;strong&gt;strictly controlled comparison&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You can refer Figure 3 in the paper &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5lxnf1o0bikm5uyqqjvh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5lxnf1o0bikm5uyqqjvh.png" alt="cage-diagram" width="800" height="453"&gt;&lt;/a&gt;&lt;br&gt;
source - &lt;a href="https://arxiv.org/pdf/2512.10942" rel="noopener noreferrer"&gt;VL-JEPA&lt;/a&gt; paper &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fun1dholf5337q0tv56p4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fun1dholf5337q0tv56p4.png" alt="comparison" width="800" height="341"&gt;&lt;/a&gt;&lt;br&gt;
source - &lt;a href="https://arxiv.org/pdf/2512.10942" rel="noopener noreferrer"&gt;VL-JEPA&lt;/a&gt; paper &lt;/p&gt;

&lt;p&gt;Both models used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The same vision encoder&lt;/li&gt;
&lt;li&gt;The same data&lt;/li&gt;
&lt;li&gt;The same batch size&lt;/li&gt;
&lt;li&gt;The same training steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;em&gt;only&lt;/em&gt; difference was the objective:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Predict embeddings vs generate tokens.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Results
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Learns Faster (Sample Efficiency)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
After 5M samples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VL-JEPA: &lt;strong&gt;14.7 CIDEr&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Generative VLM: &lt;strong&gt;7.1 CIDEr&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Requires Less Brain Power (Parameter Efficiency)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;
VL-JEPA used &lt;strong&gt;50% fewer trainable parameters&lt;/strong&gt; (0.5B vs 1B).&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Understands World Dynamics Better&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
On the WorldPrediction benchmark (state transition reasoning):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VL-JEPA: &lt;strong&gt;65.7%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;GPT-4o / Gemini-2.0: ~53%&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Importantly, this benchmark tests &lt;strong&gt;understanding how the world changes&lt;/strong&gt;, not symbolic reasoning or tool use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;VL-JEPA proves that Thinking ≠ Talking.&lt;/p&gt;

&lt;p&gt;By separating the understanding process (Predictor) from the generation process (Decoder), Meta has built a model that is quieter, faster, and fundamentally more grounded in physical reality.&lt;/p&gt;

&lt;p&gt;If we want AI agents that can watch a toddler and catch a falling glass of milk in real-time, we don't need models that can write a poem about the splash. We need models that can predict the spill before it happens. On my view VL-JEPA is the first step toward that future.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>genai</category>
      <category>llm</category>
    </item>
    <item>
      <title>The Hidden Cost of LangChain: Why My Simple RAG System Cost 2.7x More Than Expected</title>
      <dc:creator>Himanjan</dc:creator>
      <pubDate>Tue, 22 Jul 2025 23:05:20 +0000</pubDate>
      <link>https://forem.com/himanjan/the-hidden-cost-of-langchain-why-my-simple-rag-system-cost-27x-more-than-expected-4hk9</link>
      <guid>https://forem.com/himanjan/the-hidden-cost-of-langchain-why-my-simple-rag-system-cost-27x-more-than-expected-4hk9</guid>
      <description>&lt;p&gt;&lt;em&gt;A developer's journey from excitement to shock—and what I learned about LangChain's true cost.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Moment I Realized Something Was Wrong
&lt;/h2&gt;

&lt;p&gt;Recently, I started deep diving into agentic AI, experimenting with &lt;strong&gt;LangChain&lt;/strong&gt; for building a simple &lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt; system. Everything seemed fine—until I noticed something strange.&lt;/p&gt;

&lt;p&gt;Just &lt;strong&gt;two runs&lt;/strong&gt; of my LangChain-based Python script consumed more than &lt;strong&gt;$0.038&lt;/strong&gt; in OpenAI API costs, and my credit balance dropped from around $5 to &lt;strong&gt;$4.93&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;I'm on the &lt;em&gt;Pay-As-You-Go&lt;/em&gt; plan — so I &lt;em&gt;feel&lt;/em&gt; every API call.&lt;/p&gt;

&lt;p&gt;That got me thinking: &lt;em&gt;Is LangChain doing more under the hood than I realize?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I decided to compare it with &lt;strong&gt;manual GPT-4 API call using OpenAI's SDK&lt;/strong&gt; — and what I found might surprise you. It was not initially easy as I did not get much resource to trace the call directly or from any IDE based extensions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Investigation: LangChain vs Manual Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Task
&lt;/h3&gt;

&lt;p&gt;Build a simple RAG system that:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs0b98b4t6pgslacxlrxr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs0b98b4t6pgslacxlrxr.png" alt="RAG" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Seems straightforward, right? Let me show you what happened when I built this two different ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 1: The LangChain Way
&lt;/h3&gt;

&lt;p&gt;Here's my initial implementation using LangChain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.document_loaders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TextLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.vectorstores&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.chains&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RetrievalQA&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.callbacks.manager&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_openai_callback&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.callbacks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseCallbackHandler&lt;/span&gt;

&lt;span class="c1"&gt;# 🔹 LLM call counter
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CountingHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseCallbackHandler&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_calls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_llm_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_calls&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 LLM call #&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_calls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load and split document
&lt;/span&gt;&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TextLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myfile.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use OpenAI's embedding model (same for both examples)
&lt;/span&gt;&lt;span class="n"&gt;embedding_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-ada-002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vectorstore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Setup GPT-4 LLM with counter
&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CountingHandler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callbacks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;qa_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RetrievalQA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_chain_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chain_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Change to "stuff" or "map_reduce" for testing
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;get_openai_callback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qa_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the main idea of the document?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📌 Final Response:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📊 LangChain Usage:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total LLM Calls: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_calls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prompt Tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Completion Tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total Tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimated Cost: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_cost&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Looks clean and simple, right?&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 2: The Manual Way
&lt;/h3&gt;

&lt;p&gt;Here's the same functionality using direct OpenAI SDK calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# 🔹 Load and split text
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myfile.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="n"&gt;chunk_overlap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;span class="c1"&gt;# 🔹 Get embeddings from OpenAI
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_openai_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-ada-002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;

&lt;span class="n"&gt;chunk_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_openai_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 🔹 Store in FAISS
&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_embeddings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatL2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_embeddings&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# 🔹 Embed user query
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the main idea of the document?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-ada-002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;

&lt;span class="c1"&gt;# 🔹 Search top 3 chunks
&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;top_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="c1"&gt;# 🔹 Build prompt and ask GPT-4
&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a helpful assistant. Use the context below to answer the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question.

Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 🔹 Output
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📌 Final Response:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 🔹 Token usage
&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📊 Manual GPT-4 Usage:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prompt Tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Completion Tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total Tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.03&lt;/span&gt;  &lt;span class="c1"&gt;# Est. GPT-4 input/output token cost
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimated Cost: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Shocking Results
&lt;/h2&gt;

&lt;p&gt;After running both implementations with identical documents &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"I added my blog which I have published in dev.to which compares RAG vs Prompt engineering vs fine tuning" which is here &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://dev.to/himanjan/rag-vs-fine-tuning-vs-prompt-engineering-the-complete-enterprise-guide-2jod"&gt;https://dev.to/himanjan/rag-vs-fine-tuning-vs-prompt-engineering-the-complete-enterprise-guide-2jod&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I saved this &lt;code&gt;.md&lt;/code&gt; file contents directly to a &lt;code&gt;.txt&lt;/code&gt; file which is &lt;code&gt;myfile.txt&lt;/code&gt; in the code and run the code &lt;/p&gt;

&lt;p&gt;The response comparison you can see below and both uses the same embedding model &lt;code&gt;**text-embedding-ada-002**&lt;/code&gt; from opneAI &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response from Langchain version&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvj9ummxfep95iq6tmbgm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvj9ummxfep95iq6tmbgm.png" alt="Lang-chain-version" width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response from Manual OpenAPI invocation version&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uouzhtw5eyl4z568ktc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uouzhtw5eyl4z568ktc.png" alt="Manual-version" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have already referred or skimmed through my blog you can understand how neat and precise summary the manual version produced with just 342 prompt tokens which is half of the Langchain used tokens. If you are getting a little confused that if prompt token is more then cost would go high. That is also a hidden game in Langchain. When we use &lt;strong&gt;refine&lt;/strong&gt; chain type in Langchain which most of the production system we use it will break the things as below&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Call 1: Initial answer with first chunk&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Call 2: Refine answer with second chunk&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Call 3: Refine again with third chunk&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Call 4: Final refinement&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each call includes the &lt;strong&gt;full prompt + previous context&lt;/strong&gt;, accumulating tokens.&lt;br&gt;
Other chain types for RAG/QA are below&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;stuff&lt;/strong&gt; - Puts all retrieved docs into a single prompt (most efficient)&lt;br&gt;
&lt;strong&gt;refine&lt;/strong&gt; - Iteratively refines answer with each document (what we used)&lt;br&gt;
&lt;strong&gt;map_reduce&lt;/strong&gt; - Processes each doc separately, then combines results&lt;br&gt;
&lt;strong&gt;map_rerank&lt;/strong&gt; - Scores each doc's answer and returns the best one&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Comparison Summary:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manual approach&lt;/strong&gt;: 487 tokens, &lt;strong&gt;$0.0146&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain approach&lt;/strong&gt;: 1,017 tokens, &lt;strong&gt;$0.0388&lt;/strong&gt; (&lt;strong&gt;2.7x more expensive!&lt;/strong&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let me break this down:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Manual Implementation&lt;/th&gt;
&lt;th&gt;LangChain Implementation&lt;/th&gt;
&lt;th&gt;Difference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Tokens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;487&lt;/td&gt;
&lt;td&gt;1,017&lt;/td&gt;
&lt;td&gt;+108%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.0146&lt;/td&gt;
&lt;td&gt;$0.0388&lt;/td&gt;
&lt;td&gt;+166%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API Calls&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3 (trackable)&lt;/td&gt;
&lt;td&gt;??? (hidden)&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Debugging Difficulty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;td&gt;Nightmare&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why Is LangChain So Much More Expensive?
&lt;/h2&gt;

&lt;p&gt;After digging deeper, I discovered several hidden costs in LangChain:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Suboptimal Batching&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;LangChain's &lt;code&gt;OpenAIEmbeddings&lt;/code&gt; defaults to batching 1,000 texts per API call, but OpenAI supports up to 2,048 inputs per request. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;You're making ~2x more API calls than necessary&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;More calls = more latency + more rate limit exposure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Hidden Internal Calls&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;LangChain makes API calls you can't see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal prompt formatting calls&lt;/li&gt;
&lt;li&gt;Retry logic that may duplicate requests&lt;/li&gt;
&lt;li&gt;Chain validation calls&lt;/li&gt;
&lt;li&gt;Memory management overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Inefficient Context Management&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The framework often includes unnecessary context or makes redundant calls for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Document metadata processing&lt;/li&gt;
&lt;li&gt;Chain state management&lt;/li&gt;
&lt;li&gt;Output parsing validation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Broken Cost Tracking&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Perhaps most troubling: &lt;code&gt;get_openai_callback()&lt;/code&gt; often shows &lt;strong&gt;$0.00&lt;/strong&gt; when you're actually being charged. I experienced this firsthand—the callback reported no costs while my OpenAI balance clearly decreased.&lt;/p&gt;

&lt;p&gt;Then I explored further to read about this and found multiple blogs and some GitHub issues on this!!&lt;/p&gt;

&lt;h2&gt;
  
  
  The Broader Pattern: Companies Are Moving Away
&lt;/h2&gt;

&lt;p&gt;My experience isn't isolated. Research reveals a troubling pattern:&lt;/p&gt;

&lt;h3&gt;
  
  
  Real Company Migrations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Octomind&lt;/strong&gt; used LangChain for a year to power AI agents that create and fix software tests. After growing frustrations with debugging and inflexibility, they &lt;strong&gt;removed LangChain entirely in 2024&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Once we removed it… we could just code. No longer being constrained by LangChain made our team far more productive."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Multiple development teams&lt;/strong&gt; have documented similar experiences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10+ months of LangChain code replaced with direct OpenAI implementations in just weeks&lt;/li&gt;
&lt;li&gt;Elimination of dependency conflicts and version incompatibilities&lt;/li&gt;
&lt;li&gt;Significant performance improvements and cost reductions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Technical Evidence from the Community
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GitHub Issues&lt;/strong&gt; document systematic problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Issue #12994: &lt;code&gt;get_openai_callback()&lt;/code&gt; showing $0.00 instead of actual $18.24 costs&lt;/li&gt;
&lt;li&gt;Issue #14952: Broken debug logs that merge messages incorrectly&lt;/li&gt;
&lt;li&gt;Widespread reports of &lt;code&gt;AttributeError: module 'langchain' has no attribute 'debug'&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Developer testimonials&lt;/strong&gt; consistently report:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple tasks requiring complex workarounds&lt;/li&gt;
&lt;li&gt;More time debugging LangChain than building features&lt;/li&gt;
&lt;li&gt;Inability to optimize for specific use cases due to abstraction layers&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What This Means for Your Projects
&lt;/h2&gt;

&lt;h3&gt;
  
  
  When LangChain Might Be Worth It:
&lt;/h3&gt;

&lt;p&gt;✅ &lt;strong&gt;Rapid prototyping&lt;/strong&gt; where cost isn't a concern&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Learning RAG concepts&lt;/strong&gt; and experimentation&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Demos and tutorials&lt;/strong&gt; that need quick setup&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Multi-provider scenarios&lt;/strong&gt; requiring provider abstraction  &lt;/p&gt;

&lt;h3&gt;
  
  
  When to Skip LangChain:
&lt;/h3&gt;

&lt;p&gt;❌ &lt;strong&gt;Production systems&lt;/strong&gt; where cost and performance matter&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Budget-conscious projects&lt;/strong&gt; on pay-as-you-go plans&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Applications requiring precise cost tracking&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Performance-critical systems&lt;/strong&gt; needing optimization&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Projects requiring detailed debugging capabilities&lt;/strong&gt;  &lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line: Transparency Wins
&lt;/h2&gt;

&lt;p&gt;My investigation revealed that &lt;strong&gt;what you can't see can hurt you&lt;/strong&gt;. LangChain's abstractions, while convenient for learning, often hide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2-3x higher token usage&lt;/strong&gt; than optimal implementations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple hidden API calls&lt;/strong&gt; that compound costs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suboptimal batching&lt;/strong&gt; that wastes money and time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broken cost tracking&lt;/strong&gt; that leaves you blind to expenses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For my pay-as-you-go budget, these hidden costs add up quickly. What should have been a $0.015 experiment became a $0.038 surprise—and that's just for two simple runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actionable Recommendations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For Learning and Prototyping:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use LangChain to understand RAG concepts quickly&lt;/li&gt;
&lt;li&gt;Expect 2-3x higher costs during development&lt;/li&gt;
&lt;li&gt;Don't rely on &lt;code&gt;get_openai_callback()&lt;/code&gt; for accurate tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  For Production Systems:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with direct API implementations&lt;/strong&gt; for transparency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch embeddings optimally&lt;/strong&gt; (up to 2,048 inputs per OpenAI call)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track every token&lt;/strong&gt; with precise cost calculation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Profile your usage patterns&lt;/strong&gt; before optimizing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost Optimization Strategy:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Implement precise token tracking&lt;/strong&gt; from day one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch operations efficiently&lt;/strong&gt; to minimize API calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache embeddings&lt;/strong&gt; to avoid repeated calculations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor costs continuously&lt;/strong&gt; with direct API usage metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;LangChain serves an important purpose in the AI ecosystem—it helps developers learn and prototype quickly. But for production systems where every dollar counts, &lt;strong&gt;transparency and control&lt;/strong&gt; are worth the extra development effort.&lt;/p&gt;

&lt;p&gt;The 2.7x cost difference I discovered isn't just about money—it's about &lt;strong&gt;understanding what your code actually does&lt;/strong&gt;. When you're building AI applications that could scale to thousands of users, those hidden costs become hidden disasters.&lt;/p&gt;

&lt;p&gt;My advice? &lt;strong&gt;Learn with LangChain, deploy with direct APIs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your wallet will thank you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you experienced similar cost surprises with LangChain? Share your experience in the comments below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keywords&lt;/strong&gt;: LangChain, OpenAI API, RAG, Cost Optimization, AI Development, Token Usage, Production AI&lt;/p&gt;




&lt;h3&gt;
  
  
  About the Author
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;A software developer exploring the practical challenges of building production AI systems. Currently investigating the gap between AI framework promises and real-world performance.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticai</category>
      <category>langchain</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>RAG vs Fine-tuning vs Prompt Engineering: The Complete Enterprise Guide</title>
      <dc:creator>Himanjan</dc:creator>
      <pubDate>Sat, 28 Jun 2025 00:09:33 +0000</pubDate>
      <link>https://forem.com/himanjan/rag-vs-fine-tuning-vs-prompt-engineering-the-complete-enterprise-guide-2jod</link>
      <guid>https://forem.com/himanjan/rag-vs-fine-tuning-vs-prompt-engineering-the-complete-enterprise-guide-2jod</guid>
      <description>&lt;p&gt;&lt;em&gt;How to choose the right AI approach for your business needs&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When building AI applications for your business, you'll face a critical decision: Should you use Retrieval-Augmented Generation (RAG), fine-tune a model, or rely on prompt engineering? Each approach has distinct advantages, costs, and use cases. This guide will help you make the right choice with real-world examples and practical frameworks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Three Approaches
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prompt Engineering: The Art of Communication
&lt;/h3&gt;

&lt;p&gt;Prompt engineering is like having a conversation with a highly knowledgeable assistant. You craft specific instructions, provide context, and guide the AI's responses through carefully designed prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; You provide instructions, examples, and context directly in your input to guide the model's behavior without changing the underlying model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Example&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weak Prompt:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Summarize the latest AI trends."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strong Prompt:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Act as a tech analyst and write a 300-word summary of the top 3
generative AI trends for enterprise adoption in 2025. 
For each trend, briefly explain its impact on the software
development industry. The tone should be professional and informative."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  RAG (Retrieval-Augmented Generation): Dynamic Knowledge Integration
&lt;/h3&gt;

&lt;p&gt;RAG combines the power of search with generation. It retrieves relevant information from your knowledge base in real-time and uses that context to generate accurate, up-to-date responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; When a user asks a question, the RAG system first retrieves relevant documents or data snippets from a specified knowledge base (like your company's internal wiki, product documentation, or a database). This retrieved information is then passed to the LLM along with the original prompt, giving the model the necessary context to generate a factually grounded and accurate response.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fine-tuning: Specialized Model Training
&lt;/h3&gt;

&lt;p&gt;Fine-tuning involves training a pre-existing model on your specific data to create a customized version that understands your domain, terminology, and patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; You take a base model and continue training it on your specific dataset, adjusting the model's weights to perform better on your particular tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detailed Comparison
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Prompt Engineering
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero setup cost&lt;/strong&gt;: Start immediately with existing models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maximum flexibility&lt;/strong&gt;: Easily adjust behavior with prompt changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No technical infrastructure&lt;/strong&gt;: Works with any API-based model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rapid iteration&lt;/strong&gt;: Test different approaches in minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No data preparation&lt;/strong&gt;: Use natural language instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version control friendly&lt;/strong&gt;: Prompts are just text files&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Limited context window&lt;/strong&gt;: Constrained by model's token limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inconsistent results&lt;/strong&gt;: Performance varies with prompt quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No persistent learning&lt;/strong&gt;: Can't learn from new information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt injection risks&lt;/strong&gt;: Vulnerable to malicious inputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual optimization&lt;/strong&gt;: Requires human expertise to craft effective prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token costs&lt;/strong&gt;: Long prompts increase API usage costs&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Best Use Cases
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Quick prototypes and MVPs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;General-purpose applications&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When you need immediate results&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Small-scale applications&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tasks with clear, simple instructions&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Real-World Example: Customer Service Chatbot
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Company:&lt;/strong&gt; Mid-sized e-commerce startup&lt;br&gt;
&lt;strong&gt;Challenge:&lt;/strong&gt; Handle basic customer inquiries without extensive setup&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Used prompt engineering with clear instructions about company policies, tone, and escalation procedures&lt;br&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Deployed in 2 days, handled 60% of basic inquiries effectively&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example Prompt:
"You are a helpful customer service representative for TechStore. 
Be friendly, professional, and concise. If asked about returns, 
our policy is 30 days with receipt. For technical issues, 
escalate to human support. Always end with 'Is there anything 
else I can help you with?'"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. RAG (Retrieval-Augmented Generation)
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Always current&lt;/strong&gt;: Accesses real-time information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalable knowledge&lt;/strong&gt;: Handle millions of documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explainable&lt;/strong&gt;: Can show source documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-effective&lt;/strong&gt;: No model retraining needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic updates&lt;/strong&gt;: Add new information instantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduced hallucinations&lt;/strong&gt;: Grounded in actual documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible data sources&lt;/strong&gt;: PDFs, databases, websites, APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex architecture&lt;/strong&gt;: Requires vector databases and search infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval quality dependency&lt;/strong&gt;: Poor search = poor responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency overhead&lt;/strong&gt;: Additional retrieval step adds delay&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunking challenges&lt;/strong&gt;: Document segmentation affects quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher operational costs&lt;/strong&gt;: Multiple systems to maintain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data preprocessing&lt;/strong&gt;: Documents need cleaning and structuring&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Best Use Cases
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Knowledge bases and documentation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Customer support with evolving information&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Research and analysis applications&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compliance and regulatory queries&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enterprise search and Q&amp;amp;A&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Real-World Example: Legal Research Platform
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Company:&lt;/strong&gt; Large law firm (500+ attorneys)&lt;br&gt;
&lt;strong&gt;Challenge:&lt;/strong&gt; Quickly find relevant case law and regulations across thousands of documents&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; RAG system indexing legal databases, case files, and regulatory documents&lt;br&gt;
&lt;strong&gt;Implementation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector database with 2M+ legal documents&lt;/li&gt;
&lt;li&gt;Semantic search for case similarity&lt;/li&gt;
&lt;li&gt;Real-time updates when new cases are filed
&lt;strong&gt;Result:&lt;/strong&gt; Reduced research time from hours to minutes, 40% increase in billable efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Technical Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding model: Specialized legal text embeddings&lt;/li&gt;
&lt;li&gt;Vector store: Pinecone with legal document metadata&lt;/li&gt;
&lt;li&gt;Retrieval: Hybrid search (semantic + keyword)&lt;/li&gt;
&lt;li&gt;Generation: GPT-4 with legal prompt templates&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Fine-tuning
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain expertise&lt;/strong&gt;: Learns your specific language and patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent performance&lt;/strong&gt;: Stable, predictable outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compact responses&lt;/strong&gt;: No need to include context in prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom behavior&lt;/strong&gt;: Learns unique workflows and decision patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency&lt;/strong&gt;: Smaller, specialized models can outperform larger general ones&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intellectual property&lt;/strong&gt;: Your customized model becomes a business asset&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High upfront costs&lt;/strong&gt;: Requires significant data preparation and training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data requirements&lt;/strong&gt;: Needs thousands of high-quality examples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time-intensive&lt;/strong&gt;: Weeks or months to develop properly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance overhead&lt;/strong&gt;: Must retrain for updates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Technical expertise&lt;/strong&gt;: Requires ML engineering skills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inflexible&lt;/strong&gt;: Hard to modify behavior after training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catastrophic forgetting&lt;/strong&gt;: May lose general capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Best Use Cases
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Highly specialized domains&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistent, repetitive tasks&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When you have abundant training data&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Applications requiring specific output formats&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When general models consistently fail&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Real-World Example: Medical Diagnosis Assistant
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Company:&lt;/strong&gt; Regional hospital network&lt;br&gt;
&lt;strong&gt;Challenge:&lt;/strong&gt; Create an AI assistant that understands medical terminology and follows clinical protocols&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Fine-tuned model on medical records, clinical guidelines, and diagnostic procedures&lt;br&gt;
&lt;strong&gt;Implementation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training data: 100K+ anonymized medical cases&lt;/li&gt;
&lt;li&gt;Base model: BioBERT specialized for medical text&lt;/li&gt;
&lt;li&gt;Fine-tuning: 3 months with medical experts&lt;/li&gt;
&lt;li&gt;Validation: Tested against clinical gold standards
&lt;strong&gt;Result:&lt;/strong&gt; 85% accuracy in preliminary diagnoses, reduced diagnosis time by 30%&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Decision Framework: Which Approach to Choose?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Start with These Questions:
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Data and Knowledge Requirements
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Do you need access to frequently changing information?&lt;/strong&gt; → RAG&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do you have thousands of examples of desired behavior?&lt;/strong&gt; → Fine-tuning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can you describe your requirements clearly?&lt;/strong&gt; → Prompt Engineering&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Technical Resources
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Limited technical team?&lt;/strong&gt; → Prompt Engineering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong engineering but limited ML expertise?&lt;/strong&gt; → RAG&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedicated ML team and infrastructure?&lt;/strong&gt; → Fine-tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Time and Budget Constraints
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Need results this week?&lt;/strong&gt; → Prompt Engineering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can wait 2-4 weeks for better results?&lt;/strong&gt; → RAG&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Have 2-6 months for optimal solution?&lt;/strong&gt; → Fine-tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. Scale and Performance Requirements
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prototype or small-scale?&lt;/strong&gt; → Prompt Engineering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise-scale with evolving content?&lt;/strong&gt; → RAG&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-volume, consistent performance needed?&lt;/strong&gt; → Fine-tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Enterprise Examples by Industry
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Financial Services
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Investment research platform&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Engineering:&lt;/strong&gt; Quick market analysis templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG:&lt;/strong&gt; Real-time financial news and earnings reports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning:&lt;/strong&gt; Specialized financial language and regulatory compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Chosen Approach:&lt;/strong&gt; RAG + Prompt Engineering hybrid&lt;br&gt;
&lt;strong&gt;Why:&lt;/strong&gt; Need current market data (RAG) with consistent analysis format (prompts)&lt;/p&gt;

&lt;h3&gt;
  
  
  Healthcare
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Clinical decision support system&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Engineering:&lt;/strong&gt; Basic symptom checkers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG:&lt;/strong&gt; Latest medical research and drug interactions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning:&lt;/strong&gt; Specialized medical reasoning and terminology&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Chosen Approach:&lt;/strong&gt; Fine-tuning with RAG augmentation&lt;br&gt;
&lt;strong&gt;Why:&lt;/strong&gt; Medical accuracy requires specialized training, but needs current research&lt;/p&gt;

&lt;h3&gt;
  
  
  E-commerce
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Product recommendation engine&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Engineering:&lt;/strong&gt; Simple recommendation rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG:&lt;/strong&gt; Current product catalogs and reviews&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning:&lt;/strong&gt; Customer behavior patterns and preferences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Chosen Approach:&lt;/strong&gt; Fine-tuning for personalization&lt;br&gt;
&lt;strong&gt;Why:&lt;/strong&gt; Rich customer data enables personalized behavior learning&lt;/p&gt;

&lt;h2&gt;
  
  
  Hybrid Approaches: Best of All Worlds
&lt;/h2&gt;

&lt;p&gt;Many successful enterprise applications combine multiple approaches:&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG + Prompt Engineering
&lt;/h3&gt;

&lt;p&gt;Perfect for customer support systems that need both current information and consistent tone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Software company help desk&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RAG retrieves relevant documentation&lt;/li&gt;
&lt;li&gt;Prompt engineering ensures helpful, branded responses&lt;/li&gt;
&lt;li&gt;Result: Accurate, current, and consistently helpful support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fine-tuning + RAG
&lt;/h3&gt;

&lt;p&gt;Ideal for specialized domains requiring both expertise and current information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Legal research platform&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fine-tuned model understands legal reasoning&lt;/li&gt;
&lt;li&gt;RAG provides access to latest cases and regulations&lt;/li&gt;
&lt;li&gt;Result: Expert-level legal analysis with current information&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  All Three Combined
&lt;/h3&gt;

&lt;p&gt;Enterprise-grade solutions often use a layered approach:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Enterprise knowledge management&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuned model&lt;/strong&gt; for domain understanding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG&lt;/strong&gt; for accessing company knowledge base&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt engineering&lt;/strong&gt; for role-specific responses&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Implementation Roadmap
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1: Start with Prompt Engineering (Week 1-2)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Validate your use case quickly&lt;/li&gt;
&lt;li&gt;Understand user needs and edge cases&lt;/li&gt;
&lt;li&gt;Build initial user feedback loop&lt;/li&gt;
&lt;li&gt;Estimate performance requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 2: Implement RAG if Needed (Week 3-6)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;If you need access to large knowledge bases&lt;/li&gt;
&lt;li&gt;When information changes frequently&lt;/li&gt;
&lt;li&gt;For explainable AI requirements&lt;/li&gt;
&lt;li&gt;To reduce hallucinations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 3: Consider Fine-tuning (Month 2-6)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;When you have sufficient training data&lt;/li&gt;
&lt;li&gt;For highly specialized domains&lt;/li&gt;
&lt;li&gt;When consistency is critical&lt;/li&gt;
&lt;li&gt;To optimize for performance and cost&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prompt Engineering
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Development:&lt;/strong&gt; $5K-$20K (mainly developer time)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ongoing:&lt;/strong&gt; API costs ($0.01-$0.06 per 1K tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance:&lt;/strong&gt; Low (prompt updates)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  RAG
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Development:&lt;/strong&gt; $50K-$200K (infrastructure + development)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ongoing:&lt;/strong&gt; $1K-$10K/month (vector DB + compute)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance:&lt;/strong&gt; Medium (data pipeline management)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fine-tuning
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Development:&lt;/strong&gt; $100K-$500K (data prep + training + validation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ongoing:&lt;/strong&gt; $2K-$20K/month (model hosting + retraining)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance:&lt;/strong&gt; High (continuous data collection + retraining)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Pitfalls and How to Avoid Them
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prompt Engineering Pitfalls
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Over-engineering prompts:&lt;/strong&gt; Keep them simple and clear&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not testing edge cases:&lt;/strong&gt; Use diverse test scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring prompt injection:&lt;/strong&gt; Validate and sanitize inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  RAG Pitfalls
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Poor chunking strategy:&lt;/strong&gt; Test different chunk sizes and overlap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Irrelevant retrieval:&lt;/strong&gt; Improve embedding quality and search logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Information overload:&lt;/strong&gt; Limit retrieved context to most relevant&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fine-tuning Pitfalls
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Insufficient training data:&lt;/strong&gt; Ensure data quality over quantity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overfitting:&lt;/strong&gt; Use proper validation and regularization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forgetting base capabilities:&lt;/strong&gt; Monitor general performance degradation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Future-Proofing Your Decision
&lt;/h2&gt;

&lt;p&gt;Technology evolves rapidly. Consider these factors for long-term success:&lt;/p&gt;

&lt;h3&gt;
  
  
  Emerging Trends
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Larger context windows&lt;/strong&gt; may reduce RAG complexity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better base models&lt;/strong&gt; may reduce fine-tuning needs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal capabilities&lt;/strong&gt; will expand all approaches&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flexibility Planning
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Start with simpler approaches (prompt engineering/RAG)&lt;/li&gt;
&lt;li&gt;Design systems that can incorporate fine-tuned models later&lt;/li&gt;
&lt;li&gt;Maintain data collection for future fine-tuning opportunities&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The choice between RAG, fine-tuning, and prompt engineering isn't always either/or. The best enterprise AI solutions often combine multiple approaches strategically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with prompt engineering&lt;/strong&gt; for rapid prototyping and validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add RAG&lt;/strong&gt; when you need access to large, changing knowledge bases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider fine-tuning&lt;/strong&gt; for specialized domains with abundant data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Remember: the "best" approach is the one that solves your specific problem effectively within your constraints. Start simple, measure results, and evolve your approach as your needs and capabilities grow.&lt;/p&gt;

&lt;p&gt;The future belongs to organizations that can adapt their AI strategy as technology evolves. By understanding the strengths and limitations of each approach, you'll be equipped to make informed decisions that drive real business value.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>techstrategy</category>
      <category>enterprise</category>
    </item>
    <item>
      <title>Running AWS Model Context Protocol (MCP) Servers on Docker containers with DeepSeek LLM</title>
      <dc:creator>Himanjan</dc:creator>
      <pubDate>Tue, 10 Jun 2025 21:00:42 +0000</pubDate>
      <link>https://forem.com/himanjan/running-aws-model-context-protocol-mcp-servers-on-docker-containers-with-deepseek-llm-4bp9</link>
      <guid>https://forem.com/himanjan/running-aws-model-context-protocol-mcp-servers-on-docker-containers-with-deepseek-llm-4bp9</guid>
      <description>&lt;p&gt;MCP (Model Context Protocol) has gained popularity due to its ease of use, standardization, and efficiency in integrating AI models with external systems. MCP is particularly useful in AI-driven automation, agent-based systems, and LLM-powered applications, making it a go-to choice for developers looking to enhance AI interactions. &lt;/p&gt;

&lt;p&gt;In this post, we'll take an in-depth look at running an AWS Lab MCP server inside a Docker container and leveraging a large language model (LLM)—DeepSeek in this case—to send prompts for executing actions efficiently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9s7hdk9wo8a3rc7a3xd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9s7hdk9wo8a3rc7a3xd.png" alt="AWS MCP" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're already familiar with AWS Cloud and want to explore how MCP operates in real time to execute specific actions, &lt;a href="https://github.com/awslabs/mcp" rel="noopener noreferrer"&gt;AWS Labs MCP &lt;/a&gt;is a great starting point for hands-on experimentation.&lt;/p&gt;

&lt;p&gt;I will cover more details on MCP architecture and how AWS MCP server interacts with LLM in separate post. I will cover here how you can get the things working really quick and see MCP in action.&lt;/p&gt;

&lt;h3&gt;
  
  
  Steps
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Clone the AWS MCP git repo&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; git clone https://github.com/awslabs/mcp.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Navigate to the MCP server you want to run as docker container. We  will try to generate some cool AWS diagram in this case. So I will  navigate to &lt;code&gt;aws-diagram-mcp-server&lt;/code&gt; to see this in action&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; cd aws-mcp/mcp/src/aws-diagram-mcp-server/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Build the Docker images&lt;/strong&gt;&lt;br&gt;
 Run the docker file to create an docker image out of it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; docker build -t awslabs/aws-diagram-mcp-server .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the image is built successfully we will use this MCP server now to connect from a LLM agent. I will use Cline extension in this case. I found this super useful and ease of use in terms of configuration and running the MCP server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. LLM API Configuration from VSCode&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Install Cline extension in VScode&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyjlzpwigc2age48kktzo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyjlzpwigc2age48kktzo.png" alt="Cline Extension" width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add the LLM API&lt;/strong&gt;&lt;br&gt;
Open Cline extension and configure the LLM (API you will use to connect to MCP). By default Cline uses Anthropic Claude sonnet-4 LLM and you can change this from the API Provider option.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqlzd0wj2fg65z6stkii.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqlzd0wj2fg65z6stkii.png" alt="anthropic" width="672" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In my case I have selected DeepSeek and this is my personal favourite. You need to include the API key and click on done. Refer the below screenshots to change the API provider and the adding the key.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;LLM API provider&lt;/th&gt;
&lt;th&gt;Update API Key&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsxyd46m5789gqvuuffx.png" width="616" height="1478"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faxf5dj1zh2gjhaywkhj8.png" width="616" height="1478"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Once the LLM API configuration is successful, you can test sending a hello from the LLM to see the response from the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Add MCP server in Cline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the LLM API configuration is successful, we can start adding the MCP server to Cline. Select the MCP server icon below and click on &lt;code&gt;manage MCP server&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jemb1l0jm32rxa18osz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jemb1l0jm32rxa18osz.png" alt="MCP server" width="676" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click on &lt;code&gt;Configure MCP Servers&lt;/code&gt; and that will open a file in VSCode &lt;code&gt;cline_mcp_settings.json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Add the below for the AWS diagram generator MCP server. You can get the docker command MCP json for every MCP server in the respective MCP server documentation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"awslabs.aws-diagram-mcp-server": {
        "command": "docker",
        "args": [
          "run",
          "--rm",
          "--interactive",
          "--env",
          "FASTMCP_LOG_LEVEL=ERROR",
          "awslabs/aws-diagram-mcp-server"
        ],
        "env": {},
        "disabled": false,
        "autoApprove": []
      }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full &lt;code&gt;cline_mcp_settings.json&lt;/code&gt; should look below after adding the above block and click cmd+s/ctrl+s.&lt;br&gt;
This will spin up a container for the MCP server from the docker image built. Make sure you have used the right image name on the above &lt;code&gt;cline_mcp_settings.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In my case it is &lt;code&gt;awslabs/aws-diagram-mcp-server&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Run and Test your MCP server to see it in action&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now lets generate some cool AWS diagrams to see our MCP in action.&lt;/p&gt;

&lt;p&gt;the prompt I will write to the LLM &lt;/p&gt;

&lt;p&gt;&lt;code&gt;generate an AWS diagaram &lt;br&gt;
an ASG and a RDS in private subnet. an ALB in front of the ASG  as its target group and Route53 DNS to ALB as backend. Route53 with https://example.com&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can track your total spent per prompt, tokens, Cache read/write while LLM will start the API request. That is the cool thing about Cline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0lbb7v479r4tihwet9q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0lbb7v479r4tihwet9q.png" alt="Image description" width="676" height="1564"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It might prompt you multiple times to approve based on the operation it performs. Click on &lt;code&gt;Approve&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Once the task is completed you will see a message like below with the details. As you see I spent total $0.0056 for my below request to generate the diagram. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqwcvkooxnkd9wxdwop9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqwcvkooxnkd9wxdwop9.png" alt="Image description" width="658" height="1414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; &lt;br&gt;
Docker based MCP will place files generated in the container's file path. So we need to copy the files from container to local to see this. In this case it is generated in &lt;code&gt;/tmp/generated-diagrams/&lt;/code&gt; and we have to copy using the below&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker cp &amp;lt;container_name&amp;gt;:/tmp/generated-diagrams generated-diagrams/&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frnc3xqn9g3wzbzfwa1qw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frnc3xqn9g3wzbzfwa1qw.png" alt="Image description" width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>aws</category>
      <category>llm</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
