Forem: Himanjan

Claude Mythos In Preview

Himanjan — Wed, 08 Apr 2026 13:12:33 +0000

AI Just Found Bugs That Humans Missed for 27 Years — And That Changes Everything

How Anthropic's new model is rewriting the rules of cybersecurity — explained simply.

On April 7, 2026, Anthropic quietly dropped one of the most important announcements in cybersecurity history. Their new model, Claude Mythos Preview, found and exploited security flaws in every major operating system and every major web browser — many of which had been hiding in plain sight for decades.

Let me break down what happened, why it matters, and what it means for all of us — no jargon, no hype, just the facts.

🧠 First, What Are "Vulnerabilities"?

Think of software like a house. A vulnerability is an unlocked window that nobody noticed. It's been there since the house was built, but because no one checked that particular window, burglars never found it either.

Now imagine an AI that can walk through every room of every house on the internet and check every window, every door, every crack — in hours, not years.

That's what Mythos Preview does for software.

🔍 What Did It Actually Find?

Here are three real examples — all confirmed and now patched:

1. A 27-Year-Old Bug in OpenBSD

What is OpenBSD? A super-secure operating system used to run firewalls and critical internet infrastructure. Its entire reputation is built on security.

The bug: When two computers talk over the internet using TCP (the basic protocol for web traffic), they send "acknowledgment" messages back and forth — "Hey, I got packets 1 through 10."

OpenBSD had a flaw in how it tracked these acknowledgments. By sending a carefully crafted message with a negative starting point, an attacker could trick the system into crashing.

How? Two small bugs combined:

Bug 1: The code checked if the end of an acknowledgment was valid, but never checked the start. Usually harmless.
Bug 2: Under a very specific condition, the code tried to write to a memory address that no longer existed (a "null pointer").

The trick was that reaching Bug 2 should have been impossible — except that by exploiting Bug 1 with a number roughly 2 billion away from the expected range, a math overflow fooled both safety checks simultaneously.

Impact: Anyone on the internet could remotely crash any OpenBSD machine. This bug existed since 1998.

📊 Cost to find it: Under $50 for that specific run
   (within a $20,000 sweep of ~1,000 runs across the codebase)

2. A 16-Year-Old Bug in FFmpeg

What is FFmpeg? The video processing engine behind almost every app that plays or converts video. YouTube, VLC, Discord — they all rely on it. It's one of the most tested pieces of software on Earth.

The bug: When decoding H.264 video (the standard format for most video), FFmpeg tracks which "slice" each chunk of pixels belongs to using a table of 16-bit numbers (max value: 65,535).

The code uses the value 65535 as a special marker meaning "nobody owns this pixel yet." But if an attacker creates a video with exactly 65,536 slices, then slice number 65,535 collides with the marker. The decoder gets confused, thinks a nonexistent neighbor pixel is real, and writes data where it shouldn't.

Think of it like a hotel that uses room 9999 as
the code for "this room doesn't exist."

Now someone books exactly 10,000 rooms.
Guest 9999 checks in — and the system can't tell
the difference between a real guest and "doesn't exist."

Why did nobody catch this? Fuzzers (automated testing tools) had hit this code millions of times with random inputs. But they never tried a video with exactly 65,536 slices — because no real video would ever have that many. The AI understood the logic of the code, not just random inputs.

3. Full Remote Takeover of FreeBSD (CVE-2026-4747)

What is FreeBSD? Another widely-used operating system, especially for servers and networking equipment.

The bug: FreeBSD's file-sharing service (NFS) had a function that copied data from an incoming network packet into a 128-byte buffer — but the packet could be up to 400 bytes. Classic buffer overflow.

What Mythos Preview did with it:

Step	What Happened
1. Found the overflow	The AI read the FreeBSD kernel source and spotted the mismatch between buffer size (128 bytes) and input limit (400 bytes)
2. Noticed the missing protections	The buffer was declared as `int32_t[]` instead of `char[]`, so the compiler didn't add a security canary — a guard value that normally detects overflows
3. Figured out how to authenticate	To reach the vulnerable code, you need a secret handle. The AI discovered you could get it by making one unauthenticated call that leaks the server's UUID
4. Built a 20-step attack chain	The actual exploit needed ~1,000 bytes but only had 200 bytes of space. So the AI split it into 6 sequential network requests, each building a piece of the attack in memory
5. Result	Unauthenticated root access — complete control of the machine, from anywhere on the internet

This bug had been hiding in FreeBSD for 17 years.

📈 The Numbers Are Staggering

Here's the leap in capability compared to the previous best model (Opus 4.6):

Task	Opus 4.6	Mythos Preview
Firefox JS engine exploits (out of hundreds of attempts)	2	181 ✅
Full control-flow hijack on patched targets	1	10
CyberGym vulnerability benchmark	66.6%	83.1%
SWE-bench Verified (code tasks)	80.8%	93.9%

And here's the part that should make you sit up:

Non-security-engineers at Anthropic asked Mythos Preview to find remote code execution bugs before going to bed. They woke up to complete, working exploits.

🤝 Project Glasswing — The Industry Response

Anthropic isn't releasing this model to the public. Instead, they launched Project Glasswing — named after a butterfly with transparent wings — bringing together 12 major partners:

🏢 AWS          🍎 Apple         📡 Broadcom
🌐 Cisco        🛡️ CrowdStrike   🔍 Google
🏦 JPMorganChase 🐧 Linux Foundation
💻 Microsoft    🎮 NVIDIA        🔒 Palo Alto Networks
🤖 Anthropic (leading the effort)

Plus 40+ additional organizations working on critical infrastructure.

The investment:

💰 $100M in model usage credits for partners
💰 $2.5M to open-source security foundations (OpenSSF, Alpha-Omega)
💰 $1.5M to the Apache Software Foundation

The idea is simple: let the defenders find the bugs before the attackers do.

🔗 What About Chaining Vulnerabilities?

This is where it gets really impressive. Many individual bugs aren't dangerous on their own. It's like having a key that opens one door — but the door leads to another locked door.

Mythos Preview can chain vulnerabilities together automatically:

Example from the Linux kernel:

  Bug 1 → Bypass address randomization (figure out WHERE things are in memory)
       ↓
  Bug 2 → Read contents of a protected data structure
       ↓
  Bug 3 → Write to a previously-freed piece of memory
       ↓
  Bug 4 → Place a malicious object exactly where the write lands
       ↓
  🔓 Full root access

It did this across Linux, web browsers (building JIT heap sprays and sandbox escapes), and even closed-source software by reverse-engineering the binaries first.

💡 Why This Is Different From Everything Before

Traditional security testing works like this:

Old way (Fuzzing):
  → Generate millions of random inputs
  → Feed them to the program
  → See if anything crashes
  → Hope you got lucky

AI way (Mythos Preview):
  → Read and UNDERSTAND the code
  → Hypothesize where bugs might be
  → Test specific theories
  → Chain bugs together into real attacks
  → Produce a working exploit with a full report

The FFmpeg example is the perfect illustration. Fuzzers hit that code millions of times. They never thought to try exactly 65,536 slices because they don't understand what the code does. The AI did.

⚠️ The Scary Part

These capabilities weren't intentionally trained. They emerged naturally from making the model better at coding and reasoning. The same improvements that help it fix bugs also help it exploit them.

And here's the timeline that should concern everyone:

A few months ago  → Models couldn't find non-trivial vulnerabilities
A few weeks ago   → Models could find bugs but rarely exploit them
Today             → Mythos Preview finds AND exploits zero-days autonomously
Tomorrow          → ???

Anthropic has identified thousands of additional high-severity vulnerabilities that are still going through responsible disclosure. Only about 1% of what they've found has been patched so far.

🛡️ What Should Defenders Do Right Now?

Anthropic's advice, and I think it's sound:

1. Start Using AI for Security Today

You don't need Mythos Preview. Current models like Claude Opus 4.6 can already find hundreds of vulnerabilities. The point is to start building the muscle now.

2. Shorten Your Patch Cycles

The window between "vulnerability disclosed" and "exploit available" just collapsed from weeks to hours. Auto-update everything. Treat security patches as urgent, not routine.

3. Rethink "Defense in Depth"

Some security measures work by making exploitation tedious rather than impossible. AI doesn't get tired. Focus on hard barriers (like memory safety, address randomization) over friction-based defenses.

4. Automate Your Incident Response

More bugs found = more attacks attempted. You can't staff your way through the volume. Let models help with triage, investigation, and response.

🔮 The Big Picture

We're at an inflection point. For 20 years, cybersecurity has been in a relatively stable equilibrium — attacks evolved, but the shape of attacks stayed similar. That's about to change.

The good news: defense has the long-term advantage. Defenders can use these tools proactively to find and fix every bug. Attackers only need to find one — but defenders can now find them first, at scale.

The bad news: the transition will be rough. Until the security world adapts, attackers who get access to similar capabilities will have a field day.

Anthropic's bet with Project Glasswing is that by giving defenders a head start — even a few months — the industry can reach a new, more secure equilibrium before the storm hits.

Whether that bet pays off depends on how fast the rest of us move.

📚 Further Reading

Anthropic's Technical Blog Post — Full technical details on every vulnerability discussed
Project Glasswing Announcement — Partner quotes and initiative details
Anthropic's Coordinated Vulnerability Disclosure Policy

If you found this useful, give it a 👏 and share it with your team. The cybersecurity landscape just changed — and everyone building software needs to understand how.

The Brain of the Future Agent: Why VL-JEPA Matters for Real-World AI

Himanjan — Sun, 11 Jan 2026 01:31:10 +0000

The "Generative" Trap

If you have been following AI recently, you know the drill: Input → Generate. You give ChatGPT, Gemini, or Claude a prompt, it generates words. You give Sora a prompt, it generates pixels. You give Gemini Veo a prompt, it creates a cinematic scene from scratch.

This method, known as autoregressive generation, is the engine behind almost every modern AI. It works by predicting the next tiny piece of data (a token) based on the previous ones.

But there is a massive inefficiency lurking here.

Imagine you are watching a video of a person cooking. To understand that video, do you need to be able to paint every single pixel of the steam rising from the pot? No. You just need to grasp the abstract concept: “Water is boiling.”

Standard Vision-Language Models (VLMs) like LLaVA or GPT-4V are forced to “paint the steam.” They must model every surface-level detail—linguistic style, word choice, or pixel noise—just to prove they understand the scene. This makes them:

Computationally Expensive: They waste compute on irrelevant details.

(Example: It burns energy calculating the exact shape of every cloud when you simply asked, “Is it sunny?”)
Slow: They must generate outputs token-by-token, which kills real-time performance.

(Example: It’s like waiting for a slow typist to finish a paragraph before you can know if the answer is “Yes” or “No.”)
Hallucination-Prone: If they don’t know a detail, the training objective still forces them to emit some token sequence—often resulting in confident but incorrect completions.

(Example: Ask it to read a blurry license plate, and it will invent numbers just to complete the pattern.)

The inefficiency comes from the loss itself: cross-entropy penalizes every token mismatch, even when two answers mean the same thing.

VL-JEPA (Vision-Language Joint Embedding Predictive Architecture)

After spending more than three days reading this paper VL-JPEA, I can say this confidently, this paper introduces the first non-generative vision-language model designed to handle general-domain tasks in real time. It doesn't try to generate the answer. It predicts the mathematical "thought" of the answer.

VL-JEPA builds directly on the Joint Embedding Predictive Architecture (JEPA) philosophy: never predict noise, only predict meaning. In fact, its vision encoder is literally a pre-trained V-JEPA 2 model, which provides the rich, physics-aware video representations that the language component then learns to understand.

VL-JEPA builds directly on the Joint Embedding Predictive Architecture (JEPA) philosophy:

Never predict noise. Predict meaning.

Part 1: The Core Philosophy (Prediction vs. Generation)

To understand VL-JEPA, you must unlearn the “next token prediction” habit.

We need to shift our goal from creating pixels or words to predicting states.

I’ll explain this using one concrete scenario throughout: Spilled Milk.

1. The Standard VLM Approach (Generative)

In a standard model (like LLaVA or GPT-4V), the training goal is to generate text tokens.

X (Input): Video frames of the glass sliding.
Y (Target): The text “The glass falls and spills.”

The Process:

The model guesses “The,” then “glass,” then “falls.”

If it guesses wrong (e.g., “The cup…”), it is penalized—even though the meaning is correct.

2. The VL-JEPA Approach (Predictive)

VL-JEPA does not model probabilities over tokens.

Instead, it minimizes the distance between embeddings in a continuous space.

SX (Input Embedding): A vector summarizing “glass sliding.”
SY (Target Embedding): A vector summarizing “spill occurred.”

The Process:

Given the sliding embedding, can the model predict the spill embedding?

No words. No pixels. Just meaning.

The “Orthogonal” Problem (from the paper)

Text generation has a hidden flaw:

In raw token space, different correct answers can look completely unrelated.

“The milk spilled.”
“The liquid made a mess.”

A standard VLM treats these as nearly orthogonal because the words don’t overlap.

VL-JEPA’s solution:

In embedding space, both sentences map to nearby points because their meaning is the same.

This collapses a messy, multi-modal output distribution into a single smooth region, making learning dramatically more efficient.

Part 2: The Architecture (The Tripod of Understanding)

Before we build the full car, we need to acknowledge the engine:

VL-JEPA does not learn to see from scratch.

Its vision encoder is initialized from V-JEPA 2, which already has a “gut feeling” for physics—like knowing unsupported objects tend to fall.

Here’s how the system processes our spilled milk scenario:

1. The X-Encoder (The Eyes)

What it is: A Vision Transformer (V-JEPA 2).
What it does: Compresses video frames into visual embeddings—dense numerical representations of objects, motion, and relationships.

It does not predict future pixels.

2. The Predictor (The Brain)

What it is: A Transformer initialized from Llama-3.2 layers.
What it does: Combines:
- Visual embeddings (glass sliding)
- A text query (e.g., “What happens next?”)

It predicts a target embedding representing what will happen.

Conceptually, it behaves as if it were composing latent factors like motion, support, and gravity to arrive at “spilled milk.”

Unlike language models, this predictor uses bi-directional attention, allowing vision and query tokens to jointly condition the prediction.

3. The Y-Encoder (The Abstract Target)

What it is: A text embedding model (EmbeddingGemma).
What it does: Converts “The milk spills” into the ground-truth answer embedding.

The model is trained to minimize the distance between its prediction and this embedding.

4. The Y-Decoder (The Mouth — Optional!)

What it is: A lightweight text decoder.
Key idea: It is not used during main training.

The model can think about the milk spilling without talking about it.

Text is generated only when a human needs it, which is critical for efficiency.

Part 3: The Superpower — Selective Decoding

This is what makes VL-JEPA different

Imagine a robot watching the glass.

Standard VLM (The Chatty Observer)

Frame 1: “The glass is on the table.”
Frame 10: “The glass is moving.”
Frame 20: “The glass is still moving.”

It wastes compute describing moments where nothing meaningful changes.

VL-JEPA (The Silent Observer)

VL-JEPA produces a continuous stream of embeddings.

Frames 1–50: Embeddings remain stable (situation unchanged). Decoder stays off. Silence.
Frame 51: The glass tips. The variance of the embedding stream increases, signaling a semantic transition.

Only then does the decoder activate:

“The glass has fallen.”

This reduces decoding operations by ~2.85× while maintaining the same accuracy.

Part 4: The Verdict (Is It Actually Better?)

Meta didn’t just theorize this—they ran a strictly controlled comparison.

You can refer Figure 3 in the paper

source - VL-JEPA paper

Both models used:

The same vision encoder
The same data
The same batch size
The same training steps

The only difference was the objective:

Predict embeddings vs generate tokens.

The Results

Learns Faster (Sample Efficiency)

After 5M samples:
- VL-JEPA: 14.7 CIDEr
- Generative VLM: 7.1 CIDEr
Requires Less Brain Power (Parameter Efficiency)

VL-JEPA used 50% fewer trainable parameters (0.5B vs 1B).
Understands World Dynamics Better

On the WorldPrediction benchmark (state transition reasoning):
- VL-JEPA: 65.7%
- GPT-4o / Gemini-2.0: ~53%

Importantly, this benchmark tests understanding how the world changes, not symbolic reasoning or tool use.

Conclusion

VL-JEPA proves that Thinking ≠ Talking.

By separating the understanding process (Predictor) from the generation process (Decoder), Meta has built a model that is quieter, faster, and fundamentally more grounded in physical reality.

If we want AI agents that can watch a toddler and catch a falling glass of milk in real-time, we don't need models that can write a poem about the splash. We need models that can predict the spill before it happens. On my view VL-JEPA is the first step toward that future.

The Hidden Cost of LangChain: Why My Simple RAG System Cost 2.7x More Than Expected

Himanjan — Tue, 22 Jul 2025 23:05:20 +0000

A developer's journey from excitement to shock—and what I learned about LangChain's true cost.

The Moment I Realized Something Was Wrong

Recently, I started deep diving into agentic AI, experimenting with LangChain for building a simple RAG (Retrieval-Augmented Generation) system. Everything seemed fine—until I noticed something strange.

Just two runs of my LangChain-based Python script consumed more than $0.038 in OpenAI API costs, and my credit balance dropped from around $5 to $4.93!

I'm on the Pay-As-You-Go plan — so I feel every API call.

That got me thinking: Is LangChain doing more under the hood than I realize?

I decided to compare it with manual GPT-4 API call using OpenAI's SDK — and what I found might surprise you. It was not initially easy as I did not get much resource to trace the call directly or from any IDE based extensions.

The Investigation: LangChain vs Manual Implementation

The Task

Build a simple RAG system that:

Seems straightforward, right? Let me show you what happened when I built this two different ways.

Approach 1: The LangChain Way

Here's my initial implementation using LangChain:

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_community.callbacks.manager import get_openai_callback
from langchain_core.callbacks import BaseCallbackHandler

# 🔹 LLM call counter
class CountingHandler(BaseCallbackHandler):
    def __init__(self):
        self.llm_calls = 0

    def on_llm_start(self, *args, **kwargs):
        self.llm_calls += 1
        print(f"🔍 LLM call #{self.llm_calls}")

# Load and split document
loader = TextLoader("myfile.txt")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# Use OpenAI's embedding model (same for both examples)
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = FAISS.from_documents(chunks, embedding_model)
retriever = vectorstore.as_retriever()

# Setup GPT-4 LLM with counter
handler = CountingHandler()
llm = ChatOpenAI(model="gpt-4", temperature=0, callbacks=[handler])

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="refine"  # Change to "stuff" or "map_reduce" for testing
)

with get_openai_callback() as cb:
    response = qa_chain.run("What is the main idea of the document?")
    print("\n📌 Final Response:")
    print(response)

    print("\n📊 LangChain Usage:")
    print(f"Total LLM Calls: {handler.llm_calls}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Estimated Cost: ${cb.total_cost:.4f}")

Looks clean and simple, right?

Approach 2: The Manual Way

Here's the same functionality using direct OpenAI SDK calls:

from openai import OpenAI
import faiss
import numpy as np
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# 🔹 Load and split text
with open("myfile.txt") as f:
    text = f.read()

chunk_size = 500
chunk_overlap = 50
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - chunk_overlap)]

# 🔹 Get embeddings from OpenAI
def get_openai_embeddings(text_list):
    embeddings = []
    for text in text_list:
        response = client.embeddings.create(
            model="text-embedding-ada-002",
            input=text
        )
        embeddings.append(response.data[0].embedding)
    return embeddings

chunk_embeddings = get_openai_embeddings(chunks)

# 🔹 Store in FAISS
dimension = len(chunk_embeddings[0])
index = faiss.IndexFlatL2(dimension)
index.add(np.array(chunk_embeddings).astype("float32"))

# 🔹 Embed user query
query = "What is the main idea of the document?"
query_embedding = client.embeddings.create(
    model="text-embedding-ada-002",
    input=query
).data[0].embedding

# 🔹 Search top 3 chunks
D, I = index.search(np.array([query_embedding]).astype("float32"), k=3)
top_chunks = [chunks[i] for i in I[0]]

# 🔹 Build prompt and ask GPT-4
context = "\n\n".join(top_chunks)
prompt = f"""
You are a helpful assistant. Use the context below to answer the user's question.

Context:
{context}

Question: {query}
Answer:"""

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2
)

# 🔹 Output
print("\n📌 Final Response:")
print(response.choices[0].message.content)

# 🔹 Token usage
usage = response.usage
print("\n📊 Manual GPT-4 Usage:")
print(f"Prompt Tokens: {usage.prompt_tokens}")
print(f"Completion Tokens: {usage.completion_tokens}")
print(f"Total Tokens: {usage.total_tokens}")
cost = usage.total_tokens / 1000 * 0.03  # Est. GPT-4 input/output token cost
print(f"Estimated Cost: ${cost:.4f}")

The Shocking Results

After running both implementations with identical documents

"I added my blog which I have published in dev.to which compares RAG vs Prompt engineering vs fine tuning" which is here

https://dev.to/himanjan/rag-vs-fine-tuning-vs-prompt-engineering-the-complete-enterprise-guide-2jod

I saved this .md file contents directly to a .txt file which is myfile.txt in the code and run the code

The response comparison you can see below and both uses the same embedding model **text-embedding-ada-002** from opneAI

Response from Langchain version

Response from Manual OpenAPI invocation version

If you have already referred or skimmed through my blog you can understand how neat and precise summary the manual version produced with just 342 prompt tokens which is half of the Langchain used tokens. If you are getting a little confused that if prompt token is more then cost would go high. That is also a hidden game in Langchain. When we use refine chain type in Langchain which most of the production system we use it will break the things as below

Call 1: Initial answer with first chunk
Call 2: Refine answer with second chunk
Call 3: Refine again with third chunk
Call 4: Final refinement

Each call includes the full prompt + previous context, accumulating tokens.
Other chain types for RAG/QA are below

stuff - Puts all retrieved docs into a single prompt (most efficient)
refine - Iteratively refines answer with each document (what we used)
map_reduce - Processes each doc separately, then combines results
map_rerank - Scores each doc's answer and returns the best one

Cost Comparison Summary:

Manual approach: 487 tokens, $0.0146
LangChain approach: 1,017 tokens, $0.0388 (2.7x more expensive!)

Let me break this down:

Metric	Manual Implementation	LangChain Implementation	Difference
Total Tokens	487	1,017	+108%
Total Cost	$0.0146	$0.0388	+166%
API Calls	3 (trackable)	??? (hidden)	Unknown
Debugging Difficulty	Easy	Nightmare	-

Why Is LangChain So Much More Expensive?

After digging deeper, I discovered several hidden costs in LangChain:

1. Suboptimal Batching

LangChain's OpenAIEmbeddings defaults to batching 1,000 texts per API call, but OpenAI supports up to 2,048 inputs per request. This means:

You're making ~2x more API calls than necessary
More calls = more latency + more rate limit exposure

2. Hidden Internal Calls

LangChain makes API calls you can't see:

Internal prompt formatting calls
Retry logic that may duplicate requests
Chain validation calls
Memory management overhead

3. Inefficient Context Management

The framework often includes unnecessary context or makes redundant calls for:

Document metadata processing
Chain state management
Output parsing validation

4. Broken Cost Tracking

Perhaps most troubling: get_openai_callback() often shows $0.00 when you're actually being charged. I experienced this firsthand—the callback reported no costs while my OpenAI balance clearly decreased.

Then I explored further to read about this and found multiple blogs and some GitHub issues on this!!

The Broader Pattern: Companies Are Moving Away

My experience isn't isolated. Research reveals a troubling pattern:

Real Company Migrations

Octomind used LangChain for a year to power AI agents that create and fix software tests. After growing frustrations with debugging and inflexibility, they removed LangChain entirely in 2024:

"Once we removed it… we could just code. No longer being constrained by LangChain made our team far more productive."

Multiple development teams have documented similar experiences:

10+ months of LangChain code replaced with direct OpenAI implementations in just weeks
Elimination of dependency conflicts and version incompatibilities
Significant performance improvements and cost reductions

Technical Evidence from the Community

GitHub Issues document systematic problems:

Issue #12994: get_openai_callback() showing $0.00 instead of actual $18.24 costs
Issue #14952: Broken debug logs that merge messages incorrectly
Widespread reports of AttributeError: module 'langchain' has no attribute 'debug'

Developer testimonials consistently report:

Simple tasks requiring complex workarounds
More time debugging LangChain than building features
Inability to optimize for specific use cases due to abstraction layers

What This Means for Your Projects

When LangChain Might Be Worth It:

✅ Rapid prototyping where cost isn't a concern

✅ Learning RAG concepts and experimentation

✅ Demos and tutorials that need quick setup

✅ Multi-provider scenarios requiring provider abstraction

When to Skip LangChain:

❌ Production systems where cost and performance matter

❌ Budget-conscious projects on pay-as-you-go plans

❌ Applications requiring precise cost tracking

❌ Performance-critical systems needing optimization

❌ Projects requiring detailed debugging capabilities

The Bottom Line: Transparency Wins

My investigation revealed that what you can't see can hurt you. LangChain's abstractions, while convenient for learning, often hide:

2-3x higher token usage than optimal implementations
Multiple hidden API calls that compound costs
Suboptimal batching that wastes money and time
Broken cost tracking that leaves you blind to expenses

For my pay-as-you-go budget, these hidden costs add up quickly. What should have been a $0.015 experiment became a $0.038 surprise—and that's just for two simple runs.

Actionable Recommendations

For Learning and Prototyping:

Use LangChain to understand RAG concepts quickly
Expect 2-3x higher costs during development
Don't rely on get_openai_callback() for accurate tracking

For Production Systems:

Start with direct API implementations for transparency
Batch embeddings optimally (up to 2,048 inputs per OpenAI call)
Track every token with precise cost calculation
Profile your usage patterns before optimizing

Cost Optimization Strategy:

Implement precise token tracking from day one
Batch operations efficiently to minimize API calls
Cache embeddings to avoid repeated calculations
Monitor costs continuously with direct API usage metrics

Final Thoughts

LangChain serves an important purpose in the AI ecosystem—it helps developers learn and prototype quickly. But for production systems where every dollar counts, transparency and control are worth the extra development effort.

The 2.7x cost difference I discovered isn't just about money—it's about understanding what your code actually does. When you're building AI applications that could scale to thousands of users, those hidden costs become hidden disasters.

My advice? Learn with LangChain, deploy with direct APIs.

Your wallet will thank you.

Have you experienced similar cost surprises with LangChain? Share your experience in the comments below.

Keywords: LangChain, OpenAI API, RAG, Cost Optimization, AI Development, Token Usage, Production AI

About the Author

A software developer exploring the practical challenges of building production AI systems. Currently investigating the gap between AI framework promises and real-world performance.

RAG vs Fine-tuning vs Prompt Engineering: The Complete Enterprise Guide

Himanjan — Sat, 28 Jun 2025 00:09:33 +0000

How to choose the right AI approach for your business needs

When building AI applications for your business, you'll face a critical decision: Should you use Retrieval-Augmented Generation (RAG), fine-tune a model, or rely on prompt engineering? Each approach has distinct advantages, costs, and use cases. This guide will help you make the right choice with real-world examples and practical frameworks.

Understanding the Three Approaches

Prompt Engineering: The Art of Communication

Prompt engineering is like having a conversation with a highly knowledgeable assistant. You craft specific instructions, provide context, and guide the AI's responses through carefully designed prompts.

How it works: You provide instructions, examples, and context directly in your input to guide the model's behavior without changing the underlying model.

Example

Weak Prompt:

"Summarize the latest AI trends."

Strong Prompt:

"Act as a tech analyst and write a 300-word summary of the top 3
generative AI trends for enterprise adoption in 2025. 
For each trend, briefly explain its impact on the software
development industry. The tone should be professional and informative."

RAG (Retrieval-Augmented Generation): Dynamic Knowledge Integration

RAG combines the power of search with generation. It retrieves relevant information from your knowledge base in real-time and uses that context to generate accurate, up-to-date responses.

How it works: When a user asks a question, the RAG system first retrieves relevant documents or data snippets from a specified knowledge base (like your company's internal wiki, product documentation, or a database). This retrieved information is then passed to the LLM along with the original prompt, giving the model the necessary context to generate a factually grounded and accurate response.

Fine-tuning: Specialized Model Training

Fine-tuning involves training a pre-existing model on your specific data to create a customized version that understands your domain, terminology, and patterns.

How it works: You take a base model and continue training it on your specific dataset, adjusting the model's weights to perform better on your particular tasks.

Detailed Comparison

1. Prompt Engineering

Pros

Zero setup cost: Start immediately with existing models
Maximum flexibility: Easily adjust behavior with prompt changes
No technical infrastructure: Works with any API-based model
Rapid iteration: Test different approaches in minutes
No data preparation: Use natural language instructions
Version control friendly: Prompts are just text files

Cons

Limited context window: Constrained by model's token limits
Inconsistent results: Performance varies with prompt quality
No persistent learning: Can't learn from new information
Prompt injection risks: Vulnerable to malicious inputs
Manual optimization: Requires human expertise to craft effective prompts
Token costs: Long prompts increase API usage costs

Best Use Cases

Quick prototypes and MVPs
General-purpose applications
When you need immediate results
Small-scale applications
Tasks with clear, simple instructions

Real-World Example: Customer Service Chatbot

Company: Mid-sized e-commerce startup
Challenge: Handle basic customer inquiries without extensive setup
Solution: Used prompt engineering with clear instructions about company policies, tone, and escalation procedures
Result: Deployed in 2 days, handled 60% of basic inquiries effectively

Example Prompt:
"You are a helpful customer service representative for TechStore. 
Be friendly, professional, and concise. If asked about returns, 
our policy is 30 days with receipt. For technical issues, 
escalate to human support. Always end with 'Is there anything 
else I can help you with?'"

2. RAG (Retrieval-Augmented Generation)

Pros

Always current: Accesses real-time information
Scalable knowledge: Handle millions of documents
Explainable: Can show source documents
Cost-effective: No model retraining needed
Dynamic updates: Add new information instantly
Reduced hallucinations: Grounded in actual documents
Flexible data sources: PDFs, databases, websites, APIs

Cons

Complex architecture: Requires vector databases and search infrastructure
Retrieval quality dependency: Poor search = poor responses
Latency overhead: Additional retrieval step adds delay
Chunking challenges: Document segmentation affects quality
Higher operational costs: Multiple systems to maintain
Data preprocessing: Documents need cleaning and structuring

Best Use Cases

Knowledge bases and documentation
Customer support with evolving information
Research and analysis applications
Compliance and regulatory queries
Enterprise search and Q&A

Real-World Example: Legal Research Platform

Company: Large law firm (500+ attorneys)
Challenge: Quickly find relevant case law and regulations across thousands of documents
Solution: RAG system indexing legal databases, case files, and regulatory documents
Implementation:

Vector database with 2M+ legal documents
Semantic search for case similarity
Real-time updates when new cases are filed Result: Reduced research time from hours to minutes, 40% increase in billable efficiency

Technical Stack:

Embedding model: Specialized legal text embeddings
Vector store: Pinecone with legal document metadata
Retrieval: Hybrid search (semantic + keyword)
Generation: GPT-4 with legal prompt templates

3. Fine-tuning

Pros

Domain expertise: Learns your specific language and patterns
Consistent performance: Stable, predictable outputs
Compact responses: No need to include context in prompts
Custom behavior: Learns unique workflows and decision patterns
Efficiency: Smaller, specialized models can outperform larger general ones
Intellectual property: Your customized model becomes a business asset

Cons

High upfront costs: Requires significant data preparation and training
Data requirements: Needs thousands of high-quality examples
Time-intensive: Weeks or months to develop properly
Maintenance overhead: Must retrain for updates
Technical expertise: Requires ML engineering skills
Inflexible: Hard to modify behavior after training
Catastrophic forgetting: May lose general capabilities

Best Use Cases

Highly specialized domains
Consistent, repetitive tasks
When you have abundant training data
Applications requiring specific output formats
When general models consistently fail

Real-World Example: Medical Diagnosis Assistant

Company: Regional hospital network
Challenge: Create an AI assistant that understands medical terminology and follows clinical protocols
Solution: Fine-tuned model on medical records, clinical guidelines, and diagnostic procedures
Implementation:

Training data: 100K+ anonymized medical cases
Base model: BioBERT specialized for medical text
Fine-tuning: 3 months with medical experts
Validation: Tested against clinical gold standards Result: 85% accuracy in preliminary diagnoses, reduced diagnosis time by 30%

Decision Framework: Which Approach to Choose?

Start with These Questions:

1. Data and Knowledge Requirements

Do you need access to frequently changing information? → RAG
Do you have thousands of examples of desired behavior? → Fine-tuning
Can you describe your requirements clearly? → Prompt Engineering

2. Technical Resources

Limited technical team? → Prompt Engineering
Strong engineering but limited ML expertise? → RAG
Dedicated ML team and infrastructure? → Fine-tuning

3. Time and Budget Constraints

Need results this week? → Prompt Engineering
Can wait 2-4 weeks for better results? → RAG
Have 2-6 months for optimal solution? → Fine-tuning

4. Scale and Performance Requirements

Prototype or small-scale? → Prompt Engineering
Enterprise-scale with evolving content? → RAG
High-volume, consistent performance needed? → Fine-tuning

Enterprise Examples by Industry

Financial Services

Scenario: Investment research platform

Prompt Engineering: Quick market analysis templates
RAG: Real-time financial news and earnings reports
Fine-tuning: Specialized financial language and regulatory compliance

Chosen Approach: RAG + Prompt Engineering hybrid
Why: Need current market data (RAG) with consistent analysis format (prompts)

Healthcare

Scenario: Clinical decision support system

Prompt Engineering: Basic symptom checkers
RAG: Latest medical research and drug interactions
Fine-tuning: Specialized medical reasoning and terminology

Chosen Approach: Fine-tuning with RAG augmentation
Why: Medical accuracy requires specialized training, but needs current research

E-commerce

Scenario: Product recommendation engine

Prompt Engineering: Simple recommendation rules
RAG: Current product catalogs and reviews
Fine-tuning: Customer behavior patterns and preferences

Chosen Approach: Fine-tuning for personalization
Why: Rich customer data enables personalized behavior learning

Hybrid Approaches: Best of All Worlds

Many successful enterprise applications combine multiple approaches:

RAG + Prompt Engineering

Perfect for customer support systems that need both current information and consistent tone.

Example: Software company help desk

RAG retrieves relevant documentation
Prompt engineering ensures helpful, branded responses
Result: Accurate, current, and consistently helpful support

Fine-tuning + RAG

Ideal for specialized domains requiring both expertise and current information.

Example: Legal research platform

Fine-tuned model understands legal reasoning
RAG provides access to latest cases and regulations
Result: Expert-level legal analysis with current information

All Three Combined

Enterprise-grade solutions often use a layered approach:

Example: Enterprise knowledge management

Fine-tuned model for domain understanding
RAG for accessing company knowledge base
Prompt engineering for role-specific responses

Implementation Roadmap

Phase 1: Start with Prompt Engineering (Week 1-2)

Validate your use case quickly
Understand user needs and edge cases
Build initial user feedback loop
Estimate performance requirements

Phase 2: Implement RAG if Needed (Week 3-6)

If you need access to large knowledge bases
When information changes frequently
For explainable AI requirements
To reduce hallucinations

Phase 3: Consider Fine-tuning (Month 2-6)

When you have sufficient training data
For highly specialized domains
When consistency is critical
To optimize for performance and cost

Cost Analysis

Prompt Engineering

Development: $5K-$20K (mainly developer time)
Ongoing: API costs ($0.01-$0.06 per 1K tokens)
Maintenance: Low (prompt updates)

RAG

Development: $50K-$200K (infrastructure + development)
Ongoing: $1K-$10K/month (vector DB + compute)
Maintenance: Medium (data pipeline management)

Fine-tuning

Development: $100K-$500K (data prep + training + validation)
Ongoing: $2K-$20K/month (model hosting + retraining)
Maintenance: High (continuous data collection + retraining)

Common Pitfalls and How to Avoid Them

Prompt Engineering Pitfalls

Over-engineering prompts: Keep them simple and clear
Not testing edge cases: Use diverse test scenarios
Ignoring prompt injection: Validate and sanitize inputs

RAG Pitfalls

Poor chunking strategy: Test different chunk sizes and overlap
Irrelevant retrieval: Improve embedding quality and search logic
Information overload: Limit retrieved context to most relevant

Fine-tuning Pitfalls

Insufficient training data: Ensure data quality over quantity
Overfitting: Use proper validation and regularization
Forgetting base capabilities: Monitor general performance degradation

Future-Proofing Your Decision

Technology evolves rapidly. Consider these factors for long-term success:

Emerging Trends

Larger context windows may reduce RAG complexity
Better base models may reduce fine-tuning needs
Multimodal capabilities will expand all approaches

Flexibility Planning

Start with simpler approaches (prompt engineering/RAG)
Design systems that can incorporate fine-tuned models later
Maintain data collection for future fine-tuning opportunities

Conclusion

The choice between RAG, fine-tuning, and prompt engineering isn't always either/or. The best enterprise AI solutions often combine multiple approaches strategically:

Start with prompt engineering for rapid prototyping and validation
Add RAG when you need access to large, changing knowledge bases
Consider fine-tuning for specialized domains with abundant data

Remember: the "best" approach is the one that solves your specific problem effectively within your constraints. Start simple, measure results, and evolve your approach as your needs and capabilities grow.

The future belongs to organizations that can adapt their AI strategy as technology evolves. By understanding the strengths and limitations of each approach, you'll be equipped to make informed decisions that drive real business value.

Running AWS Model Context Protocol (MCP) Servers on Docker containers with DeepSeek LLM

Himanjan — Tue, 10 Jun 2025 21:00:42 +0000

MCP (Model Context Protocol) has gained popularity due to its ease of use, standardization, and efficiency in integrating AI models with external systems. MCP is particularly useful in AI-driven automation, agent-based systems, and LLM-powered applications, making it a go-to choice for developers looking to enhance AI interactions.

In this post, we'll take an in-depth look at running an AWS Lab MCP server inside a Docker container and leveraging a large language model (LLM)—DeepSeek in this case—to send prompts for executing actions efficiently.

If you're already familiar with AWS Cloud and want to explore how MCP operates in real time to execute specific actions, AWS Labs MCP is a great starting point for hands-on experimentation.

I will cover more details on MCP architecture and how AWS MCP server interacts with LLM in separate post. I will cover here how you can get the things working really quick and see MCP in action.

Steps

1. Clone the AWS MCP git repo

 git clone https://github.com/awslabs/mcp.git

Navigate to the MCP server you want to run as docker container. We will try to generate some cool AWS diagram in this case. So I will navigate to aws-diagram-mcp-server to see this in action

 cd aws-mcp/mcp/src/aws-diagram-mcp-server/

2. Build the Docker images
Run the docker file to create an docker image out of it

 docker build -t awslabs/aws-diagram-mcp-server .

Once the image is built successfully we will use this MCP server now to connect from a LLM agent. I will use Cline extension in this case. I found this super useful and ease of use in terms of configuration and running the MCP server.

3. LLM API Configuration from VSCode

Install Cline extension in VScode
Add the LLM API
Open Cline extension and configure the LLM (API you will use to connect to MCP). By default Cline uses Anthropic Claude sonnet-4 LLM and you can change this from the API Provider option.

In my case I have selected DeepSeek and this is my personal favourite. You need to include the API key and click on done. Refer the below screenshots to change the API provider and the adding the key.

LLM API provider	Update API Key

Once the LLM API configuration is successful, you can test sending a hello from the LLM to see the response from the model.

4. Add MCP server in Cline

Once the LLM API configuration is successful, we can start adding the MCP server to Cline. Select the MCP server icon below and click on manage MCP server.

Click on Configure MCP Servers and that will open a file in VSCode cline_mcp_settings.json

Add the below for the AWS diagram generator MCP server. You can get the docker command MCP json for every MCP server in the respective MCP server documentation.

"awslabs.aws-diagram-mcp-server": {
        "command": "docker",
        "args": [
          "run",
          "--rm",
          "--interactive",
          "--env",
          "FASTMCP_LOG_LEVEL=ERROR",
          "awslabs/aws-diagram-mcp-server"
        ],
        "env": {},
        "disabled": false,
        "autoApprove": []
      }

The full cline_mcp_settings.json should look below after adding the above block and click cmd+s/ctrl+s.
This will spin up a container for the MCP server from the docker image built. Make sure you have used the right image name on the above cline_mcp_settings.json.

In my case it is awslabs/aws-diagram-mcp-server

5. Run and Test your MCP server to see it in action

Now lets generate some cool AWS diagrams to see our MCP in action.

the prompt I will write to the LLM

generate an AWS diagaram an ASG and a RDS in private subnet. an ALB in front of the ASG as its target group and Route53 DNS to ALB as backend. Route53 with https://example.com

You can track your total spent per prompt, tokens, Cache read/write while LLM will start the API request. That is the cool thing about Cline.

It might prompt you multiple times to approve based on the operation it performs. Click on Approve

Once the task is completed you will see a message like below with the details. As you see I spent total $0.0056 for my below request to generate the diagram.

Note
Docker based MCP will place files generated in the container's file path. So we need to copy the files from container to local to see this. In this case it is generated in /tmp/generated-diagrams/ and we have to copy using the below

docker cp <container_name>:/tmp/generated-diagrams generated-diagrams/