Forem: Shivnath Tathe

I Built a Local-First Alternative to LangSmith After Spending $200 Debugging a Pipeline I Couldn't See | Shivnath Tathe

Shivnath Tathe — Tue, 12 May 2026 05:23:04 +0000

Last month I spent $200 on OpenAI tokens debugging a RAG pipeline. I had no idea which step was eating my budget. LangSmith would have helped — but I didn't want to send my prompts to a cloud service just to debug locally.

So I built opensmith.

What is opensmith?

A local-first LLM pipeline tracer. Zero cloud.
Zero setup. Everything stays on your machine.

pip install opensmith

How it works

One decorator:

from opensmith import trace

@trace(tags=["rag", "production"])
def pipeline(query: str):
    docs = retrieve(query)
    return generate(docs, query)

That's it. opensmith captures:

Function name, inputs, outputs
Latency in milliseconds
Token usage per step
Cost estimate
Errors with full stack trace
Parent-child relationships for nested calls

The dashboard

opensmith ui

Opens at localhost:7823. Live WebSocket updates, search, filters, charts. No account needed.

vs LangSmith

	LangSmith	opensmith
Setup	Cloud account	pip install
Data	Sent to cloud	100% local
Framework	Best with LangChain	Any Python
Cost	Free then paid	Free forever

What's new in v0.1.5

Token budget alerts — never blow your API
budget again:

@trace(token_budget=1000)
def expensive_pipeline():
    ...
# Prints: ⚠ expensive_pipeline used 1,247 tokens (budget: 1,000)

CLI trace filters:

opensmith traces --q rag --status err --tags production

Auto port detection — if 7823 is taken,
opensmith finds the next free port automatically.

opensmith init — creates a starter config:

opensmith init
# Creates opensmith.json in current directory

Autopatch — zero code changes

from opensmith import autopatch
autopatch(only=["openai", "qdrant"])

Supported: OpenAI, Anthropic, LiteLLM, Qdrant, ChromaDB, Pinecone.

Stats after 5 days

1,500+ PyPI downloads. 16 GitHub stars. Growing organically.

GitHub: https://github.com/shivnathtathe/opensmith

If this saves you time, a ⭐ helps others
discover it. Thank you for 1,500+ downloads 🙏

What would you add to a local-first tracing tool?

I Built a Local Alternative to LangSmith — 600+ Downloads in One Day 🚀 | Shivnath Tathe

Shivnath Tathe — Thu, 07 May 2026 14:06:51 +0000

I didn't expect this.

I published opensmith yesterday — a local-first LLM pipeline tracer
built as an alternative to LangSmith.

No cloud. No API key. No account. No data leaving your machine.

Just pip install opensmith and you're tracing.

And in one day — 600+ downloads.

Why I Built This

Every time I built an LLM app, I had to:

Create a LangSmith account
Set environment variables
Send my traces to someone else's cloud
Pray my data was safe

For production enterprise apps, healthcare tools, or anything
privacy-sensitive — that's a dealbreaker.

So I asked: what if observability just worked locally?

Why Observability Matters

You cannot improve what you cannot see.

When your LLM pipeline breaks, you need to know:

🔍 Which prompt caused the bad output?
⏱️ Where is the latency spike?
💸 Which call is burning tokens unnecessarily?
🔁 Which chain step is failing silently?

Without tracing, you're flying blind.

LangSmith solves this — but at the cost of your data going
to their servers.

opensmith solves it without that cost.

What opensmith Does

from opensmith import trace

@trace
def my_llm_pipeline(prompt):
    # your LLM calls here
    return response

That's it. One decorator.

✅ Traces every LLM call locally
✅ No cloud dependency
✅ No setup beyond pip install
✅ Works with any LLM — OpenAI, Anthropic, local models
✅ MIT licensed, fully open source

Local-First Is Not a Compromise

People assume local = limited.

I'd argue the opposite.

Local-first means:

Your traces never leave your machine
No rate limits on observability
Works fully offline
Zero vendor lock-in
Instant setup in any environment

This is especially critical for:

🏥 Healthcare AI apps
🏦 Banking and fintech LLM tools
🛡️ Defense and enterprise systems
👨‍💻 Solo developers who just want to ship fast

The Response

600+ downloads. One day. Zero marketing.

That number tells me one thing — developers are tired of
giving their data to yet another SaaS just to debug their
own code.

Try It

pip install opensmith

GitHub: github.com/shivnathtathe/opensmith

PyPI: pypi.org/project/opensmith

If you're building LLM apps and care about privacy,
observability, or just hate unnecessary signups —
give it a try.

And if you find it useful, a ⭐ on GitHub means
the world to an independent researcher building
in public. 🙏

Built by an independent AI researcher in Hyderabad, India.
Believer in AI that runs without expensive infrastructure.

I Built a Local-First Alternative to LangSmith in Python (No Cloud, No Setup)-Shivnath Tathe

Shivnath Tathe — Tue, 05 May 2026 09:24:24 +0000

LLM apps are getting more complex. A simple prompt call turns into a pipeline: embed the query, retrieve documents, call a model, parse the answer, maybe call tools, then run a second model pass. When something breaks, terminal logs are not enough.

That is where tracing tools help. LangSmith does this well — but it requires a cloud account, works best inside LangChain, and sends your trace data to a hosted service.

For a lot of projects, that is more than I need. I wanted something closer to this:

pip install opensmith

No Docker. No account. No config. No data leaving my machine.

So I built opensmith.

What is opensmith?

opensmith is a local-first LLM pipeline tracer for Python.

It stores traces in SQLite at ~/.opensmith/traces.db and serves a local dashboard at localhost:7823. The dashboard shows traces, nested steps, inputs, outputs, errors, latency, token usage, cost estimates, and live updates via WebSocket.

	LangSmith	opensmith
Setup	Cloud account required	`pip install opensmith`
Data privacy	Sends traces to cloud	100% local, SQLite only
Framework	Best with LangChain	Works with any Python code
Cost	Free tier then paid	Free forever, open source
Offline	No	Yes
Dashboard	Hosted	localhost:7823

opensmith is to LangSmith what Ollama is to OpenAI — the local-first, privacy-first alternative.

How to use it

1. Decorator

The simplest way is the @trace decorator:

from opensmith import trace

@trace(tags=["production", "rag"])
def rag_pipeline(question: str):
    docs = search_docs(question)
    return call_llm(docs + question)

It captures function name, inputs, output, errors, latency, and parent-child relationships for nested traced functions automatically.

2. Async support

Async functions work exactly the same way:

from opensmith import trace

@trace(tags=["async", "openai"])
async def call_llm(prompt: str):
    response = await openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return response

3. Context manager

For manual logging:

from opensmith import trace

with trace("my_pipeline", tags=["debug"]) as t:
    t.log("query", query)
    response = call_model(query)
    t.log("response", response)

4. Zero-code autopatch

For existing codebases, autopatch monkey-patches supported SDKs with zero code changes:

from opensmith import autopatch

# Patch everything
autopatch()

# Patch only specific backends
autopatch(only=["openai", "qdrant"])

# Patch everything except specific backends
autopatch(exclude=["chromadb"])

Supported autopatch targets:

OpenAI
Anthropic
LiteLLM
Qdrant
ChromaDB
Pinecone

If a package is not installed, opensmith skips it silently.

5. Console mode

For terminal-first workflows:

from opensmith import set_console_mode, trace

set_console_mode(True)

@trace
def classify_intent(text: str):
    return classifier(text)

This prints traces as they finish:

✓ classify_intent  245ms  [42 tokens  $0.000105]
✗ extract_entities  300ms  ERROR: ValueError: response exceeded max tokens

6. Project config

Add an opensmith.json to your project root:

{
  "db_path": "./my_traces.db",
  "console_mode": false,
  "autopatch": ["openai", "qdrant"]
}

opensmith reads this on import and applies it automatically.

The Dashboard

Start the dashboard with:

opensmith ui

Open http://localhost:7823.

v0.1.2 dashboard includes:

Live WebSocket updates via SQLite polling
Search bar
Filter by status (OK / ERR)
Filter by tags
Model column
Token and cost rollups from steps
Latency SVG bar chart
Token SVG bar chart
Inline trace detail with input/output JSON
Nested step timeline
LIVE indicator

Everything runs locally. No data leaves your machine.

CLI reference

opensmith ui        # Start dashboard at localhost:7823
opensmith traces    # List recent traces in terminal
opensmith stats     # Show aggregate stats
opensmith clear     # Clear all traces

Install

pip install opensmith

Requires Python 3.10+.

What's next

opensmith is still early. Some things I want to explore:

Better cost tables for more model providers
More SDK autopatch targets
Trace export and import
OpenTelemetry-compatible export without requiring cloud
Better support for long-running and streaming traces
Trace comparison views

Links

GitHub: github.com/shivnathtathe/opensmith
PyPI: pypi.org/project/opensmith

If you are building LLM apps in Python and want tracing without sending data to a hosted service, I would love your feedback. And if this looks useful, a ⭐ on GitHub would mean a lot — it helps others discover the project.

What would you want from a local-first tracing tool for AI pipelines?

Run GPT-5.5 Free in Your Terminal — OpenCode + ChatGPT Plus Free Month by Shivnath Tathe

Shivnath Tathe — Mon, 04 May 2026 04:40:57 +0000

OpenAI is currently offering ChatGPT Plus free for 1 month (normally ₹1999/month). That means GPT-5.5 access at ₹0. Here's how to wire it into OpenCode — the open-source terminal coding agent — in under 5 minutes.

Step 1: Claim the Free Plus Month

Go to chatgpt.com/#pricing
The Plus card shows ₹0 for the first month (LIMITED TIME badge)
Click "Claim free offer"
Sign in with a real email — you'll need to attach a UPI/card, so do not use temp mail

🚨 Important: The free Plus offer may not work with personal email accounts (Gmail, Yahoo, etc.). Use your work/org email for the best chance of seeing the offer.

Prerequisites

Before installing, make sure you have the following:

Tool	Version	Check	Download
Node.js	18+	`node --version`	nodejs.org
npm	comes with Node	`npm --version`	(bundled with Node.js)
Git	any	`git --version`	git-scm.com

✅ If you're on Windows and all three commands return a version number in PowerShell, you're good to go.

Step 2: Install OpenCode

OpenCode is an open-source AI coding agent that runs in your terminal. Think Cursor — but in your CLI, fully provider-agnostic.

Windows (npm — recommended):

npm install -g opencode-ai

macOS / Linux:

curl -fsSL https://opencode.ai/install | bash

macOS (Homebrew):

brew install anomalyco/tap/opencode

Windows alternatives (Scoop or Chocolatey):

scoop install opencode
# or
choco install opencode

💡 WSL users on Windows can use the curl command directly inside the WSL terminal.

Verify install:

opencode --version

Step 3: Launch and Connect OpenAI

Run OpenCode in your project directory:

cd your-project
opencode

You'll see the OpenCode TUI — it launches with a prompt and shows the current model in the status bar:

Now type /connect directly in the prompt and search for openai:

Select "OpenAI (ChatGPT Plus/Pro or API key)".

Step 4: Choose Auth Method — Browser Login (No API Key Needed!)

This is the key step. OpenCode gives you 3 options:

Select "ChatGPT Pro/Plus (browser)".

OpenCode will generate a link — copy it and paste it in the browser where your ChatGPT account is already logged in. Complete the sign-in flow there.

No API key required. No platform billing. Your Plus subscription is all you need.

Step 5: Restart and Select GPT-5.5

After completing browser auth, restart OpenCode:

opencode

Then type /models directly in the prompt and select GPT-5.5 · medium — the best balance of speed vs accuracy for day-to-day coding.

You'll see the status bar confirm: Build · GPT-5.5 OpenAI · medium ✅

What You Get

	Free Tier	Plus (Free Month)
Model in ChatGPT	GPT-5.3	GPT-5.5
OpenCode auth	API key only	Browser login (no key!)
Cost	₹0	₹0 for first month
Codex in ChatGPT	Trial only	✅ Included

Pro Tips

Use medium variant for GPT-5.5 — best speed/accuracy tradeoff for most coding tasks
Use /review to review uncommitted changes, branches, or PRs directly from the TUI (tip shown in the OpenCode UI itself)

- Keep `AGENTS.md` lean — OpenCode reads it as context on every message; bloated files eat your token budget

Why OpenCode Instead of Just ChatGPT?

Runs in your terminal — zero context switching
Reads your actual codebase — not copy-pasted snippets
Multi-file edits natively
Provider-agnostic — swap to Gemini, Anthropic, or local models anytime
Open source — github.com/sst/opencode

The free Plus month auto-renews at ₹1999. Cancel before renewal if you don't want to continue — your OpenCode setup keeps working with any other provider or API key.

Docs: opencode.ai/docs · Repo: github.com/sst/opencode

The free Plus month auto-renews at ₹1999. Cancel before renewal if you don't want to continue — your OpenCode setup keeps working with any other provider or API key.

⚠️ Set a calendar reminder now — go to chatgpt.com → profile → Manage subscription → cancel before the month ends. Easy to forget, hurts to miss.

Tested on OpenCode v1.14.33 · GPT-5.5 · May 2026

Drop a comment if you hit issues — happy to help debug.

I Tried Building GPT Without Training — Just Math. Here’s Where It Broke | Shivnath Tathe

Shivnath Tathe — Fri, 17 Apr 2026 07:01:29 +0000

The Question

What if we skipped training entirely?

Every language model — GPT, LLaMA, BERT — learns by optimising a loss function over millions of gradient steps. But the underlying data is just text: words appearing near other words. Co-occurrence. Counting.

So I asked: how far can pure mathematics take us toward text generation, without a single training step?

I built the whole thing from scratch in Python with NumPy. No PyTorch, no TensorFlow, no model.train(). Just matrices, statistics, and formulas.

Here's what happened.

The Setup: nanoVectorDB

I started with nanoVectorDB — a vector database I'd built from scratch using only NumPy. The original goal was embeddings and similarity search. But then I thought: if I can build word vectors without training, can I also generate text without training?

The corpus: WikiText-103 — 80 million tokens of Wikipedia articles. 10,000 word vocabulary covering 89% of all tokens.

The math pipeline:

Co-occurrence matrix — For each word pair, count how often they appear within a window of 5 words. Forward-heavy weighting (0.7 forward, 0.3 backward) because "the king" tells you more about what follows than what came before.
PPMI (Positive Pointwise Mutual Information) — Raw counts are dominated by common words. PPMI asks: "does this word pair appear together MORE than chance would predict?" It's the formula: PMI(x,y) = log(P(x,y) / P(x)P(y)), clamped to zero for negative values.
SVD (Singular Value Decomposition) — Compress the sparse 10,000×10,000 PPMI matrix into dense 64-dimensional word embeddings. Each word becomes a vector of 64 numbers.
Bigram grammar matrix — Separately, count every word-to-word transition. P(next | last_word). A 10,000×10,000 matrix of raw transition probabilities.

No training. Just counting and matrix factorisation.

What Pure Math Gets Right: Meaning

The embeddings were shockingly good.

Word neighbours:

king  → heir, regent, throne, prince, emperor, pope
queen → princess, duchess, sophia, isabella, catherine
music → indie, pop, hop, jazz, rap, songs, dance
river → lake, creek, valley, upstream, canyon

Analogies (on real data, 80M tokens):

king:man :: queen:? → woman ✓
man:woman :: boy:?  → girl ✓
france:paris :: japan:? → tokyo ✓  
king:queen :: prince:? → princess ✓

5 out of 8 exact matches at rank 1. 7 out of 8 in the top 5.

This isn't a toy result. The SVD embeddings understand that king-queen has the same relationship as prince-princess. They understand that France-Paris maps to Japan-Tokyo. All from counting word co-occurrences in Wikipedia, factorising the matrix, and computing cosine similarities.

This validates Levy & Goldberg (2014) — Word2Vec is implicitly factorising a PMI co-occurrence matrix. We just did it explicitly.

Where It Breaks: Generation

Meaning was solved. Now I tried to generate text. Seed the system with "the king and queen" and predict the next word, then the next, and so on.

Attempt 1: Semantic only (cosine similarity to context)

the king and queen → isabella sophia catherine isabella sophia 
catherine isabella sophia catherine...

Pure synonym loop. The most similar word to "queen" is "isabella". The most similar word to "isabella" is "sophia". Then back to "catherine". Forever.

Attempt 2: Bigram grammar only

Grammar knew that "queen" is often followed by "anne", and "anne" is followed by "elizabeth". But it produced generic Wikipedia filler with no topic awareness.

Attempt 3: Two-stage (the breakthrough)

This was the key idea:

Semantic filter: Find the 20 words most similar to the current context (cosine similarity of SVD embeddings)
Grammar rerank: Among those 20, score each by bigram probability — how often does it actually follow the last word in real text?
Combine: final = 0.7 × grammar + 0.3 × semantic
No repeat: Block every word that's been used before

Semantic proposes. Grammar disposes. No-repeat forces forward motion.

The result:

the war → guerrilla forces fighting troops advancing germans 
italians retreated retreat battle captured ottoman army soldiers 
surrendered marched garrison surrender siege reinforcements 
assault force deployed units corps cavalry division th infantry 
regiment rd battalion nd brigade headquarters unit commanding 
anzac divisional artillery

That's a coherent military narrative — from guerrilla warfare through retreat, surrender, siege, to specific military units and hierarchy. Every transition makes bigram sense. The semantic filter keeps it on-topic. No-repeat pushes it forward.

More outputs from the same system:

the school was built → constructed building construction block 
tower walls towers arches columns carved wooden stone wall arch 
roof tiles marble floors panels decorated brick exterior 
decoration decorative sculptures paintings depicting figures

Architecture → materials → decoration → art. A visual journey through a building.

she won the award → winning medal awarded prize award recipient 
honorary academy graduate school student faculty students 
enrolled

Awards → academia → enrollment. A career trajectory.

The 15 Versions That Followed

The two-stage system generated impressive topic walks but not sentences. So I spent the next 15 versions trying to fix it.

v3.2 — Union pools. Instead of only semantic candidates, I combined semantic top-20 + grammar top-20 into a pool of ~40 candidates. Grammar words like "was", "of", "the" could now compete. Result: grammar words dominated after a few steps. Every seed converged to "...of his own right to be used as well known..." — the same generic Wikipedia filler.

v3.3 — Dual memory. Semantic context tracked only content words (skipping grammar picks). Grammar context used the full sentence. Result: semantic stayed on topic but grammar picks were random glue words. Content and structure weren't coordinated.

v3.4 — Forced alternation (SEM/GRAM/SEM/GRAM). Forced the system to alternate between semantic and grammar picks. Result: the most readable output yet:

the war → victorious IN surrender OF surrendered TO seized BY 
besieged AND captured ON

Grammar words (of, in, by, to, and) appeared as glue between content words. Almost readable — but the grammar words weren't chosen for the content words. "Of" appeared because it has a high bigram score after almost anything, not because the sentence needed it there.

v3.5 — Trigram grammar. Built a trigram dictionary from the corpus (5.9 million unique contexts). Trigrams captured real phrases that bigrams couldn't:

dining → hall
shopping → centre  
tourist → attraction
nobel → peace
honorary → degree

These are genuine multi-word expressions. The bigram only saw "dining → room" or "dining → area". The trigram saw "dining hall" as a unit. But trigram sparsity meant frequent fallback to bigram.

v3.7 — 4-gram. Even sparser, rarely fired, fell back to trigram → bigram. Marginal improvement.

v3.8 — Fuzzy n-grams. The most creative attempt. Instead of exact trigram lookup, find similar contexts via embedding cosine similarity. "emperor empress" could borrow predictions from "king queen" because their embeddings are close. Result: the fuzzy matching was too loose — it matched contexts that sounded similar but had completely different meanings. Pulled in noise.

v3.9 — Union pools + fuzzy trigram. Combined everything. Same gravity-well problem — converged to generic filler after ~8 steps.

v4.0 — Alpha sweep. Tested grammar weights from 0.3 to 0.9 across 10 seeds. Different seeds needed different alpha values. No single alpha worked universally.

v4.1 — MMR soft diversity. Instead of hard-blocking used words, computed max cosine similarity to all previously used word embeddings as a penalty. final = relevance - λ × redundancy. λ=0.4 forced exploration of adjacent semantic regions. "the war" at λ=0.4 traced history across civilizations:

guerrilla forces fighting retreat battle army troops captured 
turkish soldiers surrendered italians germans retreated 
outnumbered defenders withdrew exhausted armies marched siege 
ottoman turks byzantine empire conquered egypt syria lebanon 
palestine israel occupation vietnam cambodia independence

From guerrilla warfare → Ottoman Empire → Byzantine Empire → Egypt/Syria → Israel/Palestine → Vietnam/Cambodia → independence. A walk through centuries of military history, forced by diversity to keep exploring.

The Scorecard

Capability	Toy (213 words)	100k tokens	80M tokens
Similarity separation	0.93	0.21	0.87
Analogies @5	100%	33%	70%
NTP (token accuracy)	41%	2.7%	0%
Generation	Semantic chains	—	Topic walks, no sentences

PPMI + SVD solves meaning. Bigrams solve local transitions. Together they generate coherent topic walks. But they cannot generate grammatical sentences.

Why It Can't Generate Sentences

After 15 versions, the diagnosis is clear. Every fix solved one problem and created another:

What we tried	What it fixed	What it broke
SVD semantic only	Meaning	Loops, no grammar
+ Bigram grammar	Basic transitions	Generic glue chains
+ No repeat	No loops	Exhausts topic words
+ Dual pool	Grammar words appear	Grammar dominates
+ Dual memory	Topic stays alive	Grammar picks random glue
+ Alternation	Content+glue pattern	No coordination
+ Trigram	Real phrases	Sparsity
+ Fuzzy n-gram	Generalisation	Too loose, noise
+ MMR diversity	Explores new regions	Still no sentences

The missing piece is always the same: position-dependent context tracking.

After "the king ruled the", our system needs to know "we need a noun here — specifically an object of 'ruled'." But:

Semantic scoring only knows "what word is RELATED to the recent context" — it doesn't know about syntactic roles.
Grammar scoring only knows "what word commonly FOLLOWS the last word" — P(next | kingdom) doesn't know we're in the object position of "ruled".

A transformer solves this with attention over the full sequence. At position 5, it can look back at position 2 ("ruled") and learn that "ruled the ___" needs a noun object. Our system can only look at the last 1-4 words, and it can't learn positional patterns because there's no learning.

Static embeddings give every word one fixed vector regardless of context. "King" after "the" (needs a verb next) has the same vector as "king" after "became" (needs a determiner). Dynamic, context-dependent representations require attention — and attention requires training.

What This Proves

Levy & Goldberg (2014) proved that Word2Vec implicitly factorises a PMI matrix. Zhao et al. (2025) proved that next-token prediction training converges to SVD factors of co-occurrence structure.

Our experiments confirm both from the other direction: we built the SVD factorisation explicitly and got embeddings that rival Word2Vec quality. But we also proved WHERE that equivalence breaks down — at generation.

Transformers aren't doing something fundamentally different from SVD for meaning. But they add the crucial missing piece: positional, context-dependent reweighting of those factors at every step.

The map of meaning can be built with pure math. The navigator through that map requires learning.

The Stack

Language: Python
Core: NumPy, SciPy (sparse SVD)
GPU acceleration: CuPy (cupyx.scatter_add for co-occurrence matrix building on CUDA)
Data: WikiText-103 via HuggingFace datasets (80M tokens)
Hardware: Kaggle T4 GPU
Training: Zero. None. Not a single gradient step.

What I'd Build Next

This isn't a dead end — it's a foundation. The experiments point to several directions:

Retrieval instead of generation. The embeddings are excellent for finding relevant content. Instead of generating word-by-word, use the SVD vectors to RETRIEVE real sentences from the corpus that match the semantic context. That's what vector databases are actually for.
Hybrid systems. Use the pure-math embeddings as a pre-computed semantic layer, then a small trained model (even a simple RNN) just for the sequential state tracking. The heavy lifting of meaning is already done.
Educational tool. This entire pipeline is transparent — every number is interpretable. No black boxes. Perfect for teaching how language models work from first principles.

Try It Yourself

All you need is NumPy, SciPy, and WikiText-103. Build a
co-occurrence matrix, apply PPMI, run SVD, add a bigram
grammar matrix. Two matrices. Two stages. No training.
Just math.

And now you know exactly where the math stops and the

learning begins.

This research was conducted as an independent exploration. Thanks to Levy & Goldberg, 2014 for the theoretical foundation and to Zhao et al. (2025) for extending the connection to next-token prediction.

Claude Code Just Got a Desktop Redesign — Here's What Changed!

Shivnath Tathe — Wed, 15 Apr 2026 07:43:34 +0000

Anthropic dropped a major redesign of the Claude Code desktop app today, and it's not just a visual refresh. The whole thing has been rethought around one idea: you're not waiting on one task anymore — you're orchestrating many.

If you've been using Claude Code seriously, you know the friction. Multiple repos, multiple tasks, alt-tabbing between your terminal and the app, losing context mid-session. This update addresses most of that directly.

Here's what actually changed.

1. Sidebar With Multi-Project Session Management

The new sidebar groups all your sessions by project. Every repo you've worked on shows up with its sessions listed underneath — you can jump between them instantly without losing context.

This is the biggest UX shift. Before, managing multiple parallel workstreams felt clunky. Now it's front and center. Kick off a refactor in one repo, a bug fix in another, and switch between them as results come in.

2. Integrated Terminal — No More Alt-Tabbing

The terminal is now a built-in pane inside the app itself. You can have your chat session on the left and a live terminal on the right, docked together in one window.

This sounds small, but it's significant in practice. Before, you kept a separate terminal open on the side. Now everything lives in one place — run commands, check output, continue the conversation, all without switching windows.

3. Side Chat — Ask Without Interrupting the Agent

This is my personal favorite addition. Press Ctrl + ; (or ⌘ + ; on Mac) during any active session to open a side chat.

The key detail — straight from the UI itself:

Claude sees the full session context, and nothing here is added to the main conversation.

So you can ask "wait, what does this function actually do?" or "is this approach correct?" without derailing the agent mid-task. It's context-aware but isolated. Genuinely useful.

4. Model + Effort Control in One Place

Click the model selector at the bottom right, and you get a clean dropdown with two controls together: model and effort level.

The Effort setting (Low / Medium / High) is underrated. Medium is the default sweet spot — fast enough, thorough enough. Switch to High when you need Claude to really think through a complex architectural decision. Switch to Low for quick edits and boilerplate.

Having model + effort together in one click is clean. No digging through settings.

Other Notable Changes

Beyond the four big ones:

Drag-and-drop pane layout — arrange terminal, preview, diff viewer, and chat in any grid you want
Faster diff viewer — rebuilt for large changesets, noticeably snappier
Three view modes — Verbose, Normal, Summary — dial from full transparency into Claude's tool calls down to just the results
Plugin parity with CLI — desktop app now fully supports Claude Code plugins
SSH support on Mac — previously Linux only
Expanded preview pane — open HTML files and PDFs directly in-app

The Bigger Picture

The old Claude Code desktop felt like a wrapper around the CLI. This redesign feels like a proper IDE for agentic work — one where you're managing agents, not just chatting with one.

The parallel session management, integrated terminal, and side chat together tell a coherent story: you're the orchestrator now, and the UI should reflect that.

If you're on Pro, Max, Team, or Enterprise — update and restart. It's worth it.

Using Claude Code heavily? Drop your workflow in the comments — curious how others are structuring multi-session work.

I Built an AI Agent That Thinks Out Loud While Using Your APIs—Here's the Non-Obvious Part

Shivnath Tathe — Tue, 14 Apr 2026 13:24:34 +0000

How MCP + Claude turned a boring hotel search into a reasoning machine

You've probably seen the demos. An AI assistant that calls tools, fetches data, and returns an answer. Clean. Impressive. But here's what those demos don't show you — the part where the model thinks between tool calls.

That's what I actually built. And the bugs I hit along the way taught me more about agentic AI than any tutorial ever could.

What Is MCP and Why Should You Care

The Model Context Protocol (MCP) is Anthropic's open standard for connecting AI models to external tools and data sources. Think of it as a USB-C port for AI—one protocol, any tool.

Instead of hardcoding tool logic into your LLM app, you expose tools via an MCP server. The model discovers them, decides when to use them, calls them, and reasons about the results. Your backend stays clean. Your AI gets smarter.

I built an MCP server on top of a corporate travel API—hotels, weather, city lookups—and connected it to a Claude-powered chat interface. What came out the other side surprised me.

The Part Nobody Talks About—Agentic Loops

Most tutorials show you a single tool call. User asks, model calls tool, model answers. Done.

Real agentic behavior is messier and far more interesting.

Here is what actually happens when a user asks, "Is Mumbai good to visit this month with the best hotels?"

User:    Is Mumbai good to visit this month? Find best hotels too.

Claude:  I'll help plan your trip. Let me first look up Mumbai's city code,
         then fetch the weather and hotel listings.

[calls city_lookup -> gets city code MUI]

Claude:  Found Mumbai with code MUI. Weather looks great — let me pull hotels now.

[calls get_weather_forecast -> 5 day forecast]
[calls get_hotels -> 20 luxury properties]

Claude:  Perfect. Here's your complete trip summary...
         [full structured answer with weather table + hotel list]

The model narrates what it is about to do, executes the tool, interprets the result, decides if it needs more data, and continues. That is not a chatbot. That is a reasoning agent.

The Bug That Took Half a Day to Find

When I first implemented streaming, tool calls were silently disappearing. The model would narrate perfectly—"Let me search for hotels..."*—then stop. No tool call. No error. Just silence.

The culprit was buried in the Anthropic Python SDK.

When a tool-use block completes during streaming, the SDK emits a ParsedContentBlockStopEvent. My code was only handling RawContentBlockStopEvent. Two different event types. One character difference in a string comparison. Half a day gone.

# What I expected
elif etype == "RawContentBlockStopEvent":
    # handle tool completion

# What the SDK actually sends for tool_use blocks
elif etype == "ParsedContentBlockStopEvent":
    block = event.content_block
    if block.type == "tool_use":
        # THIS is where the tool data lives
        tool_block = {
            "type": "tool_use",
            "id": block.id,
            "name": block.name,
            "input": block.input,
        }

The fix was five lines. The discovery was everything.

The Architecture That Actually Works

Here is the full stack I landed on after a lot of iteration:

Frontend (React + Vite)
    │  SSE stream
Backend (FastAPI)
    │  Anthropic API (claude-sonnet-4-5)
    │  MCP client
MCP Server (FastAPI + MCP SDK)
    │  REST APIs (Hotels, Weather, City Lookup)

The key design decision—the backend runs a multi-turn agentic loop:

for iteration in range(MAX_ITERATIONS):
    async for event in llm.stream_agentic(messages, tools, system):
        if event["type"] == "text_delta":
            yield to_frontend(event["text"])   # stream narration live
        elif event["type"] == "tool_use":
            tool_use_events.append(event)
        elif event["type"] == "end":
            stop_reason = event["stop_reason"]

    if stop_reason != "tool_use":
        break                                  # model is done

    # execute tools, feed results back, loop again
    messages.append({"role": "assistant", "content": content_blocks})
    messages.append({"role": "user", "content": tool_results})

Every iteration, narration streams to the frontend in real time. Tools execute server-side. Results feed back into the next iteration. The loop continues until the model says it is done.

The Frontend Problem Nobody Mentions

Streaming an agentic response to a UI is harder than it looks.

The naive approach — separate arrays for tool calls and text — breaks immediately. When the model narrates, calls a tool, narrates again, calls another tool, then gives a final answer, you need the UI to reflect that exact sequence.

I refactored the Turn data model from this:

// Bad -- loses ordering information
interface Turn {
  toolCalls: ToolCall[]       // all tools, no position
  assistantMessage: string    // all text, no position
}

To this:

// Good -- preserves arrival order
type ContentBlock =
  | { kind: "text"; text: string }
  | { kind: "tool"; tool: ToolCall }

interface Turn {
  blocks: ContentBlock[]      // interleaved, in order
}

Now the UI renders exactly what happened—narration, tool call, more narration, another tool call, and final answer—in the sequence the model produced.

Conversation Memory Across Tool Calls

Here is the subtle bug that broke follow-up questions.

After a full agentic turn I was saving history like this:

# Wrong -- loses all tool data
await append_message(session_id, Message(role="user", content=user_text))
await append_message(session_id, Message(role="assistant", content=response_text))

So when a user asked *"Give me those hotels in a table," Claude had no idea what hotels they meant. The formatted summary text was there but the actual JSON tool results were gone.

The fix—save the full message array, including every tool use block and tool result:

# Right -- full context preserved
await set_messages(session_id, messages)
# messages contains: user -> assistant+tool_use -> tool_result -> assistant -> ...

Now Claude can answer follow-up questions using the actual data from previous tool calls. No re-fetching, no guessing.

What Surprised Me Most

The model is a better orchestrator than I expected. Given a vague question like "Is Mumbai good to visit?", it independently decides to look up the city code first, then the weather, then hotels—in the right order, with the right parameters—without being told the sequence.

That is not prompt engineering. That is the model reasoning about what it needs and going to get it.

MCP makes this possible because the tools are described, not hardcoded. The model reads the tool definitions and figures out the rest.

Try It Yourself

The MCP server exposes tools over HTTP using the streamable-HTTP transport—no websockets, no special infrastructure. Any HTTP client can call it.

# Your MCP server is just a FastAPI app
uvicorn main:app --host 0.0.0.0 --port 8000

# Claude Desktop connects via config
{
  "mcpServers": {
    "my-server": {
      "url": "http://localhost:8000/mcp"
    }
  }
}

Where This Goes Next

A few things I am actively working on:

Precise vs Expressive mode—user-configurable system prompts so the model gives short direct answers or detailed, friendly responses depending on context
City code disambiguation—when a user says "Dubai," the model should confirm which Dubai before searching
Smooth streaming—React 18 automatic batching creates interesting problems when you want character-level typewriter effects on SSE streams

MCP is still early. The tooling is moving fast. But the core idea—give the model well-described tools and let it reason about when and how to use them—is already working in production.

About the Author

Shivnath Tathe—Software Engineer and Independent Researcher working at the intersection of LLMs and production systems. Published work on 4-bit quantized neural network training and continual learning on arXiv.


LinkedIn	shivnathtathe
X	@shivtathe

If you are building something with MCP or agentic AI, I would love to connect.

Run a 397B AI Model for Free Using Claude Code (3 Commands)

Shivnath Tathe — Mon, 06 Apr 2026 12:18:10 +0000

Most people think you need expensive APIs or a powerful GPU to run large AI models.

You don't. Here's how I ran a 397B parameter model for free in under 5 minutes on Windows.

What You Actually Need

A Windows machine
An Ollama account (free)
Internet connection

That's it. No GPU. No API key. No Anthropic billing.

What's happening under the hood

Two things working together:

Claude Code CLI is just the terminal interface and agent shell. Nothing goes through Anthropic's servers.
Ollama hosts and runs Qwen3.5 397B on their own cloud infrastructure, completely free

Claude Code supports Ollama as a backend. So you get a familiar, powerful agent interface while Ollama handles 100% of the actual inference. Anthropic is not involved beyond providing the CLI shell.

Step 1: Install Claude Code

Open PowerShell and run:

irm https://claude.ai/install.ps1 | iex

Step 2: Install Ollama

irm https://ollama.com/install.ps1 | iex

Step 3: Launch Claude Code with the 397B model

ollama launch claude --model qwen3.5:397b-cloud

You'll be asked to verify with your Ollama account once. After that it just works.

What you get

Full Claude Code agent interface
397B parameter model (Qwen3.5)
~256K token context window
Cloud hosted, so no local GPU needed
Free as long as you have an Ollama account

Important notes

This routes through Ollama's cloud, so you need internet
Don't use it for sensitive or private data
Free access may change in the future, use it while it lasts

Why this matters

We're entering a phase where the agent layer and the model layer are completely separate.

Claude Code is just the interface. The model underneath can be swapped. Local or cloud, open or closed, free or paid.

This setup is a simple example of that shift. The barrier to running frontier-scale models is no longer hardware or money. It's just knowing how to connect the right tools.

Try it

Three commands. Less than 5 minutes. No credit card.

If you run into issues drop them in the comments.

I research LLM training and continual learning. Follow for more no-fluff AI content.

Your AI Is Not Thinking. It's Multiplying Numbers. Let Me Show You Exactly How.

Shivnath Tathe — Mon, 06 Apr 2026 06:25:24 +0000

Everyone's talking about AI like it's magic. I work with it daily. It's not. Here's what's actually happening inside.

I've fine-tuned LLMs. I've published research on them. I've built systems around them.

And the single most honest thing I can tell you about large language models is this:

At the bottom, it's matrix multiplication. That's it.

Not intelligence. Not reasoning. Not understanding. Matrices of floating point numbers being multiplied together, billions of times per second.

But here's the uncomfortable part. That doesn't mean nothing interesting is happening.

Let me break this down without the hype, without the doomsaying, and without the marketing.

What a "Model" Actually Is

Forget the word "model." It carries too much baggage.

What you're actually dealing with is a file. A very large file full of numbers, floats arranged in matrices. GPT-2 has 117 million of them. GPT-3 has 175 billion. These numbers are called weights.

That's the model. Numbers in a file.

When you send a message to an LLM, here's what happens mechanically:

Your text gets converted to tokens (integers from a vocabulary)
Each token gets looked up in an embedding table (a matrix) and becomes a vector
That vector passes through N identical blocks, each doing attention (matmul) and feedforward (matmul + nonlinearity)
Final layer produces logits over the vocabulary
Sample from that distribution and get the next token

Repeat until done.

No memory. No state. No "thinking." Pure function application.

What "Training" Actually Is

This is where people get philosophical, so let me be precise.

Training is not teaching. There's no curriculum, no explanation, no understanding being transferred.

Here's the actual process:

Show the model some text
It predicts the next token
Compare prediction to actual next token, compute loss (a single number)
Backpropagate gradients through every matrix
Nudge every weight by a tiny amount in the direction that reduces loss
Repeat roughly a trillion times

The only signal the model ever receives is: your probability distribution over the next token was wrong, adjust.

No grammar lessons. No semantic explanations. No world knowledge explicitly provided. Just: you were wrong, here's by how much, here's which direction to shift.

And yet grammar emerges. Semantics emerges. World knowledge emerges.

That's not magic. That's what happens when you apply a single optimization pressure billions of times across the entire written record of human thought.

Why Next-Token Prediction Even Works

This is the question nobody asks clearly enough.

The implicit assumption is that predicting the next word is a shallow task. A parlor trick.

It's not. Here's why.

Language is not random. Language is massively structured at every level simultaneously:

Surface statistics: "New York" appears together with near-deterministic frequency
Syntax: "The ___" expects a noun or adjective, not a verb. Every time.
Semantics: "eat" expects a food object. "drink" expects liquid. Violations sound wrong.
World knowledge: "Paris is the capital of ___" has essentially one answer across millions of documents
Discourse: A medical article doesn't randomly switch to cooking. Topics are coherent.
Causal structure: "He dropped the glass. It ___" and physics is implicitly encoded because text describing physics is consistent

To predict the next token accurately across all of these simultaneously, the model is forced to learn all of these regularities. Not because anyone labeled them. Because they're all load-bearing for loss reduction.

Shannon estimated English has roughly 1 to 1.5 bits of true entropy per character. Language is not a high-entropy signal. It's a highly compressible, deeply structured one.

Next-token prediction works because language itself is learnable. The model just exploits that ruthlessly.

What the Weights Actually Encode

Here's the honest answer: nobody fully knows.

What we do know from interpretability research is directional. Early layers tend to capture surface patterns, later layers tend to capture more abstract task-relevant signals. But it's distributed, messy, and not cleanly decomposable.

You cannot point to a weight and say "this one handles subject-verb agreement."

What the weights collectively store is a giant tangled function that behaves as if it knows grammar, semantics, world facts, and reasoning patterns. Because that behavior is what minimizes loss on human-generated text.

It's not a list of rules. It's a compressed statistical model of human language and thought, discovered purely by gradient pressure.

"Just Matrix Multiplication" Is Not the Same as "Trivial"

This is where I'll push back on the reductive take, including my own initial instinct.

Yes, it's matmul. But DNA is just chemical interactions. The brain is just electrical signals. Both produce systems of staggering complexity.

The Universal Approximation Theorem tells us that stacked layers with nonlinearities can approximate any function given enough capacity. The architecture isn't doing something magical. But the composition of many simple operations produces a function of extraordinary complexity.

"Just matmul" at the mechanistic level does not imply "nothing meaningful" at the behavioral level.

What it does imply is that the meaningful behavior is not designed, it's emergent. And that's a genuinely different and more honest framing than either "it's just statistics" or "it's basically thinking."

The Transformer Is Not The Only Answer

Here's something the hype cycle obscures: the transformer architecture is not special in principle. It's dominant in practice.

What you actually need to exploit language regularities:

A way to consume sequential input
A way to condition predictions on context
Enough capacity to store learned regularities
A next-token prediction objective on enough data

RNNs satisfied all four. So did CNNs on text. So do State Space Models like Mamba, which use no attention at all, run in O(n) instead of O(n squared), and are competitive with transformers on many benchmarks today.

The transformer won because attention handles long-range context without compression, and because it parallelizes perfectly on GPUs. The hardware fit mattered as much as the architecture itself.

Mamba is a serious contender. Hybrid architectures mixing SSM and attention layers are emerging. Five years from now, the dominant architecture is probably neither pure transformer nor pure SSM.

And honestly? It's probably something nobody is currently working on.

What We Still Don't Know

I want to be precise here because this matters.

We don't know why scaling works.

We know that scaling works. More parameters plus more data plus more compute equals better performance, with surprising consistency. But the mechanism by which scaling produces emergent capabilities, tasks the model couldn't do at smaller scale suddenly working at larger scale, is genuinely unresolved.

"Combinatorial pattern composition" is a label for the mystery, not an explanation of it.

We built something that behaves intelligently. We know the training procedure in full detail. We do not know why that procedure, at scale, produces the behavior it does.

That's not a gap that marketing will fill. That's an open research problem.

The Real Summary

Claim	Truth
It's just matrix multiplication	True, mechanistically
Nothing meaningful is happening	False
It understands language	False
It learns statistical structure	True
We fully understand why it works	False
The transformer is the final architecture	Almost certainly false

Why This Framing Matters

If you think LLMs are magic, you'll use them wrong. You'll trust outputs that are confident but wrong, anthropomorphize failure modes, expect capabilities that aren't there.

If you think LLMs are "just statistics" and therefore uninteresting, you'll also use them wrong. You'll dismiss genuine capabilities, fail to understand where they're reliable vs brittle.

The accurate framing is: a large function approximator trained via optimization that exhibits emergent structured behavior, whose full mechanism we don't yet understand.

Not magic. Not trivial. Something genuinely new that deserves precise thinking.

I research LLM training, continual learning, and quantization. If this sparked something, let's discuss in the comments.

Attention Is All You Need — Explained Like You’re Building It From Scratch

Shivnath Tathe — Thu, 02 Apr 2026 05:32:13 +0000

Everyone has seen this diagram.

And almost everyone says:

“Transformers use attention instead of recurrence.”

But that doesn’t actually explain anything.

So let’s rebuild this from scratch — the way you would understand it if you were designing it yourself.

🧠 The Real Problem Transformers Solve

Before transformers, models like RNNs and LSTMs processed text like this:

One word at a time → sequentially

This caused two big problems:

Slow training (no parallelism)
Long-range dependencies break

Example:

“The cat, which was sitting near the window for hours, suddenly jumped.”

To understand “jumped”, the model needs context from far back.

RNNs struggle here.

⚡ The Core Idea

Instead of processing words one by one:

What if every word could look at every other word at the same time?

That’s attention.

🔍 What “Attention” Actually Means

Forget formulas for a second.

Think like this:

Each word asks:

“Which other words are important for me?”

Example:

Sentence:

“The animal didn’t cross the street because it was tired.”

What does “it” refer to?

animal?
street?

Attention helps the model decide.

⚙️ How It Works (Intuition First)

Each word is converted into a vector.

From that vector, we create:

Query (Q) → what I’m looking for
Key (K) → what I offer
Value (V) → actual information

🧠 Matching Process

Every word does:

Compare its Query with all Keys

This gives:

similarity scores

Then:

normalize (softmax)
use scores to combine Values

💡 Translation:

“Take information from important words, ignore the rest”

🔥 Multi-Head Attention (Why multiple?)

One attention head might learn:

grammar

Another:

relationships

Another:

positional meaning

So instead of one view:

We use multiple perspectives in parallel

🧱 Transformer Block (Now the Diagram Makes Sense)

Each block has:

1. Attention

Words interact with each other

2. Add & Norm

Stabilizes training

3. Feed Forward

Processes each token independently

🔁 Why “Add & Norm”?

This is often ignored but critical.

It keeps gradients stable and prevents information loss

Without it:

deep transformers won’t train well

⚡ Encoder vs Decoder

From your diagram:

Left side → Encoder

Reads input
Builds representation

Right side → Decoder

Generates output
Uses:
- masked attention (can’t see future)
- encoder output

🔒 Masked Attention (Important for LLMs)

When generating:

Model should not see future words

So we mask them.

🚀 Why Transformers Changed Everything

Because they:

Allow parallel computation
Handle long-range dependencies
Scale extremely well

💥 The Real Insight

Transformers don’t “understand language”.

They do something simpler but powerful:

They learn relationships between tokens

🧠 Connecting to LLMs

When you do:

model.generate("Hello")

What happens?

Each token attends to previous tokens
Builds context dynamically
Predicts next token

🔥 Final Thought

The paper says:

“Attention Is All You Need”

But the deeper idea is:

You don’t need sequence —
You need relationships

Once you understand that…

Transformers stop looking complex
and start looking inevitable.

If you're building models or exploring low-bit training like I am, this perspective changes everything.

Because now the question becomes

How efficiently can we compute these relationships?

Forem: Shivnath Tathe

I Built a Local-First Alternative to LangSmith After Spending $200 Debugging a Pipeline I Couldn't See | Shivnath Tathe

What is opensmith?

How it works

The dashboard

vs LangSmith

What's new in v0.1.5

Autopatch — zero code changes

Stats after 5 days

I Built a Local Alternative to LangSmith — 600+ Downloads in One Day 🚀 | Shivnath Tathe

Why I Built This

Why Observability Matters

What opensmith Does

Local-First Is Not a Compromise

The Response

Try It

I Built a Local-First Alternative to LangSmith in Python (No Cloud, No Setup)-Shivnath Tathe

What is opensmith?

How to use it

1. Decorator

2. Async support

3. Context manager

4. Zero-code autopatch

5. Console mode

6. Project config

The Dashboard

CLI reference

Install

What's next

Links

Run GPT-5.5 Free in Your Terminal — OpenCode + ChatGPT Plus Free Month by Shivnath Tathe

Step 1: Claim the Free Plus Month

Prerequisites

Step 2: Install OpenCode

Step 3: Launch and Connect OpenAI

Step 4: Choose Auth Method — Browser Login (No API Key Needed!)

Step 5: Restart and Select GPT-5.5

What You Get

Pro Tips

- Keep AGENTS.md lean — OpenCode reads it as context on every message; bloated files eat your token budget

Why OpenCode Instead of Just ChatGPT?

⚠️ Set a calendar reminder now — go to chatgpt.com → profile → Manage subscription → cancel before the month ends. Easy to forget, hurts to miss.

I Tried Building GPT Without Training — Just Math. Here’s Where It Broke | Shivnath Tathe

The Question

The Setup: nanoVectorDB

What Pure Math Gets Right: Meaning

Where It Breaks: Generation

The 15 Versions That Followed

The Scorecard

Why It Can't Generate Sentences

What This Proves

The Stack

What I'd Build Next

Try It Yourself

learning begins.

Claude Code Just Got a Desktop Redesign — Here's What Changed!

1. Sidebar With Multi-Project Session Management

2. Integrated Terminal — No More Alt-Tabbing

3. Side Chat — Ask Without Interrupting the Agent

4. Model + Effort Control in One Place

Other Notable Changes

The Bigger Picture

I Built an AI Agent That Thinks Out Loud While Using Your APIs—Here's the Non-Obvious Part

What Is MCP and Why Should You Care

The Part Nobody Talks About—Agentic Loops

The Bug That Took Half a Day to Find

The Architecture That Actually Works

The Frontend Problem Nobody Mentions

Conversation Memory Across Tool Calls

What Surprised Me Most

Try It Yourself

Where This Goes Next

About the Author

Run a 397B AI Model for Free Using Claude Code (3 Commands)

What You Actually Need

What's happening under the hood

Step 1: Install Claude Code

Step 2: Install Ollama

Step 3: Launch Claude Code with the 397B model

What you get

- Keep `AGENTS.md` lean — OpenCode reads it as context on every message; bloated files eat your token budget