Forem: Soumya Ranjan Nanda

How I would design observability for an LLM-powered workflow

Soumya Ranjan Nanda — Fri, 24 Apr 2026 07:04:20 +0000

Most LLM observability discussions stay too shallow for production work.

They stop at:

log the prompt
log the response
maybe add tracing

That helps, but it is not enough once your system includes retrieval, tool calls, guardrails, fallbacks, and evaluation loops.

This article is my attempt to describe observability for LLM systems the way I’d design it as a software engineer working on production workflows:
as a debugging and systems-design problem, not a monitoring buzzword.

I cover:

what observability really means in an LLM-powered workflow
traces vs logs vs metrics, and why all three matter
what to capture at each step: request, retrieval, prompt build, model, tools, validation, fallback, response
latency decomposition across workflow stages
token usage and cost visibility
tool-call tracing and agent execution visibility
retrieval/context debugging
prompt/version/model lineage
session, thread, and user correlation
guardrail and fallback instrumentation
evaluation signals and feedback loops
privacy, redaction, and sensitive-data concerns

The core idea is simple:

A lot of teams are logging the conversation.
Very few are instrumenting the workflow.

That difference matters when you need to answer questions like:

Why was this request slow?
Why was it expensive?
Why did retrieval fail?
Why did the agent take this path?
Why did a fallback trigger?
Did the answer actually help the user?

If you’re building LLM-powered features, RAG systems, or agent workflows, I’d love to hear how you’re approaching observability in practice.

Original article: https://medium.com/p/ad3326b31ddd

Prompt Engineering is Not Enough: Where Software Architecture Takes Over

Soumya Ranjan Nanda — Mon, 20 Apr 2026 06:18:07 +0000

Prompt engineering is not enough: where software architecture takes over

A better prompt can improve a demo.

It cannot give you:

typed contracts
retries with sane failure handling
fallback paths
grounded retrieval
observable workflows
safe tool boundaries

That’s the shift I wrote about in this article.

Prompt engineering still matters, but once a feature moves into production, software architecture usually decides whether it actually works.

Read here:
https://medium.com/p/f3371a973de7

Curious where others have seen this shift happen in real AI systems.

Why AI Features Fail in Production Even When The Demo Works

Soumya Ranjan Nanda — Wed, 15 Apr 2026 19:44:58 +0000

The demo is usually the easy part.

Production is where the real engineering starts:

latency budgets
degraded modes
validation
observability
trust boundaries
retrieval quality
cost control

I wrote a practical breakdown from a software engineering angle here:

https://medium.com/p/3929c4263952

Curious which failure mode teams underestimate most.

The debugging story behind PrematureCloseException in a high-volume bulk workflow

Soumya Ranjan Nanda — Wed, 15 Apr 2026 06:30:05 +0000

When more concurrency broke my bulk workflow

I increased concurrency to speed up a high-volume bulk workflow.

At first, it looked like the right move. Smaller runs got faster, throughput improved, and the pipeline seemed healthier.

Then larger runs started failing with PrematureCloseException.

That was the moment I realized the problem was no longer just performance. It had become a system pressure problem.

A few lessons from the debugging journey:

more parallelism does not always mean more throughput
chunk size is not just a batch setting — it becomes a stability boundary
retries only help after the concurrency model is sane
connection pool behavior matters a lot more under load
partial-failure handling makes bulk workflows much more trustworthy

What finally helped was not one magic fix. It was a combination of:

reducing unsafe parallelism
tuning chunk size more carefully
adding retry with backoff
stabilizing connection pool behavior
treating concurrency as a budget instead of a goal

I wrote the full debugging story here:

https://medium.com/p/758f87e312d5

Curious how others handle this kind of issue in bulk or async workflows.

How I Scaled Bulk Search in Spring Boot with Parallel Batch Jobs and Controlled Concurrency

Soumya Ranjan Nanda — Mon, 13 Apr 2026 10:57:38 +0000

I recently wrote about a backend problem that looked simple at first but became a real architecture and reliability challenge under load:

scaling bulk search in Spring Boot with parallel batch jobs and controlled concurrency

A few lessons stood out for me:

parallelism helps, but only until it starts hurting upstream systems
chunking is not just a batch setting, it becomes a stability boundary
partial failure handling matters as much as throughput
caching repeated enrichment work can remove a surprising amount of unnecessary load

One of the biggest shifts for me was moving away from a more limited blocking flow into parallel batch jobs while still keeping pressure on downstream systems under control.

I wrote the full breakdown here:
https://medium.com/p/6a742ad7af9d

What I learned building bulk search for large datasets in React + Spring Boot

Soumya Ranjan Nanda — Mon, 13 Apr 2026 05:43:30 +0000

Bulk search sounds easy until real users start pasting spreadsheet data, uploading messy CSVs, and expecting clear results for thousands of records.

I recently built a bulk search workflow in React + Spring Boot, and this article is a practical breakdown of what actually mattered: normalization, validation, chunking, frontend performance, and partial-failure reporting.

Read the full article here: https://medium.com/p/ea69f155054a