Forem: Pavan Kumar Appannagari

Building Reliable AI Systems: Why Prompting Isn’t Enough

Pavan Kumar Appannagari — Wed, 29 Apr 2026 02:22:37 +0000

Introduction

Most generative AI demos work.
Most generative AI systems fail.

That gap isn’t about model quality—it’s about system design.

Over the past year, I’ve been experimenting with applying large language models to real engineering workflows—generating structured outputs from messy inputs, integrating enterprise data, and building agent-like systems.

The biggest lesson so far: prompting is the easy part.

Building something reliable around it is the real engineering problem.

This mirrors a pattern seen in distributed and mobile systems—reliability emerges from architecture, not individual components.

The Illusion of “It Works”

If you’ve worked with LLMs, you’ve probably seen this pattern:

You write a prompt.

The output looks correct.

You feel confident enough to move forward.

Then reality kicks in. The same prompt behaves differently with slight input variations, edge cases break the logical flow, and hallucinations creep in.

This is because LLMs are not deterministic systems; they are probabilistic engines operating over incomplete context. Treating them like traditional, static APIs is where the architecture starts to break down.

Why Prompting Alone Fails

Prompting is necessary—but insufficient.

Context Sensitivity — Small changes in input produce large differences in output
Lack of Constraints — Models optimize for plausibility, not correctness
No Built-in Validation — No guarantee output meets requirements

Prompting gives you a response; engineering gives you a system.

The Shift: From Prompts to System

To move beyond a prototype, you have to stop thinking in terms of prompts and start thinking in terms of systems. A reliable AI architecture typically includes:

Input Normalization
Before the LLM even sees the data, you must clean it. This means stripping unnecessary tokens, standardizing formats (like ISO dates), and pre-filtering noise. By reducing the entropy of the input, you make the model's job significantly easier and the output more predictable.

Structured Enforcement
Don't just ask for JSON; use features like OpenAI Function Calling, JSON Mode, or libraries like Instructor and Pydantic to force the model to adhere to a specific programmatic structure.

Guardrails and Retries
Handling failures gracefully is a requirement, not an option. However, there is a Cost of Retries. Every loop adds latency and token expense. In production, you must cap these retries (usually 1–2 attempts) and fall back to a "safe default" if the system remains non-compliant.

A Practical Pattern: The Validation & Correction Loop

In my experiments, a "Validation Loop" pattern works best to turn an unreliable interaction into a manageable workflow:

Constrain the Input: Convert free-form text into a semi-structured representation.

Use Structured Prompts: Clearly define expected output format using schemas.

The Validation Gate: Check for schema correctness, missing fields, and logical consistency.

Log Everything: Track inputs, outputs, failures, and retries.

💡 Engineering Tip: Semantic Validation
Problem: The JSON is valid, but the data is wrong.

Solution: Use secondary "checker" prompts or traditional logic to validate the content of the JSON against your business rules before it reaches the end user.

Observability is Underrated

In traditional systems, we monitor latency and 500-level errors. In AI systems, we need to monitor output quality and failure patterns.

Without deep logging, you won't know why a prompt that worked yesterday is failing today, or how often your "retry loop" is actually being triggered. High-quality observability allows you to see if specific types of user input consistently cause hallucinations, allowing you to iterate on your input normalization or guardrails.

Connecting Back to Real Engineering

What’s interesting is how familiar this starts to feel. If you’ve worked on distributed systems or mobile architecture, this pattern is not new. We are used to dealing with:

Unreliable downstream components.

Partial failures.

The necessity of validation at the boundary.

The importance of circuit breakers.

LLMs introduce incredible new capabilities, but the core engineering principles remain the same. Reliability is not a property of the model; it’s a property of the system you build around it.

Final Thoughts

Generative AI makes it easy to build something impressive quickly. But building something dependable requires a different mindset.

The question is no longer:
“Does the prompt work?”

It’s:
“How does the system behave under uncertainty?”

That’s where real engineering begins.

Written by Pavan Kumar Appannagari

Software Engineer — Mobile & Cross-Platform Systems | Applied AI | Technical Writer

More writing:

https://pavan-kumar-appannagari.github.io/writing/

From Research Paper to Prototype: Using Generative AI to Automatically Generate Test Cases

Pavan Kumar Appannagari — Wed, 18 Mar 2026 01:34:08 +0000

Introduction

About five years ago, I came across a research paper on Search-Based Software Testing (SBST) published on IEEE.

The idea was fascinating: instead of writing test cases manually, software testing could be treated as an optimization problem. Algorithms could explore the space of possible inputs and automatically discover test cases that maximize coverage and expose hidden defects.

Conceptually, it felt like a glimpse into the future of testing.

But there was a problem.

While the theory was elegant, turning it into something practical was difficult. Implementing SBST systems required complex tooling, specialized algorithms, and infrastructure that most development teams simply did not have access to.

At the time, the idea stayed in the back of my mind as an interesting possibility that felt just out of reach.

Fast forward several years, and the landscape of software engineering has changed dramatically. While my recent work has focused on mobile architecture and behavioral consistency, this experiment explores how generative AI can improve software testing workflows. With the rise of modern generative AI and large language models (LLMs), machines can now interpret natural language requirements, reason about system behavior, and generate structured outputs.

Suddenly, that old research idea started to feel much more practical.

The Shift: Traditional SBST relied on evolutionary algorithms and mathematical optimization to explore a system's input space. Modern LLMs approach the problem via semantic reasoning—interpreting natural language specifications to infer behaviors. Rather than searching blindly for edge cases, AI can now reason about them directly from the requirements.

When an internal innovation summit provided an opportunity to experiment, I decided to revisit that curiosity:

Could generative AI finally make automated test case generation practical?

The Problem: Manual Test Case Generation

In many software teams, writing manual test cases remains a time-consuming and repetitive activity. Test engineers often begin with requirements or user stories written in formats such as Given–When–Then.

For example:

Given a policyholder submits a claim
When the claim amount exceeds the policy limit
Then the claim should be rejected

While this format improves readability, it often leaves several important questions unanswered:

What edge cases exist?
Are boundary conditions clearly defined?
What negative scenarios should be tested?
Are there missing requirements or ambiguous behaviors?

As a result, QA engineers must manually expand each user story into a comprehensive set of test scenarios.

This process is valuable but slow—and it is exactly the type of structured reasoning that modern AI models excel at.

The Idea: AI-Assisted Test Case Generation

The core idea behind my prototype was simple: use generative AI to analyze user stories and automatically produce detailed manual test cases.

The system takes user stories written in Given–When–Then format and generates:

Detailed test scenarios
Step-by-step execution instructions
Expected results
Edge cases
Potential gaps or anomalies in the specification

Architecture Overview

To implement the prototype quickly and avoid infrastructure overhead, I built the system using a serverless architecture on AWS.

[ Client Application ] ──▶ [ API Gateway ] ──▶ [ AWS Lambda ] ──▶ [ Amazon Bedrock ] ──▶ [ Jurassic-2 Ultra ]

AWS Lambda: The Core Processing Engine

The backbone of the system is an AWS Lambda function responsible for:

Receiving API requests
Formatting prompts for the AI model
Sending requests to Amazon Bedrock
Processing and returning generated test cases

The serverless model allowed me to focus on application logic rather than infrastructure management.

Amazon Bedrock: Managed Generative AI

Amazon Bedrock provides a managed interface for accessing foundation models without the need to manage model infrastructure.

For this prototype, I selected AI21 Jurassic-2 Ultra, which at the time of the proof-of-concept demonstrated strong instruction-following capabilities and produced consistently structured outputs suitable for generating test scenarios.

Prompt Design: The Key to Results

Rather than simply asking the model to generate tests, the prompt provided structured instructions describing the expected output format.

The model was guided to:

identify requirement gaps
generate boundary scenarios
produce structured test steps and expected outcomes

In practice, the prompt acted as a lightweight specification that steered the model toward more structured reasoning.

Example Output

Given the earlier user story regarding policy claims, the AI generated the following scenarios:

Test Case — Claim Exceeds Policy Limit

Steps

Submit a claim greater than the maximum policy coverage.
Process claim through validation system.

Expected Result

The system rejects the claim and displays an appropriate error message.

Test Case — Boundary Condition

Steps

Submit a claim exactly equal to the policy limit.

Expected Result

The claim should be approved.

🤖 AI Observation: Requirement Gap Detection

Observation: The specification does not clarify whether partial approvals are allowed if the claim exceeds the policy limit.

Suggested Clarification: Define whether the system should automatically adjust the payout to the maximum allowed value.

This type of analysis can help identify requirement gaps early in the development cycle.

The Bigger Opportunity for QA

The goal of this prototype is not to replace QA engineers.

Instead, it demonstrates how generative AI can augment the testing process by:

Accelerating test case generation
Identifying requirement gaps earlier
Improving coverage of edge cases
Reducing repetitive manual work

This aligns with the shift-left testing movement, where quality assurance begins earlier in the development lifecycle.

However, like any AI-assisted workflow, generated test cases should be treated as suggestions rather than authoritative outputs. A human-in-the-loop remains essential to validate scenarios and ensure they align with the intended system behavior.

Future Directions

While this prototype focused on manual test cases, several extensions are possible:

Integration with issue tracking systems such as Jira
Automatic generation of automated test scripts (Selenium, XCTest)
Continuous analysis of specifications for inconsistencies

Beyond manual scenarios, a natural evolution is generating unit tests directly from source code.

By analyzing execution paths and boundary conditions, generative AI can help bridge the gap between high-level requirements and low-level code coverage—a topic I plan to explore in a future article.

Final Thoughts

Sometimes ideas arrive before the tools needed to realize them.

What began as a curiosity sparked by a research paper eventually became a practical experiment made possible by the convergence of serverless computing and foundation models.

For me, this project was a reminder that the most interesting engineering experiments often begin with a simple question:

What becomes possible when yesterday’s research finally meets the tools capable of bringing it to life?

🚀 Explore More

This article is also published on:

Personal Blog: Pavan’s Engineering Notes
Medium: From Research Paper to Prototype

Written by Pavan Kumar Appannagari — Software Engineer — Mobile Systems & Applied AI

Feature Parity Bugs Aren’t Testing Failures — They’re Architectural

Pavan Kumar Appannagari — Wed, 11 Mar 2026 02:58:41 +0000

Part of the Behavioral Consistency Series

Previously: The Doppelgänger Dilemma — Why Apps Drift

Every mobile team has seen the bug report:

“Android works. iOS fails.”

Backend logs show success.

The payloads look identical.

Nothing crashes.

Yet the system behaves differently across platforms.

The instinctive reaction is procedural.

Teams respond with:

Expanded regression coverage
Cross-platform test matrices
More release coordination
Tighter QA cycles

These responses feel responsible.

But they treat the symptom, not the cause.

Testing detected the divergence.

Architecture allowed it.

Feature parity bugs are rarely QA failures.

They are structural consequences of duplicated decision-making.

The Misplaced Blame

When two platforms implement the same business rule independently, both implementations may be correct when they ship.

Over time, however, they evolve.

One team refactors validation logic
Another optimizes performance
A backend contract changes subtly

Eventually a small difference appears.

A backend returns null.

One platform interprets it as:

“Use default.”

The other interprets it as:

“Error state.”

Both interpretations are reasonable.

Only one is consistent.

Every individual change is rational.

The system-level outcome is divergence.

Behavioral drift does not emerge because engineers are careless.

It emerges because duplication creates multiple sources of truth.

Quality processes can measure inconsistency.

They cannot manufacture consistency.

The Mathematical Inevitability of Drift

Engineers often treat parity bugs as process failures.

The thinking goes like this:

Improve coordination → fewer bugs
Improve testing → fewer divergences

Yet mature systems often show the opposite pattern.

Parity bugs increase over time.

What changed is not discipline.

What changed is time.

When the same decision exists in two places, every change introduces interpretation.

A validation rule evolves.

An edge case gets optimized.

An assumption gets refactored.

Each modification is correct in isolation.

Collectively, they create divergence.

A useful way to think about it is:

Drift ∝ Duplication × Time

If duplication is zero, drift cannot accumulate.

If time is zero, divergence cannot emerge.

In real systems, neither is zero.

Testing can observe drift.

Process can slow drift.

Only architecture removes the conditions required for drift.

Why Testing Cannot Solve It

Testing answers the question:

Does the implementation match expectations today?

Architecture answers the question:

Can implementations disagree tomorrow?

A test suite verifies behavior after decisions are implemented.

Architecture determines how many places decisions can exist.

Adding tests increases detection speed.

It does not reduce divergence probability.

Quality assurance is reactive by design.

Consistency is preventative by design.

You cannot test independent implementations into permanent agreement.

What an Architectural Fix Actually Looks Like

If drift grows with duplication over time, the architectural solution is simple in principle:

Reduce duplication at the decision layer.

This does not require:

Sharing UI
Abandoning native development
Moving to a single mobile codebase

Instead, it requires consolidating the source of truth for behavior.

In mobile systems, the most critical decisions include:

Validation rules
State transitions
Business invariants
Contract interpretation
Edge case handling

When these decisions live in two repositories, divergence is inevitable.

When they live in one shared module, divergence becomes structurally impossible.

Where Kotlin Multiplatform Fits

This is where Kotlin Multiplatform (KMP) becomes architecturally interesting.

KMP does not unify rendering layers.

It does not abstract platform UX.

Instead, it provides a narrower but more powerful guarantee:

The same decision is compiled into both platforms.

Validation logic written once.

State transitions defined once.

Error interpretation defined once.

Android renders it natively.

iOS renders it natively.

The architecture shifts from:

Two implementations attempting to stay aligned

One implementation rendered twice

Testing still matters.

But now tests verify the correctness of shared behavior, rather than alignment between independent implementations.

Architecture Determines the Shape of Bugs

Feature parity bugs are expected in duplicated systems.

Testing can surface them.

Coordination can slow them.

Process can mitigate them.

But only architecture can prevent them.

The question teams often ask is:

How do we catch parity bugs earlier?

The better question is:

Why are we designing systems where parity bugs are structurally possible?

When decisions are shared and rendering is native, the category of cross-platform divergence shrinks dramatically.

That is not a testing improvement.

It is a design correction.

This article is also published on:

Personal Blog: Pavan’s Engineering Notes
Medium: Feature Parity Bugs Aren’t Testing Failures

Written by Pavan Kumar Appannagari

Software Engineer — Mobile Systems & Applied AI

Behavioral Consistency Series

Part 1 — The Doppelgänger Dilemma
Part 2 — Feature Parity Bugs Are Architectural, Not Testing Failures
Part 3 — Sharing Domain Logic Across Platforms (coming soon)

The Doppelgänger Dilemma: Why Your Mobile Apps Look Alike but Act Like Strangers

Pavan Kumar Appannagari — Mon, 23 Feb 2026 01:12:30 +0000

Most mobile teams don’t ship one app.

They ship two apps that slowly disagree.

A validation rule changes on Android.

iOS ships it two sprints later.

Weeks afterward, users report “random failures” but nothing is actually broken.

The platforms simply made different decisions.

I call this the Doppelgänger Dilemma:

apps that look identical in the store, yet behave like strangers in production.

In mobile engineering, the hardest problem is not performance or UI.

It’s keeping behavior consistent across independently evolving codebases.

Feature parity is not a testing problem.

It is an architecture problem.

1. The Identity Crisis in Your App Drawer

In today’s mobile ecosystem, we are quietly haunted by the Doppelgänger Dilemma.

In biology, doppelgängers are unrelated individuals who merely resemble one another.

In mobile engineering, this describes the fractured relationship between iOS and Android applications.

Users expect a seamless, consistent experience regardless of device. Yet beneath the glass, these apps are often complete strangers — built on separate stacks, architectural patterns, and independently evolving codebases.

In practice this surfaces as something familiar to every mobile team:

two pull requests for every feature
two different bug tickets weeks later

When codebases behave like unrelated twins rather than a unified system, we aren’t just building apps we are duplicating technical debt.

2. The Hidden Beast: Synchronization Costs

Product planning usually assumes development cost scales linearly with platforms.

In reality, there is a hidden multiplier: Synchronization Cost.

A simple example:

A password policy update required:

minimum length change
special character validation
backend enforcement

Android shipped immediately.

iOS shipped two sprints later.

For weeks, login failures appeared random to users but the real cause was behavioral divergence.

Synchronization cost grows faster than feature complexity.

Every new capability introduces:

duplicated validation logic
mismatched edge cases
inconsistent release timing
multiplied testing permutations

As Robert C. Martin observed, duplication compounds software failures.

In mobile, it compounds across platforms.

3. The Cross-Platform Compromise

To fight duplication, the industry embraced cross-platform frameworks.

They optimize reach — but platform vendors optimize evolution speed.

Apple and Google continuously introduce new interaction models and hardware integrations.

Abstraction layers inevitably trail platform innovation.

The result is not broken apps — but subtly incorrect ones.

Users feel it as friction rather than bugs.

Not everything should be shared.

4. From Doppelgängers to Symbiotic Cousins

A sustainable strategy is to treat platforms as Symbiotic Cousins, not identical twins.

Modern native languages — Swift and Kotlin — converged philosophically:

Concept	Swift	Kotlin
Immutability	`let`	`val`
Optionals	`Optional`	`Nullable`
Concurrency	`async/await`	`Coroutines`
UI Model	SwiftUI	Compose

The alignment is not syntax — it is architectural thinking:

immutability, explicit state, deterministic concurrency

This allows shared intent without shared UI layers.

5. Sharing the Brain, Not the Face: Kotlin Multiplatform

Kotlin Multiplatform (KMP) enables a surgical solution:

Share behavior — keep presentation native.

Instead of duplicating domain rules:

// shared/commonMain
class EmailValidator {
    fun isValid(email: String): Boolean {
        return email.contains("@") && email.length > 5
    }
}

let validator = EmailValidator()
let valid = validator.isValid(email: input)

KMP compiles shared logic into native artifacts:

JVM modules for Android

Native frameworks for iOS

No runtime bridge.
No UI abstraction.
Just one behavioral source of truth.

Adoption can be incremental validation, networking, or business rules first.

6. Declarative UI: Architectural Alignment

SwiftUI and Jetpack Compose changed mobile architecture.

UI is no longer a mutable object tree it is a function of state.

This removes the impedance mismatch older MVC/MVP layers created.

Now the shared layer produces state:

KMP owns behavior

Native UI owns expression

Consistency without uniformity.

7. Observed Industry Pattern

Large mobile organizations increasingly converge on the same strategy:

shared domain logic + native UI

Not because of tooling preference —
because behavioral consistency matters more than code reuse.

The winning architecture is not write-once-run-everywhere.

It is decide-once-render-natively.

8. A Practical Observation

In multiple mobile initiatives I’ve observed both directly and through peer teams feature parity issues often emerge not because of poor engineering, but because domain rules evolve independently across platforms.

When validation or business logic lives in separate codebases, small differences accumulate quietly. These differences typically surface during integration testing or post-release analysis, where behavior appears inconsistent despite both implementations being “correct” in isolation.

Introducing a shared domain validation layer changes the failure pattern:

Before:

behavioral differences surfaced unpredictably during release cycles

parity verification required manual cross-platform comparison

After:

platform behavior aligned by default

discrepancies were traceable primarily to backend contract changes

The measurable gain was not raw performance.

It was architectural predictability.

9. Beyond the Divide

The Doppelgänger Dilemma is not a tooling problem.
It is an architectural choice.

Modern mobile architecture no longer optimizes for platform independence.

It optimizes for behavioral consistency.

Kotlin Multiplatform enables teams to unify decision-making while preserving native experience.

The goal is not writing less code.

It is removing disagreement from the system.

Written by Pavan Kumar Appannagari — Software Engineer — Mobile Systems & Applied AI

This article is also published on:

Personal Blog: Pavan’s Engineering Notes
Medium: The Doppelgänger Dilemma

This is Part 1 of a series on behavioral consistency in mobile architecture.

Upcoming:
• Why Feature Parity Bugs Are Architectural, Not QA Issues
• Sharing Validation Logic Across iOS and Android with KMP
• Swift Concurrency vs Kotlin Coroutines: A Mental Model Mapping