Forem: Adamo Software

Tool-use API design for LLMs: 5 patterns that prevent agent loops and silent failures

Adamo Software — Tue, 05 May 2026 02:53:33 +0000

In July 2025, a developer's Claude Code instance hit a recursion loop and burned through 1.67 billion tokens in 5 hours, generating an estimated $16,000 to $50,000 in API charges before anyone noticed. The agent did not crash. It did not throw an error. It just kept calling tools, getting confused, calling more tools, and silently accumulating cost. Old software crashes. LLM agents spend.

This is the failure mode most teams discover the hard way. You design a clean tool interface, the agent works in your test environment, you ship it to production, and three weeks later something hits an edge case that sends the agent into a loop. The patterns below are what we have used to prevent agent loops and silent failures across production LLM systems handling thousands of tool calls per day. None of them are about better prompts. They are about better tool design.

Why agent loops happen in the first place

Before the patterns, it helps to understand the failure mode precisely.

An LLM agent receives a user request, reasons about which tool to call, calls it, gets a result, reasons about whether the goal is achieved, and either responds or calls another tool. This loop is supposed to terminate when the goal is achieved or the model decides no further action is needed.

It does not terminate when:

The tool result is ambiguous. The model cannot tell whether the call succeeded, so it tries again with slightly different parameters.
The tool fails silently. The model receives a non-error response that does not actually contain the data it needed, so it interprets this as "I should retry."
The tool returns conflicting information. Two consecutive calls return different results, the model loses confidence in either, and tries to "verify" by calling more tools.
The model misreads its own previous output. With long context windows, the model sees a previous tool call's result, forgets it already processed that result, and re-processes it as new information.

Every one of these is preventable with tool design. The model is not the problem. The interface is.

Pattern 1: Make every tool result self-describing

The single most common cause of agent loops is tool results that the model cannot interpret without making assumptions.

A bad tool result:

{
  "results": [
    {"id": "h_1234", "name": "Hotel Granbell", "price": 128},
    {"id": "h_5678", "name": "Shibuya Stream", "price": 142}
  ]
}

The model now has to assume what this means. Are these the only matches? Are there more? Is the search complete? What was searched for? When the model gets confused (and at scale, it always gets confused eventually), it calls the tool again "to verify."

A self-describing tool result:

{
  "status": "success",
  "search_id": "srch_abc123",
  "query_summary": {
    "destination": "Shibuya, Tokyo",
    "check_in": "2026-07-12",
    "check_out": "2026-07-15",
    "guests": 1,
    "max_price": 150
  },
  "results": [
    {"id": "h_1234", "name": "Hotel Granbell", "price": 128, "currency": "USD"},
    {"id": "h_5678", "name": "Shibuya Stream", "price": 142, "currency": "USD"}
  ],
  "total_matches": 2,
  "is_complete": true,
  "next_action_hint": "User has 2 valid options. Present both with prices. Do not search again unless user changes parameters."
}

The is_complete: true and next_action_hint fields are the critical additions. The model can now read this result, understand that the search is finished, and know what to do next without re-querying. The query_summary echo lets the model verify it called the tool with the right parameters.

The next_action_hint is unconventional but extremely effective. It is a short instruction included in the tool response that tells the model what state the conversation is in. Think of it as the tool nudging the model toward correct loop termination.

# Wrapping tools to inject next_action_hint
def with_action_hint(tool_func):
    def wrapper(*args, **kwargs):
        result = tool_func(*args, **kwargs)
        result['next_action_hint'] = derive_hint(result)
        return result
    return wrapper

def derive_hint(result):
    if result['status'] == 'success' and result['total_matches'] == 0:
        return "No matches. Inform user and ask for relaxed criteria. Do not retry."
    if result['status'] == 'success' and result['total_matches'] > 0:
        return f"Found {result['total_matches']} matches. Present to user. Do not search again unless parameters change."
    if result['status'] == 'error':
        return f"Tool failed: {result['error']}. Inform user. Do not retry without user input."
    return "Process result and decide next step."

Implementing this across our tool surface reduced retry-driven loops by approximately 60% in production.

Pattern 2: Distinguish between "no results" and "tool failure"

The second most common cause of agent loops: ambiguous failure states.

A search that returns zero matches is a successful tool call. A search that timed out is a failed tool call. To the LLM, both can look identical if the tool just returns an empty results array.

# Bad: indistinguishable from no results
def search_hotels(query):
    try:
        results = supplier_api.search(query)
        return {"results": results}
    except Exception:
        return {"results": []}  # silent failure

# Good: explicit status with retry guidance
def search_hotels(query):
    try:
        results = supplier_api.search(query, timeout=5)
        return {
            "status": "success",
            "results": results,
            "total_matches": len(results),
            "retryable": False,
        }
    except SupplierTimeout:
        return {
            "status": "error",
            "error_type": "timeout",
            "error_message": "Supplier API did not respond within 5 seconds.",
            "retryable": True,
            "retry_after_ms": 2000,
            "max_retries_remaining": get_retry_budget(query),
        }
    except SupplierAuthError:
        return {
            "status": "error",
            "error_type": "auth",
            "error_message": "API authentication failed.",
            "retryable": False,
            "user_facing_message": "We're having trouble accessing hotel data. Please try again later.",
        }
    except RateLimitError as e:
        return {
            "status": "error",
            "error_type": "rate_limit",
            "error_message": f"Rate limit hit. Reset in {e.reset_seconds}s.",
            "retryable": True,
            "retry_after_ms": e.reset_seconds * 1000,
        }

The retryable flag is doing real work. When it is false, the LLM knows there is no point retrying and will inform the user instead. When it is true, the LLM has a structured retry path with explicit limits.

Without this pattern, an authentication failure that looks like an empty result set causes the model to try increasingly creative parameter combinations to "find results," consuming tokens and producing nothing.

Pattern 3: Enforce a hard call budget at the orchestrator level

No matter how well-designed your tools are, the model will occasionally enter a loop. The orchestrator must enforce a hard ceiling.

class AgentOrchestrator:
    def __init__(self, max_tool_calls=15, max_total_cost_usd=0.50):
        self.max_tool_calls = max_tool_calls
        self.max_total_cost_usd = max_total_cost_usd
        self.calls_made = 0
        self.total_cost_usd = 0

    async def run_agent_turn(self, user_message, conversation_history):
        history = conversation_history + [{"role": "user", "content": user_message}]

        while self.calls_made < self.max_tool_calls:
            if self.total_cost_usd >= self.max_total_cost_usd:
                return self._cost_limit_response()

            response = await call_llm(history, tools=self.tools)
            self.total_cost_usd += response.cost_usd

            if not response.tool_calls:
                # Model produced final response, exit loop
                return response.content

            for tool_call in response.tool_calls:
                self.calls_made += 1
                tool_result = await self.execute_tool(tool_call)
                history.append({"role": "tool", "content": tool_result})

        # Hit call limit. Force a final response.
        return await self._force_final_response(history)

    async def _force_final_response(self, history):
        # Add explicit instruction and call LLM with tools=None
        history.append({
            "role": "system",
            "content": "Tool call limit reached. Produce a final response to the user "
                       "based on information already gathered. Do not request more tools."
        })
        response = await call_llm(history, tools=None)
        return response.content

Two safeguards here. First, max_tool_calls prevents infinite loops by capping iterations. Fifteen is our default for booking workflows. Anything more than that is almost always a sign the agent is confused, not productive. Second, max_total_cost_usd is a financial circuit breaker. Even if the agent finds creative ways to make many tool calls, it cannot spend more than the per-conversation budget.

When the limit is hit, the orchestrator does not just return an error. It calls the LLM one more time with tools=None, forcing it to produce a final response from whatever it has gathered. This is much better UX than "Sorry, agent failed."

For high-volume systems, also implement per-tenant rate limiting. The single-developer Claude Code incident burned $16-50K because there was no per-account ceiling. Production systems need both per-conversation and per-tenant limits.

Pattern 4: Detect repeated calls and short-circuit them

Even with call budgets, agents waste budget by repeating the same call with minor variations. The fix is a deduplication layer at the orchestrator.

import hashlib
import json

class ToolCallDeduplicator:
    def __init__(self, window_size=5):
        self.recent_calls = []
        self.window_size = window_size

    def is_duplicate(self, tool_name, arguments):
        signature = self._signature(tool_name, arguments)
        is_dup = any(call == signature for call in self.recent_calls)
        self.recent_calls.append(signature)
        if len(self.recent_calls) > self.window_size:
            self.recent_calls.pop(0)
        return is_dup

    def _signature(self, tool_name, arguments):
        # Normalize arguments for comparison
        normalized = json.dumps(arguments, sort_keys=True, default=str)
        return f"{tool_name}:{hashlib.sha256(normalized.encode()).hexdigest()[:16]}"


# In the orchestrator
async def execute_tool(self, tool_call):
    if self.deduplicator.is_duplicate(tool_call.name, tool_call.arguments):
        return {
            "status": "duplicate_call_blocked",
            "message": (
                f"This exact {tool_call.name} call was made earlier in this conversation "
                f"with the same arguments. The previous result is already in your context. "
                f"Use it instead of calling again."
            ),
            "retryable": False,
        }

    return await self._actually_execute(tool_call)

When the model calls the same tool with the same arguments twice in a 5-call window, the orchestrator returns a structured "this is a duplicate" message instead of executing again. The model sees this and almost always recovers, often by referring back to the earlier result.

This pattern caught about 8% of calls in our production systems. Eight percent of total tool calls were unnecessary repeats. Blocking them saved both cost and latency.

A subtle detail: the deduplication signature should be lossy enough to catch near-duplicates. We use exact argument matching, but for some tools (search queries that differ only in word order), a normalization step before hashing would catch more.

Pattern 5: Parameter validation at the boundary, not inside the LLM

The slowest path to detecting a bad tool call is letting the LLM make it, the tool execute it, and the failure propagate back. The fastest path is validating parameters before the tool runs.

from pydantic import BaseModel, Field, validator
from datetime import date, timedelta


class SearchHotelsArgs(BaseModel):
    destination: str = Field(min_length=2, max_length=100)
    check_in: date
    check_out: date
    guests: int = Field(ge=1, le=20)
    max_price: float = Field(gt=0, le=10000)

    @validator('check_in')
    def check_in_not_in_past(cls, v):
        if v < date.today():
            raise ValueError(f"check_in date {v} is in the past")
        return v

    @validator('check_out')
    def check_out_after_check_in(cls, v, values):
        if 'check_in' in values and v <= values['check_in']:
            raise ValueError("check_out must be after check_in")
        if 'check_in' in values and (v - values['check_in']) > timedelta(days=90):
            raise ValueError("Stay length cannot exceed 90 days")
        return v


async def execute_tool(self, tool_call):
    if tool_call.name == "search_hotels":
        try:
            args = SearchHotelsArgs(**tool_call.arguments)
        except ValidationError as e:
            return {
                "status": "validation_error",
                "errors": e.errors(),
                "user_facing_hint": (
                    "Some search parameters were invalid. Confirm with the user before retrying."
                ),
                "retryable_after_correction": True,
            }
        return await self._search_hotels(args)

This catches three classes of bad calls:

Type errors: the LLM passes a string where the tool expects an integer.
Range errors: the LLM tries to search for 50 guests in one room.
Logical errors: check-out before check-in, dates in the past.

By rejecting these at the boundary with structured error responses, we prevent the deeper failure mode where the supplier API gets called with bad data, returns a cryptic error, the LLM cannot interpret the error, and a loop starts.

The Pydantic-based approach also gives you JSON schema generation for free, which feeds directly into the tool definitions you send to the LLM. Schema-aligned validation across both ends.

What I would not recommend

A few approaches we tried and abandoned:

Asking the LLM "are you done?" prompts mid-loop. Slows everything down and only works inconsistently. The orchestrator-level call budget is more reliable.
Letting the LLM see the full call history in every iteration. Increases context cost dramatically and provides little benefit. Pattern 4 (deduplication with structured feedback) is more efficient.
Streaming tool execution with partial results. Looks attractive but creates new failure modes where the LLM acts on incomplete data. Stick with atomic tool calls that either complete or fail cleanly.
Auto-generating tool definitions from API specs. Tempting because it sounds DRY, but auto-generated descriptions are usually not what the LLM needs. Hand-written tool descriptions, with explicit guidance about when to use the tool and when not to, work better.

Production results

After implementing these five patterns across our LLM-powered booking systems:

Agent loop incidents: dropped from 3 to 5 per week to under 1 per month.
Average tool calls per conversation: dropped 22%, mostly by eliminating duplicates and unnecessary retries.
Time-to-final-response: improved 18%, primarily from earlier short-circuiting of bad parameter calls.
Cost per conversation: dropped 31%, combination of fewer tool calls and tighter budget enforcement.

The patterns are not glamorous. They are mostly defensive engineering. But the alternative is the Claude Code incident: a $16,000 to $50,000 loss from an agent that "did not crash" but kept spending. In production LLM systems, the difference between cost-stable and not is exactly this kind of unsexy infrastructure.

If you are designing tools for an LLM agent, treat the tool interface as a contract that must hold even when the model is confused. The model will be confused. The contract will be tested. Patterns 1 through 5 are how it survives.

These patterns were developed across production builds at Adamo Software, including AI travel assistant deployments and agentic AI systems where tool-use reliability is non-negotiable.

Tool-use API design for LLMs: 5 patterns that prevent agent loops and silent failures

Adamo Software — Tue, 05 May 2026 02:53:33 +0000

Why agent loops happen in the first place

Before the patterns, it helps to understand the failure mode precisely.

It does not terminate when:

The tool result is ambiguous. The model cannot tell whether the call succeeded, so it tries again with slightly different parameters.
The tool fails silently. The model receives a non-error response that does not actually contain the data it needed, so it interprets this as "I should retry."
The tool returns conflicting information. Two consecutive calls return different results, the model loses confidence in either, and tries to "verify" by calling more tools.
The model misreads its own previous output. With long context windows, the model sees a previous tool call's result, forgets it already processed that result, and re-processes it as new information.

Every one of these is preventable with tool design. The model is not the problem. The interface is.

Pattern 1: Make every tool result self-describing

The single most common cause of agent loops is tool results that the model cannot interpret without making assumptions.

A bad tool result:

{
  "results": [
    {"id": "h_1234", "name": "Hotel Granbell", "price": 128},
    {"id": "h_5678", "name": "Shibuya Stream", "price": 142}
  ]
}

A self-describing tool result:

{
  "status": "success",
  "search_id": "srch_abc123",
  "query_summary": {
    "destination": "Shibuya, Tokyo",
    "check_in": "2026-07-12",
    "check_out": "2026-07-15",
    "guests": 1,
    "max_price": 150
  },
  "results": [
    {"id": "h_1234", "name": "Hotel Granbell", "price": 128, "currency": "USD"},
    {"id": "h_5678", "name": "Shibuya Stream", "price": 142, "currency": "USD"}
  ],
  "total_matches": 2,
  "is_complete": true,
  "next_action_hint": "User has 2 valid options. Present both with prices. Do not search again unless user changes parameters."
}

# Wrapping tools to inject next_action_hint
def with_action_hint(tool_func):
    def wrapper(*args, **kwargs):
        result = tool_func(*args, **kwargs)
        result['next_action_hint'] = derive_hint(result)
        return result
    return wrapper

def derive_hint(result):
    if result['status'] == 'success' and result['total_matches'] == 0:
        return "No matches. Inform user and ask for relaxed criteria. Do not retry."
    if result['status'] == 'success' and result['total_matches'] > 0:
        return f"Found {result['total_matches']} matches. Present to user. Do not search again unless parameters change."
    if result['status'] == 'error':
        return f"Tool failed: {result['error']}. Inform user. Do not retry without user input."
    return "Process result and decide next step."

Implementing this across our tool surface reduced retry-driven loops by approximately 60% in production.

Pattern 2: Distinguish between "no results" and "tool failure"

The second most common cause of agent loops: ambiguous failure states.

A search that returns zero matches is a successful tool call. A search that timed out is a failed tool call. To the LLM, both can look identical if the tool just returns an empty results array.

# Bad: indistinguishable from no results
def search_hotels(query):
    try:
        results = supplier_api.search(query)
        return {"results": results}
    except Exception:
        return {"results": []}  # silent failure

# Good: explicit status with retry guidance
def search_hotels(query):
    try:
        results = supplier_api.search(query, timeout=5)
        return {
            "status": "success",
            "results": results,
            "total_matches": len(results),
            "retryable": False,
        }
    except SupplierTimeout:
        return {
            "status": "error",
            "error_type": "timeout",
            "error_message": "Supplier API did not respond within 5 seconds.",
            "retryable": True,
            "retry_after_ms": 2000,
            "max_retries_remaining": get_retry_budget(query),
        }
    except SupplierAuthError:
        return {
            "status": "error",
            "error_type": "auth",
            "error_message": "API authentication failed.",
            "retryable": False,
            "user_facing_message": "We're having trouble accessing hotel data. Please try again later.",
        }
    except RateLimitError as e:
        return {
            "status": "error",
            "error_type": "rate_limit",
            "error_message": f"Rate limit hit. Reset in {e.reset_seconds}s.",
            "retryable": True,
            "retry_after_ms": e.reset_seconds * 1000,
        }

Pattern 3: Enforce a hard call budget at the orchestrator level

No matter how well-designed your tools are, the model will occasionally enter a loop. The orchestrator must enforce a hard ceiling.

class AgentOrchestrator:
    def __init__(self, max_tool_calls=15, max_total_cost_usd=0.50):
        self.max_tool_calls = max_tool_calls
        self.max_total_cost_usd = max_total_cost_usd
        self.calls_made = 0
        self.total_cost_usd = 0

    async def run_agent_turn(self, user_message, conversation_history):
        history = conversation_history + [{"role": "user", "content": user_message}]

        while self.calls_made < self.max_tool_calls:
            if self.total_cost_usd >= self.max_total_cost_usd:
                return self._cost_limit_response()

            response = await call_llm(history, tools=self.tools)
            self.total_cost_usd += response.cost_usd

            if not response.tool_calls:
                # Model produced final response, exit loop
                return response.content

            for tool_call in response.tool_calls:
                self.calls_made += 1
                tool_result = await self.execute_tool(tool_call)
                history.append({"role": "tool", "content": tool_result})

        # Hit call limit. Force a final response.
        return await self._force_final_response(history)

    async def _force_final_response(self, history):
        # Add explicit instruction and call LLM with tools=None
        history.append({
            "role": "system",
            "content": "Tool call limit reached. Produce a final response to the user "
                       "based on information already gathered. Do not request more tools."
        })
        response = await call_llm(history, tools=None)
        return response.content

Pattern 4: Detect repeated calls and short-circuit them

Even with call budgets, agents waste budget by repeating the same call with minor variations. The fix is a deduplication layer at the orchestrator.

import hashlib
import json

class ToolCallDeduplicator:
    def __init__(self, window_size=5):
        self.recent_calls = []
        self.window_size = window_size

    def is_duplicate(self, tool_name, arguments):
        signature = self._signature(tool_name, arguments)
        is_dup = any(call == signature for call in self.recent_calls)
        self.recent_calls.append(signature)
        if len(self.recent_calls) > self.window_size:
            self.recent_calls.pop(0)
        return is_dup

    def _signature(self, tool_name, arguments):
        # Normalize arguments for comparison
        normalized = json.dumps(arguments, sort_keys=True, default=str)
        return f"{tool_name}:{hashlib.sha256(normalized.encode()).hexdigest()[:16]}"


# In the orchestrator
async def execute_tool(self, tool_call):
    if self.deduplicator.is_duplicate(tool_call.name, tool_call.arguments):
        return {
            "status": "duplicate_call_blocked",
            "message": (
                f"This exact {tool_call.name} call was made earlier in this conversation "
                f"with the same arguments. The previous result is already in your context. "
                f"Use it instead of calling again."
            ),
            "retryable": False,
        }

    return await self._actually_execute(tool_call)

This pattern caught about 8% of calls in our production systems. Eight percent of total tool calls were unnecessary repeats. Blocking them saved both cost and latency.

Pattern 5: Parameter validation at the boundary, not inside the LLM

The slowest path to detecting a bad tool call is letting the LLM make it, the tool execute it, and the failure propagate back. The fastest path is validating parameters before the tool runs.

from pydantic import BaseModel, Field, validator
from datetime import date, timedelta


class SearchHotelsArgs(BaseModel):
    destination: str = Field(min_length=2, max_length=100)
    check_in: date
    check_out: date
    guests: int = Field(ge=1, le=20)
    max_price: float = Field(gt=0, le=10000)

    @validator('check_in')
    def check_in_not_in_past(cls, v):
        if v < date.today():
            raise ValueError(f"check_in date {v} is in the past")
        return v

    @validator('check_out')
    def check_out_after_check_in(cls, v, values):
        if 'check_in' in values and v <= values['check_in']:
            raise ValueError("check_out must be after check_in")
        if 'check_in' in values and (v - values['check_in']) > timedelta(days=90):
            raise ValueError("Stay length cannot exceed 90 days")
        return v


async def execute_tool(self, tool_call):
    if tool_call.name == "search_hotels":
        try:
            args = SearchHotelsArgs(**tool_call.arguments)
        except ValidationError as e:
            return {
                "status": "validation_error",
                "errors": e.errors(),
                "user_facing_hint": (
                    "Some search parameters were invalid. Confirm with the user before retrying."
                ),
                "retryable_after_correction": True,
            }
        return await self._search_hotels(args)

This catches three classes of bad calls:

Type errors: the LLM passes a string where the tool expects an integer.
Range errors: the LLM tries to search for 50 guests in one room.
Logical errors: check-out before check-in, dates in the past.

The Pydantic-based approach also gives you JSON schema generation for free, which feeds directly into the tool definitions you send to the LLM. Schema-aligned validation across both ends.

What I would not recommend

A few approaches we tried and abandoned:

Asking the LLM "are you done?" prompts mid-loop. Slows everything down and only works inconsistently. The orchestrator-level call budget is more reliable.
Letting the LLM see the full call history in every iteration. Increases context cost dramatically and provides little benefit. Pattern 4 (deduplication with structured feedback) is more efficient.
Streaming tool execution with partial results. Looks attractive but creates new failure modes where the LLM acts on incomplete data. Stick with atomic tool calls that either complete or fail cleanly.
Auto-generating tool definitions from API specs. Tempting because it sounds DRY, but auto-generated descriptions are usually not what the LLM needs. Hand-written tool descriptions, with explicit guidance about when to use the tool and when not to, work better.

Production results

After implementing these five patterns across our LLM-powered booking systems:

Agent loop incidents: dropped from 3 to 5 per week to under 1 per month.
Average tool calls per conversation: dropped 22%, mostly by eliminating duplicates and unnecessary retries.
Time-to-final-response: improved 18%, primarily from earlier short-circuiting of bad parameter calls.
Cost per conversation: dropped 31%, combination of fewer tool calls and tighter budget enforcement.

These patterns were developed across production builds at Adamo Software, including AI travel assistant deployments and agentic AI systems where tool-use reliability is non-negotiable.

How we reduced LLM hallucination to under 1% in a production booking system

Adamo Software — Tue, 28 Apr 2026 08:27:45 +0000

LLM hallucinations are not a research curiosity. They are a production problem. A 2026 Stanford AI Index reported hallucination rates across 26 leading LLMs ranging from 22% to 94%, and even the best-performing model on Vectara's leaderboard (Gemini-2.0-Flash-001) still hallucinates 0.7% of the time on grounded summarization tasks. In a booking system that handles thousands of transactions daily, even a 1% hallucination rate means dozens of bookings made with wrong prices, wrong dates, or wrong policies every single day.

This article walks through the validation pipeline we built that brought our production hallucination rate from approximately 4% (out of the box) to under 1%, and the architectural decisions that mattered most.

What "hallucination" means in a booking context

Before we get into the solution, it helps to be precise about what we are fixing. In a booking system, hallucinations break down into four categories:

Numerical hallucinations. The LLM states a price, a duration, a distance, or a count that does not match the source data. Example: the API returns $189/night, but the LLM tells the user "$179 per night." This is the most common and most damaging category.

Temporal hallucinations. Wrong times, wrong dates, wrong durations. Example: the API returns "departure 22:30 local time" and the LLM says "departure at 10:30 PM" (correct in 12-hour format) or "departure at 2:30 AM" (incorrect). The first is fine. The second is a hallucination.

Attribute hallucinations. The LLM invents amenities, policies, or features that are not in the source data. Example: claiming a hotel has a rooftop pool when the API returned no pool data, or stating a flight allows free cabin baggage when the fare class actually charges.

Citation hallucinations. The LLM references reviews, ratings, or sources that do not exist. Example: "According to recent guest reviews, the hotel scored 9.2 for cleanliness" when no review API was queried.

These categories matter because each requires a different detection strategy. A single validation pipeline that treats them uniformly will miss most of the failures.

The architecture: separating reasoning from facts

The core principle: the LLM never generates facts. It only formats them.

                ┌─────────────────────┐
   User query   │   LLM (reasoning)   │
   ───────────▶ │   Intent + Tool     │
                │   selection only    │
                └──────────┬──────────┘
                           │
                           ▼
                ┌─────────────────────┐
                │  Tool execution     │
                │  (real APIs)        │
                │  ───────────────    │
                │  • search_hotels    │
                │  • get_pricing      │
                │  • check_avail      │
                └──────────┬──────────┘
                           │
                           ▼
                ┌─────────────────────┐
                │  Structured data    │
                │  (canonical facts)  │
                └──────────┬──────────┘
                           │
                           ▼
                ┌─────────────────────┐
                │   LLM (formatter)   │
                │   Generates prose   │
                │   from facts only   │
                └──────────┬──────────┘
                           │
                           ▼
                ┌─────────────────────┐
                │ Validation pipeline │
                │ Checks LLM output   │
                │ against facts       │
                └──────────┬──────────┘
                           │
                           ▼
                       User reply

The LLM has two roles, executed in two separate calls:

Reasoning call. Given the user's message, decide which tools to invoke and with what parameters. Output: a list of tool calls. No facts generated yet.
Formatting call. Given the tool results (structured JSON), generate a natural language response. The system prompt explicitly forbids introducing any numerical, temporal, or factual claim not present in the tool output.

This separation alone reduced our hallucination rate from ~4% to ~2%. The remaining 2% came from the LLM still occasionally introducing claims during the formatting step. That is what the validation pipeline catches.

The validation pipeline

The pipeline runs every LLM-formatted response through three checks before sending it to the user.

Check 1: Numerical claim extraction and verification

Every number in the LLM's response must trace back to the structured tool output. We extract numbers using regex (with currency, time, and unit awareness), then verify each one against the canonical fact set.

import re
from decimal import Decimal

# Patterns for different numerical claim types
PRICE_PATTERN = re.compile(
    r'\$\s?(\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?)|(\d+(?:\.\d{1,2})?)\s?(?:USD|dollars?)',
    re.IGNORECASE
)
TIME_PATTERN = re.compile(
    r'\b(\d{1,2}):(\d{2})\s?(AM|PM|am|pm)?\b'
)
PERCENT_PATTERN = re.compile(r'(\d+(?:\.\d+)?)\s?%')

def extract_numerical_claims(text: str) -> dict:
    """Extract all numerical claims from LLM-generated text."""
    claims = {
        'prices': [],
        'times': [],
        'percentages': [],
    }

    for match in PRICE_PATTERN.finditer(text):
        value = match.group(1) or match.group(2)
        claims['prices'].append(Decimal(value.replace(',', '')))

    for match in TIME_PATTERN.finditer(text):
        hour, minute, meridiem = match.groups()
        claims['times'].append({
            'raw': match.group(0),
            'hour': int(hour),
            'minute': int(minute),
            'meridiem': meridiem,
        })

    for match in PERCENT_PATTERN.finditer(text):
        claims['percentages'].append(Decimal(match.group(1)))

    return claims


def validate_claims(claims: dict, canonical_facts: dict) -> list:
    """Return list of unsupported claims."""
    violations = []

    # Build lookup sets from canonical facts
    valid_prices = {Decimal(str(p)) for p in canonical_facts.get('prices', [])}
    valid_times = canonical_facts.get('times', [])

    for price in claims['prices']:
        if price not in valid_prices:
            # Allow small rounding (±$0.01)
            if not any(abs(price - vp) <= Decimal('0.01') for vp in valid_prices):
                violations.append(('price', price, valid_prices))

    for time_claim in claims['times']:
        if not is_time_in_canonical(time_claim, valid_times):
            violations.append(('time', time_claim, valid_times))

    return violations

If any claim fails verification, the response is rejected and the LLM is asked to regenerate with stricter instructions. After two failed attempts, the system falls back to a templated response built directly from the structured data.

Check 2: Attribute grounding via embedding similarity

Numerical extraction does not catch attribute hallucinations ("the hotel has a rooftop pool"). For these, we use a different approach: embedding-based grounding.

For every property the LLM mentions in its response, we have a canonical attribute list from the API. We compute embeddings for both the LLM's claims (extracted as noun phrases) and the canonical attributes, then check that every claim has a high-similarity match in the canonical set.

from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer('all-MiniLM-L6-v2')

def extract_noun_phrases(text: str) -> list:
    """Extract attribute claims using a lightweight NLP pipeline."""
    # Simplified: use spaCy noun chunks in production
    import spacy
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    return [chunk.text for chunk in doc.noun_chunks]


def validate_attributes(
    llm_text: str,
    canonical_attributes: list[str],
    threshold: float = 0.7
) -> list:
    """Flag attributes claimed in LLM text that don't match canonical data."""
    claimed_phrases = extract_noun_phrases(llm_text)

    if not claimed_phrases or not canonical_attributes:
        return []

    claim_embeddings = embedder.encode(claimed_phrases)
    canonical_embeddings = embedder.encode(canonical_attributes)

    violations = []
    for phrase, claim_emb in zip(claimed_phrases, claim_embeddings):
        similarities = np.dot(canonical_embeddings, claim_emb) / (
            np.linalg.norm(canonical_embeddings, axis=1) *
            np.linalg.norm(claim_emb)
        )
        if similarities.max() < threshold:
            violations.append({
                'claim': phrase,
                'best_match_score': float(similarities.max()),
                'best_match': canonical_attributes[similarities.argmax()],
            })

    return violations

This catches statements like "the hotel offers a fitness center and spa" when the API only returned "fitness center" (no spa). The threshold of 0.7 was tuned empirically. Lower thresholds let too many attribute hallucinations through. Higher thresholds caused false positives on legitimate paraphrases.

Check 3: Self-consistency via second LLM pass

For complex multi-sentence responses, regex and embeddings miss subtle errors. The third check uses a second LLM call (with a smaller, cheaper model) configured as a fact-checker.

FACT_CHECK_PROMPT = """You are a strict fact-checker. Given:
1. A SOURCE DATA object (the only ground truth)
2. A CANDIDATE RESPONSE (what we want to send to a user)

Your job: identify any claim in the CANDIDATE that is NOT supported by the SOURCE.

Output JSON only:
{
  "supported": true/false,
  "violations": [
    {"claim": "exact text from candidate", "reason": "why not supported"}
  ]
}

Do not flag stylistic differences or paraphrasing. Only flag factual claims that contradict or are not present in SOURCE.

SOURCE DATA:
{source_json}

CANDIDATE RESPONSE:
{response_text}
"""


async def fact_check(response_text: str, source_data: dict) -> dict:
    result = await call_llm(
        model='gpt-4o-mini',  # cheaper model for fact-checking
        prompt=FACT_CHECK_PROMPT.format(
            source_json=json.dumps(source_data, indent=2),
            response_text=response_text,
        ),
        response_format={'type': 'json_object'},
        temperature=0,  # deterministic fact-checking
    )
    return json.loads(result)

This catches hallucinations that are linguistically subtle. Example: the source data says "free cancellation until 24 hours before check-in" and the LLM writes "you can cancel free of charge anytime before your stay." The numerical extractor sees nothing wrong (no numbers conflict). The embedding check thinks it is similar enough. But the fact-checker catches that "anytime" contradicts "until 24 hours before."

The fact-checker call adds 200 to 400ms of latency. We run it asynchronously and only block the user-facing response when the fact-checker flags violations.

Results from production

After 60 days of running this pipeline in production:

Numerical hallucinations: dropped from ~2.4% to 0.3%
Temporal hallucinations: dropped from ~0.8% to 0.1%
Attribute hallucinations: dropped from ~1.2% to 0.4%
Citation hallucinations: dropped to 0% (we eliminated review/rating citations from the formatter prompt entirely)

Combined hallucination rate: under 1% across all conversation turns. The 0.8% to 0.9% that remains is dominated by edge cases in attribute paraphrasing (e.g., the LLM saying "ocean view" when the source said "sea view"). These are technically inaccurate but rarely cause booking errors.

Things that did not work

A few approaches we tried and abandoned:

Higher temperature with self-correction prompts. Asking the LLM "are you sure?" or "double-check your facts" before responding reduced hallucinations slightly but was inconsistent. Google's December 2024 research found that asking "are you hallucinating right now?" reduces hallucination rates by 17%, but the effect diminishes after 5-7 interactions. We confirmed this empirically.
Fine-tuning on our domain data. Improved phrasing and tone but did not reliably reduce hallucinations. The model still invented prices and times confidently. Fine-tuning addresses style, not factual grounding.
Constraining output to JSON-only. Useful for tool calls but unsuitable for user-facing responses. Travelers want natural language, not structured data.
Reasoning models for everything. Paradoxically, OpenAI's o3 reasoning model hallucinated 33% of the time on PersonQA, double the rate of its predecessor o1, and the smaller o4-mini performed even worse at 48%. Reasoning models excel at analysis but introduce more hallucinations on factual tasks. We use reasoning mode only for intent parsing, never for response generation.

Practical takeaways

If you are building any LLM-powered system where factual accuracy matters (booking, customer service, financial advisory, healthcare assistance), the architectural rules that mattered most for us:

Two LLM calls, not one. Reasoning and formatting are different jobs. Mixing them in a single call lets the LLM hallucinate facts while reasoning.
Structured ground truth as the only fact source. If a claim is not in the structured tool output, it cannot be in the response. Period.
Validation as a hard gate. Logging hallucinations is not enough. Block them from reaching the user, even at the cost of latency and occasional templated responses.
Layer your detection. No single check (regex, embeddings, fact-checker) catches everything. Layered checks with different signal types catch different failure modes.
Measure rates by category. "Hallucination rate" as a single number is not actionable. Numerical, temporal, attribute, and citation hallucinations have different costs and different solutions.

The fundamental insight is that hallucination is not a model problem. It is a system design problem. The Stanford AI Index 2026 calls for engineers to "treat hallucination rates as design inputs rather than bugs to be ignored". That has been our experience exactly. Pick a baseline model with reasonable performance, then build the pipeline around it that enforces grounding.

This pattern was developed across several production builds at Adamo Software, including AI travel assistants and AI chatbots for travel booking where booking accuracy is non-negotiable.

How we handle LLM context window limits without losing conversation quality

Adamo Software — Tue, 21 Apr 2026 08:17:52 +0000

Every developer building on LLMs hits the same wall eventually. Your chatbot works beautifully for the first 10 turns, then starts forgetting things. Your agent ran a 30-step workflow and lost track of the original goal halfway through. Your RAG system stuffed so much context into the prompt that response quality dropped.

This is the context window problem, and it does not go away by switching to a model with a bigger window. We learned this the hard way while building an AI assistant for a travel booking platform. This post covers the strategies we actually use in production, with the trade-offs we hit.

Why bigger context windows are not the answer

Claude 3.5 Sonnet has a 200K token window. GPT-4o has 128K. Gemini 1.5 Pro has up to 2M. The temptation is to just throw everything in.

Three problems with that approach.

First, cost. Input tokens are not free. At 2M tokens per call, you are spending significant money on every request even before the model generates anything.

Second, latency. Processing a 200K-token prompt takes meaningfully longer than a 10K-token one. For a chat interface, this is the difference between instant and sluggish.

Third, and most importantly, quality degrades with length. Research from Anthropic and others has consistently shown that models pay less attention to content in the middle of very long contexts. This is called the "lost in the middle" problem. A fact placed at token 80,000 of a 150,000-token context has a real chance of being ignored.

So the question is not "how do we fit everything," it is "what actually needs to be in the prompt right now."

The four strategies we use

We combine four techniques depending on the use case. None of these are novel individually. The value is in knowing when to use which.

1. Sliding window with summarization

For chatbots and conversational agents, we keep the last N turns verbatim and summarize everything older. The key design decision is how often to summarize.

from typing import List
from dataclasses import dataclass

@dataclass
class Message:
    role: str
    content: str
    tokens: int

RECENT_TURNS = 6
SUMMARIZE_THRESHOLD = 20

def manage_context(messages: List[Message], summary: str) -> tuple[List[Message], str]:
    if len(messages) <= SUMMARIZE_THRESHOLD:
        return messages, summary

    # Keep the last N turns raw
    recent = messages[-RECENT_TURNS:]
    to_summarize = messages[:-RECENT_TURNS]

    # Incremental summarization: feed old summary + new messages
    new_summary = summarize(
        existing_summary=summary,
        new_messages=to_summarize
    )
    return recent, new_summary

We trigger summarization when the conversation exceeds 20 turns, not on every turn. Summarizing every turn is wasteful and introduces quality drift because you are summarizing summaries of summaries.

The trade-off: summaries lose specificity. If a user mentioned "I prefer aisle seats near the front" on turn 3 and you compressed that into "user discussed seat preferences" on turn 25, the agent may forget the actual preference. We mitigate this with strategy #3 below.

2. Relevance-based retrieval instead of full history

For long-running agents that make many tool calls, we do not send the entire tool call history back on every step. Instead, we embed each prior action and its result, and retrieve only the top-k most relevant to the current step.

def build_agent_context(current_goal: str, all_steps: List[Step], k: int = 5):
    # Embed the current goal
    query_embedding = embed(current_goal)

    # Embed each step's summary
    step_embeddings = [embed(f"{s.action}: {s.result}") for s in all_steps]

    # Retrieve top-k most relevant prior steps
    scores = cosine_similarity(query_embedding, step_embeddings)
    top_k_indices = np.argsort(scores)[-k:]
    relevant_steps = [all_steps[i] for i in sorted(top_k_indices)]

    return relevant_steps

This works well when agent steps are semantically diverse. It works poorly when every step is similar, because the embeddings cluster too tightly. For those cases we fall back to the sliding window.

3. Structured memory for facts that must not be lost

Some information cannot be lost to summarization. User preferences, confirmed bookings, authentication context, critical constraints. We extract these into a structured memory object that travels with every prompt.

structured_memory = {
    "user_profile": {
        "name": "extracted_from_conversation",
        "preferences": ["aisle seat", "non-smoking", "high floor"],
    },
    "session_state": {
        "current_booking": {"destination": "Tokyo", "dates": "2026-06-12 to 2026-06-20"},
        "confirmed_steps": ["flight_selected", "hotel_searched"],
    },
    "hard_constraints": ["budget: $3000 max", "must arrive before June 14"]
}

The LLM does not write to this object freely. We use a dedicated extraction step after each turn, with a structured output schema, to pull out facts. This gives us deterministic memory instead of relying on the model to remember.

The Anthropic prompt caching documentation is worth reading if you go this route, because a stable memory block at the start of your prompt is an ideal cache target.

4. Context compression for large retrieved documents

For RAG systems retrieving long documents, we compress before injection. Instead of pasting a 5000-word document into the context, we run a fast model (Haiku or GPT-4o-mini) to extract only the passages relevant to the user's query.

This is a two-model pipeline:

Retrieval returns top-k documents (often 3-5 long docs)
A fast, cheap model extracts relevant sections from each
The main model sees only the compressed, relevant content

The extra inference call adds ~200ms of latency but typically reduces main prompt size by 70-85%. Net cost is lower and quality is usually higher because the main model is not distracted by irrelevant content.

When each strategy fails

Being specific about failure modes, because this is where blog posts usually wave their hands:

Sliding window fails when users reference something from far back in the conversation ("like that restaurant I mentioned earlier"). Always pair with structured memory.
Relevance retrieval fails when the current step has no good semantic overlap with prior relevant steps. For example, if step 30 needs information from step 2 but they use completely different vocabulary, retrieval misses it.
Structured memory fails when the extraction step produces low-quality outputs. Garbage in, garbage out. We validate extractions against a Pydantic schema and retry with a stricter prompt on validation failure.
Context compression fails when the query is ambiguous. If the user asks "tell me more about that," the compression model has no way to know what "that" refers to. We rewrite the query using recent conversation context before passing it to compression.

What changed when we combined all four

Before we had a structured context strategy, a 50-turn conversation in our travel agent would produce noticeably worse responses by turn 40. Users would need to re-state preferences. The agent would propose options the user had already rejected.

After combining sliding window + relevance retrieval + structured memory:

Average tokens per request dropped from ~18,000 to ~6,500, a 64% reduction
User-reported "the AI forgot what I said" complaints dropped significantly in internal testing
Response latency p95 improved from 4.2s to 2.1s

One thing we did not improve: cost per successful conversation. The reduction in tokens was offset by the extra inference calls for summarization and extraction. What we got was better quality at roughly the same cost, which for a production agent is the right trade.

Wrapping up

The context window is a constraint to design around, not a capacity to fill. A model with 2M tokens gives you more runway, but if you depend on stuffing everything in, your quality will still degrade and your costs will still climb.

Start with a sliding window for recent turns, structured memory for facts that matter, and retrieval for everything in between. Compression is the advanced move once the basics are in place.

If you are working on production AI systems and want deeper context on multi-step agent design, we have written previously about AI agent fallback chains and human-in-the-loop patterns that pairs well with this post. For background reading, Greg Kamradt's Needle in a Haystack benchmarks are a good way to see context window degradation empirically.

I work on AI platform engineering at Adamo Software, where we build custom AI systems for travel, healthcare, and enterprise clients.

Why travel search is harder than eCommerce search: The technical differences most developer miss

Adamo Software — Tue, 14 Apr 2026 02:54:30 +0000

If you have built search for an e-commerce product and then moved to a travel project, you know the feeling. Everything you assumed about search breaks within the first week. The data model is different. The freshness requirements are different. The query patterns are different. Even the definition of "in stock" is different.

This is not a difficulty ranking. E-commerce search has its own hard problems. But the two domains diverge in ways that catch experienced developers off guard, and understanding those differences before you start building saves weeks of rework.

Inventory is perishable, not stockable

In e-commerce, a product either exists in the warehouse or it does not. A pair of shoes in size 42 is available until someone buys the last pair. The inventory state changes when a purchase happens. Between purchases, the state is stable. You can cache product availability for minutes or even hours without causing problems.

In travel, inventory expires whether anyone buys it or not. A hotel room on April 15th ceases to exist on April 16th. A flight seat for the 9am departure is worthless at 9:01am. This means every search result has a built-in expiration timestamp that has nothing to do with purchase activity.

The technical consequence: your caching strategy needs to account for time-based expiration, not just event-based invalidation. A hotel room that was available 30 minutes ago might still be available (nobody booked it) or might be gone (someone booked it on another channel). You cannot know without checking the source system. E-commerce developers used to comfortable cache TTLs discover that travel search requires either very short TTLs or real-time availability checks on every result, both of which have cost and latency implications.

One product, many sources of truth

In e-commerce, your product catalog is your source of truth. You control it. When you update a price or mark something out of stock, the change propagates through your system. There is one database, one version of the truth.

In travel, the same hotel room is sold simultaneously through the hotel's own website, Booking.com, Expedia, Agoda, and three other OTAs. Each channel has its own cached version of availability. The hotel's Property Management System (PMS) is the theoretical source of truth, but updates to the PMS propagate to each channel at different speeds through different APIs.

When a developer queries availability for "hotels in Da Nang, July 15-18," the response depends on which system they queried, when they queried it, and how recently that system synced with the PMS. Two queries made 30 seconds apart can return different results, not because availability changed, but because the cache refresh cycle hit between them.

In e-commerce, if your search returns a product, the customer can almost certainly buy it. In travel, if your search returns a room, the customer might click through and discover it was booked on another channel 45 seconds ago. This is why travel platforms need a confirmation step that re-checks availability at the moment of booking, a pattern that e-commerce checkout rarely requires.

Price is a function, not a field

In e-commerce, a product has a price. It might change during a sale or promotion, but at any given moment, the price is a stored value in a database column. You query it, you display it.

In travel, price is computed at query time based on a combination of factors: the dates selected, the number of guests, the room type, the cancellation policy chosen, the customer's loyalty tier, the time of day, demand levels, competitor pricing, and sometimes even the customer's country of origin (due to regional pricing agreements).

A single hotel room does not have "a price." It has a pricing function that returns different values depending on the parameters. This means travel search cannot simply index prices in Elasticsearch the way e-commerce search indexes product prices. You either pre-compute prices for common date and occupancy combinations (expensive in storage, complex to keep fresh) or you compute prices at search time by calling the supplier's pricing API (expensive in latency, subject to rate limits).

Most travel search systems use a hybrid: pre-computed base rates for display in search results, with real-time pricing API calls when the user clicks through to a specific property. The mismatch between these two numbers is a constant source of user frustration ("the price changed when I clicked") and engineering headaches.

Queries are multi-dimensional by default

An e-commerce search query is typically a keyword with optional filters. "Running shoes, size 42, black, under $100." The search engine matches keywords against product attributes and applies filters. The core operation is text matching plus filtering.

A travel search query always involves at least three dimensions that interact with each other: location, dates, and occupancy. "Hotels in Hoi An, July 15-18, 2 adults 1 child." These three dimensions are not independent filters. The location determines which properties exist. The dates determine which of those properties have availability. The occupancy determines which room types within available properties can accommodate the guests.

In e-commerce, you can filter sequentially: find all shoes, then filter by size, then by color. In travel, you need to evaluate all three dimensions simultaneously because a property might have availability for 2 adults on July 15-17 but not July 17-18, or it might have a room for 2 adults but not 2 adults plus 1 child on those specific dates.

This multi-dimensional constraint satisfaction is why travel search is computationally heavier than e-commerce search at the query level, and why GDS (Global Distribution System) APIs are notoriously slow. They are doing constraint satisfaction across millions of inventory records, not keyword matching.

Sorting is subjective and context-dependent

E-commerce search has well-understood sorting defaults: relevance, price low to high, price high to low, bestselling, newest. These are straightforward to implement because the sorting criteria are attributes stored on the product.

Travel search sorting is more ambiguous. "Best" for a business traveler means close to the meeting location, with fast WiFi and a desk. "Best" for a family means close to the beach, with a pool and a kids' club. "Best" for a budget backpacker means cheapest per night with decent reviews. The same inventory, the same dates, completely different optimal orderings.

This is why personalization has a larger impact on conversion in travel search than in e-commerce search. In e-commerce, sorting by "bestselling" is a reasonable default that serves most users. In travel, there is no universal default that works. The platform either invests in personalized ranking or accepts that search results will feel generic to most users.

What this means if you are building travel search

If you are moving from e-commerce to travel development, these are the adjustments that will save you the most time:

Rethink your caching strategy. E-commerce caching patterns (cache for 5-15 minutes, invalidate on purchase events) do not transfer. Travel search needs either much shorter TTLs or a two-phase approach: cached results for browsing, real-time confirmation at booking time. Budget for higher infrastructure costs.

Separate "display price" from "booking price." Accept that the price shown in search results will sometimes differ from the price at checkout. Build the UX to handle this gracefully (price change notifications, rate locks) rather than trying to eliminate the mismatch entirely. Eliminating it is prohibitively expensive at scale.

Index availability windows, not availability states. Instead of a boolean "available: true/false," store date ranges with room type and occupancy constraints. Your search index needs to answer "is this property available for these specific dates and this specific guest configuration?" not just "is this property available?"

Plan for multi-source data reconciliation from day one. If you are aggregating inventory from multiple suppliers or OTAs, build a normalization layer that maps different data formats into a unified schema before it hits your search index. Do not let supplier-specific data structures leak into your search logic.

Build the confirmation step into your booking flow. Unlike e-commerce where "add to cart" is low-stakes, travel search results go stale fast. The availability check at booking time is not optional. Design your UX and your API around the assumption that some results will be unavailable by the time the user clicks "book."

Built by the engineering team at Adamo Software. We build custom platforms for travel, healthcare, and enterprise applications.

Building an AI fallback system: when to use GPT-4o, when to fall back to Haiku, when to skip the LLM entirely

Adamo Software — Tue, 07 Apr 2026 03:40:23 +0000

Not every query deserves a frontier model. A user asking "what is your cancellation policy?" does not need GPT-4o to generate the answer. A rules engine or a simple database lookup handles it in 5 milliseconds at zero token cost.

We learned this the hard way. Our first production deployment sent everything through GPT-4o. The quality was great. The bill was $7,200/month for a feature that should have cost $2,000. Worse, 60% of those queries were simple enough that a smaller model (or no model at all) would have produced identical output.

This article covers the three-tier fallback system we built: a rules engine for deterministic queries, a cheap model (Claude Haiku) for simple generation, and a frontier model (GPT-4o) for complex reasoning. Stack: Node.js 20, TypeScript.

The three tiers

Here is the routing logic:

Incoming query
    ↓
┌─────────────────────┐
│  Tier 0: Rules      │  → deterministic lookup, no LLM
│  (FAQ, status, data)│     cost: $0, latency: <10ms
└─────────┬───────────┘
          ↓ not matched
┌─────────────────────┐
│  Tier 1: Haiku      │  → simple generation
│  (summaries, format)│     cost: $1/$5 per 1M tokens
└─────────┬───────────┘
          ↓ quality check fails
┌─────────────────────┐
│  Tier 2: GPT-4o     │  → complex reasoning
│  (analysis, compare)│     cost: $2.50/$10 per 1M tokens
└─────────────────────┘

The classifier that decides the tier is itself a cheap LLM call. We use GPT-4o-mini with a one-line system prompt. The classification costs roughly $0.0001 per request, which is negligible.

Tier 0: skip the LLM entirely

This is the highest-ROI tier because it costs nothing. Before any query hits an LLM, we check if it matches a deterministic pattern:

interface TierZeroRule {
  patterns: RegExp[];
  handler: (query: string, context: any) => string;
}

const tierZeroRules: TierZeroRule[] = [
  {
    // FAQ lookups
    patterns: [
      /cancellation\s*policy/i,
      /refund\s*policy/i,
      /check.?in\s*time/i,
      /check.?out\s*time/i,
    ],
    handler: (query, context) => {
      return faqDatabase.findBestMatch(query);
    },
  },
  {
    // Booking status (pure data lookup)
    patterns: [
      /booking\s*(status|confirmation)/i,
      /order\s*#?\d+/i,
    ],
    handler: (query, context) => {
      const bookingId = extractBookingId(query);
      return bookingService.getStatus(bookingId);
    },
  },
  {
    // Price checks (structured data, no generation needed)
    patterns: [
      /how much.*(cost|price)/i,
      /price\s*(for|of)/i,
    ],
    handler: (query, context) => {
      return pricingService.lookup(query, context);
    },
  },
];

function tryTierZero(query: string, context: any): string | null {
  for (const rule of tierZeroRules) {
    if (rule.patterns.some(p => p.test(query))) {
      return rule.handler(query, context);
    }
  }
  return null; // no match, proceed to LLM tiers
}

In our system, Tier 0 catches 22% of all queries. Those are 22% of queries that never touch an LLM, never cost a token, and return in under 10 milliseconds.

The key insight: do not overthink the pattern matching. Simple regex works. If a query contains "cancellation policy", you know the answer. You do not need embeddings or a classifier for this.

The classifier: deciding between Tier 1 and Tier 2

For queries that pass Tier 0, we need to decide: cheap model or expensive model? We use a lightweight classifier:

async function classifyComplexity(query: string): Promise<'simple' | 'complex'> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `Classify the user query as SIMPLE or COMPLEX.
SIMPLE: factual questions, summaries, formatting, translation, single-step tasks.
COMPLEX: comparisons, multi-step reasoning, analysis, recommendations with tradeoffs, ambiguous intent.
Reply with one word only.`,
      },
      { role: 'user', content: query },
    ],
    max_tokens: 1,
    temperature: 0,
  });

  const classification = response.choices[0].message.content?.trim().toUpperCase();
  return classification === 'COMPLEX' ? 'complex' : 'simple';
}

This adds about 200ms latency and costs $0.0001 per call. At 10,000 queries/day, the classifier costs $1/day total. The savings from routing 60% of queries to Haiku instead of GPT-4o are roughly $150/day.

Tier 1 and Tier 2: the fallback chain

Here is where the fallback logic lives. We call Tier 1 first. If the response fails a quality check, we escalate to Tier 2:

async function processQuery(query: string, context: any) {
  // Tier 0: deterministic
  const tierZeroResult = tryTierZero(query, context);
  if (tierZeroResult) {
    return { response: tierZeroResult, tier: 0, cost: 0 };
  }

  // Classify
  const complexity = await classifyComplexity(query);

  if (complexity === 'simple') {
    // Tier 1: Haiku
    const haiku = await callModel('claude-haiku', query, context);

    if (passesQualityCheck(haiku, query)) {
      return { response: haiku, tier: 1, cost: haiku.tokenCost };
    }

    // Fallback to Tier 2 if Haiku output is low quality
    const gpt4o = await callModel('gpt-4o', query, context);
    return { response: gpt4o, tier: 2, cost: haiku.tokenCost + gpt4o.tokenCost };
  }

  // Complex queries go straight to Tier 2
  const gpt4o = await callModel('gpt-4o', query, context);
  return { response: gpt4o, tier: 2, cost: gpt4o.tokenCost };
}

The quality check is the most important function in the entire system. A bad quality check either wastes money (escalating when unnecessary) or serves bad responses (not escalating when needed).

Our quality check is simple:

function passesQualityCheck(response: ModelResponse, query: string): boolean {
  // Check 1: response is not empty or too short
  if (!response.text || response.text.length < 20) return false;

  // Check 2: model did not refuse or hedge excessively
  const hedgePatterns = [
    /i'm not sure/i,
    /i don't have.*information/i,
    /i cannot.*determine/i,
  ];
  if (hedgePatterns.some(p => p.test(response.text))) return false;

  // Check 3: response addresses the query topic
  // (simple keyword overlap check, not semantic)
  const queryKeywords = extractKeywords(query);
  const responseKeywords = extractKeywords(response.text);
  const overlap = queryKeywords.filter(k => responseKeywords.includes(k));
  if (overlap.length < queryKeywords.length * 0.3) return false;

  return true;
}

We intentionally kept this rule-based, not LLM-based. Using another LLM call to check quality would add latency and cost that defeats the purpose.

Results after 6 weeks

Metric	Before (GPT-4o only)	After (3-tier)
Monthly LLM cost	$7,200	$2,100
Avg response latency	1.8s	0.6s
Tier 0 (no LLM)	0%	22%
Tier 1 (Haiku)	0%	51%
Tier 2 (GPT-4o)	100%	27%
Fallback rate (Tier 1→2)	n/a	4.2%
User satisfaction (CSAT)	4.3/5	4.2/5

The CSAT dropped by 0.1 points. We consider that acceptable for a 71% cost reduction. The drop came entirely from the 4.2% of queries where Haiku's response was served but was slightly less thorough than what GPT-4o would have produced. The quality check catches the obvious failures, but subtle quality differences slip through.

What we would do differently

Track cost per tier from day one. We only started logging which tier handled each query after week 2. Those first two weeks of data were lost, making it harder to benchmark improvements.

Start with a stricter quality check and loosen it. Our first quality check was too lenient. It let through some Haiku responses that should have been escalated. We tightened the keyword overlap threshold from 0.2 to 0.3, which increased the fallback rate from 2.1% to 4.2% but eliminated the worst-quality responses.

Consider a Tier 1.5. We now think there is room for a middle tier between Haiku ($1/$5 per 1M tokens) and GPT-4o ($2.50/$10). Something like GPT-4o-mini or Claude Sonnet for queries that are too complex for Haiku but do not need frontier-level reasoning. We are testing this now.

Wrapping up

The single biggest cost optimization in our LLM stack was not caching, not prompt compression, not fine-tuning. It was routing queries to the right tier. 22% of queries never needed an LLM. 51% needed only a cheap model. Only 27% actually required GPT-4o.

If you are running everything through a single frontier model, start by logging your queries for a week and categorizing them by complexity. You will likely find that the majority do not need frontier reasoning. Build Tier 0 first (it is free), add a classifier, and let the data tell you where the boundaries should be.

Built by the engineering team at Adamo Software. We build AI-powered platforms for travel, healthcare, and enterprise applications.

Building a real-time travel search engine: lessons from integrating with GDS APIs

Adamo Software — Mon, 30 Mar 2026 06:48:09 +0000

If you have ever tried to integrate with a Global Distribution System API (Amadeus, Sabre, or Travelport), you know it is not like calling a typical REST endpoint. GDS APIs were architected decades ago, carry SOAP/XML legacy patterns even in their modern REST wrappers, and enforce aggressive rate limits that can break your search experience during a traffic spike. This article shares what we learned building a travel search service that queries multiple GDS providers in parallel, normalizes their responses into a unified schema, and caches results intelligently to stay within quota limits.

The project context

We were building a custom booking platform for a mid-size tour operator. The core requirement: let travelers search flights, hotels, and packages from a single search bar, with results aggregated from Amadeus Self-Service APIs, a Sabre REST endpoint, and two direct hotel suppliers via proprietary APIs. The stack was Node.js (Fastify) for the search service, Redis for caching, and Elasticsearch for indexing normalized inventory.

The naive approach would be to call each supplier sequentially, merge results, and return them. That gives you 3 to 5 second response times. Travelers leave after 2 seconds. So the real engineering challenge was not "can we connect to a GDS" but "can we make multi-supplier search feel instant."

Authentication: the part nobody warns you about

Every GDS has its own auth flow, and they all have quirks.

Amadeus Self-Service uses OAuth 2.0 with client credentials. You get an access token that expires in 30 minutes. Straightforward, except the token refresh has a subtle gotcha: if you refresh too aggressively under load, you can hit a secondary rate limit on the auth endpoint itself. We solved this by implementing a singleton token manager that refreshes proactively at the 25-minute mark, not on expiry.

// Token manager with proactive refresh
class AmadeusTokenManager {
  constructor(clientId, clientSecret) {
    this.clientId = clientId;
    this.clientSecret = clientSecret;
    this.token = null;
    this.expiresAt = 0;
  }

  async getToken() {
    // Refresh 5 minutes before expiry
    if (Date.now() < this.expiresAt - 300000) {
      return this.token;
    }

    const res = await fetch(
      'https://api.amadeus.com/v1/security/oauth2/token',
      {
        method: 'POST',
        headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
        body: new URLSearchParams({
          grant_type: 'client_credentials',
          client_id: this.clientId,
          client_secret: this.clientSecret,
        }),
      }
    );

    const data = await res.json();
    this.token = data.access_token;
    // Amadeus tokens last 1799 seconds (~30 min)
    this.expiresAt = Date.now() + data.expires_in * 1000;
    return this.token;
  }
}

Sabre uses a similar OAuth flow but requires a base64-encoded composite of client ID and secret in the authorization header, not in the body. It is a small difference that costs you an hour of debugging if you miss it.

The lesson: do not assume GDS auth flows are interchangeable. Read each provider's docs carefully, and build a provider-specific auth module from the start.

Parallel search with timeout control

The search service fires requests to all suppliers simultaneously using Promise.allSettled. This is critical. Promise.all would fail the entire search if one supplier times out. With allSettled, we return whatever results come back within the timeout window.

async function searchAllSuppliers(query, timeoutMs = 2000) {
  const suppliers = [
    searchAmadeus(query),
    searchSabre(query),
    searchDirectHotelA(query),
    searchDirectHotelB(query),
  ];

  // Wrap each with a timeout
  const withTimeout = suppliers.map((p, i) =>
    Promise.race([
      p.then(res => ({ supplier: i, status: 'ok', data: res })),
      new Promise(resolve =>
        setTimeout(
          () => resolve({ supplier: i, status: 'timeout', data: [] }),
          timeoutMs
        )
      ),
    ])
  );

  const results = await Promise.allSettled(withTimeout);

  return results
    .filter(r => r.status === 'fulfilled' && r.value.status === 'ok')
    .flatMap(r => r.value.data);
}

We set the timeout at 2000ms. If Amadeus responds in 800ms and Sabre takes 3 seconds, the user sees Amadeus results immediately. We log the Sabre timeout and investigate later. In practice, Amadeus Self-Service APIs responded in 400 to 1200ms for flight searches. Sabre was more variable, ranging from 300ms to 2500ms depending on route complexity.

The caching strategy that saved our quota

This is where most GDS integrations either succeed or blow their budget. Amadeus Self-Service provides a free monthly quota per API, with per-call charges of roughly $0.001 to $0.025 once you exceed it (Amadeus Pricing). At 10,000 searches per day, you exhaust free quotas fast.

Our caching strategy has three layers:

Layer 1: Exact-match cache (Redis, TTL 5 minutes). Same origin, destination, dates, passengers. This catches repeated searches from the same user session and from multiple users searching popular routes.

function buildCacheKey(query) {
  const { origin, destination, departDate, returnDate, pax } = query;
  return `search:${origin}:${destination}:${departDate}:${returnDate || 'oneway'}:${pax}`;
}

async function cachedSearch(query) {
  const key = buildCacheKey(query);
  const cached = await redis.get(key);

  if (cached) {
    return JSON.parse(cached);
  }

  const results = await searchAllSuppliers(query);
  // Cache for 5 minutes. Flight prices change,
  // but not every 30 seconds.
  await redis.setex(key, 300, JSON.stringify(results));
  return results;
}

Layer 2: Fuzzy-match cache (Redis, TTL 15 minutes). If a user searches HAN to NRT on June 15, and another searches HAN to NRT on June 14 or 16, the price structure is usually similar. We serve the cached result with a "prices may vary" indicator while firing a background refresh. This cut our API calls by roughly 40%.

Layer 3: Prewarming popular routes (cron job, every 30 minutes). We identified the top 50 route pairs from booking history and pre-fetched their availability on a schedule. This means the first search of the day for a popular route hits cache, not the GDS.

Combined, these three layers reduced our actual GDS API calls by approximately 65%, keeping us well within free quotas during the first months of operation.

Data normalization: the unglamorous but essential part

Each GDS returns data in wildly different formats. Amadeus returns a nested JSON structure with dictionaries for airline and location codes. Sabre returns a flatter structure but uses different field names and embeds fare rules differently. Direct hotel APIs return whatever the hotel decided to put in their XML feed.

We built a normalization layer with a simple interface:

// Each supplier implements this interface
function normalizeAmadeusFlightOffer(raw) {
  return {
    id: `amadeus_${raw.id}`,
    supplier: 'amadeus',
    type: 'flight',
    origin: raw.itineraries[0].segments[0].departure.iataCode,
    destination: raw.itineraries[0].segments.at(-1).arrival.iataCode,
    departureTime: raw.itineraries[0].segments[0].departure.at,
    arrivalTime: raw.itineraries[0].segments.at(-1).arrival.at,
    stops: raw.itineraries[0].segments.length - 1,
    airline: raw.validatingAirlineCodes[0],
    price: {
      amount: parseFloat(raw.price.grandTotal),
      currency: raw.price.currency,
    },
    cabinClass: raw.travelerPricings[0]
      .fareDetailsBySegment[0].cabin,
    rawSupplierData: raw, // Keep original for booking step
  };
}

The rawSupplierData field is important. When the user selects a result and proceeds to booking, the booking service needs the original supplier payload, not our normalized version. Normalizing for display and keeping raw data for transactions is a pattern that saved us from dozens of edge case bugs.

NDC support: plan for it now

If your platform handles flights, you need to account for NDC (New Distribution Capability). Airlines like Lufthansa, American Airlines, and Singapore Airlines increasingly distribute their best fares through NDC channels rather than traditional GDS. Amadeus supports NDC through their Flight Offers Search API, but the response structure has subtle differences from GDS results. Our normalization layer handles both, but it required explicit branching logic.

The practical impact: if you only integrate with traditional GDS channels, you may miss 15 to 30% of available fares on NDC-forward airlines. Build your normalizer to handle both from the start.

Lessons learned

1. GDS integration is 25 to 35% of total development effort. We underestimated it on our first project. Authentication, rate limiting, error handling, data normalization, and edge cases (codeshare flights, multi-city itineraries, mixed-cabin fares) consume far more time than the booking flow itself. Our team at Adamo Software now plans GDS integration as a dedicated workstream, not a subtask.

2. Cache aggressively, but with clear invalidation rules. Stale pricing leads to booking failures. We learned to set TTLs based on the product type: 5 minutes for flights (prices change frequently), 30 minutes for hotels (rates are more stable), and 2 hours for activities (rarely change intraday).

3. Never trust a GDS response blindly. Validate prices, check that segments connect logically, and verify that the fare class is actually bookable before showing it to the user. We encountered phantom availability, where the GDS reports seats available but the airline rejects the booking, roughly 2 to 3% of the time on certain carriers.

4. Build supplier-agnostic from day one. Your search interface, result schema, and booking flow should not know or care which GDS provided the data. When we added a fourth supplier six months later, it took two days instead of two weeks because the architecture supported it.

5. Monitor your look-to-book ratio. GDS providers watch this metric. If you are making thousands of search calls but very few bookings, your pricing terms can change. Our ratio stabilized at around 1:25 (one booking per 25 searches) after implementing the caching layers.

Wrapping up

Building a real-time travel search engine is less about connecting to an API and more about engineering around its constraints. Rate limits, response format inconsistencies, authentication quirks, and NDC fragmentation are the real challenges. The caching architecture alone saved us from blowing our API budget and delivered sub-second response times to users. If you are building an AI-powered travel booking platform and planning GDS integration, invest your time in the data layer and caching strategy first. The API calls themselves are the easy part.

How we designed an AI Agent workflow with fallback chains and human-in-the-loop

Adamo Software — Mon, 23 Mar 2026 03:40:48 +0000

If you've shipped an AI agent to production, you already know the uncomfortable truth: the demo works great, but real users find every edge case your prompt didn't anticipate. We ran into this exact problem when building an internal document processing agent for a healthcare client. The agent worked fine 85% of the time. The other 15% ranged from "slightly wrong" to "confidently hallucinated a patient ID that doesn't exist."

This post walks through the fallback architecture we built to handle those failures gracefully, without turning every request into a human review bottleneck.

The problem with linear agent workflows

Our first version was straightforward: user uploads a document, the LLM extracts structured fields, validates against a schema, and writes to the database. A single chain, no branching logic.

The failure math killed us. If each step in a 5-step workflow has 90% reliability, your end-to-end success rate drops to about 59%. Add more steps, and it gets worse fast.

We needed a system where failures at any step could be caught, rerouted, and resolved without restarting the entire pipeline.

The fallback chain pattern

Instead of a single LLM call per step, we implemented a tiered fallback chain. The concept is simple: try the primary approach, and if confidence drops below a threshold, cascade to the next option.

Here's the core logic in Python:

from dataclasses import dataclass
from typing import Any, Optional
import logging

logger = logging.getLogger(__name__)

@dataclass
class AgentResult:
    output: Any
    confidence: float
    model_used: str
    fallback_triggered: bool = False

class FallbackChain:
    def __init__(self, confidence_threshold: float = 0.7):
        self.threshold = confidence_threshold
        self.chain = []

    def add_handler(self, name: str, handler, min_confidence: float = 0.0):
        self.chain.append({
            "name": name,
            "handler": handler,
            "min_confidence": min_confidence
        })
        return self

    async def execute(self, input_data: dict) -> AgentResult:
        for i, step in enumerate(self.chain):
            try:
                result = await step["handler"](input_data)

                if result.confidence >= self.threshold:
                    result.fallback_triggered = (i > 0)
                    logger.info(
                        f"Step '{step['name']}' succeeded "
                        f"(confidence: {result.confidence:.2f})"
                    )
                    return result

                logger.warning(
                    f"Step '{step['name']}' below threshold "
                    f"({result.confidence:.2f} < {self.threshold})"
                )

            except Exception as e:
                logger.error(f"Step '{step['name']}' failed: {e}")
                continue

        # All automated steps failed, escalate to human
        return AgentResult(
            output=None,
            confidence=0.0,
            model_used="human_escalation",
            fallback_triggered=True
        )

We typically set up three tiers:

Primary model (e.g., GPT-4o or Claude) with a specialized prompt. Fast, cost-effective for straightforward cases.
Enhanced model with additional context injection. We pull in RAG-retrieved examples of similar documents and few-shot them into the prompt.
Human escalation. The request lands in a review queue with full context: the original input, what each model attempted, and where confidence dropped.

# Setting up the chain for document extraction
extraction_chain = FallbackChain(confidence_threshold=0.75)

extraction_chain.add_handler(
    name="primary_extraction",
    handler=primary_llm_extract
)
extraction_chain.add_handler(
    name="enhanced_extraction_with_rag",
    handler=rag_enhanced_extract
)
extraction_chain.add_handler(
    name="human_review",
    handler=queue_for_human_review
)

Confidence scoring: the hard part

The fallback chain is useless without reliable confidence signals. LLM token probabilities alone are not enough. A model can be confidently wrong. Anthropic published a practical guide on evaluating agent outputs that covers calibration in more depth.

We use a composite confidence score built from three signals:

def compute_confidence(
    llm_output: dict,
    schema: dict,
    historical_outputs: list[dict]
) -> float:
    # 1. Schema compliance: does the output match expected types/formats?
    schema_score = validate_against_schema(llm_output, schema)

    # 2. Self-consistency: run the same input 3 times, 
    #    measure agreement across outputs
    consistency_score = measure_output_consistency(
        llm_output, historical_outputs
    )

    # 3. Field-level heuristics: known patterns for dates, IDs, codes
    heuristic_score = run_field_heuristics(llm_output)

    # Weighted combination
    return (
        0.3 * schema_score 
        + 0.4 * consistency_score 
        + 0.3 * heuristic_score
    )

Schema compliance catches obvious failures like missing required fields or wrong data types. Self-consistency catches the subtler ones. If you run the same extraction three times and get three different patient names, something is off.

The heuristic layer handles domain-specific validation. In healthcare, that means checking date formats, verifying that ICD codes match known patterns, and flagging values that fall outside clinical ranges.

Where human-in-the-loop actually fits

The biggest mistake we made early on was treating HITL as a binary switch: either the agent handles it or a human does. In practice, you need multiple levels of human involvement.

We settled on three escalation tiers:

Tier 1: Async review. The agent completed the task but confidence was borderline (0.6 to 0.75). A human reviewer sees the output alongside the original document and either approves or corrects it. This handles about 10% of requests and adds 2 to 4 hours of latency, which was acceptable for our use case.

Tier 2: Real-time intervention. Confidence dropped below 0.6, or the agent hit a known ambiguity pattern (e.g., handwritten notes, poor scan quality). The workflow pauses, and the request routes to an available specialist through a Slack notification. We used LangGraph's interrupt() pattern for this:

from langgraph.types import interrupt

def extraction_node(state: dict) -> dict:
    result = await extraction_chain.execute(state["document"])

    if result.model_used == "human_escalation":
        # Pause the workflow and wait for human input
        human_response = interrupt({
            "reason": "Low confidence extraction",
            "document_id": state["document_id"],
            "attempted_output": result.output,
            "confidence": result.confidence
        })
        return {"extracted_data": human_response}

    return {"extracted_data": result.output}

For a step-by-step walkthrough of interrupts and commands in LangGraph, this tutorial covers the basics well

Tier 3: Full manual processing. The document type is entirely outside the agent's training distribution. This happens maybe 2% of the time. The system logs the case as a training candidate for future model improvement.

Circuit breakers for cascading failures

One thing we learned the hard way: when an upstream model starts degrading (rate limits, API instability, model drift), it can poison every downstream step. A hallucinated field in step 1 becomes a corrupted database entry by step 4.

We added circuit breakers that monitor rolling error rates per step:

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, reset_timeout: int = 60):
        self.failure_count = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.state = "closed"  # closed = normal, open = blocking
        self.last_failure_time = None

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.threshold:
            self.state = "open"
            logger.critical("Circuit breaker OPEN. Routing to fallback.")

    def can_execute(self) -> bool:
        if self.state == "closed":
            return True
        # Check if enough time has passed to retry
        if time.time() - self.last_failure_time > self.reset_timeout:
            self.state = "half-open"
            return True
        return False

When the circuit opens, all requests for that step skip directly to the next fallback tier. This prevents a degraded model from wasting tokens and time on requests it's going to fail anyway.

What we measured

After running this architecture for three months on the healthcare document processing pipeline:

End-to-end accuracy went from 85% to 97.3%
Average latency increased by 400ms for the primary path (acceptable)
Human review volume dropped from ~30% of all documents to ~12%, because the enhanced RAG fallback caught most borderline cases
Zero hallucinated patient IDs made it to the database (previously ~2 per week)

The biggest win was not the accuracy improvement itself. It was the fact that we could now quantify exactly where the system was failing and allocate human attention to the cases that actually needed it.

Key takeaways

Design for failure from day one. If your agent workflow has no fallback path, you're building a demo, not a production system.
Confidence scoring needs multiple signals. Token probabilities are not enough. Combine schema validation, self-consistency checks, and domain heuristics.
HITL is a spectrum, not a switch. Different confidence levels should trigger different levels of human involvement. Not every edge case needs real-time intervention.
Monitor the monitors. Circuit breakers and rolling error rates prevent cascading failures from eating your entire pipeline.

Wrapping up

Building reliable AI agent workflows is less about picking the right model and more about designing the right failure modes. The fallback chain pattern gave us a structured way to degrade gracefully, and the tiered HITL approach kept humans involved where they add value without turning them into full-time babysitters for the AI.

If you want to dive deeper into interrupt mechanics, the LangGraph team wrote a solid overview of the pattern.

If you're building something similar, start with the confidence scoring. Everything else follows from having a reliable signal for "how much should I trust this output."

I'm a software engineer at Adamo Software, where we build custom AI and healthcare platforms for enterprise clients.

How we reduced AI inference costs by 60% without sacrificing accuracy

Adamo Software — Tue, 17 Mar 2026 06:39:42 +0000

Running ML models in production is expensive. When we deployed a document classification pipeline for a fintech client last year, our inference costs hit $12,000/month within the first quarter. The models were accurate, but the economics did not scale. Over 4 months, we brought that number down to $4,500/month while keeping accuracy above 95%. Here is exactly how we did it.

The starting point

The client needed to classify and extract data from financial documents: invoices, bank statements, tax forms, and contracts. We built a pipeline using a fine-tuned BERT model for classification and a GPT-based model for entity extraction.

The stack:

Classification: Fine-tuned BERT-large (340M params) on AWS SageMaker
Extraction: GPT-4 API calls for structured data extraction
Volume: ~50,000 documents/month
Infra: SageMaker real-time endpoints, always-on

It worked well functionally. But the cost breakdown was brutal:

SageMaker endpoints (24/7):    $4,200/month
GPT-4 API calls:               $6,800/month
S3 + data transfer:            $1,000/month
Total:                         $12,000/month

Step 1: Model distillation for classification

BERT-large was overkill for our classification task. We had 12 document categories, and after analyzing confusion matrices, most categories were clearly separable.

We distilled BERT-large into a DistilBERT model (66M params) using the standard knowledge distillation approach:

from transformers import (
    DistilBertForSequenceClassification,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)
import torch
import torch.nn.functional as F

class DistillationTrainer(Trainer):
    def __init__(self, teacher_model, temperature=4.0, alpha=0.5, **kwargs):
        super().__init__(**kwargs)
        self.teacher = teacher_model
        self.temperature = temperature
        self.alpha = alpha

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        outputs = model(**inputs)
        student_logits = outputs.logits

        with torch.no_grad():
            teacher_outputs = self.teacher(**inputs)
            teacher_logits = teacher_outputs.logits

        # Soft target loss
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=-1),
            F.softmax(teacher_logits / self.temperature, dim=-1),
            reduction="batchmean"
        ) * (self.temperature ** 2)

        # Hard target loss
        hard_loss = F.cross_entropy(student_logits, inputs["labels"])

        loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss
        return (loss, outputs) if return_outputs else loss

Results after distillation:

Accuracy drop: 97.2% → 95.8% (acceptable for our use case)
Inference speed: 3.2x faster
Model size: 5.1x smaller
SageMaker cost: Could now run on ml.c5.xlarge instead of ml.g4dn.xlarge

This single change cut SageMaker costs from $4,200 to $1,400/month.

Step 2: Replace GPT-4 with targeted smaller models

GPT-4 was our biggest cost driver. We were sending full document text to GPT-4 for entity extraction, which was like using a sledgehammer to hang a picture frame.

We analyzed our extraction tasks and found three categories:

Structured fields (invoice numbers, dates, amounts): These follow predictable patterns
Semi-structured fields (line items, payment terms): Some variation but bounded
Unstructured fields (contract clauses, special conditions): Actually needs LLM reasoning

For category 1, we replaced GPT-4 with regex + a small NER model:

import re
from typing import Optional

INVOICE_PATTERNS = [
    r"(?:invoice|inv)[\s#.:]*([A-Z0-9-]{4,20})",
    r"(?:bill|receipt)[\s#.:]*([A-Z0-9-]{4,20})",
]

DATE_PATTERNS = [
    r"\b(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})\b",
    r"\b(\d{4}[/-]\d{1,2}[/-]\d{1,2})\b",
    r"\b(\w+ \d{1,2},? \d{4})\b",
]

def extract_structured_fields(text: str) -> dict:
    results = {}

    for pattern in INVOICE_PATTERNS:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            results["invoice_number"] = match.group(1)
            break

    for pattern in DATE_PATTERNS:
        match = re.search(pattern, text)
        if match:
            results["date"] = match.group(1)
            break

    # Amount extraction with currency handling
    amount_match = re.search(
        r"(?:total|amount|due)[\s:]*[\$€£]?\s*([\d,]+\.?\d*)",
        text, re.IGNORECASE
    )
    if amount_match:
        results["amount"] = amount_match.group(1).replace(",", "")

    return results

For category 2, we fine-tuned a smaller model (Llama 3 8B quantized to 4-bit) hosted on a single GPU instance.

For category 3 only (about 15% of documents), we kept GPT-4 but switched to GPT-4o-mini where possible.

The cost shift:

Before:
  GPT-4 for all docs:          $6,800/month

After:
  Regex + NER (categories 1):  ~$0 (runs on existing infra)
  Llama 3 8B on g5.xlarge:     $900/month
  GPT-4o-mini (category 3):    $400/month
  Total extraction:            $1,300/month

Step 3: Batch processing and auto-scaling

The original setup ran SageMaker endpoints 24/7, but document uploads were heavily concentrated during business hours (8 AM to 6 PM local time). Nights and weekends had near-zero traffic.

We switched to:

Async inference endpoints with auto-scaling (min instances: 0, scale to demand)
Batch transform jobs for bulk uploads (client uploaded batches of 500+ documents every Monday)
Spot instances for batch jobs (70% cheaper than on-demand)

# SageMaker async endpoint config with auto-scaling
scaling_client = boto3.client("application-autoscaling")

scaling_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=4,
)

scaling_client.put_scaling_policy(
    PolicyName="scale-on-queue-depth",
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,
        "ScaleOutCooldown": 60,
    },
)

This dropped SageMaker costs another $600/month by eliminating idle compute during off-hours.

Final numbers

                        Before      After       Savings
Classification:         $4,200      $800        -81%
Entity extraction:      $6,800      $1,300      -81%
Infrastructure:         $1,000      $400        -60%
Batch processing:       $0          $200        (new)
Monitoring/logging:     $0          $100        (new)
                        -------     -------
Total:                  $12,000     $2,800      -77%

We actually exceeded the 60% target and landed at 77% reduction. Accuracy stayed at 95.4% overall (down from 97.2%), which the client considered a worthwhile tradeoff.

Key takeaways

Audit before optimizing. We spent two weeks just instrumenting costs per API call, per model, per document type. Without that data, we would have optimized the wrong things.

Not every task needs your biggest model. The single highest-impact change was pulling structured field extraction out of GPT-4. That regex script took a day to write and saved $5,000/month.

Distillation is underrated for production workloads. If your classification accuracy is already high (96%+), a distilled model will likely maintain acceptable performance at a fraction of the cost.

Auto-scaling to zero is powerful. If your workload is not truly 24/7, do not pay for 24/7 compute.

Wrapping up

The common instinct when AI costs spike is to negotiate API pricing or switch providers. In our experience, the bigger wins come from rethinking which model handles which task. Most production pipelines have a mix of simple and complex work, and matching model capability to task complexity is where the real savings are.

If you are running into similar cost issues with your AI-powered pipelines, start with the cost audit. You will almost certainly find that a large chunk of your spend goes to tasks that do not need your most expensive model.

I'm a software engineer at Adamo Software, where we build AI and data pipelines for clients in fintech and healthcare.

Building a Real-Time Hotel Booking Engine: How We Solved Double-Booking Across 6 OTAs

Adamo Software — Mon, 16 Mar 2026 04:15:35 +0000

Last year, our team built a centralized booking engine for a hotel chain operating 12 properties across Southeast Asia. The core challenge: their rooms were listed simultaneously on Booking.com, Expedia, Agoda, Traveloka, Trip.com, and their own direct booking site. Double-bookings were happening 3-4 times per week, each one costing the property an average of $180 in relocation fees and guest compensation.

This post breaks down the architecture we used to eliminate that problem.

Why Double-Booking Is Harder Than a Simple Database Lock

The naive solution sounds straightforward: lock the room row in the database before confirming a booking. But at scale across multiple OTA channels, this breaks down fast.

The root issue is distributed timing. When a guest clicks "Book Now" on Expedia, that request hits Expedia's servers first, then gets forwarded to the hotel's system via API. Meanwhile, another guest on Agoda books the same room type for the same dates. Both requests arrive at the booking engine within a 200-400ms window. A simple row-level lock in PostgreSQL handles sequential requests fine, but when two OTA webhooks fire near-simultaneously, the second request often reads stale availability data before the first transaction commits.

The problem gets worse with connection pooling. Under load, database connections queue up, and the gap between "read availability" and "write confirmation" widens. We measured this gap averaging 150ms in normal conditions, spiking to 800ms during peak booking hours (6-10 PM local time).

Our Architecture: Event-Driven Inventory with Optimistic Locking

We evaluated two approaches: pessimistic locking (SELECT FOR UPDATE) and optimistic locking with version control. We chose optimistic locking for one reason: pessimistic locks under high concurrency caused connection pool exhaustion in our load tests. With 6 OTAs sending concurrent requests, the lock wait times cascaded.

The architecture has three core components:

Inventory Service (Node.js): Owns the single source of truth for room availability. Every inventory record carries a version integer. When a booking request arrives, the service reads the current version, validates availability, then attempts an UPDATE with a WHERE clause matching both the room ID and the expected version number. If the version has changed between read and write, the UPDATE affects zero rows, and the service rejects the booking with a retry signal.
Message Queue (RabbitMQ): All incoming OTA booking requests land in a RabbitMQ queue before hitting the Inventory Service. This serializes concurrent requests per room-date combination using consistent hashing on the routing key (property_id.room_type.date). Two bookings for the same room on the same date always route to the same queue consumer.
Channel Sync Worker (Python): After every confirmed booking or cancellation, this worker pushes updated availability to all 6 OTA channels via their respective APIs.

sql-- Optimistic lock: only succeeds if version hasn't changed
UPDATE room_inventory
SET available_count = available_count - 1,
    version = version + 1
WHERE property_id = $1
  AND room_type = $2
  AND stay_date = $3
  AND version = $4
  AND available_count > 0;
-- If rows_affected = 0, another booking won the race

The key design decision was using RabbitMQ's consistent hash exchange rather than a standard topic exchange. This guaranteed that competing requests for the same inventory naturally serialized through a single consumer, reducing the optimistic lock collision rate from ~12% (in our initial tests without the queue) to under 0.3%.

Syncing Availability Back to 6 OTAs

After a booking confirms, the Channel Sync Worker must update availability across all channels. Each OTA has a different API, different rate limits, and different latency profiles.

We learned two things the hard way:

Batch updates beat individual pushes. Booking.com's API accepts bulk availability updates, but Agoda's API at the time only supported single-date, single-room-type calls. Pushing updates one-by-one to Agoda added 4-6 seconds of total latency per booking. We switched to batching Agoda updates every 30 seconds, which was an acceptable trade-off: a 30-second window where availability might be slightly stale versus a guaranteed fast sync path for the other 5 channels.
Webhook-based sync is faster than polling, but you need a fallback. Expedia and Booking.com support outbound webhooks to notify the hotel system of new bookings. Trip.com and Traveloka did not (at the time of our integration). For channels without webhooks, we poll every 60 seconds. The polling fallback caught roughly 8% of bookings that would have otherwise created conflicts.

Handling Edge Cases: Cancellations and Partial Failures

The hardest part was not the booking itself but what happens after.
When a guest cancels on Expedia, the system must release that inventory and push the updated count to the other 5 channels. If the push to Booking.com fails (network timeout, API downtime), the room stays marked as unavailable there. Multiply this across hundreds of rooms and dates, and ghost unavailability quietly eats into revenue.

We implemented a Saga pattern with compensating actions. Each availability update to an OTA channel is tracked as a step in a saga. If a step fails, the system retries with exponential backoff (3 attempts, 5s/15s/45s intervals). If all retries fail, the saga marks that channel as "dirty" and a reconciliation job runs every 15 minutes to force-sync dirty channels by pulling their current state and comparing it against the Inventory Service.

This reconciliation loop caught an average of 23 stale records per day across all properties during the first month. After stabilizing the OTA integrations, that number dropped to 2-3 per day.

What We Measured After 3 Months

Double-bookings dropped from 3-4 per week to zero in the first 90 days of production
Average sync latency across all 6 channels fell to 1.2 seconds for webhook-enabled OTAs and 34 seconds worst-case for polling-based channels
Optimistic lock collision rate held steady at 0.2-0.4%, meaning less than 1 in 200 concurrent booking attempts needed a retry
Reconciliation job "dirty" records stabilized at 2-3 per day, almost all from Traveloka API timeouts

What We Would Do Differently

If we started this project today, three things would change:

Redis for the hot inventory cache. We served all availability reads directly from PostgreSQL. Under peak load (Black Friday sale), query latency spiked. A Redis layer for real-time availability reads with PostgreSQL as the write-through backend would have smoothed this out.
Event sourcing from day one. We added event sourcing to the booking flow in month two after debugging a dispute where the hotel claimed a cancellation never came through. Having the full event log from the start would have saved a week of forensic work.
Contract testing for OTA APIs. Three of the six OTA APIs changed their response schemas during our 8-month engagement without prior notice. Pact-style contract tests running against sandbox environments would have caught these before production.

The engineering team at Adamo Software, where we build booking engines, channel managers, and travel platforms for operators worldwide. Got questions? Drop a comment below.

How We Built a Multi-Module Healthcare Platform in 8 Months (15-Person Team, HIPAA-Compliant)

Adamo Software — Wed, 11 Mar 2026 08:54:20 +0000

Earlier this year, our team at Adamo Software shipped a full-scale healthcare platform for a European startup. The system covers telemedicine consultations, lab and imaging orders, e-prescriptions with a pharmacy bidding mechanism, and family member management across four interconnected portals: Patient, Provider, Pharmacy, and Admin.

This post is not a tutorial. It is a honest breakdown of the decisions we made, the problems we ran into, and what we would do differently. If you are building anything in healthcare, some of this might save you a few weeks of pain.

The Scope (Bigger Than It Looks)

What the client described initially sounded like "a telemedicine app." What we actually built was four separate applications sharing a common backend:

Patient portal allows users to manage family members, dependents, and caregivers under one account. Patients can book consultations (video, home visit, or in-office), order lab tests and imaging based on doctor referrals, and purchase medications through an integrated pharmacy system.
Provider portal serves clinics. Each clinic manages multiple facilities and doctors. Freelancer doctors can either join a facility or offer services independently. The portal handles consultation scheduling, clinical notes, file uploads, and prescription management.
Pharmacy portal receives prescriptions and manages a bidding system where pharmacies compete on pricing. Patients can also request prescription refills.
Admin portal ties everything together with role-based access control across all three user-facing applications.
On top of this, the entire platform had to meet HIPAA compliance standards.

Tech Stack and Why We Chose It

Frontend: ReactJS (web) + React Native (mobile)
Backend: PHP
Database: MySQL
Video: Agora
Payments: Stripe + PayPal
Clinical AI: Isabel AI (diagnostic decision support)
AI features: OpenAI integration
Project management: Jira

The client was a startup. Budget optimization was not a preference, it was a survival constraint. We chose open-source technologies and scalable cloud infrastructure with a serverless model where possible. PHP and MySQL are not trendy, but they are battle-tested, well-documented, and significantly cheaper to maintain than trendier stacks when you need to hire or scale a team quickly.

Agora was chosen for video consultations over building a custom WebRTC implementation. For a startup racing to market, spending three months on video infrastructure would have been reckless. Agora gave us stable, low-latency video with HIPAA-compliant encryption out of the box.

Isabel AI handles clinical decision support, helping doctors cross-reference symptoms against a medical knowledge base during consultations. OpenAI powers additional features like note summarization and patient-facing explanations.

Four Problems That Almost Derailed Us

1. Medical data is wide, deep, and unforgiving

Healthcare data is not like e-commerce data. A single patient record can reference lab results, imaging files, prescriptions, consultation notes, referral chains, insurance information, and family relationships. And every field has clinical implications if it is wrong.

We solved this by designing a modular architecture with clear separation between clinic, lab, imaging, and pharmacy modules. Each module owns its data domain but shares patient identity through a secure central layer. This made the system easier to reason about and significantly easier to extend when the client inevitably asked for new features mid-development.

2. Every clinical action needs expert confirmation, which slows everything down

In healthcare, a doctor writes a prescription, but a pharmacist must verify it. A doctor orders imaging, but a radiologist must confirm appropriateness. A lab result comes back, but a clinician must review it before the patient sees it. These confirmation chains are the core workflow, not an edge case.

We built an automated approval workflow with real-time notifications, audit trails, and role-based confirmation gates. Doctors, specialists, and patients interact within the same platform, and every action is logged and traceable. This reduced the approval bottleneck from days (in the client's previous manual process) to hours.

3. Twelve-hour time zone gap between our team and the client

Our development team is based in Vietnam. The client is in Europe. That is a 5-6 hour gap in winter and the communication window shrinks further with daylight saving changes.

We could not rely on synchronous communication. Instead, we built a semi-automated workflow: daily async standup reports, structured Jira boards with clear acceptance criteria per ticket, and twice-weekly video syncs during the overlap window. The PM owned a daily progress digest that the client received every morning their time.

This is not glamorous, but it works. The key insight is that timezone gaps are not solved by more meetings. They are solved by better documentation and clearer task definitions so that both sides can make progress independently.

4. HIPAA compliance touches everything, not just the database

Most developers think HIPAA means "encrypt the database." It means much more. Audit logging for every data access. Role-based access control that prevents a pharmacy from seeing consultation notes. Encrypted video streams. Secure file uploads for medical images. Consent management. Data retention policies.

We addressed this from architecture, not as an afterthought. Every API endpoint has access control checks. Every data mutation is logged. Video sessions through Agora use end-to-end encryption. File storage uses encrypted-at-rest cloud buckets with signed URLs that expire.

If you are building healthcare software and your HIPAA plan is "we will add encryption later," stop and rethink. Retrofitting compliance into an existing system costs three to five times more than building it in from the start.

Team Structure and Timeline

Total team: 15 members

4 ReactJS developers (web)
3 React Native developers (mobile)
2 PHP backend developers
3 QA testers
1 Project Manager
1 Business Analyst
1 UI/UX Designer

Timeline: 8 months for the web application, followed by 5 months for the mobile application.

The BA was critical for this project. Healthcare workflows are not intuitive for developers. Having someone who could translate clinical requirements into technical specifications prevented dozens of misunderstandings that would have become expensive bugs later.

What We Would Do Differently

Start mobile earlier.

We built web first, then mobile. In retrospect, designing the API layer with mobile constraints in mind from day one would have saved rework. Mobile has different data fetching patterns, offline requirements, and UX constraints that affect backend design.

Invest more in the pharmacy bidding logic upfront.

The bidding system where pharmacies compete on prescription pricing was conceptually simple but operationally complex. Edge cases around partial fills, refill pricing, delivery logistics, and payment splits consumed more time than expected. We should have prototyped this module independently before integrating it.

Document clinical workflows visually before writing any code.

We did this partially, but not comprehensively enough. Healthcare has invisible dependencies everywhere. A prescription refill, for example, might require re-verification of the original diagnosis, insurance eligibility check, and pharmacy availability confirmation. Flowcharts for every clinical workflow would have caught integration issues earlier.

Takeaways

If you are building a healthcare platform, here is what I would tell you over coffee:

Modular architecture is not optional. Separate your clinical domains (consult, lab, imaging, pharmacy) cleanly. You will thank yourself when requirements change, and they will change.

HIPAA is an architecture decision, not a feature. Build compliance into your data layer, API layer, and infrastructure from the first commit.

Timezone gaps are a process problem, not a people problem. Invest in async workflows, clear documentation, and structured handoffs.

Choose boring technology for startups. PHP and MySQL are not exciting, but they ship, they scale, and your next hire already knows them.

If this kind of work interests you, we write about healthcare software development on our blog at Adamo Software.

I work at Adamo Software, a software development company that builds healthcare, travel, and AI solutions. This is a real project we delivered. Client details are anonymized per our NDA.