<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Adamo Software</title>
    <description>The latest articles on Forem by Adamo Software (@adamo_software).</description>
    <link>https://forem.com/adamo_software</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3818049%2F4a543a12-b6f2-4458-8a43-f6d299eeb761.jpg</url>
      <title>Forem: Adamo Software</title>
      <link>https://forem.com/adamo_software</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/adamo_software"/>
    <language>en</language>
    <item>
      <title>Why travel search is harder than eCommerce search: The technical differences most developer miss</title>
      <dc:creator>Adamo Software</dc:creator>
      <pubDate>Tue, 14 Apr 2026 02:54:30 +0000</pubDate>
      <link>https://forem.com/adamocompany/why-travel-search-is-harder-than-ecommerce-search-the-technical-differences-most-developer-miss-339f</link>
      <guid>https://forem.com/adamocompany/why-travel-search-is-harder-than-ecommerce-search-the-technical-differences-most-developer-miss-339f</guid>
      <description>&lt;p&gt;If you have built search for an e-commerce product and then moved to a travel project, you know the feeling. Everything you assumed about search breaks within the first week. The data model is different. The freshness requirements are different. The query patterns are different. Even the definition of "in stock" is different.&lt;/p&gt;

&lt;p&gt;This is not a difficulty ranking. E-commerce search has its own hard problems. But the two domains diverge in ways that catch experienced developers off guard, and understanding those differences before you start building saves weeks of rework.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inventory is perishable, not stockable
&lt;/h2&gt;

&lt;p&gt;In e-commerce, a product either exists in the warehouse or it does not. A pair of shoes in size 42 is available until someone buys the last pair. The inventory state changes when a purchase happens. Between purchases, the state is stable. You can cache product availability for minutes or even hours without causing problems.&lt;/p&gt;

&lt;p&gt;In travel, inventory expires whether anyone buys it or not. A hotel room on April 15th ceases to exist on April 16th. A flight seat for the 9am departure is worthless at 9:01am. This means every search result has a built-in expiration timestamp that has nothing to do with purchase activity.&lt;/p&gt;

&lt;p&gt;The technical consequence: your caching strategy needs to account for time-based expiration, not just event-based invalidation. A hotel room that was available 30 minutes ago might still be available (nobody booked it) or might be gone (someone booked it on another channel). You cannot know without checking the source system. E-commerce developers used to comfortable cache TTLs discover that travel search requires either very short TTLs or real-time availability checks on every result, both of which have cost and latency implications.&lt;/p&gt;

&lt;h2&gt;
  
  
  One product, many sources of truth
&lt;/h2&gt;

&lt;p&gt;In e-commerce, your product catalog is your source of truth. You control it. When you update a price or mark something out of stock, the change propagates through your system. There is one database, one version of the truth.&lt;/p&gt;

&lt;p&gt;In travel, the same hotel room is sold simultaneously through the hotel's own website, Booking.com, Expedia, Agoda, and three other OTAs. Each channel has its own cached version of availability. The hotel's Property Management System (PMS) is the theoretical source of truth, but updates to the PMS propagate to each channel at different speeds through different APIs.&lt;/p&gt;

&lt;p&gt;When a developer queries availability for "hotels in Da Nang, July 15-18," the response depends on which system they queried, when they queried it, and how recently that system synced with the PMS. Two queries made 30 seconds apart can return different results, not because availability changed, but because the cache refresh cycle hit between them.&lt;/p&gt;

&lt;p&gt;In e-commerce, if your search returns a product, the customer can almost certainly buy it. In travel, if your search returns a room, the customer might click through and discover it was booked on another channel 45 seconds ago. This is why travel platforms need a confirmation step that re-checks availability at the moment of booking, a pattern that e-commerce checkout rarely requires.&lt;/p&gt;

&lt;h2&gt;
  
  
  Price is a function, not a field
&lt;/h2&gt;

&lt;p&gt;In e-commerce, a product has a price. It might change during a sale or promotion, but at any given moment, the price is a stored value in a database column. You query it, you display it.&lt;/p&gt;

&lt;p&gt;In travel, price is computed at query time based on a combination of factors: the dates selected, the number of guests, the room type, the cancellation policy chosen, the customer's loyalty tier, the time of day, demand levels, competitor pricing, and sometimes even the customer's country of origin (due to regional pricing agreements).&lt;/p&gt;

&lt;p&gt;A single hotel room does not have "a price." It has a pricing function that returns different values depending on the parameters. This means travel search cannot simply index prices in Elasticsearch the way e-commerce search indexes product prices. You either pre-compute prices for common date and occupancy combinations (expensive in storage, complex to keep fresh) or you compute prices at search time by calling the supplier's pricing API (expensive in latency, subject to rate limits).&lt;/p&gt;

&lt;p&gt;Most travel search systems use a hybrid: pre-computed base rates for display in search results, with real-time pricing API calls when the user clicks through to a specific property. The mismatch between these two numbers is a constant source of user frustration ("the price changed when I clicked") and engineering headaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Queries are multi-dimensional by default
&lt;/h2&gt;

&lt;p&gt;An e-commerce search query is typically a keyword with optional filters. "Running shoes, size 42, black, under $100." The search engine matches keywords against product attributes and applies filters. The core operation is text matching plus filtering.&lt;/p&gt;

&lt;p&gt;A travel search query always involves at least three dimensions that interact with each other: location, dates, and occupancy. "Hotels in Hoi An, July 15-18, 2 adults 1 child." These three dimensions are not independent filters. The location determines which properties exist. The dates determine which of those properties have availability. The occupancy determines which room types within available properties can accommodate the guests.&lt;/p&gt;

&lt;p&gt;In e-commerce, you can filter sequentially: find all shoes, then filter by size, then by color. In travel, you need to evaluate all three dimensions simultaneously because a property might have availability for 2 adults on July 15-17 but not July 17-18, or it might have a room for 2 adults but not 2 adults plus 1 child on those specific dates.&lt;/p&gt;

&lt;p&gt;This multi-dimensional constraint satisfaction is why travel search is computationally heavier than e-commerce search at the query level, and why GDS (Global Distribution System) APIs are notoriously slow. They are doing constraint satisfaction across millions of inventory records, not keyword matching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sorting is subjective and context-dependent
&lt;/h2&gt;

&lt;p&gt;E-commerce search has well-understood sorting defaults: relevance, price low to high, price high to low, bestselling, newest. These are straightforward to implement because the sorting criteria are attributes stored on the product.&lt;/p&gt;

&lt;p&gt;Travel search sorting is more ambiguous. "Best" for a business traveler means close to the meeting location, with fast WiFi and a desk. "Best" for a family means close to the beach, with a pool and a kids' club. "Best" for a budget backpacker means cheapest per night with decent reviews. The same inventory, the same dates, completely different optimal orderings.&lt;/p&gt;

&lt;p&gt;This is why personalization has a larger impact on conversion in travel search than in e-commerce search. In e-commerce, sorting by "bestselling" is a reasonable default that serves most users. In travel, there is no universal default that works. The platform either invests in personalized ranking or accepts that search results will feel generic to most users.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means if you are building travel search
&lt;/h2&gt;

&lt;p&gt;If you are moving from e-commerce to travel development, these are the adjustments that will save you the most time:&lt;/p&gt;

&lt;p&gt;Rethink your caching strategy. E-commerce caching patterns (cache for 5-15 minutes, invalidate on purchase events) do not transfer. Travel search needs either much shorter TTLs or a two-phase approach: cached results for browsing, real-time confirmation at booking time. Budget for higher infrastructure costs.&lt;/p&gt;

&lt;p&gt;Separate "display price" from "booking price." Accept that the price shown in search results will sometimes differ from the price at checkout. Build the UX to handle this gracefully (price change notifications, rate locks) rather than trying to eliminate the mismatch entirely. Eliminating it is prohibitively expensive at scale.&lt;/p&gt;

&lt;p&gt;Index availability windows, not availability states. Instead of a boolean "available: true/false," store date ranges with room type and occupancy constraints. Your search index needs to answer "is this property available for these specific dates and this specific guest configuration?" not just "is this property available?"&lt;/p&gt;

&lt;p&gt;Plan for multi-source data reconciliation from day one. If you are aggregating inventory from multiple suppliers or OTAs, build a normalization layer that maps different data formats into a unified schema before it hits your search index. Do not let supplier-specific data structures leak into your search logic.&lt;/p&gt;

&lt;p&gt;Build the confirmation step into your booking flow. Unlike e-commerce where "add to cart" is low-stakes, travel search results go stale fast. The availability check at booking time is not optional. Design your UX and your API around the assumption that some results will be unavailable by the time the user clicks "book."&lt;/p&gt;




&lt;p&gt;Built by the engineering team at &lt;a href="https://adamosoft.com/" rel="noopener noreferrer"&gt;Adamo Software&lt;/a&gt;. We build custom platforms for travel, healthcare, and enterprise applications.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>architecture</category>
      <category>beginners</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building an AI fallback system: when to use GPT-4o, when to fall back to Haiku, when to skip the LLM entirely</title>
      <dc:creator>Adamo Software</dc:creator>
      <pubDate>Tue, 07 Apr 2026 03:40:23 +0000</pubDate>
      <link>https://forem.com/adamo_software/building-an-ai-fallback-system-when-to-use-gpt-4o-when-to-fall-back-to-haiku-when-to-skip-the-43j</link>
      <guid>https://forem.com/adamo_software/building-an-ai-fallback-system-when-to-use-gpt-4o-when-to-fall-back-to-haiku-when-to-skip-the-43j</guid>
      <description>&lt;p&gt;Not every query deserves a frontier model. A user asking "what is your cancellation policy?" does not need GPT-4o to generate the answer. A rules engine or a simple database lookup handles it in 5 milliseconds at zero token cost.&lt;/p&gt;

&lt;p&gt;We learned this the hard way. Our first production deployment sent everything through GPT-4o. The quality was great. The bill was $7,200/month for a feature that should have cost $2,000. Worse, 60% of those queries were simple enough that a smaller model (or no model at all) would have produced identical output.&lt;/p&gt;

&lt;p&gt;This article covers the three-tier fallback system we built: a rules engine for deterministic queries, a cheap model (Claude Haiku) for simple generation, and a frontier model (GPT-4o) for complex reasoning. Stack: Node.js 20, TypeScript.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three tiers
&lt;/h2&gt;

&lt;p&gt;Here is the routing logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Incoming query
    ↓
┌─────────────────────┐
│  Tier 0: Rules      │  → deterministic lookup, no LLM
│  (FAQ, status, data)│     cost: $0, latency: &amp;lt;10ms
└─────────┬───────────┘
          ↓ not matched
┌─────────────────────┐
│  Tier 1: Haiku      │  → simple generation
│  (summaries, format)│     cost: $1/$5 per 1M tokens
└─────────┬───────────┘
          ↓ quality check fails
┌─────────────────────┐
│  Tier 2: GPT-4o     │  → complex reasoning
│  (analysis, compare)│     cost: $2.50/$10 per 1M tokens
└─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The classifier that decides the tier is itself a cheap LLM call. We use GPT-4o-mini with a one-line system prompt. The classification costs roughly $0.0001 per request, which is negligible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tier 0: skip the LLM entirely
&lt;/h2&gt;

&lt;p&gt;This is the highest-ROI tier because it costs nothing. Before any query hits an LLM, we check if it matches a deterministic pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;TierZeroRule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;RegExp&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="nl"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tierZeroRules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TierZeroRule&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// FAQ lookups&lt;/span&gt;
    &lt;span class="na"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="sr"&gt;/cancellation&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*policy/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sr"&gt;/refund&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*policy/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sr"&gt;/check.&lt;/span&gt;&lt;span class="se"&gt;?&lt;/span&gt;&lt;span class="sr"&gt;in&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*time/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sr"&gt;/check.&lt;/span&gt;&lt;span class="se"&gt;?&lt;/span&gt;&lt;span class="sr"&gt;out&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*time/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;faqDatabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findBestMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Booking status (pure data lookup)&lt;/span&gt;
    &lt;span class="na"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="sr"&gt;/booking&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;status|confirmation&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sr"&gt;/order&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*#&lt;/span&gt;&lt;span class="se"&gt;?\d&lt;/span&gt;&lt;span class="sr"&gt;+/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bookingId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractBookingId&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;bookingService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bookingId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Price checks (structured data, no generation needed)&lt;/span&gt;
    &lt;span class="na"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="sr"&gt;/how much.*&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;cost|price&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sr"&gt;/price&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;for|of&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;pricingService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lookup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;tryTierZero&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;tierZeroRules&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// no match, proceed to LLM tiers&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In our system, Tier 0 catches 22% of all queries. Those are 22% of queries that never touch an LLM, never cost a token, and return in under 10 milliseconds.&lt;/p&gt;

&lt;p&gt;The key insight: do not overthink the pattern matching. Simple regex works. If a query contains "cancellation policy", you know the answer. You do not need embeddings or a classifier for this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The classifier: deciding between Tier 1 and Tier 2
&lt;/h2&gt;

&lt;p&gt;For queries that pass Tier 0, we need to decide: cheap model or expensive model? We use a lightweight classifier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;classifyComplexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;simple&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;complex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;system&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Classify the user query as SIMPLE or COMPLEX.
SIMPLE: factual questions, summaries, formatting, translation, single-step tasks.
COMPLEX: comparisons, multi-step reasoning, analysis, recommendations with tradeoffs, ambiguous intent.
Reply with one word only.`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;classification&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toUpperCase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;classification&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;COMPLEX&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;complex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;simple&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This adds about 200ms latency and costs $0.0001 per call. At 10,000 queries/day, the classifier costs $1/day total. The savings from routing 60% of queries to Haiku instead of GPT-4o are roughly $150/day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tier 1 and Tier 2: the fallback chain
&lt;/h2&gt;

&lt;p&gt;Here is where the fallback logic lives. We call Tier 1 first. If the response fails a quality check, we escalate to Tier 2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;processQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Tier 0: deterministic&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tierZeroResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tryTierZero&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tierZeroResult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tierZeroResult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Classify&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;classifyComplexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;simple&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Tier 1: Haiku&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;haiku&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;callModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-haiku&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;passesQualityCheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;haiku&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;haiku&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;haiku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tokenCost&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Fallback to Tier 2 if Haiku output is low quality&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;gpt4o&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;callModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;gpt4o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;haiku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tokenCost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;gpt4o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tokenCost&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Complex queries go straight to Tier 2&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;gpt4o&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;callModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;gpt4o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;gpt4o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tokenCost&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The quality check is the most important function in the entire system. A bad quality check either wastes money (escalating when unnecessary) or serves bad responses (not escalating when needed).&lt;/p&gt;

&lt;p&gt;Our quality check is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;passesQualityCheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ModelResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Check 1: response is not empty or too short&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Check 2: model did not refuse or hedge excessively&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hedgePatterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sr"&gt;/i'm not sure/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sr"&gt;/i don't have.*information/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sr"&gt;/i cannot.*determine/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;];&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hedgePatterns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Check 3: response addresses the query topic&lt;/span&gt;
  &lt;span class="c1"&gt;// (simple keyword overlap check, not semantic)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;queryKeywords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractKeywords&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;responseKeywords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractKeywords&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;overlap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;queryKeywords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;responseKeywords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;queryKeywords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We intentionally kept this rule-based, not LLM-based. Using another LLM call to check quality would add latency and cost that defeats the purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results after 6 weeks
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before (GPT-4o only)&lt;/th&gt;
&lt;th&gt;After (3-tier)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Monthly LLM cost&lt;/td&gt;
&lt;td&gt;$7,200&lt;/td&gt;
&lt;td&gt;$2,100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg response latency&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;td&gt;0.6s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tier 0 (no LLM)&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;22%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tier 1 (Haiku)&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;51%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tier 2 (GPT-4o)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;27%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fallback rate (Tier 1→2)&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;4.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User satisfaction (CSAT)&lt;/td&gt;
&lt;td&gt;4.3/5&lt;/td&gt;
&lt;td&gt;4.2/5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The CSAT dropped by 0.1 points. We consider that acceptable for a 71% cost reduction. The drop came entirely from the 4.2% of queries where Haiku's response was served but was slightly less thorough than what GPT-4o would have produced. The quality check catches the obvious failures, but subtle quality differences slip through.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we would do differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Track cost per tier from day one.&lt;/strong&gt; We only started logging which tier handled each query after week 2. Those first two weeks of data were lost, making it harder to benchmark improvements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with a stricter quality check and loosen it.&lt;/strong&gt; Our first quality check was too lenient. It let through some Haiku responses that should have been escalated. We tightened the keyword overlap threshold from 0.2 to 0.3, which increased the fallback rate from 2.1% to 4.2% but eliminated the worst-quality responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consider a Tier 1.5.&lt;/strong&gt; We now think there is room for a middle tier between Haiku ($1/$5 per 1M tokens) and GPT-4o ($2.50/$10). Something like GPT-4o-mini or Claude Sonnet for queries that are too complex for Haiku but do not need frontier-level reasoning. We are testing this now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;The single biggest cost optimization in our LLM stack was not caching, not prompt compression, not fine-tuning. It was routing queries to the right tier. 22% of queries never needed an LLM. 51% needed only a cheap model. Only 27% actually required GPT-4o.&lt;/p&gt;

&lt;p&gt;If you are running everything through a single frontier model, start by logging your queries for a week and categorizing them by complexity. You will likely find that the majority do not need frontier reasoning. Build Tier 0 first (it is free), add a classifier, and let the data tell you where the boundaries should be.&lt;/p&gt;




&lt;p&gt;Built by the engineering team at &lt;a href="https://adamosoft.com/blog/ai-development-services/how-to-build-an-ai-agent/" rel="noopener noreferrer"&gt;Adamo Software&lt;/a&gt;. We build AI-powered platforms for travel, healthcare, and enterprise applications.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>javascript</category>
      <category>webdev</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building a real-time travel search engine: lessons from integrating with GDS APIs</title>
      <dc:creator>Adamo Software</dc:creator>
      <pubDate>Mon, 30 Mar 2026 06:48:09 +0000</pubDate>
      <link>https://forem.com/adamo_software/building-a-real-time-travel-search-engine-lessons-from-integrating-with-gds-apis-2c48</link>
      <guid>https://forem.com/adamo_software/building-a-real-time-travel-search-engine-lessons-from-integrating-with-gds-apis-2c48</guid>
      <description>&lt;p&gt;If you have ever tried to integrate with a Global Distribution System API (Amadeus, Sabre, or Travelport), you know it is not like calling a typical REST endpoint. GDS APIs were architected decades ago, carry SOAP/XML legacy patterns even in their modern REST wrappers, and enforce aggressive rate limits that can break your search experience during a traffic spike. This article shares what we learned building a travel search service that queries multiple GDS providers in parallel, normalizes their responses into a unified schema, and caches results intelligently to stay within quota limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The project context
&lt;/h2&gt;

&lt;p&gt;We were building a custom booking platform for a mid-size tour operator. The core requirement: let travelers search flights, hotels, and packages from a single search bar, with results aggregated from Amadeus Self-Service APIs, a Sabre REST endpoint, and two direct hotel suppliers via proprietary APIs. The stack was Node.js (Fastify) for the search service, Redis for caching, and Elasticsearch for indexing normalized inventory.&lt;/p&gt;

&lt;p&gt;The naive approach would be to call each supplier sequentially, merge results, and return them. That gives you 3 to 5 second response times. Travelers leave after 2 seconds. So the real engineering challenge was not "can we connect to a GDS" but "can we make multi-supplier search feel instant."&lt;/p&gt;

&lt;h2&gt;
  
  
  Authentication: the part nobody warns you about
&lt;/h2&gt;

&lt;p&gt;Every GDS has its own auth flow, and they all have quirks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amadeus Self-Service&lt;/strong&gt; uses OAuth 2.0 with client credentials. You get an access token that expires in 30 minutes. Straightforward, except the token refresh has a subtle gotcha: if you refresh too aggressively under load, you can hit a secondary rate limit on the auth endpoint itself. We solved this by implementing a singleton token manager that refreshes proactively at the 25-minute mark, not on expiry.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Token manager with proactive refresh&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AmadeusTokenManager&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;clientId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;clientSecret&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;clientId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;clientId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;clientSecret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;clientSecret&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expiresAt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;getToken&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Refresh 5 minutes before expiry&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expiresAt&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;300000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://api.amadeus.com/v1/security/oauth2/token&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/x-www-form-urlencoded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URLSearchParams&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
          &lt;span class="na"&gt;grant_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;client_credentials&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;clientId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;client_secret&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;clientSecret&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;access_token&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// Amadeus tokens last 1799 seconds (~30 min)&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expiresAt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expires_in&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sabre&lt;/strong&gt; uses a similar OAuth flow but requires a base64-encoded composite of client ID and secret in the authorization header, not in the body. It is a small difference that costs you an hour of debugging if you miss it.&lt;/p&gt;

&lt;p&gt;The lesson: do not assume GDS auth flows are interchangeable. Read each provider's docs carefully, and build a provider-specific auth module from the start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel search with timeout control
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9k0iygwfr5snvt06t97r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9k0iygwfr5snvt06t97r.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The search service fires requests to all suppliers simultaneously using &lt;code&gt;Promise.allSettled&lt;/code&gt;. This is critical. &lt;code&gt;Promise.all&lt;/code&gt; would fail the entire search if one supplier times out. With &lt;code&gt;allSettled&lt;/code&gt;, we return whatever results come back within the timeout window.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;searchAllSuppliers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;timeoutMs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;suppliers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nf"&gt;searchAmadeus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;searchSabre&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;searchDirectHotelA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;searchDirectHotelB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;];&lt;/span&gt;

  &lt;span class="c1"&gt;// Wrap each with a timeout&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;withTimeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;suppliers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;race&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
      &lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;supplier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="p"&gt;})),&lt;/span&gt;
      &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
        &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;supplier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;timeout&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
          &lt;span class="nx"&gt;timeoutMs&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;allSettled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;withTimeout&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fulfilled&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatMap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We set the timeout at 2000ms. If Amadeus responds in 800ms and Sabre takes 3 seconds, the user sees Amadeus results immediately. We log the Sabre timeout and investigate later. In practice, Amadeus Self-Service APIs responded in 400 to 1200ms for flight searches. Sabre was more variable, ranging from 300ms to 2500ms depending on route complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The caching strategy that saved our quota
&lt;/h2&gt;

&lt;p&gt;This is where most GDS integrations either succeed or blow their budget. Amadeus Self-Service provides a free monthly quota per API, with per-call charges of roughly $0.001 to $0.025 once you exceed it (&lt;a href="https://developers.amadeus.com/self-service/apis-docs/guides/developer-guides/pricing/" rel="noopener noreferrer"&gt;Amadeus Pricing&lt;/a&gt;). At 10,000 searches per day, you exhaust free quotas fast.&lt;/p&gt;

&lt;p&gt;Our caching strategy has three layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Exact-match cache (Redis, TTL 5 minutes).&lt;/strong&gt; Same origin, destination, dates, passengers. This catches repeated searches from the same user session and from multiple users searching popular routes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;buildCacheKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;origin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;departDate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;returnDate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pax&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`search:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;origin&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;departDate&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;returnDate&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;oneway&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;pax&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;cachedSearch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buildCacheKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;searchAllSuppliers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Cache for 5 minutes. Flight prices change,&lt;/span&gt;
  &lt;span class="c1"&gt;// but not every 30 seconds.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 2: Fuzzy-match cache (Redis, TTL 15 minutes).&lt;/strong&gt; If a user searches HAN to NRT on June 15, and another searches HAN to NRT on June 14 or 16, the price structure is usually similar. We serve the cached result with a "prices may vary" indicator while firing a background refresh. This cut our API calls by roughly 40%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Prewarming popular routes (cron job, every 30 minutes).&lt;/strong&gt; We identified the top 50 route pairs from booking history and pre-fetched their availability on a schedule. This means the first search of the day for a popular route hits cache, not the GDS.&lt;/p&gt;

&lt;p&gt;Combined, these three layers reduced our actual GDS API calls by approximately 65%, keeping us well within free quotas during the first months of operation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data normalization: the unglamorous but essential part
&lt;/h2&gt;

&lt;p&gt;Each GDS returns data in wildly different formats. Amadeus returns a nested JSON structure with &lt;code&gt;dictionaries&lt;/code&gt; for airline and location codes. Sabre returns a flatter structure but uses different field names and embeds fare rules differently. Direct hotel APIs return whatever the hotel decided to put in their XML feed.&lt;/p&gt;

&lt;p&gt;We built a normalization layer with a simple interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Each supplier implements this interface&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;normalizeAmadeusFlightOffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`amadeus_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;supplier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;amadeus&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;flight&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;origin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;itineraries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;departure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;iataCode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;itineraries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;arrival&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;iataCode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;departureTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;itineraries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;departure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;arrivalTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;itineraries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;arrival&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;stops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;itineraries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;airline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;validatingAirlineCodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;parseFloat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grandTotal&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;cabinClass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;travelerPricings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fareDetailsBySegment&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;cabin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;rawSupplierData&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// Keep original for booking step&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;rawSupplierData&lt;/code&gt; field is important. When the user selects a result and proceeds to booking, the booking service needs the original supplier payload, not our normalized version. Normalizing for display and keeping raw data for transactions is a pattern that saved us from dozens of edge case bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  NDC support: plan for it now
&lt;/h2&gt;

&lt;p&gt;If your platform handles flights, you need to account for NDC (New Distribution Capability). Airlines like Lufthansa, American Airlines, and Singapore Airlines increasingly distribute their best fares through NDC channels rather than traditional GDS. Amadeus supports NDC through their &lt;a href="https://developers.amadeus.com/self-service/category/flights/api-doc/flight-offers-search" rel="noopener noreferrer"&gt;Flight Offers Search API&lt;/a&gt;, but the response structure has subtle differences from GDS results. Our normalization layer handles both, but it required explicit branching logic.&lt;/p&gt;

&lt;p&gt;The practical impact: if you only integrate with traditional GDS channels, you may miss 15 to 30% of available fares on NDC-forward airlines. Build your normalizer to handle both from the start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. GDS integration is 25 to 35% of total development effort.&lt;/strong&gt; We underestimated it on our first project. Authentication, rate limiting, error handling, data normalization, and edge cases (codeshare flights, multi-city itineraries, mixed-cabin fares) consume far more time than the booking flow itself. Our team at Adamo Software now plans &lt;a href="https://adamosoft.com/blog/travel-software-development/gds-integration/" rel="noopener noreferrer"&gt;GDS integration&lt;/a&gt; as a dedicated workstream, not a subtask.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cache aggressively, but with clear invalidation rules.&lt;/strong&gt; Stale pricing leads to booking failures. We learned to set TTLs based on the product type: 5 minutes for flights (prices change frequently), 30 minutes for hotels (rates are more stable), and 2 hours for activities (rarely change intraday).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Never trust a GDS response blindly.&lt;/strong&gt; Validate prices, check that segments connect logically, and verify that the fare class is actually bookable before showing it to the user. We encountered phantom availability, where the GDS reports seats available but the airline rejects the booking, roughly 2 to 3% of the time on certain carriers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Build supplier-agnostic from day one.&lt;/strong&gt; Your search interface, result schema, and booking flow should not know or care which GDS provided the data. When we added a fourth supplier six months later, it took two days instead of two weeks because the architecture supported it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Monitor your look-to-book ratio.&lt;/strong&gt; GDS providers watch this metric. If you are making thousands of search calls but very few bookings, your pricing terms can change. Our ratio stabilized at around 1:25 (one booking per 25 searches) after implementing the caching layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Building a real-time travel search engine is less about connecting to an API and more about engineering around its constraints. Rate limits, response format inconsistencies, authentication quirks, and NDC fragmentation are the real challenges. The caching architecture alone saved us from blowing our API budget and delivered sub-second response times to users. If you are building an &lt;a href="https://adamosoft.com/blog/travel-software-development/ai-powered-travel-booking-platform/" rel="noopener noreferrer"&gt;AI-powered travel booking platform&lt;/a&gt; and planning GDS integration, invest your time in the data layer and caching strategy first. The API calls themselves are the easy part.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>tutorial</category>
      <category>api</category>
    </item>
    <item>
      <title>How we designed an AI Agent workflow with fallback chains and human-in-the-loop</title>
      <dc:creator>Adamo Software</dc:creator>
      <pubDate>Mon, 23 Mar 2026 03:40:48 +0000</pubDate>
      <link>https://forem.com/adamo_software/how-we-designed-an-ai-agent-workflow-with-fallback-chains-and-human-in-the-loop-kdb</link>
      <guid>https://forem.com/adamo_software/how-we-designed-an-ai-agent-workflow-with-fallback-chains-and-human-in-the-loop-kdb</guid>
      <description>&lt;p&gt;If you've shipped an AI agent to production, you already know the uncomfortable truth: the demo works great, but real users find every edge case your prompt didn't anticipate. We ran into this exact problem when building an internal document processing &lt;a href="https://adamosoft.com/healthcare-software-development/" rel="noopener noreferrer"&gt;agent for a healthcare client&lt;/a&gt;. The agent worked fine 85% of the time. The other 15% ranged from "slightly wrong" to "confidently hallucinated a patient ID that doesn't exist."&lt;/p&gt;

&lt;p&gt;This post walks through the fallback architecture we built to handle those failures gracefully, without turning every request into a human review bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with linear agent workflows
&lt;/h2&gt;

&lt;p&gt;Our first version was straightforward: user uploads a document, the LLM extracts structured fields, validates against a schema, and writes to the database. A single chain, no branching logic.&lt;/p&gt;

&lt;p&gt;The failure math killed us. If each step in a 5-step workflow has 90% reliability, your end-to-end success rate drops to about 59%. Add more steps, and it gets worse fast.&lt;/p&gt;

&lt;p&gt;We needed a system where failures at any step could be caught, rerouted, and resolved without restarting the entire pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fallback chain pattern
&lt;/h2&gt;

&lt;p&gt;Instead of a single LLM call per step, we implemented a tiered fallback chain. The concept is simple: try the primary approach, and if confidence drops below a threshold, cascade to the next option.&lt;/p&gt;

&lt;p&gt;Here's the core logic in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;
    &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;model_used&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;fallback_triggered&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FallbackChain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confidence_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;confidence_threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;handler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;min_confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;min_confidence&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AgentResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;handler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fallback_triggered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; succeeded &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(confidence: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

                &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; below threshold &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &amp;lt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="c1"&gt;# All automated steps failed, escalate to human
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;AgentResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;model_used&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human_escalation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;fallback_triggered&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We typically set up three tiers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Primary model&lt;/strong&gt; (e.g., GPT-4o or Claude) with a specialized prompt. Fast, cost-effective for straightforward cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enhanced model&lt;/strong&gt; with additional context injection. We pull in RAG-retrieved examples of similar documents and few-shot them into the prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human escalation&lt;/strong&gt;. The request lands in a review queue with full context: the original input, what each model attempted, and where confidence dropped.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Setting up the chain for document extraction
&lt;/span&gt;&lt;span class="n"&gt;extraction_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FallbackChain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confidence_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;extraction_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;primary_extraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;primary_llm_extract&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;extraction_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enhanced_extraction_with_rag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rag_enhanced_extract&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;extraction_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;queue_for_human_review&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Confidence scoring: the hard part
&lt;/h2&gt;

&lt;p&gt;The fallback chain is useless without reliable confidence signals. LLM token probabilities alone are not enough. A model can be confidently wrong. Anthropic published &lt;a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents" rel="noopener noreferrer"&gt;a practical guide on evaluating agent outputs&lt;/a&gt; that covers calibration in more depth.&lt;/p&gt;

&lt;p&gt;We use a composite confidence score built from three signals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_confidence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm_output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;historical_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Schema compliance: does the output match expected types/formats?
&lt;/span&gt;    &lt;span class="n"&gt;schema_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_against_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Self-consistency: run the same input 3 times, 
&lt;/span&gt;    &lt;span class="c1"&gt;#    measure agreement across outputs
&lt;/span&gt;    &lt;span class="n"&gt;consistency_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;measure_output_consistency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;llm_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;historical_outputs&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Field-level heuristics: known patterns for dates, IDs, codes
&lt;/span&gt;    &lt;span class="n"&gt;heuristic_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_field_heuristics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Weighted combination
&lt;/span&gt;    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;schema_score&lt;/span&gt; 
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;consistency_score&lt;/span&gt; 
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;heuristic_score&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Schema compliance catches obvious failures like missing required fields or wrong data types. Self-consistency catches the subtler ones. If you run the same extraction three times and get three different patient names, something is off.&lt;/p&gt;

&lt;p&gt;The heuristic layer handles domain-specific validation. In healthcare, that means checking date formats, verifying that ICD codes match known patterns, and flagging values that fall outside clinical ranges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where human-in-the-loop actually fits
&lt;/h2&gt;

&lt;p&gt;The biggest mistake we made early on was treating HITL as a binary switch: either the agent handles it or a human does. In practice, you need multiple levels of human involvement.&lt;/p&gt;

&lt;p&gt;We settled on three escalation tiers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1: Async review.&lt;/strong&gt; The agent completed the task but confidence was borderline (0.6 to 0.75). A human reviewer sees the output alongside the original document and either approves or corrects it. This handles about 10% of requests and adds 2 to 4 hours of latency, which was acceptable for our use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 2: Real-time intervention.&lt;/strong&gt; Confidence dropped below 0.6, or the agent hit a known ambiguity pattern (e.g., handwritten notes, poor scan quality). The workflow pauses, and the request routes to an available specialist through a Slack notification. We used &lt;a href="https://docs.langchain.com/oss/python/langgraph/interrupts" rel="noopener noreferrer"&gt;LangGraph's &lt;code&gt;interrupt()&lt;/code&gt; pattern&lt;/a&gt; for this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;interrupt&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extraction_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;extraction_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_used&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human_escalation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Pause the workflow and wait for human input
&lt;/span&gt;        &lt;span class="n"&gt;human_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;interrupt&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low confidence extraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attempted_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;human_response&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a step-by-step walkthrough of interrupts and commands in LangGraph, &lt;a href="https://dev.to/jamesbmour/interrupts-and-commands-in-langgraph-building-human-in-the-loop-workflows-4ngl"&gt;this tutorial&lt;/a&gt; covers the basics well&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 3: Full manual processing.&lt;/strong&gt; The document type is entirely outside the agent's training distribution. This happens maybe 2% of the time. The system logs the case as a training candidate for future model improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Circuit breakers for cascading failures
&lt;/h2&gt;

&lt;p&gt;One thing we learned the hard way: when an upstream model starts degrading (rate limits, API instability, model drift), it can poison every downstream step. A hallucinated field in step 1 becomes a corrupted database entry by step 4.&lt;/p&gt;

&lt;p&gt;We added circuit breakers that monitor rolling error rates per step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CircuitBreaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;failure_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reset_timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;failure_threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_timeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reset_timeout&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;closed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# closed = normal, open = blocking
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;critical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Circuit breaker OPEN. Routing to fallback.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;can_execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;closed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="c1"&gt;# Check if enough time has passed to retry
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;half-open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the circuit opens, all requests for that step skip directly to the next fallback tier. This prevents a degraded model from wasting tokens and time on requests it's going to fail anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we measured
&lt;/h2&gt;

&lt;p&gt;After running this architecture for three months on the healthcare document processing pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;End-to-end accuracy went from 85% to 97.3%&lt;/li&gt;
&lt;li&gt;Average latency increased by 400ms for the primary path (acceptable)&lt;/li&gt;
&lt;li&gt;Human review volume dropped from ~30% of all documents to ~12%, because the enhanced RAG fallback caught most borderline cases&lt;/li&gt;
&lt;li&gt;Zero hallucinated patient IDs made it to the database (previously ~2 per week)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest win was not the accuracy improvement itself. It was the fact that we could now quantify exactly where the system was failing and allocate human attention to the cases that actually needed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Design for failure from day one.&lt;/strong&gt; If your agent workflow has no fallback path, you're building a demo, not a production system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence scoring needs multiple signals.&lt;/strong&gt; Token probabilities are not enough. Combine schema validation, self-consistency checks, and domain heuristics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HITL is a spectrum, not a switch.&lt;/strong&gt; Different confidence levels should trigger different levels of human involvement. Not every edge case needs real-time intervention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor the monitors.&lt;/strong&gt; Circuit breakers and rolling error rates prevent cascading failures from eating your entire pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Building reliable &lt;a href="https://adamosoft.com/ai-development-services/" rel="noopener noreferrer"&gt;AI agent workflows&lt;/a&gt; is less about picking the right model and more about designing the right failure modes. The fallback chain pattern gave us a structured way to degrade gracefully, and the tiered HITL approach kept humans involved where they add value without turning them into full-time babysitters for the AI.&lt;/p&gt;

&lt;p&gt;If you want to dive deeper into interrupt mechanics, the LangGraph team wrote &lt;a href="https://blog.langchain.com/making-it-easier-to-build-human-in-the-loop-agents-with-interrupt/" rel="noopener noreferrer"&gt;a solid overview of the pattern&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you're building something similar, start with the confidence scoring. Everything else follows from having a reliable signal for "how much should I trust this output."&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm a software engineer at Adamo Software, where we build custom AI and healthcare platforms for enterprise clients.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>devops</category>
      <category>programming</category>
    </item>
    <item>
      <title>How we reduced AI inference costs by 60% without sacrificing accuracy</title>
      <dc:creator>Adamo Software</dc:creator>
      <pubDate>Tue, 17 Mar 2026 06:39:42 +0000</pubDate>
      <link>https://forem.com/adamo_software/how-we-reduced-ai-inference-costs-by-60-without-sacrificing-accuracy-448h</link>
      <guid>https://forem.com/adamo_software/how-we-reduced-ai-inference-costs-by-60-without-sacrificing-accuracy-448h</guid>
      <description>&lt;p&gt;Running ML models in production is expensive. When we deployed a document classification pipeline for a &lt;a href="https://adamosoft.com/finance-software-development/" rel="noopener noreferrer"&gt;fintech client&lt;/a&gt; last year, our inference costs hit $12,000/month within the first quarter. The models were accurate, but the economics did not scale. Over 4 months, we brought that number down to $4,500/month while keeping accuracy above 95%. Here is exactly how we did it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The starting point
&lt;/h2&gt;

&lt;p&gt;The client needed to classify and extract data from financial documents: invoices, bank statements, tax forms, and contracts. We built a pipeline using a fine-tuned BERT model for classification and a GPT-based model for entity extraction.&lt;/p&gt;

&lt;p&gt;The stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Classification&lt;/strong&gt;: Fine-tuned BERT-large (340M params) on &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html" rel="noopener noreferrer"&gt;AWS SageMaker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extraction&lt;/strong&gt;: &lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-4 API&lt;/a&gt; calls for structured data extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: ~50,000 documents/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infra&lt;/strong&gt;: SageMaker real-time endpoints, always-on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It worked well functionally. But the cost breakdown was brutal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SageMaker endpoints (24/7):    $4,200/month
GPT-4 API calls:               $6,800/month
S3 + data transfer:            $1,000/month
Total:                         $12,000/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Model distillation for classification
&lt;/h2&gt;

&lt;p&gt;BERT-large was overkill for our classification task. We had 12 document categories, and after analyzing confusion matrices, most categories were clearly separable.&lt;/p&gt;

&lt;p&gt;We distilled BERT-large into a DistilBERT model (66M params) using the standard knowledge distillation approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;DistilBertForSequenceClassification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;BertForSequenceClassification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Trainer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TrainingArguments&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn.functional&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DistillationTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Trainer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;teacher_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;4.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;teacher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;teacher_model&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;student_logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;teacher_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;teacher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;teacher_logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;teacher_outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;

        &lt;span class="c1"&gt;# Soft target loss
&lt;/span&gt;        &lt;span class="n"&gt;soft_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kl_div&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_logits&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;teacher_logits&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;reduction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batchmean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Hard target loss
&lt;/span&gt;        &lt;span class="n"&gt;hard_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cross_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;labels&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;soft_loss&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;hard_loss&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;return_outputs&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;loss&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results after distillation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy drop&lt;/strong&gt;: 97.2% → 95.8% (acceptable for our use case)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference speed&lt;/strong&gt;: 3.2x faster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model size&lt;/strong&gt;: 5.1x smaller&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SageMaker cost&lt;/strong&gt;: Could now run on ml.c5.xlarge instead of ml.g4dn.xlarge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This single change cut SageMaker costs from $4,200 to $1,400/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Replace GPT-4 with targeted smaller models
&lt;/h2&gt;

&lt;p&gt;GPT-4 was our biggest cost driver. We were sending full document text to GPT-4 for entity extraction, which was like using a sledgehammer to hang a picture frame.&lt;/p&gt;

&lt;p&gt;We analyzed our extraction tasks and found three categories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Structured fields&lt;/strong&gt; (invoice numbers, dates, amounts): These follow predictable patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semi-structured fields&lt;/strong&gt; (line items, payment terms): Some variation but bounded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unstructured fields&lt;/strong&gt; (contract clauses, special conditions): Actually needs LLM reasoning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For category 1, we replaced GPT-4 with regex + a small NER model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="n"&gt;INVOICE_PATTERNS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?:invoice|inv)[\s#.:]*([A-Z0-9-]{4,20})&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?:bill|receipt)[\s#.:]*([A-Z0-9-]{4,20})&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;DATE_PATTERNS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(\d{4}[/-]\d{1,2}[/-]\d{1,2})\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(\w+ \d{1,2},? \d{4})\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_structured_fields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;INVOICE_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;DATE_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="c1"&gt;# Amount extraction with currency handling
&lt;/span&gt;    &lt;span class="n"&gt;amount_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?:total|amount|due)[\s:]*[\$€£]?\s*([\d,]+\.?\d*)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;amount_match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;amount_match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For category 2, we fine-tuned a smaller model (Llama 3 8B quantized to 4-bit) hosted on a single GPU instance.&lt;/p&gt;

&lt;p&gt;For category 3 only (about 15% of documents), we kept GPT-4 but switched to GPT-4o-mini where possible.&lt;/p&gt;

&lt;p&gt;The cost shift:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before:
  GPT-4 for all docs:          $6,800/month

After:
  Regex + NER (categories 1):  ~$0 (runs on existing infra)
  Llama 3 8B on g5.xlarge:     $900/month
  GPT-4o-mini (category 3):    $400/month
  Total extraction:            $1,300/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Batch processing and auto-scaling
&lt;/h2&gt;

&lt;p&gt;The original setup ran SageMaker endpoints 24/7, but document uploads were heavily concentrated during business hours (8 AM to 6 PM local time). Nights and weekends had near-zero traffic.&lt;/p&gt;

&lt;p&gt;We switched to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Async inference endpoints&lt;/strong&gt; with auto-scaling (min instances: 0, scale to demand)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch transform jobs&lt;/strong&gt; for bulk uploads (client uploaded batches of 500+ documents every Monday)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spot instances&lt;/strong&gt; for batch jobs (70% cheaper than on-demand)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# SageMaker async endpoint config with auto-scaling
&lt;/span&gt;&lt;span class="n"&gt;scaling_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application-autoscaling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scaling_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_scalable_target&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ServiceNamespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sagemaker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ResourceId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;endpoint_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/variant/AllTraffic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ScalableDimension&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sagemaker:variant:DesiredInstanceCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MinCapacity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MaxCapacity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scaling_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_scaling_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;PolicyName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scale-on-queue-depth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ServiceNamespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sagemaker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ResourceId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;endpoint_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/variant/AllTraffic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ScalableDimension&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sagemaker:variant:DesiredInstanceCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;PolicyType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TargetTrackingScaling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TargetTrackingScalingPolicyConfiguration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TargetValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CustomizedMetricSpecification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ApproximateBacklogSizePerInstance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AWS/SageMaker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Statistic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Average&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ScaleInCooldown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ScaleOutCooldown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This dropped SageMaker costs another $600/month by eliminating idle compute during off-hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final numbers
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        Before      After       Savings
Classification:         $4,200      $800        -81%
Entity extraction:      $6,800      $1,300      -81%
Infrastructure:         $1,000      $400        -60%
Batch processing:       $0          $200        (new)
Monitoring/logging:     $0          $100        (new)
                        -------     -------
Total:                  $12,000     $2,800      -77%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We actually exceeded the 60% target and landed at 77% reduction. Accuracy stayed at 95.4% overall (down from 97.2%), which the client considered a worthwhile tradeoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Audit before optimizing.&lt;/strong&gt; We spent two weeks just instrumenting costs per API call, per model, per document type. Without that data, we would have optimized the wrong things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not every task needs your biggest model.&lt;/strong&gt; The single highest-impact change was pulling structured field extraction out of GPT-4. That regex script took a day to write and saved $5,000/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distillation is underrated for production workloads.&lt;/strong&gt; If your classification accuracy is already high (96%+), a distilled model will likely maintain acceptable performance at a fraction of the cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-scaling to zero is powerful.&lt;/strong&gt; If your workload is not truly 24/7, do not pay for 24/7 compute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;The common instinct when AI costs spike is to negotiate API pricing or switch providers. In our experience, the bigger wins come from rethinking which model handles which task. Most production pipelines have a mix of simple and complex work, and matching model capability to task complexity is where the real savings are.&lt;/p&gt;

&lt;p&gt;If you are running into similar cost issues with your &lt;strong&gt;&lt;a href="https://adamosoft.com/ai-development-services/" rel="noopener noreferrer"&gt;AI-powered pipelines&lt;/a&gt;&lt;/strong&gt;, start with the cost audit. You will almost certainly find that a large chunk of your spend goes to tasks that do not need your most expensive model.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm a software engineer at Adamo Software, where we build AI and data pipelines for clients in fintech and healthcare.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>devops</category>
    </item>
    <item>
      <title>Building a Real-Time Hotel Booking Engine: How We Solved Double-Booking Across 6 OTAs</title>
      <dc:creator>Adamo Software</dc:creator>
      <pubDate>Mon, 16 Mar 2026 04:15:35 +0000</pubDate>
      <link>https://forem.com/adamo_software/building-a-real-time-hotel-booking-engine-how-we-solved-double-booking-across-6-otas-3g7o</link>
      <guid>https://forem.com/adamo_software/building-a-real-time-hotel-booking-engine-how-we-solved-double-booking-across-6-otas-3g7o</guid>
      <description>&lt;p&gt;Last year, our team built a centralized booking engine for a hotel chain operating 12 properties across Southeast Asia. The core challenge: their rooms were listed simultaneously on Booking.com, Expedia, Agoda, Traveloka, Trip.com, and their own direct booking site. Double-bookings were happening 3-4 times per week, each one costing the property an average of $180 in relocation fees and guest compensation.&lt;/p&gt;

&lt;p&gt;This post breaks down the architecture we used to eliminate that problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Double-Booking Is Harder Than a Simple Database Lock
&lt;/h2&gt;

&lt;p&gt;The naive solution sounds straightforward: lock the room row in the database before confirming a booking. But at scale across multiple OTA channels, this breaks down fast.&lt;/p&gt;

&lt;p&gt;The root issue is distributed timing. When a guest clicks "Book Now" on Expedia, that request hits Expedia's servers first, then gets forwarded to the hotel's system via API. Meanwhile, another guest on Agoda books the same room type for the same dates. Both requests arrive at the booking engine within a 200-400ms window. A simple row-level lock in PostgreSQL handles sequential requests fine, but when two OTA webhooks fire near-simultaneously, the second request often reads stale availability data before the first transaction commits.&lt;/p&gt;

&lt;p&gt;The problem gets worse with connection pooling. Under load, database connections queue up, and the gap between "read availability" and "write confirmation" widens. We measured this gap averaging 150ms in normal conditions, spiking to 800ms during peak booking hours (6-10 PM local time).&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Architecture: Event-Driven Inventory with Optimistic Locking
&lt;/h2&gt;

&lt;p&gt;We evaluated two approaches: pessimistic locking (SELECT FOR UPDATE) and optimistic locking with version control. We chose optimistic locking for one reason: pessimistic locks under high concurrency caused connection pool exhaustion in our load tests. With 6 OTAs sending concurrent requests, the lock wait times cascaded.&lt;/p&gt;

&lt;p&gt;The architecture has three core components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Inventory Service (Node.js)&lt;/strong&gt;: Owns the single source of truth for room availability. Every inventory record carries a version integer. When a booking request arrives, the service reads the current version, validates availability, then attempts an UPDATE with a WHERE clause matching both the room ID and the expected version number. If the version has changed between read and write, the UPDATE affects zero rows, and the service rejects the booking with a retry signal.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Message Queue (RabbitMQ)&lt;/strong&gt;: All incoming OTA booking requests land in a RabbitMQ queue before hitting the Inventory Service. This serializes concurrent requests per room-date combination using consistent hashing on the routing key (property_id.room_type.date). Two bookings for the same room on the same date always route to the same queue consumer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Channel Sync Worker (Python):&lt;/strong&gt; After every confirmed booking or cancellation, this worker pushes updated availability to all 6 OTA channels via their respective APIs.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;sql&lt;/span&gt;&lt;span class="c1"&gt;-- Optimistic lock: only succeeds if version hasn't changed&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;room_inventory&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;available_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;available_count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;version&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;property_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;room_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;stay_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;available_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- If rows_affected = 0, another booking won the race&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design decision was using RabbitMQ's consistent hash exchange rather than a standard topic exchange. This guaranteed that competing requests for the same inventory naturally serialized through a single consumer, reducing the optimistic lock collision rate from ~12% (in our initial tests without the queue) to under 0.3%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Syncing Availability Back to 6 OTAs
&lt;/h2&gt;

&lt;p&gt;After a booking confirms, the Channel Sync Worker must update availability across all channels. Each OTA has a different API, different rate limits, and different latency profiles.&lt;/p&gt;

&lt;p&gt;We learned two things the hard way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Batch updates beat individual pushes.&lt;/strong&gt; Booking.com's API accepts bulk availability updates, but Agoda's API at the time only supported single-date, single-room-type calls. Pushing updates one-by-one to Agoda added 4-6 seconds of total latency per booking. We switched to batching Agoda updates every 30 seconds, which was an acceptable trade-off: a 30-second window where availability might be slightly stale versus a guaranteed fast sync path for the other 5 channels.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Webhook-based sync is faster than polling, but you need a fallback.&lt;/strong&gt; Expedia and Booking.com support outbound webhooks to notify the hotel system of new bookings. Trip.com and Traveloka did not (at the time of our integration). For channels without webhooks, we poll every 60 seconds. The polling fallback caught roughly 8% of bookings that would have otherwise created conflicts.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Handling Edge Cases: Cancellations and Partial Failures
&lt;/h2&gt;

&lt;p&gt;The hardest part was not the booking itself but what happens after.&lt;br&gt;
When a guest cancels on Expedia, the system must release that inventory and push the updated count to the other 5 channels. If the push to Booking.com fails (network timeout, API downtime), the room stays marked as unavailable there. Multiply this across hundreds of rooms and dates, and ghost unavailability quietly eats into revenue.&lt;/p&gt;

&lt;p&gt;We implemented a &lt;strong&gt;Saga pattern&lt;/strong&gt; with compensating actions. Each availability update to an OTA channel is tracked as a step in a saga. If a step fails, the system retries with exponential backoff (3 attempts, 5s/15s/45s intervals). If all retries fail, the saga marks that channel as "dirty" and a reconciliation job runs every 15 minutes to force-sync dirty channels by pulling their current state and comparing it against the Inventory Service.&lt;/p&gt;

&lt;p&gt;This reconciliation loop caught an average of 23 stale records per day across all properties during the first month. After stabilizing the OTA integrations, that number dropped to 2-3 per day.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Measured After 3 Months
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Double-bookings dropped from 3-4 per week&lt;/strong&gt; to zero in the first 90 days of production&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Average sync latency across all 6 channels fell to 1.2 seconds&lt;/strong&gt; for webhook-enabled OTAs and 34 seconds worst-case for polling-based channels&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Optimistic lock collision rate held steady at 0.2-0.4%&lt;/strong&gt;, meaning less than 1 in 200 concurrent booking attempts needed a retry&lt;br&gt;
Reconciliation job "dirty" records stabilized at 2-3 per day, almost all from Traveloka API timeouts&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What We Would Do Differently
&lt;/h2&gt;

&lt;p&gt;If we started this project today, three things would change:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Redis for the hot inventory cache.&lt;/strong&gt; We served all availability reads directly from PostgreSQL. Under peak load (Black Friday sale), query latency spiked. A Redis layer for real-time availability reads with PostgreSQL as the write-through backend would have smoothed this out.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Event sourcing from day one.&lt;/strong&gt; We added event sourcing to the booking flow in month two after debugging a dispute where the hotel claimed a cancellation never came through. Having the full event log from the start would have saved a week of forensic work.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Contract testing for OTA APIs.&lt;/strong&gt; Three of the six OTA APIs changed their response schemas during our 8-month engagement without prior notice. Pact-style contract tests running against sandbox environments would have caught these before production.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;The engineering team at &lt;strong&gt;&lt;a href="https://adamosoft.com/travel-and-hospitality-software-development/" rel="noopener noreferrer"&gt;Adamo Software&lt;/a&gt;&lt;/strong&gt;, where we build booking engines, channel managers, and travel platforms for operators worldwide. Got questions? Drop a comment below.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>webdev</category>
      <category>postgres</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>How We Built a Multi-Module Healthcare Platform in 8 Months (15-Person Team, HIPAA-Compliant)</title>
      <dc:creator>Adamo Software</dc:creator>
      <pubDate>Wed, 11 Mar 2026 08:54:20 +0000</pubDate>
      <link>https://forem.com/adamo_software/how-we-built-a-multi-module-healthcare-platform-in-8-months-15-person-team-hipaa-compliant-20l5</link>
      <guid>https://forem.com/adamo_software/how-we-built-a-multi-module-healthcare-platform-in-8-months-15-person-team-hipaa-compliant-20l5</guid>
      <description>&lt;p&gt;Earlier this year, our team at &lt;a href="https://adamosoft.com/healthcare-software-development/" rel="noopener noreferrer"&gt;Adamo Software&lt;/a&gt; shipped a full-scale healthcare platform for a European startup. The system covers telemedicine consultations, lab and imaging orders, e-prescriptions with a pharmacy bidding mechanism, and family member management across four interconnected portals: Patient, Provider, Pharmacy, and Admin.&lt;/p&gt;

&lt;p&gt;This post is not a tutorial. It is a honest breakdown of the decisions we made, the problems we ran into, and what we would do differently. If you are building anything in healthcare, some of this might save you a few weeks of pain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scope (Bigger Than It Looks)
&lt;/h2&gt;

&lt;p&gt;What the client described initially sounded like "a telemedicine app." What we actually built was four separate applications sharing a common backend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Patient portal&lt;/strong&gt; allows users to manage family members, dependents, and caregivers under one account. Patients can book consultations (video, home visit, or in-office), order lab tests and imaging based on doctor referrals, and purchase medications through an integrated pharmacy system.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Provider portal&lt;/strong&gt; serves clinics. Each clinic manages multiple facilities and doctors. Freelancer doctors can either join a facility or offer services independently. The portal handles consultation scheduling, clinical notes, file uploads, and prescription management.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pharmacy portal&lt;/strong&gt; receives prescriptions and manages a bidding system where pharmacies compete on pricing. Patients can also request prescription refills.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Admin portal&lt;/strong&gt; ties everything together with role-based access control across all three user-facing applications.&lt;br&gt;
On top of this, the entire platform had to meet HIPAA compliance standards.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tech Stack and Why We Chose It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Frontend:&lt;/strong&gt; ReactJS (web) + React Native (mobile)&lt;br&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; PHP&lt;br&gt;
&lt;strong&gt;Database:&lt;/strong&gt; MySQL&lt;br&gt;
&lt;strong&gt;Video:&lt;/strong&gt; Agora&lt;br&gt;
&lt;strong&gt;Payments:&lt;/strong&gt; Stripe + PayPal&lt;br&gt;
&lt;strong&gt;Clinical AI:&lt;/strong&gt; Isabel AI (diagnostic decision support)&lt;br&gt;
&lt;strong&gt;AI features:&lt;/strong&gt; OpenAI integration&lt;br&gt;
&lt;strong&gt;Project management:&lt;/strong&gt; Jira&lt;/p&gt;

&lt;p&gt;The client was a startup. Budget optimization was not a preference, it was a survival constraint. We chose open-source technologies and scalable cloud infrastructure with a serverless model where possible. PHP and MySQL are not trendy, but they are battle-tested, well-documented, and significantly cheaper to maintain than trendier stacks when you need to hire or scale a team quickly.&lt;/p&gt;

&lt;p&gt;Agora was chosen for video consultations over building a custom WebRTC implementation. For a startup racing to market, spending three months on video infrastructure would have been reckless. Agora gave us stable, low-latency video with HIPAA-compliant encryption out of the box.&lt;/p&gt;

&lt;p&gt;Isabel AI handles clinical decision support, helping doctors cross-reference symptoms against a medical knowledge base during consultations. OpenAI powers additional features like note summarization and patient-facing explanations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four Problems That Almost Derailed Us
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Medical data is wide, deep, and unforgiving&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Healthcare data is not like e-commerce data. A single patient record can reference lab results, imaging files, prescriptions, consultation notes, referral chains, insurance information, and family relationships. And every field has clinical implications if it is wrong.&lt;/p&gt;

&lt;p&gt;We solved this by designing a modular architecture with clear separation between clinic, lab, imaging, and pharmacy modules. Each module owns its data domain but shares patient identity through a secure central layer. This made the system easier to reason about and significantly easier to extend when the client inevitably asked for new features mid-development.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Every clinical action needs expert confirmation, which slows everything down&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In healthcare, a doctor writes a prescription, but a pharmacist must verify it. A doctor orders imaging, but a radiologist must confirm appropriateness. A lab result comes back, but a clinician must review it before the patient sees it. These confirmation chains are the core workflow, not an edge case.&lt;/p&gt;

&lt;p&gt;We built an automated approval workflow with real-time notifications, audit trails, and role-based confirmation gates. Doctors, specialists, and patients interact within the same platform, and every action is logged and traceable. This reduced the approval bottleneck from days (in the client's previous manual process) to hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Twelve-hour time zone gap between our team and the client&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our development team is based in Vietnam. The client is in Europe. That is a 5-6 hour gap in winter and the communication window shrinks further with daylight saving changes.&lt;/p&gt;

&lt;p&gt;We could not rely on synchronous communication. Instead, we built a semi-automated workflow: daily async standup reports, structured Jira boards with clear acceptance criteria per ticket, and twice-weekly video syncs during the overlap window. The PM owned a daily progress digest that the client received every morning their time.&lt;/p&gt;

&lt;p&gt;This is not glamorous, but it works. The key insight is that timezone gaps are not solved by more meetings. They are solved by better documentation and clearer task definitions so that both sides can make progress independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. HIPAA compliance touches everything, not just the database&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most developers think HIPAA means "encrypt the database." It means much more. Audit logging for every data access. Role-based access control that prevents a pharmacy from seeing consultation notes. Encrypted video streams. Secure file uploads for medical images. Consent management. Data retention policies.&lt;/p&gt;

&lt;p&gt;We addressed this from architecture, not as an afterthought. Every API endpoint has access control checks. Every data mutation is logged. Video sessions through Agora use end-to-end encryption. File storage uses encrypted-at-rest cloud buckets with signed URLs that expire.&lt;/p&gt;

&lt;p&gt;If you are building healthcare software and your HIPAA plan is "we will add encryption later," stop and rethink. Retrofitting compliance into an existing system costs three to five times more than building it in from the start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Team Structure and Timeline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Total team:&lt;/strong&gt; 15 members&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4 ReactJS developers (web)&lt;/li&gt;
&lt;li&gt;3 React Native developers (mobile)&lt;/li&gt;
&lt;li&gt;2 PHP backend developers&lt;/li&gt;
&lt;li&gt;3 QA testers&lt;/li&gt;
&lt;li&gt;1 Project Manager&lt;/li&gt;
&lt;li&gt;1 Business Analyst&lt;/li&gt;
&lt;li&gt;1 UI/UX Designer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Timeline:&lt;/strong&gt; 8 months for the web application, followed by 5 months for the mobile application.&lt;/p&gt;

&lt;p&gt;The BA was critical for this project. Healthcare workflows are not intuitive for developers. Having someone who could translate clinical requirements into technical specifications prevented dozens of misunderstandings that would have become expensive bugs later.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Would Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start mobile earlier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We built web first, then mobile. In retrospect, designing the API layer with mobile constraints in mind from day one would have saved rework. Mobile has different data fetching patterns, offline requirements, and UX constraints that affect backend design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invest more in the pharmacy bidding logic upfront.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The bidding system where pharmacies compete on prescription pricing was conceptually simple but operationally complex. Edge cases around partial fills, refill pricing, delivery logistics, and payment splits consumed more time than expected. We should have prototyped this module independently before integrating it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document clinical workflows visually before writing any code&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;We did this partially, but not comprehensively enough. Healthcare has invisible dependencies everywhere. A prescription refill, for example, might require re-verification of the original diagnosis, insurance eligibility check, and pharmacy availability confirmation. Flowcharts for every clinical workflow would have caught integration issues earlier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;If you are building a healthcare platform, here is what I would tell you over coffee:&lt;/p&gt;

&lt;p&gt;Modular architecture is not optional. Separate your clinical domains (consult, lab, imaging, pharmacy) cleanly. You will thank yourself when requirements change, and they will change.&lt;/p&gt;

&lt;p&gt;HIPAA is an architecture decision, not a feature. Build compliance into your data layer, API layer, and infrastructure from the first commit.&lt;/p&gt;

&lt;p&gt;Timezone gaps are a process problem, not a people problem. Invest in async workflows, clear documentation, and structured handoffs.&lt;/p&gt;

&lt;p&gt;Choose boring technology for startups. PHP and MySQL are not exciting, but they ship, they scale, and your next hire already knows them.&lt;/p&gt;

&lt;p&gt;If this kind of work interests you, we write about healthcare software development on our blog at &lt;a href="https://adamosoft.com/blog/healthcare-software-development/telemedicine-app-development-solutions/" rel="noopener noreferrer"&gt;Adamo Software&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I work at Adamo Software, a software development company that builds healthcare, travel, and AI solutions. This is a real project we delivered. Client details are anonymized per our NDA.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>healthcare</category>
      <category>healthcaresoftwaredevelopment</category>
      <category>digitalhealth</category>
      <category>casestudy</category>
    </item>
  </channel>
</rss>
